Mercurial > vim
comparison runtime/doc/develop.txt @ 625:81fe2ccc1207 v7.0179
updated for version 7.0179
author | vimboss |
---|---|
date | Thu, 12 Jan 2006 23:22:24 +0000 |
parents | 52e76e2b5b65 |
children | 1babf94e0b24 |
comparison
equal
deleted
inserted
replaced
624:91e7d4a7b3b0 | 625:81fe2ccc1207 |
---|---|
1 *develop.txt* For Vim version 7.0aa. Last change: 2005 Sep 01 | 1 *develop.txt* For Vim version 7.0aa. Last change: 2006 Jan 12 |
2 | 2 |
3 | 3 |
4 VIM REFERENCE MANUAL by Bram Moolenaar | 4 VIM REFERENCE MANUAL by Bram Moolenaar |
5 | 5 |
6 | 6 |
380 support). | 380 support). |
381 - For the programs and libraries: Using them as-is would require installing | 381 - For the programs and libraries: Using them as-is would require installing |
382 them separately from Vim. That's mostly not impossible, but a drawback. | 382 them separately from Vim. That's mostly not impossible, but a drawback. |
383 - Performance: A few tests showed that it's possible to check spelling on the | 383 - Performance: A few tests showed that it's possible to check spelling on the |
384 fly (while redrawing), just like syntax highlighting. But the mechanisms | 384 fly (while redrawing), just like syntax highlighting. But the mechanisms |
385 used by other code are much slower. Myspell uses a simplistic hashtable, | 385 used by other code are much slower. Myspell uses a hashtable, for example. |
386 for example. | 386 The affix compression that most spell checkers use makes it slower too. |
387 - For using an external program like aspell a communication mechanism would | 387 - For using an external program like aspell a communication mechanism would |
388 have to be setup. That's complicated to do in a portable way (Unix-only | 388 have to be setup. That's complicated to do in a portable way (Unix-only |
389 would be relatively simple, but that's not good enough). And performance | 389 would be relatively simple, but that's not good enough). And performance |
390 will become a problem (lots of process switching involved). | 390 will become a problem (lots of process switching involved). |
391 - Missing support for words with non-word characters, such as "Etten-Leur" and | 391 - Missing support for words with non-word characters, such as "Etten-Leur" and |
397 and could be a misspelled often-used word. | 397 and could be a misspelled often-used word. |
398 - For making suggestions the speed is less important and requiring to install | 398 - For making suggestions the speed is less important and requiring to install |
399 another program or library would be acceptable. But the word lists probably | 399 another program or library would be acceptable. But the word lists probably |
400 differ, the suggestions may be wrong words. | 400 differ, the suggestions may be wrong words. |
401 | 401 |
402 | |
403 Spelling suggestions *develop-spell-suggestions* | |
404 | |
405 For making suggestions there are two basic mechanisms: | |
406 1. Try changing the bad word a little bit and check for a match with a good | |
407 word. Or go through the list of good words, change them a little bit and | |
408 check for a match with the bad word. The changes are deleting a character, | |
409 inserting a character, swapping two characters, etc. | |
410 2. Perform soundfolding on both the bad word and the good words and then find | |
411 matches, possibly with a few changes like with the first mechanism. | |
412 | |
413 The first is good for finding typing mistakes. After experimenting with | |
414 hashtables and looking at solutions from other spell checkers the conclusion | |
415 was that a trie (a kind of tree structure) is ideal for this. Both for | |
416 reducing memory use and being able to try sensible changes. For example, when | |
417 inserting a character only characters that lead to good words need to be | |
418 tried. Other mechanisms (with hashtables) need to try all possible letters at | |
419 every position in the word. Also, a hashtable has the requirement that word | |
420 boundaries are identified separately, while a trie does not require this. | |
421 That makes the mechanism a lot simpler. | |
422 | |
423 Soundfolding is useful when someone knows how the words sounds but doesn't | |
424 know how it is spelled. For example, the word "dictionary" might be written | |
425 as "daktonerie". The number of changes that the first method would need to | |
426 try is very big, it's hard to find the good word that way. After soundfolding | |
427 the words become "tktnr" and "tkxnry", these differ by only two letters. | |
428 | |
429 To find words by their soundfolded equivalent (soundalike word) we need a list | |
430 of all soundfolded words. A few experiments have been done to find out what | |
431 the best method is. Alternatives: | |
432 1. Do the sound folding on the fly when looking for suggestions. This means | |
433 walking through the trie of good words, soundfolding each word and | |
434 checking how different it is from the bad word. This is very efficient for | |
435 memory use, but takes a long time. On a fast PC it takes a couple of | |
436 seconds for English, which can be acceptable for interactive use. But for | |
437 some languages it takes more than ten seconds (e.g., German, Catalan), | |
438 which is unacceptable slow. For batch processing (automatic corrections) | |
439 it's to slow for all languages. | |
440 2. Use a trie for the soundfolded words, so that searching can be done just | |
441 like how it works without soundfolding. This requires remembering a list | |
442 of good words for each soundfolded word. This makes finding matches very | |
443 fast but requires quite a lot of memory, in the order of 1 to 10 Mbyte. | |
444 For some languages more than the original word list. | |
445 3. Like the second alternative, but reduce the amount of memory by using affix | |
446 compression and store only the soundfolded basic word. This is what Aspell | |
447 does. Disadvantage is that affixes need to be stripped from the bad word | |
448 before soundfolding it, which means that mistakes at the start and/or end | |
449 of the word will cause the mechanism to fail. Also, this becomes slow when | |
450 the bad word is quite different from the good word. | |
451 | |
452 The choice made is to use the second mechanism and use a separate file. This | |
453 way a user with sufficient memory can get very good suggestions while a user | |
454 who is short of memory or just wants the spell checking and no suggestions | |
455 doesn't use so much memory. | |
456 | |
457 | |
458 Word frequency | |
459 | |
460 For sorting suggestions it helps to know which words are common. In theory we | |
461 could store a word frequency with the word in the dictionary. However, this | |
462 requires storing a count per word. That degrades word tree compression a lot. | |
463 And maintaining the word frequency for all languages will be a heavy task. | |
464 Also, it would be nice to prefer words that are already in the text. This way | |
465 the words that appear in the specific text are preferred for suggestions. | |
466 | |
467 What has been implemented is to count words that have been seen during | |
468 displaying. A hashtable is used to quickly find the word count. The count is | |
469 initialized from words listed in COMMON items in the affix file, so that it | |
470 also works when starting a new file. | |
471 | |
472 This isn't ideal, because the longer Vim is running the higher the counts | |
473 become. But in practice it is a noticable improvement over not using the word | |
474 count. | |
475 | |
402 ============================================================================== | 476 ============================================================================== |
403 4. Assumptions *design-assumptions* | 477 4. Assumptions *design-assumptions* |
404 | 478 |
405 Size of variables: | 479 Size of variables: |
406 char 8 bit signed | 480 char 8 bit signed |
407 char_u 8 bit unsigned | 481 char_u 8 bit unsigned |
408 int 16, 32 or 64 bit signed | 482 int 32 or 64 bit signed (16 might be possible with limited features) |
409 unsigned 16, 32 or 64 bit unsigned | 483 unsigned 32 or 64 bit unsigned (16 as with ints) |
410 long 32 or 64 bit signed, can hold a pointer | 484 long 32 or 64 bit signed, can hold a pointer |
411 | 485 |
412 Note that some compilers cannot handle long lines or strings. The C89 | 486 Note that some compilers cannot handle long lines or strings. The C89 |
413 standard specifies a limit of 509 characters. | 487 standard specifies a limit of 509 characters. |
414 | 488 |