Index: third_party/hyphen/README.hyphen |
diff --git a/third_party/hyphen/README.hyphen b/third_party/hyphen/README.hyphen |
deleted file mode 100644 |
index 8aa8c8767922c977e7b9b518b609e0672d78b19c..0000000000000000000000000000000000000000 |
--- a/third_party/hyphen/README.hyphen |
+++ /dev/null |
@@ -1,108 +0,0 @@ |
-Brief explanation of the hyphenation algorithm herein.[1] |
- |
-Raph Levien <raph@acm.org> |
-4 Aug 1998 |
- |
- The hyphenation algorithm is basically the same as Knuth's TeX |
-algorithm. However, the implementation is quite a bit faster. |
- |
- The hyphenation files from TeX can almost be used directly. There |
-is a preprocessing step, however. If you don't do the preprocessing |
-step, you'll get bad hyphenations (i.e. a silent failure). |
- |
- Start with a file such as hyphen.us. This is the TeX ushyph1.tex |
-file, with the exception dictionary encoded using the same rules as |
-the main portion of the file. Any line beginning with % is a comment. |
-Each other line should contain exactly one rule. |
- |
- Then, do the preprocessing - "perl substrings.pl hyphen.us". The |
-resulting file is hyphen.mashed. It's in Perl, and it's fairly slow |
-(it uses brute force algorithms; about 17 seconds on a P100), but it |
-could probably be redone in C with clever algorithms. This would be |
-valuable, for example, if it was handle user-supplied exception |
-dictionaries by integrating them into the rule table.[2] |
- |
- Once the rules are preprocessed, loading them is quite quick - |
-about 200ms on a P100. It then hyphenates at about 40,000 words per |
-second on a P100. I haven't benchmarked it against other |
-implementations (both TeX and groff contain essentially the same |
-algorithm), but expect that it runs quite a bit faster than any of |
-them. |
- |
-Knuth's algorithm |
- |
- This section contains a brief explanation of Knuth's algorithm, in |
-case you missed it from the TeX books. We'll use the semi-word |
-"example" as our running example. |
- |
- Since the beginning and end of a word are special, the algorithm is |
-actually run over the prepared word (prep_word in the source) |
-".example.". Knuths algorithm basically just does pattern matches from |
-the rule set, then applies the matches. The patterns in this case that |
-match are "xa", "xam", "mp", and "pl". These are actually stored as |
-"x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between |
-the letters, they are added in. If two (or more) patterns have numbers |
-in the same place, the highest number wins. Here's the example: |
- |
- . e x a m p l e . |
- x1a |
- x a m3 |
- 4m1p |
- 1p2l2 |
- ----------------- |
- . e x1a4m3p2l2e . |
- |
- Finally, hyphens are placed wherever odd numbers appear. They are, |
-however, suppressed after the first letter and before the last letter |
-of the word (TeX actually suppresses them before the next-to-last, as |
-well). So, it's "ex-am-ple", which is correct. |
- |
- Knuth uses a trie to implement this. I.e. he stores each rule in a |
-trie structure. For each position in the word, he searches the trie, |
-searching for a match. Most patterns are short, so efficiency should |
-be quite good. |
- |
-Theory of the algorithm |
- |
- The algorithm works as a slightly modified finite state machine. |
-There are two kinds of transitions: those that consume one letter of |
-input (which work just like your regular finite state machine), and |
-"fallback" transitions, which don't consume any input. If no |
-transition matching the next letter is found, the fallback is used. |
-One way of looking at this is a form of compression of the transition |
-tables - i.e. it behaves the same as a completely vanilla state |
-machine in which the actual transition table of a node is made up of |
-the union of transition tables of the node itself, plus its fallbacks. |
- |
- Each state is represented by a string. Thus, if the current state |
-is "am" and the next letter is "p", then the next state is "amp". |
-Fallback transitions go to states which chop off one or (sometimes) |
-more letters from the beginning. For example, if none of the |
-transitions from "amp" match the next letter, then it will fall back |
-to "mp". Similarly, if none of the transitions from "mp" match the |
-next letter, it will fall back to "m". |
- |
- Each state is also associated with a (possibly null) "match" |
-string. This represents the union of all patterns which are |
-right-justified substrings of the match string. I.e. the pattern "mp" |
-is a right-justified substring of the state "amp", so it's numbers get |
-added in. The actual calculation of this union is done by the |
-Perl preprocessing script, but could probably be done in C just about |
-as easily. |
- |
- Because each state transition either consumes one input character |
-or shortens the state string by one character, the total number of |
-state transitions is linear in the length of the word. |
- |
-[1] Documentations: |
- |
-Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er. |
-Stanford University, 1983. http://www.tug.org/docs/liang. |
- |
-László Németh: Automatic non-standard hyphenation in OpenOffice.org, |
-TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf |
- |
-[2] There is the C version of pattern converter "substrings.c" |
-in the distribution written by Nanning Buitenhuis. Unfortunatelly, |
-this version hasn't handled the non standard extension of the |
-algorithm, yet. |