OLD | NEW |
| (Empty) |
1 Brief explanation of the hyphenation algorithm herein.[1] | |
2 | |
3 Raph Levien <raph@acm.org> | |
4 4 Aug 1998 | |
5 | |
6 The hyphenation algorithm is basically the same as Knuth's TeX | |
7 algorithm. However, the implementation is quite a bit faster. | |
8 | |
9 The hyphenation files from TeX can almost be used directly. There | |
10 is a preprocessing step, however. If you don't do the preprocessing | |
11 step, you'll get bad hyphenations (i.e. a silent failure). | |
12 | |
13 Start with a file such as hyphen.us. This is the TeX ushyph1.tex | |
14 file, with the exception dictionary encoded using the same rules as | |
15 the main portion of the file. Any line beginning with % is a comment. | |
16 Each other line should contain exactly one rule. | |
17 | |
18 Then, do the preprocessing - "perl substrings.pl hyphen.us". The | |
19 resulting file is hyphen.mashed. It's in Perl, and it's fairly slow | |
20 (it uses brute force algorithms; about 17 seconds on a P100), but it | |
21 could probably be redone in C with clever algorithms. This would be | |
22 valuable, for example, if it was handle user-supplied exception | |
23 dictionaries by integrating them into the rule table.[2] | |
24 | |
25 Once the rules are preprocessed, loading them is quite quick - | |
26 about 200ms on a P100. It then hyphenates at about 40,000 words per | |
27 second on a P100. I haven't benchmarked it against other | |
28 implementations (both TeX and groff contain essentially the same | |
29 algorithm), but expect that it runs quite a bit faster than any of | |
30 them. | |
31 | |
32 Knuth's algorithm | |
33 | |
34 This section contains a brief explanation of Knuth's algorithm, in | |
35 case you missed it from the TeX books. We'll use the semi-word | |
36 "example" as our running example. | |
37 | |
38 Since the beginning and end of a word are special, the algorithm is | |
39 actually run over the prepared word (prep_word in the source) | |
40 ".example.". Knuths algorithm basically just does pattern matches from | |
41 the rule set, then applies the matches. The patterns in this case that | |
42 match are "xa", "xam", "mp", and "pl". These are actually stored as | |
43 "x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between | |
44 the letters, they are added in. If two (or more) patterns have numbers | |
45 in the same place, the highest number wins. Here's the example: | |
46 | |
47 . e x a m p l e . | |
48 x1a | |
49 x a m3 | |
50 4m1p | |
51 1p2l2 | |
52 ----------------- | |
53 . e x1a4m3p2l2e . | |
54 | |
55 Finally, hyphens are placed wherever odd numbers appear. They are, | |
56 however, suppressed after the first letter and before the last letter | |
57 of the word (TeX actually suppresses them before the next-to-last, as | |
58 well). So, it's "ex-am-ple", which is correct. | |
59 | |
60 Knuth uses a trie to implement this. I.e. he stores each rule in a | |
61 trie structure. For each position in the word, he searches the trie, | |
62 searching for a match. Most patterns are short, so efficiency should | |
63 be quite good. | |
64 | |
65 Theory of the algorithm | |
66 | |
67 The algorithm works as a slightly modified finite state machine. | |
68 There are two kinds of transitions: those that consume one letter of | |
69 input (which work just like your regular finite state machine), and | |
70 "fallback" transitions, which don't consume any input. If no | |
71 transition matching the next letter is found, the fallback is used. | |
72 One way of looking at this is a form of compression of the transition | |
73 tables - i.e. it behaves the same as a completely vanilla state | |
74 machine in which the actual transition table of a node is made up of | |
75 the union of transition tables of the node itself, plus its fallbacks. | |
76 | |
77 Each state is represented by a string. Thus, if the current state | |
78 is "am" and the next letter is "p", then the next state is "amp". | |
79 Fallback transitions go to states which chop off one or (sometimes) | |
80 more letters from the beginning. For example, if none of the | |
81 transitions from "amp" match the next letter, then it will fall back | |
82 to "mp". Similarly, if none of the transitions from "mp" match the | |
83 next letter, it will fall back to "m". | |
84 | |
85 Each state is also associated with a (possibly null) "match" | |
86 string. This represents the union of all patterns which are | |
87 right-justified substrings of the match string. I.e. the pattern "mp" | |
88 is a right-justified substring of the state "amp", so it's numbers get | |
89 added in. The actual calculation of this union is done by the | |
90 Perl preprocessing script, but could probably be done in C just about | |
91 as easily. | |
92 | |
93 Because each state transition either consumes one input character | |
94 or shortens the state string by one character, the total number of | |
95 state transitions is linear in the length of the word. | |
96 | |
97 [1] Documentations: | |
98 | |
99 Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er. | |
100 Stanford University, 1983. http://www.tug.org/docs/liang. | |
101 | |
102 László Németh: Automatic non-standard hyphenation in OpenOffice.org, | |
103 TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf | |
104 | |
105 [2] There is the C version of pattern converter "substrings.c" | |
106 in the distribution written by Nanning Buitenhuis. Unfortunatelly, | |
107 this version hasn't handled the non standard extension of the | |
108 algorithm, yet. | |
OLD | NEW |