third_party/hyphen/README.hyphen - Issue 20860003: Remove hyphenation code from Chromium.

Side by Side Diff: third_party/hyphen/README.hyphen

Issue 20860003: Remove hyphenation code from Chromium. (Closed) Base URL: svn://svn.chromium.org/chrome/trunk/src

Patch Set: rebase Created 7 years, 4 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch | Annotate | Revision Log

OLD	NEW
	(Empty)
1 Brief explanation of the hyphenation algorithm herein.[1]

2

3 Raph Levien <raph@acm.org>

4 4 Aug 1998

5

6 The hyphenation algorithm is basically the same as Knuth's TeX

7 algorithm. However, the implementation is quite a bit faster.

8

9 The hyphenation files from TeX can almost be used directly. There

10 is a preprocessing step, however. If you don't do the preprocessing

11 step, you'll get bad hyphenations (i.e. a silent failure).

12

13 Start with a file such as hyphen.us. This is the TeX ushyph1.tex

14 file, with the exception dictionary encoded using the same rules as

15 the main portion of the file. Any line beginning with % is a comment.

16 Each other line should contain exactly one rule.

17

18 Then, do the preprocessing - "perl substrings.pl hyphen.us". The

19 resulting file is hyphen.mashed. It's in Perl, and it's fairly slow

20 (it uses brute force algorithms; about 17 seconds on a P100), but it

21 could probably be redone in C with clever algorithms. This would be

22 valuable, for example, if it was handle user-supplied exception

23 dictionaries by integrating them into the rule table.[2]

24

25 Once the rules are preprocessed, loading them is quite quick -

26 about 200ms on a P100. It then hyphenates at about 40,000 words per

27 second on a P100. I haven't benchmarked it against other

28 implementations (both TeX and groff contain essentially the same

29 algorithm), but expect that it runs quite a bit faster than any of

30 them.

31

32 Knuth's algorithm

33

34 This section contains a brief explanation of Knuth's algorithm, in

35 case you missed it from the TeX books. We'll use the semi-word

36 "example" as our running example.

37

38 Since the beginning and end of a word are special, the algorithm is

39 actually run over the prepared word (prep_word in the source)

40 ".example.". Knuths algorithm basically just does pattern matches from

41 the rule set, then applies the matches. The patterns in this case that

42 match are "xa", "xam", "mp", and "pl". These are actually stored as

43 "x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between

44 the letters, they are added in. If two (or more) patterns have numbers

45 in the same place, the highest number wins. Here's the example:

46

47 . e x a m p l e .

48 x1a

49 x a m3

50 4m1p

51 1p2l2

52 -----------------

53 . e x1a4m3p2l2e .

54

55 Finally, hyphens are placed wherever odd numbers appear. They are,

56 however, suppressed after the first letter and before the last letter

57 of the word (TeX actually suppresses them before the next-to-last, as

58 well). So, it's "ex-am-ple", which is correct.

59

60 Knuth uses a trie to implement this. I.e. he stores each rule in a

61 trie structure. For each position in the word, he searches the trie,

62 searching for a match. Most patterns are short, so efficiency should

63 be quite good.

64

65 Theory of the algorithm

66

67 The algorithm works as a slightly modified finite state machine.

68 There are two kinds of transitions: those that consume one letter of

69 input (which work just like your regular finite state machine), and

70 "fallback" transitions, which don't consume any input. If no

71 transition matching the next letter is found, the fallback is used.

72 One way of looking at this is a form of compression of the transition

73 tables - i.e. it behaves the same as a completely vanilla state

74 machine in which the actual transition table of a node is made up of

75 the union of transition tables of the node itself, plus its fallbacks.

76

77 Each state is represented by a string. Thus, if the current state

78 is "am" and the next letter is "p", then the next state is "amp".

79 Fallback transitions go to states which chop off one or (sometimes)

80 more letters from the beginning. For example, if none of the

81 transitions from "amp" match the next letter, then it will fall back

82 to "mp". Similarly, if none of the transitions from "mp" match the

83 next letter, it will fall back to "m".

84

85 Each state is also associated with a (possibly null) "match"

86 string. This represents the union of all patterns which are

87 right-justified substrings of the match string. I.e. the pattern "mp"

88 is a right-justified substring of the state "amp", so it's numbers get

89 added in. The actual calculation of this union is done by the

90 Perl preprocessing script, but could probably be done in C just about

91 as easily.

92

93 Because each state transition either consumes one input character

94 or shortens the state string by one character, the total number of

95 state transitions is linear in the length of the word.

96

97 [1] Documentations:

98

99 Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er.

100 Stanford University, 1983. http://www.tug.org/docs/liang.

101

102 László Németh: Automatic non-standard hyphenation in OpenOffice.org,

103 TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf

104

105 [2] There is the C version of pattern converter "substrings.c"

106 in the distribution written by Nanning Buitenhuis. Unfortunatelly,

107 this version hasn't handled the non standard extension of the

108 algorithm, yet.

OLD	NEW

« no previous file with comments | « third_party/hyphen/README.compound ('k') | third_party/hyphen/README.nonstandard » ('j') | no next file with comments »