third_party/hyphen/README.hyphen - Issue 20860003: Remove hyphenation code from Chromium.

Unified Diff: third_party/hyphen/README.hyphen

Issue 20860003: Remove hyphenation code from Chromium. (Closed) Base URL: svn://svn.chromium.org/chrome/trunk/src

Patch Set: rebase Created 7 years, 5 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View side-by-side diff with in-line comments

Index: third_party/hyphen/README.hyphen

diff --git a/third_party/hyphen/README.hyphen b/third_party/hyphen/README.hyphen

deleted file mode 100644

index 8aa8c8767922c977e7b9b518b609e0672d78b19c..0000000000000000000000000000000000000000

--- a/third_party/hyphen/README.hyphen

+++ /dev/null

@@ -1,108 +0,0 @@

-Brief explanation of the hyphenation algorithm herein.[1]

-Raph Levien <raph@acm.org>

-4 Aug 1998

- The hyphenation algorithm is basically the same as Knuth's TeX

-algorithm. However, the implementation is quite a bit faster.

- The hyphenation files from TeX can almost be used directly. There

-is a preprocessing step, however. If you don't do the preprocessing

-step, you'll get bad hyphenations (i.e. a silent failure).

- Start with a file such as hyphen.us. This is the TeX ushyph1.tex

-file, with the exception dictionary encoded using the same rules as

-the main portion of the file. Any line beginning with % is a comment.

-Each other line should contain exactly one rule.

- Then, do the preprocessing - "perl substrings.pl hyphen.us". The

-resulting file is hyphen.mashed. It's in Perl, and it's fairly slow

-(it uses brute force algorithms; about 17 seconds on a P100), but it

-could probably be redone in C with clever algorithms. This would be

-valuable, for example, if it was handle user-supplied exception

-dictionaries by integrating them into the rule table.[2]

- Once the rules are preprocessed, loading them is quite quick -

-about 200ms on a P100. It then hyphenates at about 40,000 words per

-second on a P100. I haven't benchmarked it against other

-implementations (both TeX and groff contain essentially the same

-algorithm), but expect that it runs quite a bit faster than any of

-them.

-Knuth's algorithm

- This section contains a brief explanation of Knuth's algorithm, in

-case you missed it from the TeX books. We'll use the semi-word

-"example" as our running example.

- Since the beginning and end of a word are special, the algorithm is

-actually run over the prepared word (prep_word in the source)

-".example.". Knuths algorithm basically just does pattern matches from

-the rule set, then applies the matches. The patterns in this case that

-match are "xa", "xam", "mp", and "pl". These are actually stored as

-"x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between

-the letters, they are added in. If two (or more) patterns have numbers

-in the same place, the highest number wins. Here's the example:

- . e x a m p l e .

- x1a

- x a m3

- 4m1p

- 1p2l2

- -----------------

- . e x1a4m3p2l2e .

- Finally, hyphens are placed wherever odd numbers appear. They are,

-however, suppressed after the first letter and before the last letter

-of the word (TeX actually suppresses them before the next-to-last, as

-well). So, it's "ex-am-ple", which is correct.

- Knuth uses a trie to implement this. I.e. he stores each rule in a

-trie structure. For each position in the word, he searches the trie,

-searching for a match. Most patterns are short, so efficiency should

-be quite good.

-Theory of the algorithm

- The algorithm works as a slightly modified finite state machine.

-There are two kinds of transitions: those that consume one letter of

-input (which work just like your regular finite state machine), and

-"fallback" transitions, which don't consume any input. If no

-transition matching the next letter is found, the fallback is used.

-One way of looking at this is a form of compression of the transition

-tables - i.e. it behaves the same as a completely vanilla state

-machine in which the actual transition table of a node is made up of

-the union of transition tables of the node itself, plus its fallbacks.

- Each state is represented by a string. Thus, if the current state

-is "am" and the next letter is "p", then the next state is "amp".

-Fallback transitions go to states which chop off one or (sometimes)

-more letters from the beginning. For example, if none of the

-transitions from "amp" match the next letter, then it will fall back

-to "mp". Similarly, if none of the transitions from "mp" match the

-next letter, it will fall back to "m".

- Each state is also associated with a (possibly null) "match"

-string. This represents the union of all patterns which are

-right-justified substrings of the match string. I.e. the pattern "mp"

-is a right-justified substring of the state "amp", so it's numbers get

-added in. The actual calculation of this union is done by the

-Perl preprocessing script, but could probably be done in C just about

-as easily.

- Because each state transition either consumes one input character

-or shortens the state string by one character, the total number of

-state transitions is linear in the length of the word.

-[1] Documentations:

-Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er.

-Stanford University, 1983. http://www.tug.org/docs/liang.

-László Németh: Automatic non-standard hyphenation in OpenOffice.org,

-TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf

-[2] There is the C version of pattern converter "substrings.c"

-in the distribution written by Nanning Buitenhuis. Unfortunatelly,

-this version hasn't handled the non standard extension of the

-algorithm, yet.

« no previous file with comments | « third_party/hyphen/README.compound ('k') | third_party/hyphen/README.nonstandard » ('j') | no next file with comments »