Chromium Code Reviews
chromiumcodereview-hr@appspot.gserviceaccount.com (chromiumcodereview-hr) | Please choose your nickname with Settings | Help | Chromium Project | Gerrit Changes | Sign out
(1167)

Unified Diff: third_party/hyphen/README.hyphen

Issue 20860003: Remove hyphenation code from Chromium. (Closed) Base URL: svn://svn.chromium.org/chrome/trunk/src
Patch Set: rebase Created 7 years, 5 months ago
Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.
Jump to:
View side-by-side diff with in-line comments
Download patch
« no previous file with comments | « third_party/hyphen/README.compound ('k') | third_party/hyphen/README.nonstandard » ('j') | no next file with comments »
Expand Comments ('e') | Collapse Comments ('c') | Show Comments Hide Comments ('s')
Index: third_party/hyphen/README.hyphen
diff --git a/third_party/hyphen/README.hyphen b/third_party/hyphen/README.hyphen
deleted file mode 100644
index 8aa8c8767922c977e7b9b518b609e0672d78b19c..0000000000000000000000000000000000000000
--- a/third_party/hyphen/README.hyphen
+++ /dev/null
@@ -1,108 +0,0 @@
-Brief explanation of the hyphenation algorithm herein.[1]
-
-Raph Levien <raph@acm.org>
-4 Aug 1998
-
- The hyphenation algorithm is basically the same as Knuth's TeX
-algorithm. However, the implementation is quite a bit faster.
-
- The hyphenation files from TeX can almost be used directly. There
-is a preprocessing step, however. If you don't do the preprocessing
-step, you'll get bad hyphenations (i.e. a silent failure).
-
- Start with a file such as hyphen.us. This is the TeX ushyph1.tex
-file, with the exception dictionary encoded using the same rules as
-the main portion of the file. Any line beginning with % is a comment.
-Each other line should contain exactly one rule.
-
- Then, do the preprocessing - "perl substrings.pl hyphen.us". The
-resulting file is hyphen.mashed. It's in Perl, and it's fairly slow
-(it uses brute force algorithms; about 17 seconds on a P100), but it
-could probably be redone in C with clever algorithms. This would be
-valuable, for example, if it was handle user-supplied exception
-dictionaries by integrating them into the rule table.[2]
-
- Once the rules are preprocessed, loading them is quite quick -
-about 200ms on a P100. It then hyphenates at about 40,000 words per
-second on a P100. I haven't benchmarked it against other
-implementations (both TeX and groff contain essentially the same
-algorithm), but expect that it runs quite a bit faster than any of
-them.
-
-Knuth's algorithm
-
- This section contains a brief explanation of Knuth's algorithm, in
-case you missed it from the TeX books. We'll use the semi-word
-"example" as our running example.
-
- Since the beginning and end of a word are special, the algorithm is
-actually run over the prepared word (prep_word in the source)
-".example.". Knuths algorithm basically just does pattern matches from
-the rule set, then applies the matches. The patterns in this case that
-match are "xa", "xam", "mp", and "pl". These are actually stored as
-"x1a", "xam3", "4m1p", and "1p2l2". Whenever numbers appear between
-the letters, they are added in. If two (or more) patterns have numbers
-in the same place, the highest number wins. Here's the example:
-
- . e x a m p l e .
- x1a
- x a m3
- 4m1p
- 1p2l2
- -----------------
- . e x1a4m3p2l2e .
-
- Finally, hyphens are placed wherever odd numbers appear. They are,
-however, suppressed after the first letter and before the last letter
-of the word (TeX actually suppresses them before the next-to-last, as
-well). So, it's "ex-am-ple", which is correct.
-
- Knuth uses a trie to implement this. I.e. he stores each rule in a
-trie structure. For each position in the word, he searches the trie,
-searching for a match. Most patterns are short, so efficiency should
-be quite good.
-
-Theory of the algorithm
-
- The algorithm works as a slightly modified finite state machine.
-There are two kinds of transitions: those that consume one letter of
-input (which work just like your regular finite state machine), and
-"fallback" transitions, which don't consume any input. If no
-transition matching the next letter is found, the fallback is used.
-One way of looking at this is a form of compression of the transition
-tables - i.e. it behaves the same as a completely vanilla state
-machine in which the actual transition table of a node is made up of
-the union of transition tables of the node itself, plus its fallbacks.
-
- Each state is represented by a string. Thus, if the current state
-is "am" and the next letter is "p", then the next state is "amp".
-Fallback transitions go to states which chop off one or (sometimes)
-more letters from the beginning. For example, if none of the
-transitions from "amp" match the next letter, then it will fall back
-to "mp". Similarly, if none of the transitions from "mp" match the
-next letter, it will fall back to "m".
-
- Each state is also associated with a (possibly null) "match"
-string. This represents the union of all patterns which are
-right-justified substrings of the match string. I.e. the pattern "mp"
-is a right-justified substring of the state "amp", so it's numbers get
-added in. The actual calculation of this union is done by the
-Perl preprocessing script, but could probably be done in C just about
-as easily.
-
- Because each state transition either consumes one input character
-or shortens the state string by one character, the total number of
-state transitions is linear in the length of the word.
-
-[1] Documentations:
-
-Franklin M. Liang: Word Hy-phen-a-tion by Com-put-er.
-Stanford University, 1983. http://www.tug.org/docs/liang.
-
-László Németh: Automatic non-standard hyphenation in OpenOffice.org,
-TUGboat (27), 2006. No. 2., http://hunspell.sourceforge.net/tb87nemeth.pdf
-
-[2] There is the C version of pattern converter "substrings.c"
-in the distribution written by Nanning Buitenhuis. Unfortunatelly,
-this version hasn't handled the non standard extension of the
-algorithm, yet.
« no previous file with comments | « third_party/hyphen/README.compound ('k') | third_party/hyphen/README.nonstandard » ('j') | no next file with comments »

Powered by Google App Engine
This is Rietveld 408576698