| OLD | NEW |
| (Empty) |
| 1 Non-standard hyphenation | |
| 2 ------------------------ | |
| 3 | |
| 4 Some languages use non-standard hyphenation; `discretionary' | |
| 5 character changes at hyphenation points. For example, | |
| 6 Catalan: paral·lel -> paral-lel, | |
| 7 Dutch: omaatje -> oma-tje, | |
| 8 German (before the new orthography): Schiffahrt -> Schiff-fahrt, | |
| 9 Hungarian: asszonnyal -> asz-szony-nyal (multiple occurance!) | |
| 10 Swedish: tillata -> till-lata. | |
| 11 | |
| 12 Using this extended library, you can define | |
| 13 non-standard hyphenation patterns. For example: | |
| 14 | |
| 15 l·1l/l=l | |
| 16 a1atje./a=t,1,3 | |
| 17 .schif1fahrt/ff=f,5,2 | |
| 18 .as3szon/sz=sz,2,3 | |
| 19 n1nyal./ny=ny,1,3 | |
| 20 .til1lata./ll=l,3,2 | |
| 21 | |
| 22 or with narrow boundaries: | |
| 23 | |
| 24 l·1l/l=,1,2 | |
| 25 a1atje./a=,1,1 | |
| 26 .schif1fahrt/ff=,5,1 | |
| 27 .as3szon/sz=,2,1 | |
| 28 n1nyal./ny=,1,1 | |
| 29 .til1lata./ll=,3,1 | |
| 30 | |
| 31 Note: Libhnj uses modified patterns by preparing substrings.pl. | |
| 32 Unfortunatelly, now the conversion step can generate bad non-standard | |
| 33 patterns (non-standard -> standard pattern conversion), so using | |
| 34 narrow boundaries may be better for recent Libhnj. For example, | |
| 35 substrings.pl generates a few bad patterns for Hungarian hyphenation | |
| 36 patterns resulting bad non-standard hyphenation in a few cases. Using narrow | |
| 37 boundaries solves this problem. Java HyFo module can check this problem. | |
| 38 | |
| 39 Syntax of the non-standard hyphenation patterns | |
| 40 ------------------------------------------------ | |
| 41 | |
| 42 pat1tern/change[,start,cut] | |
| 43 | |
| 44 If this pattern matches the word, and this pattern win (see README.hyphen) | |
| 45 in the change region of the pattern, then pattern[start, start + cut - 1] | |
| 46 substring will be replaced with the "change". | |
| 47 | |
| 48 For example, a German ff -> ff-f hyphenation: | |
| 49 | |
| 50 f1f/ff=f | |
| 51 | |
| 52 or with expansion | |
| 53 | |
| 54 f1f/ff=f,1,2 | |
| 55 | |
| 56 will change every "ff" with "ff=f" at hyphenation. | |
| 57 | |
| 58 A more real example: | |
| 59 | |
| 60 % simple ff -> f-f hyphenation | |
| 61 f1f | |
| 62 % Schiffahrt -> Schiff-fahrt hyphenation | |
| 63 % | |
| 64 schif3fahrt/ff=f,5,2 | |
| 65 | |
| 66 Specification | |
| 67 | |
| 68 - Pattern: matching patterns of the original Liang's algorithm | |
| 69 - patterns must contain only one hyphenation point at change region | |
| 70 signed with an one-digit odd number (1, 3, 5, 7 or 9). | |
| 71 These point may be at subregion boundaries: schif3fahrt/ff=,5,1 | |
| 72 - only the greater value guarantees the win (don't mix non-standard and | |
| 73 non-standard patterns with the same value, for example | |
| 74 instead of f3f and schif3fahrt/ff=f,5,2 use f3f and schif5fahrt/ff=f,5,2) | |
| 75 | |
| 76 - Change: new characters. | |
| 77 Arbitrary character sequence. Equal sign (=) signs hyphenation points | |
| 78 for OpenOffice.org (like in the example). (In a possible German LaTeX | |
| 79 preprocessor, ff could be replaced with "ff, for a Hungarian one, ssz | |
| 80 with `ssz, according to the German and Hungarian Babel settings.) | |
| 81 | |
| 82 - Start: starting position of the change region. | |
| 83 - begins with 1 (not 0): schif3fahrt/ff=f,5,2 | |
| 84 - start dot doesn't matter: .schif3fahrt/ff=f,5,2 | |
| 85 - numbers don't matter: .s2c2h2i2f3f2ahrt/ff=f,5,2 | |
| 86 - In UTF-8 encoding, use Unicode character positions: össze/sz=sz,2,3 | |
| 87 ("össze" looks "össze" in an ISO 8859-1 8-bit editor). | |
| 88 | |
| 89 - Cut: length of the removed character sequence in the original word. | |
| 90 - In UTF-8 encoding, use Unicode character length: paral·1lel/l=l,5,3 | |
| 91 ("paral·lel" looks "paral·1lel" in an ISO 8859-1 8-bit editor). | |
| 92 | |
| 93 Dictionary developing | |
| 94 --------------------- | |
| 95 | |
| 96 There hasn't been extended PatGen pattern generator for non-standard | |
| 97 hyphenation patterns, yet. | |
| 98 | |
| 99 Fortunatelly, non-standard hyphenation points are forbidden in the PatGen | |
| 100 generated hyphenation patterns, so with a little patch can be develop | |
| 101 non-standard hyphenation patterns also in this case. | |
| 102 | |
| 103 Warning: If you use UTF-8 Unicode encoding in your patterns, call | |
| 104 substrings.pl with UTF-8 parameter to calculate right | |
| 105 character positions for non-standard hyphenation: | |
| 106 | |
| 107 ./substrings.pl input output UTF-8 | |
| 108 | |
| 109 Programming | |
| 110 ----------- | |
| 111 | |
| 112 Use hyphenate2() or hyphenate3() to handle non-standard hyphenation. | |
| 113 See hyphen.h for the documentation of the hyphenate*() functions. | |
| 114 See example.c for processing the output of the hyphenate*() functions. | |
| 115 | |
| 116 Warning: change characters are lower cased in the source, so you may need | |
| 117 case conversion of the change characters based on input word case detection. | |
| 118 For example, see OpenOffice.org source | |
| 119 (lingucomponent/source/hyphenator/altlinuxhyph/hyphen/hyphenimp.cxx). | |
| 120 | |
| 121 László Németh | |
| 122 <nemeth (at) openoffice.org> | |
| OLD | NEW |