OLD | NEW |
(Empty) | |
| 1 \input texinfo @c -*- texinfo -*- |
| 2 @c %**start of header |
| 3 @setfilename gperf.info |
| 4 @settitle Perfect Hash Function Generator |
| 5 @c @setchapternewpage odd |
| 6 @c %**end of header |
| 7 |
| 8 @c some day we should @include version.texi instead of defining |
| 9 @c these values at hand. |
| 10 @set UPDATED 12 June 2003 |
| 11 @set EDITION 3.0.1 |
| 12 @set VERSION 3.0.1 |
| 13 @c --------------------- |
| 14 |
| 15 @c remove the black boxes generated in the GPL appendix. |
| 16 @finalout |
| 17 |
| 18 @c Merge functions into the concept index |
| 19 @syncodeindex fn cp |
| 20 @c @synindex pg cp |
| 21 |
| 22 @dircategory Programming Tools |
| 23 @direntry |
| 24 * Gperf: (gperf). Perfect Hash Function Generator. |
| 25 @end direntry |
| 26 |
| 27 @ifinfo |
| 28 This file documents the features of the GNU Perfect Hash Function |
| 29 Generator @value{VERSION}. |
| 30 |
| 31 Copyright @copyright{} 1989-2003 Free Software Foundation, Inc. |
| 32 |
| 33 Permission is granted to make and distribute verbatim copies of this |
| 34 manual provided the copyright notice and this permission notice are |
| 35 preserved on all copies. |
| 36 |
| 37 @ignore |
| 38 Permission is granted to process this file through TeX and print the |
| 39 results, provided the printed document carries a copying permission |
| 40 notice identical to this one except for the removal of this paragraph |
| 41 (this paragraph not being relevant to the printed manual). |
| 42 |
| 43 @end ignore |
| 44 |
| 45 Permission is granted to copy and distribute modified versions of this |
| 46 manual under the conditions for verbatim copying, provided also that the |
| 47 section entitled ``GNU General Public License'' is included exactly as |
| 48 in the original, and provided that the entire resulting derived work is |
| 49 distributed under the terms of a permission notice identical to this |
| 50 one. |
| 51 |
| 52 Permission is granted to copy and distribute translations of this manual |
| 53 into another language, under the above conditions for modified versions, |
| 54 except that the section entitled ``GNU General Public License'' and this |
| 55 permission notice may be included in translations approved by the Free |
| 56 Software Foundation instead of in the original English. |
| 57 |
| 58 @end ifinfo |
| 59 |
| 60 @titlepage |
| 61 @title User's Guide to @code{gperf} @value{VERSION} |
| 62 @subtitle The GNU Perfect Hash Function Generator |
| 63 @subtitle Edition @value{EDITION}, @value{UPDATED} |
| 64 @author Douglas C. Schmidt |
| 65 @author Bruno Haible |
| 66 |
| 67 @page |
| 68 @vskip 0pt plus 1filll |
| 69 Copyright @copyright{} 1989-2003 Free Software Foundation, Inc. |
| 70 |
| 71 |
| 72 Permission is granted to make and distribute verbatim copies of |
| 73 this manual provided the copyright notice and this permission notice |
| 74 are preserved on all copies. |
| 75 |
| 76 Permission is granted to copy and distribute modified versions of this |
| 77 manual under the conditions for verbatim copying, provided also that the |
| 78 section entitled ``GNU General Public License'' is included |
| 79 exactly as in the original, and provided that the entire resulting |
| 80 derived work is distributed under the terms of a permission notice |
| 81 identical to this one. |
| 82 |
| 83 Permission is granted to copy and distribute translations of this manual |
| 84 into another language, under the above conditions for modified versions, |
| 85 except that the section entitled ``GNU General Public License'' may be |
| 86 included in a translation approved by the author instead of in the |
| 87 original English. |
| 88 @end titlepage |
| 89 |
| 90 @ifinfo |
| 91 @node Top, Copying, (dir), (dir) |
| 92 @top Introduction |
| 93 |
| 94 This manual documents the GNU @code{gperf} perfect hash function generator |
| 95 utility, focusing on its features and how to use them, and how to report |
| 96 bugs. |
| 97 |
| 98 @menu |
| 99 * Copying:: GNU @code{gperf} General Public License says |
| 100 how you can copy and share @code{gperf}. |
| 101 * Contributors:: People who have contributed to @code{gperf}. |
| 102 * Motivation:: The purpose of @code{gperf}. |
| 103 * Search Structures:: Static search structures and GNU @code{gperf} |
| 104 * Description:: High-level discussion of how GPERF functions. |
| 105 * Options:: A description of options to the program. |
| 106 * Bugs:: Known bugs and limitations with GPERF. |
| 107 * Projects:: Things still left to do. |
| 108 * Bibliography:: Material Referenced in this Report. |
| 109 |
| 110 * Concept Index:: |
| 111 |
| 112 @detailmenu --- The Detailed Node Listing --- |
| 113 |
| 114 High-Level Description of GNU @code{gperf} |
| 115 |
| 116 * Input Format:: Input Format to @code{gperf} |
| 117 * Output Format:: Output Format for Generated C Code with @code{gp
erf} |
| 118 * Binary Strings:: Use of NUL bytes |
| 119 |
| 120 Input Format to @code{gperf} |
| 121 |
| 122 * Declarations:: Declarations. |
| 123 * Keywords:: Format for Keyword Entries. |
| 124 * Functions:: Including Additional C Functions. |
| 125 * Controls for GNU indent:: Where to place directives for GNU @code{indent}. |
| 126 |
| 127 Declarations |
| 128 |
| 129 * User-supplied Struct:: Specifying keywords with attributes. |
| 130 * Gperf Declarations:: Embedding command line options in the input. |
| 131 * C Code Inclusion:: Including C declarations and definitions. |
| 132 |
| 133 Invoking @code{gperf} |
| 134 |
| 135 * Input Details:: Options that affect Interpretation of the Input
File |
| 136 * Output Language:: Specifying the Language for the Output Code |
| 137 * Output Details:: Fine tuning Details in the Output Code |
| 138 * Algorithmic Details:: Changing the Algorithms employed by @code{gperf} |
| 139 * Verbosity:: Informative Output |
| 140 |
| 141 @end detailmenu |
| 142 @end menu |
| 143 |
| 144 @end ifinfo |
| 145 |
| 146 @node Copying, Contributors, Top, Top |
| 147 @unnumbered GNU GENERAL PUBLIC LICENSE |
| 148 @include gpl.texinfo |
| 149 |
| 150 @node Contributors, Motivation, Copying, Top |
| 151 @unnumbered Contributors to GNU @code{gperf} Utility |
| 152 |
| 153 @itemize @bullet |
| 154 @item |
| 155 @cindex Bugs |
| 156 The GNU @code{gperf} perfect hash function generator utility was |
| 157 written in GNU C++ by Douglas C. Schmidt. The general |
| 158 idea for the perfect hash function generator was inspired by Keith |
| 159 Bostic's algorithm written in C, and distributed to net.sources around |
| 160 1984. The current program is a heavily modified, enhanced, and extended |
| 161 implementation of Keith's basic idea, created at the University of |
| 162 California, Irvine. Bugs, patches, and suggestions should be reported |
| 163 to @code{<bug-gnu-gperf@@gnu.org>}. |
| 164 |
| 165 @item |
| 166 Special thanks is extended to Michael Tiemann and Doug Lea, for |
| 167 providing a useful compiler, and for giving me a forum to exhibit my |
| 168 creation. |
| 169 |
| 170 In addition, Adam de Boor and Nels Olson provided many tips and insights |
| 171 that greatly helped improve the quality and functionality of @code{gperf}. |
| 172 |
| 173 @item |
| 174 Bruno Haible enhanced and optimized the search algorithm. He also rewrote |
| 175 the input routines and the output routines for better reliability, and |
| 176 added a testsuite. |
| 177 @end itemize |
| 178 |
| 179 @node Motivation, Search Structures, Contributors, Top |
| 180 @chapter Introduction |
| 181 |
| 182 @code{gperf} is a perfect hash function generator written in C++. It |
| 183 transforms an @var{n} element user-specified keyword set @var{W} into a |
| 184 perfect hash function @var{F}. @var{F} uniquely maps keywords in |
| 185 @var{W} onto the range 0..@var{k}, where @var{k} >= @var{n-1}. If @var{k} |
| 186 = @var{n-1} then @var{F} is a @emph{minimal} perfect hash function. |
| 187 @code{gperf} generates a 0..@var{k} element static lookup table and a |
| 188 pair of C functions. These functions determine whether a given |
| 189 character string @var{s} occurs in @var{W}, using at most one probe into |
| 190 the lookup table. |
| 191 |
| 192 @code{gperf} currently generates the reserved keyword recognizer for |
| 193 lexical analyzers in several production and research compilers and |
| 194 language processing tools, including GNU C, GNU C++, GNU Java, GNU Pascal, |
| 195 GNU Modula 3, and GNU indent. Complete C++ source code for @code{gperf} is |
| 196 available from @code{http://ftp.gnu.org/pub/gnu/gperf/}. |
| 197 A paper describing @code{gperf}'s design and implementation in greater |
| 198 detail is available in the Second USENIX C++ Conference proceedings |
| 199 or from @code{http://www.cs.wustl.edu/~schmidt/resume.html}. |
| 200 |
| 201 @node Search Structures, Description, Motivation, Top |
| 202 @chapter Static search structures and GNU @code{gperf} |
| 203 @cindex Static search structure |
| 204 |
| 205 A @dfn{static search structure} is an Abstract Data Type with certain |
| 206 fundamental operations, e.g., @emph{initialize}, @emph{insert}, |
| 207 and @emph{retrieve}. Conceptually, all insertions occur before any |
| 208 retrievals. In practice, @code{gperf} generates a @emph{static} array |
| 209 containing search set keywords and any associated attributes specified |
| 210 by the user. Thus, there is essentially no execution-time cost for the |
| 211 insertions. It is a useful data structure for representing @emph{static |
| 212 search sets}. Static search sets occur frequently in software system |
| 213 applications. Typical static search sets include compiler reserved |
| 214 words, assembler instruction opcodes, and built-in shell interpreter |
| 215 commands. Search set members, called @dfn{keywords}, are inserted into |
| 216 the structure only once, usually during program initialization, and are |
| 217 not generally modified at run-time. |
| 218 |
| 219 Numerous static search structure implementations exist, e.g., |
| 220 arrays, linked lists, binary search trees, digital search tries, and |
| 221 hash tables. Different approaches offer trade-offs between space |
| 222 utilization and search time efficiency. For example, an @var{n} element |
| 223 sorted array is space efficient, though the average-case time |
| 224 complexity for retrieval operations using binary search is |
| 225 proportional to log @var{n}. Conversely, hash table implementations |
| 226 often locate a table entry in constant time, but typically impose |
| 227 additional memory overhead and exhibit poor worst case performance. |
| 228 |
| 229 @cindex Minimal perfect hash functions |
| 230 @emph{Minimal perfect hash functions} provide an optimal solution for a |
| 231 particular class of static search sets. A minimal perfect hash |
| 232 function is defined by two properties: |
| 233 |
| 234 @itemize @bullet |
| 235 @item |
| 236 It allows keyword recognition in a static search set using at most |
| 237 @emph{one} probe into the hash table. This represents the ``perfect'' |
| 238 property. |
| 239 @item |
| 240 The actual memory allocated to store the keywords is precisely large |
| 241 enough for the keyword set, and @emph{no larger}. This is the |
| 242 ``minimal'' property. |
| 243 @end itemize |
| 244 |
| 245 For most applications it is far easier to generate @emph{perfect} hash |
| 246 functions than @emph{minimal perfect} hash functions. Moreover, |
| 247 non-minimal perfect hash functions frequently execute faster than |
| 248 minimal ones in practice. This phenomena occurs since searching a |
| 249 sparse keyword table increases the probability of locating a ``null'' |
| 250 entry, thereby reducing string comparisons. @code{gperf}'s default |
| 251 behavior generates @emph{near-minimal} perfect hash functions for |
| 252 keyword sets. However, @code{gperf} provides many options that permit |
| 253 user control over the degree of minimality and perfection. |
| 254 |
| 255 Static search sets often exhibit relative stability over time. For |
| 256 example, Ada's 63 reserved words have remained constant for nearly a |
| 257 decade. It is therefore frequently worthwhile to expend concerted |
| 258 effort building an optimal search structure @emph{once}, if it |
| 259 subsequently receives heavy use multiple times. @code{gperf} removes |
| 260 the drudgery associated with constructing time- and space-efficient |
| 261 search structures by hand. It has proven a useful and practical tool |
| 262 for serious programming projects. Output from @code{gperf} is currently |
| 263 used in several production and research compilers, including GNU C, GNU |
| 264 C++, GNU Java, GNU Pascal, and GNU Modula 3. The latter two compilers are |
| 265 not yet part of the official GNU distribution. Each compiler utilizes |
| 266 @code{gperf} to automatically generate static search structures that |
| 267 efficiently identify their respective reserved keywords. |
| 268 |
| 269 @node Description, Options, Search Structures, Top |
| 270 @chapter High-Level Description of GNU @code{gperf} |
| 271 |
| 272 @menu |
| 273 * Input Format:: Input Format to @code{gperf} |
| 274 * Output Format:: Output Format for Generated C Code with @code{gp
erf} |
| 275 * Binary Strings:: Use of NUL bytes |
| 276 @end menu |
| 277 |
| 278 The perfect hash function generator @code{gperf} reads a set of |
| 279 ``keywords'' from an input file (or from the standard input by |
| 280 default). It attempts to derive a perfect hashing function that |
| 281 recognizes a member of the @dfn{static keyword set} with at most a |
| 282 single probe into the lookup table. If @code{gperf} succeeds in |
| 283 generating such a function it produces a pair of C source code routines |
| 284 that perform hashing and table lookup recognition. All generated C code |
| 285 is directed to the standard output. Command-line options described |
| 286 below allow you to modify the input and output format to @code{gperf}. |
| 287 |
| 288 By default, @code{gperf} attempts to produce time-efficient code, with |
| 289 less emphasis on efficient space utilization. However, several options |
| 290 exist that permit trading-off execution time for storage space and vice |
| 291 versa. In particular, expanding the generated table size produces a |
| 292 sparse search structure, generally yielding faster searches. |
| 293 Conversely, you can direct @code{gperf} to utilize a C @code{switch} |
| 294 statement scheme that minimizes data space storage size. Furthermore, |
| 295 using a C @code{switch} may actually speed up the keyword retrieval time |
| 296 somewhat. Actual results depend on your C compiler, of course. |
| 297 |
| 298 In general, @code{gperf} assigns values to the bytes it is using |
| 299 for hashing until some set of values gives each keyword a unique value. |
| 300 A helpful heuristic is that the larger the hash value range, the easier |
| 301 it is for @code{gperf} to find and generate a perfect hash function. |
| 302 Experimentation is the key to getting the most from @code{gperf}. |
| 303 |
| 304 @node Input Format, Output Format, Description, Description |
| 305 @section Input Format to @code{gperf} |
| 306 @cindex Format |
| 307 @cindex Declaration section |
| 308 @cindex Keywords section |
| 309 @cindex Functions section |
| 310 You can control the input file format by varying certain command-line |
| 311 arguments, in particular the @samp{-t} option. The input's appearance |
| 312 is similar to GNU utilities @code{flex} and @code{bison} (or UNIX |
| 313 utilities @code{lex} and @code{yacc}). Here's an outline of the general |
| 314 format: |
| 315 |
| 316 @example |
| 317 @group |
| 318 declarations |
| 319 %% |
| 320 keywords |
| 321 %% |
| 322 functions |
| 323 @end group |
| 324 @end example |
| 325 |
| 326 @emph{Unlike} @code{flex} or @code{bison}, the declarations section and |
| 327 the functions section are optional. The following sections describe the |
| 328 input format for each section. |
| 329 |
| 330 @menu |
| 331 * Declarations:: Declarations. |
| 332 * Keywords:: Format for Keyword Entries. |
| 333 * Functions:: Including Additional C Functions. |
| 334 * Controls for GNU indent:: Where to place directives for GNU @code{indent}. |
| 335 @end menu |
| 336 |
| 337 It is possible to omit the declaration section entirely, if the @samp{-t} |
| 338 option is not given. In this case the input file begins directly with the |
| 339 first keyword line, e.g.: |
| 340 |
| 341 @example |
| 342 @group |
| 343 january |
| 344 february |
| 345 march |
| 346 april |
| 347 ... |
| 348 @end group |
| 349 @end example |
| 350 |
| 351 @node Declarations, Keywords, Input Format, Input Format |
| 352 @subsection Declarations |
| 353 |
| 354 The keyword input file optionally contains a section for including |
| 355 arbitrary C declarations and definitions, @code{gperf} declarations that |
| 356 act like command-line options, as well as for providing a user-supplied |
| 357 @code{struct}. |
| 358 |
| 359 @menu |
| 360 * User-supplied Struct:: Specifying keywords with attributes. |
| 361 * Gperf Declarations:: Embedding command line options in the input. |
| 362 * C Code Inclusion:: Including C declarations and definitions. |
| 363 @end menu |
| 364 |
| 365 @node User-supplied Struct, Gperf Declarations, Declarations, Declarations |
| 366 @subsubsection User-supplied @code{struct} |
| 367 |
| 368 If the @samp{-t} option (or, equivalently, the @samp{%struct-type} declaration) |
| 369 @emph{is} enabled, you @emph{must} provide a C @code{struct} as the last |
| 370 component in the declaration section from the input file. The first |
| 371 field in this struct must be of type @code{char *} or @code{const char *} |
| 372 if the @samp{-P} option is not given, or of type @code{int} if the option |
| 373 @samp{-P} (or, equivalently, the @samp{%pic} declaration) is enabled. |
| 374 This first field must be called @samp{name}, although it is possible to modify |
| 375 its name with the @samp{-K} option (or, equivalently, the |
| 376 @samp{%define slot-name} declaration) described below. |
| 377 |
| 378 Here is a simple example, using months of the year and their attributes as |
| 379 input: |
| 380 |
| 381 @example |
| 382 @group |
| 383 struct month @{ char *name; int number; int days; int leap_days; @}; |
| 384 %% |
| 385 january, 1, 31, 31 |
| 386 february, 2, 28, 29 |
| 387 march, 3, 31, 31 |
| 388 april, 4, 30, 30 |
| 389 may, 5, 31, 31 |
| 390 june, 6, 30, 30 |
| 391 july, 7, 31, 31 |
| 392 august, 8, 31, 31 |
| 393 september, 9, 30, 30 |
| 394 october, 10, 31, 31 |
| 395 november, 11, 30, 30 |
| 396 december, 12, 31, 31 |
| 397 @end group |
| 398 @end example |
| 399 |
| 400 @cindex @samp{%%} |
| 401 Separating the @code{struct} declaration from the list of keywords and |
| 402 other fields are a pair of consecutive percent signs, @samp{%%}, |
| 403 appearing left justified in the first column, as in the UNIX utility |
| 404 @code{lex}. |
| 405 |
| 406 If the @code{struct} has already been declared in an include file, it can |
| 407 be mentioned in an abbreviated form, like this: |
| 408 |
| 409 @example |
| 410 @group |
| 411 struct month; |
| 412 %% |
| 413 january, 1, 31, 31 |
| 414 ... |
| 415 @end group |
| 416 @end example |
| 417 |
| 418 @node Gperf Declarations, C Code Inclusion, User-supplied Struct, Declarations |
| 419 @subsubsection Gperf Declarations |
| 420 |
| 421 The declaration section can contain @code{gperf} declarations. They |
| 422 influence the way @code{gperf} works, like command line options do. |
| 423 In fact, every such declaration is equivalent to a command line option. |
| 424 There are three forms of declarations: |
| 425 |
| 426 @enumerate |
| 427 @item |
| 428 Declarations without argument, like @samp{%compare-lengths}. |
| 429 |
| 430 @item |
| 431 Declarations with an argument, like @samp{%switch=@var{count}}. |
| 432 |
| 433 @item |
| 434 Declarations of names of entities in the output file, like |
| 435 @samp{%define lookup-function-name @var{name}}. |
| 436 @end enumerate |
| 437 |
| 438 When a declaration is given both in the input file and as a command line |
| 439 option, the command-line option's value prevails. |
| 440 |
| 441 The following @code{gperf} declarations are available. |
| 442 |
| 443 @table @samp |
| 444 @item %delimiters=@var{delimiter-list} |
| 445 @cindex @samp{%delimiters} |
| 446 Allows you to provide a string containing delimiters used to |
| 447 separate keywords from their attributes. The default is ",". This |
| 448 option is essential if you want to use keywords that have embedded |
| 449 commas or newlines. |
| 450 |
| 451 @item %struct-type |
| 452 @cindex @samp{%struct-type} |
| 453 Allows you to include a @code{struct} type declaration for generated |
| 454 code; see above for an example. |
| 455 |
| 456 @item %ignore-case |
| 457 @cindex @samp{%ignore-case} |
| 458 Consider upper and lower case ASCII characters as equivalent. The string |
| 459 comparison will use a case insignificant character comparison. Note that |
| 460 locale dependent case mappings are ignored. |
| 461 |
| 462 @item %language=@var{language-name} |
| 463 @cindex @samp{%language} |
| 464 Instructs @code{gperf} to generate code in the language specified by the |
| 465 option's argument. Languages handled are currently: |
| 466 |
| 467 @table @samp |
| 468 @item KR-C |
| 469 Old-style K&R C. This language is understood by old-style C compilers and |
| 470 ANSI C compilers, but ANSI C compilers may flag warnings (or even errors) |
| 471 because of lacking @samp{const}. |
| 472 |
| 473 @item C |
| 474 Common C. This language is understood by ANSI C compilers, and also by |
| 475 old-style C compilers, provided that you @code{#define const} to empty |
| 476 for compilers which don't know about this keyword. |
| 477 |
| 478 @item ANSI-C |
| 479 ANSI C. This language is understood by ANSI C compilers and C++ compilers. |
| 480 |
| 481 @item C++ |
| 482 C++. This language is understood by C++ compilers. |
| 483 @end table |
| 484 |
| 485 The default is C. |
| 486 |
| 487 @item %define slot-name @var{name} |
| 488 @cindex @samp{%define slot-name} |
| 489 This declaration is only useful when option @samp{-t} (or, equivalently, the |
| 490 @samp{%struct-type} declaration) has been given. |
| 491 By default, the program assumes the structure component identifier for |
| 492 the keyword is @samp{name}. This option allows an arbitrary choice of |
| 493 identifier for this component, although it still must occur as the first |
| 494 field in your supplied @code{struct}. |
| 495 |
| 496 @item %define initializer-suffix @var{initializers} |
| 497 @cindex @samp{%define initializer-suffix} |
| 498 This declaration is only useful when option @samp{-t} (or, equivalently, the |
| 499 @samp{%struct-type} declaration) has been given. |
| 500 It permits to specify initializers for the structure members following |
| 501 @var{slot-name} in empty hash table entries. The list of initializers |
| 502 should start with a comma. By default, the emitted code will |
| 503 zero-initialize structure members following @var{slot-name}. |
| 504 |
| 505 @item %define hash-function-name @var{name} |
| 506 @cindex @samp{%define hash-function-name} |
| 507 Allows you to specify the name for the generated hash function. Default |
| 508 name is @samp{hash}. This option permits the use of two hash tables in |
| 509 the same file. |
| 510 |
| 511 @item %define lookup-function-name @var{name} |
| 512 @cindex @samp{%define lookup-function-name} |
| 513 Allows you to specify the name for the generated lookup function. |
| 514 Default name is @samp{in_word_set}. This option permits multiple |
| 515 generated hash functions to be used in the same application. |
| 516 |
| 517 @item %define class-name @var{name} |
| 518 @cindex @samp{%define class-name} |
| 519 This option is only useful when option @samp{-L C++} (or, equivalently, |
| 520 the @samp{%language=C++} declaration) has been given. It |
| 521 allows you to specify the name of generated C++ class. Default name is |
| 522 @code{Perfect_Hash}. |
| 523 |
| 524 @item %7bit |
| 525 @cindex @samp{%7bit} |
| 526 This option specifies that all strings that will be passed as arguments |
| 527 to the generated hash function and the generated lookup function will |
| 528 solely consist of 7-bit ASCII characters (bytes in the range 0..127). |
| 529 (Note that the ANSI C functions @code{isalnum} and @code{isgraph} do |
| 530 @emph{not} guarantee that a byte is in this range. Only an explicit |
| 531 test like @samp{c >= 'A' && c <= 'Z'} guarantees this.) |
| 532 |
| 533 @item %compare-lengths |
| 534 @cindex @samp{%compare-lengths} |
| 535 Compare keyword lengths before trying a string comparison. This option |
| 536 is mandatory for binary comparisons (@pxref{Binary Strings}). It also might |
| 537 cut down on the number of string comparisons made during the lookup, since |
| 538 keywords with different lengths are never compared via @code{strcmp}. |
| 539 However, using @samp{%compare-lengths} might greatly increase the size of the |
| 540 generated C code if the lookup table range is large (which implies that |
| 541 the switch option @samp{-S} or @samp{%switch} is not enabled), since the length |
| 542 table contains as many elements as there are entries in the lookup table. |
| 543 |
| 544 @item %compare-strncmp |
| 545 @cindex @samp{%compare-strncmp} |
| 546 Generates C code that uses the @code{strncmp} function to perform |
| 547 string comparisons. The default action is to use @code{strcmp}. |
| 548 |
| 549 @item %readonly-tables |
| 550 @cindex @samp{%readonly-tables} |
| 551 Makes the contents of all generated lookup tables constant, i.e., |
| 552 ``readonly''. Many compilers can generate more efficient code for this |
| 553 by putting the tables in readonly memory. |
| 554 |
| 555 @item %enum |
| 556 @cindex @samp{%enum} |
| 557 Define constant values using an enum local to the lookup function rather |
| 558 than with #defines. This also means that different lookup functions can |
| 559 reside in the same file. Thanks to James Clark @code{<jjc@@ai.mit.edu>}. |
| 560 |
| 561 @item %includes |
| 562 @cindex @samp{%includes} |
| 563 Include the necessary system include file, @code{<string.h>}, at the |
| 564 beginning of the code. By default, this is not done; the user must |
| 565 include this header file himself to allow compilation of the code. |
| 566 |
| 567 @item %global-table |
| 568 @cindex @samp{%global-table} |
| 569 Generate the static table of keywords as a static global variable, |
| 570 rather than hiding it inside of the lookup function (which is the |
| 571 default behavior). |
| 572 |
| 573 @item %pic |
| 574 @cindex @samp{%pic} |
| 575 Optimize the generated table for inclusion in shared libraries. This |
| 576 reduces the startup time of programs using a shared library containing |
| 577 the generated code. If the @samp{%struct-type} declaration (or, |
| 578 equivalently, the option @samp{-t}) is also given, the first field of the |
| 579 user-defined struct must be of type @samp{int}, not @samp{char *}, because |
| 580 it will contain offsets into the string pool instead of actual strings. |
| 581 To convert such an offset to a string, you can use the expression |
| 582 @samp{stringpool + @var{o}}, where @var{o} is the offset. The string pool |
| 583 name can be changed through the @samp{%define string-pool-name} declaration. |
| 584 |
| 585 @item %define string-pool-name @var{name} |
| 586 @cindex @samp{%define string-pool-name} |
| 587 Allows you to specify the name of the generated string pool created by |
| 588 the declaration @samp{%pic} (or, equivalently, the option @samp{-P}). |
| 589 The default name is @samp{stringpool}. This declaration permits the use of |
| 590 two hash tables in the same file, with @samp{%pic} and even when the |
| 591 @samp{%global-table} declaration (or, equivalently, the option @samp{-G}) |
| 592 is given. |
| 593 |
| 594 @item %null-strings |
| 595 @cindex @samp{%null-strings} |
| 596 Use NULL strings instead of empty strings for empty keyword table entries. |
| 597 This reduces the startup time of programs using a shared library containing |
| 598 the generated code (but not as much as the declaration @samp{%pic}), at the |
| 599 expense of one more test-and-branch instruction at run time. |
| 600 |
| 601 @item %define word-array-name @var{name} |
| 602 @cindex @samp{%define word-array-name} |
| 603 Allows you to specify the name for the generated array containing the |
| 604 hash table. Default name is @samp{wordlist}. This option permits the |
| 605 use of two hash tables in the same file, even when the option @samp{-G} |
| 606 (or, equivalently, the @samp{%global-table} declaration) is given. |
| 607 |
| 608 @item %switch=@var{count} |
| 609 @cindex @samp{%switch} |
| 610 Causes the generated C code to use a @code{switch} statement scheme, |
| 611 rather than an array lookup table. This can lead to a reduction in both |
| 612 time and space requirements for some input files. The argument to this |
| 613 option determines how many @code{switch} statements are generated. A |
| 614 value of 1 generates 1 @code{switch} containing all the elements, a |
| 615 value of 2 generates 2 tables with 1/2 the elements in each |
| 616 @code{switch}, etc. This is useful since many C compilers cannot |
| 617 correctly generate code for large @code{switch} statements. This option |
| 618 was inspired in part by Keith Bostic's original C program. |
| 619 |
| 620 @item %omit-struct-type |
| 621 @cindex @samp{%omit-struct-type} |
| 622 Prevents the transfer of the type declaration to the output file. Use |
| 623 this option if the type is already defined elsewhere. |
| 624 @end table |
| 625 |
| 626 @node C Code Inclusion, , Gperf Declarations, Declarations |
| 627 @subsubsection C Code Inclusion |
| 628 |
| 629 @cindex @samp{%@{} |
| 630 @cindex @samp{%@}} |
| 631 Using a syntax similar to GNU utilities @code{flex} and @code{bison}, it |
| 632 is possible to directly include C source text and comments verbatim into |
| 633 the generated output file. This is accomplished by enclosing the region |
| 634 inside left-justified surrounding @samp{%@{}, @samp{%@}} pairs. Here is |
| 635 an input fragment based on the previous example that illustrates this |
| 636 feature: |
| 637 |
| 638 @example |
| 639 @group |
| 640 %@{ |
| 641 #include <assert.h> |
| 642 /* This section of code is inserted directly into the output. */ |
| 643 int return_month_days (struct month *months, int is_leap_year); |
| 644 %@} |
| 645 struct month @{ char *name; int number; int days; int leap_days; @}; |
| 646 %% |
| 647 january, 1, 31, 31 |
| 648 february, 2, 28, 29 |
| 649 march, 3, 31, 31 |
| 650 ... |
| 651 @end group |
| 652 @end example |
| 653 |
| 654 @node Keywords, Functions, Declarations, Input Format |
| 655 @subsection Format for Keyword Entries |
| 656 |
| 657 The second input file format section contains lines of keywords and any |
| 658 associated attributes you might supply. A line beginning with @samp{#} |
| 659 in the first column is considered a comment. Everything following the |
| 660 @samp{#} is ignored, up to and including the following newline. A line |
| 661 beginning with @samp{%} in the first column is an option declaration and |
| 662 must not occur within the keywords section. |
| 663 |
| 664 The first field of each non-comment line is always the keyword itself. It |
| 665 can be given in two ways: as a simple name, i.e., without surrounding |
| 666 string quotation marks, or as a string enclosed in double-quotes, in |
| 667 C syntax, possibly with backslash escapes like @code{\"} or @code{\234} |
| 668 or @code{\xa8}. In either case, it must start right at the beginning |
| 669 of the line, without leading whitespace. |
| 670 In this context, a ``field'' is considered to extend up to, but |
| 671 not include, the first blank, comma, or newline. Here is a simple |
| 672 example taken from a partial list of C reserved words: |
| 673 |
| 674 @example |
| 675 @group |
| 676 # These are a few C reserved words, see the c.gperf file |
| 677 # for a complete list of ANSI C reserved words. |
| 678 unsigned |
| 679 sizeof |
| 680 switch |
| 681 signed |
| 682 if |
| 683 default |
| 684 for |
| 685 while |
| 686 return |
| 687 @end group |
| 688 @end example |
| 689 |
| 690 Note that unlike @code{flex} or @code{bison} the first @samp{%%} marker |
| 691 may be elided if the declaration section is empty. |
| 692 |
| 693 Additional fields may optionally follow the leading keyword. Fields |
| 694 should be separated by commas, and terminate at the end of line. What |
| 695 these fields mean is entirely up to you; they are used to initialize the |
| 696 elements of the user-defined @code{struct} provided by you in the |
| 697 declaration section. If the @samp{-t} option (or, equivalently, the |
| 698 @samp{%struct-type} declaration) is @emph{not} enabled |
| 699 these fields are simply ignored. All previous examples except the last |
| 700 one contain keyword attributes. |
| 701 |
| 702 @node Functions, Controls for GNU indent, Keywords, Input Format |
| 703 @subsection Including Additional C Functions |
| 704 |
| 705 The optional third section also corresponds closely with conventions |
| 706 found in @code{flex} and @code{bison}. All text in this section, |
| 707 starting at the final @samp{%%} and extending to the end of the input |
| 708 file, is included verbatim into the generated output file. Naturally, |
| 709 it is your responsibility to ensure that the code contained in this |
| 710 section is valid C. |
| 711 |
| 712 @node Controls for GNU indent, , Functions, Input Format |
| 713 @subsection Where to place directives for GNU @code{indent}. |
| 714 |
| 715 If you want to invoke GNU @code{indent} on a @code{gperf} input file, |
| 716 you will see that GNU @code{indent} doesn't understand the @samp{%%}, |
| 717 @samp{%@{} and @samp{%@}} directives that control @code{gperf}'s |
| 718 interpretation of the input file. Therefore you have to insert some |
| 719 directives for GNU @code{indent}. More precisely, assuming the most |
| 720 general input file structure |
| 721 |
| 722 @example |
| 723 @group |
| 724 declarations part 1 |
| 725 %@{ |
| 726 verbatim code |
| 727 %@} |
| 728 declarations part 2 |
| 729 %% |
| 730 keywords |
| 731 %% |
| 732 functions |
| 733 @end group |
| 734 @end example |
| 735 |
| 736 @noindent |
| 737 you would insert @samp{*INDENT-OFF*} and @samp{*INDENT-ON*} comments |
| 738 as follows: |
| 739 |
| 740 @example |
| 741 @group |
| 742 /* *INDENT-OFF* */ |
| 743 declarations part 1 |
| 744 %@{ |
| 745 /* *INDENT-ON* */ |
| 746 verbatim code |
| 747 /* *INDENT-OFF* */ |
| 748 %@} |
| 749 declarations part 2 |
| 750 %% |
| 751 keywords |
| 752 %% |
| 753 /* *INDENT-ON* */ |
| 754 functions |
| 755 @end group |
| 756 @end example |
| 757 |
| 758 @node Output Format, Binary Strings, Input Format, Description |
| 759 @section Output Format for Generated C Code with @code{gperf} |
| 760 @cindex hash table |
| 761 |
| 762 Several options control how the generated C code appears on the standard |
| 763 output. Two C function are generated. They are called @code{hash} and |
| 764 @code{in_word_set}, although you may modify their names with a command-line |
| 765 option. Both functions require two arguments, a string, @code{char *} |
| 766 @var{str}, and a length parameter, @code{int} @var{len}. Their default |
| 767 function prototypes are as follows: |
| 768 |
| 769 @deftypefun {unsigned int} hash (const char * @var{str}, unsigned int @var{len}) |
| 770 By default, the generated @code{hash} function returns an integer value |
| 771 created by adding @var{len} to several user-specified @var{str} byte |
| 772 positions indexed into an @dfn{associated values} table stored in a |
| 773 local static array. The associated values table is constructed |
| 774 internally by @code{gperf} and later output as a static local C array |
| 775 called @samp{hash_table}. The relevant selected positions (i.e. indices |
| 776 into @var{str}) are specified via the @samp{-k} option when running |
| 777 @code{gperf}, as detailed in the @emph{Options} section below (@pxref{Options}). |
| 778 @end deftypefun |
| 779 |
| 780 @deftypefun {} in_word_set (const char * @var{str}, unsigned int @var{len}) |
| 781 If @var{str} is in the keyword set, returns a pointer to that |
| 782 keyword. More exactly, if the option @samp{-t} (or, equivalently, the |
| 783 @samp{%struct-type} declaration) was given, it returns |
| 784 a pointer to the matching keyword's structure. Otherwise it returns |
| 785 @code{NULL}. |
| 786 @end deftypefun |
| 787 |
| 788 If the option @samp{-c} (or, equivalently, the @samp{%compare-strncmp} |
| 789 declaration) is not used, @var{str} must be a NUL terminated |
| 790 string of exactly length @var{len}. If @samp{-c} (or, equivalently, the |
| 791 @samp{%compare-strncmp} declaration) is used, @var{str} must |
| 792 simply be an array of @var{len} bytes and does not need to be NUL |
| 793 terminated. |
| 794 |
| 795 The code generated for these two functions is affected by the following |
| 796 options: |
| 797 |
| 798 @table @samp |
| 799 @item -t |
| 800 @itemx --struct-type |
| 801 Make use of the user-defined @code{struct}. |
| 802 |
| 803 @item -S @var{total-switch-statements} |
| 804 @itemx --switch=@var{total-switch-statements} |
| 805 @cindex @code{switch} |
| 806 Generate 1 or more C @code{switch} statement rather than use a large, |
| 807 (and potentially sparse) static array. Although the exact time and |
| 808 space savings of this approach vary according to your C compiler's |
| 809 degree of optimization, this method often results in smaller and faster |
| 810 code. |
| 811 @end table |
| 812 |
| 813 If the @samp{-t} and @samp{-S} options (or, equivalently, the |
| 814 @samp{%struct-type} and @samp{%switch} declarations) are omitted, the default |
| 815 action |
| 816 is to generate a @code{char *} array containing the keywords, together with |
| 817 additional empty strings used for padding the array. By experimenting |
| 818 with the various input and output options, and timing the resulting C |
| 819 code, you can determine the best option choices for different keyword |
| 820 set characteristics. |
| 821 |
| 822 @node Binary Strings, , Output Format, Description |
| 823 @section Use of NUL bytes |
| 824 @cindex NUL |
| 825 |
| 826 By default, the code generated by @code{gperf} operates on zero |
| 827 terminated strings, the usual representation of strings in C. This means |
| 828 that the keywords in the input file must not contain NUL bytes, |
| 829 and the @var{str} argument passed to @code{hash} or @code{in_word_set} |
| 830 must be NUL terminated and have exactly length @var{len}. |
| 831 |
| 832 If option @samp{-c} (or, equivalently, the @samp{%compare-strncmp} |
| 833 declaration) is used, then the @var{str} argument does not need |
| 834 to be NUL terminated. The code generated by @code{gperf} will only |
| 835 access the first @var{len}, not @var{len+1}, bytes starting at @var{str}. |
| 836 However, the keywords in the input file still must not contain NUL |
| 837 bytes. |
| 838 |
| 839 If option @samp{-l} (or, equivalently, the @samp{%compare-lengths} |
| 840 declaration) is used, then the hash table performs binary |
| 841 comparison. The keywords in the input file may contain NUL bytes, |
| 842 written in string syntax as @code{\000} or @code{\x00}, and the code |
| 843 generated by @code{gperf} will treat NUL like any other byte. |
| 844 Also, in this case the @samp{-c} option (or, equivalently, the |
| 845 @samp{%compare-strncmp} declaration) is ignored. |
| 846 |
| 847 @node Options, Bugs, Description, Top |
| 848 @chapter Invoking @code{gperf} |
| 849 |
| 850 There are @emph{many} options to @code{gperf}. They were added to make |
| 851 the program more convenient for use with real applications. ``On-line'' |
| 852 help is readily available via the @samp{--help} option. Here is the |
| 853 complete list of options. |
| 854 |
| 855 @menu |
| 856 * Output File:: Specifying the Location of the Output File |
| 857 * Input Details:: Options that affect Interpretation of the Input
File |
| 858 * Output Language:: Specifying the Language for the Output Code |
| 859 * Output Details:: Fine tuning Details in the Output Code |
| 860 * Algorithmic Details:: Changing the Algorithms employed by @code{gperf} |
| 861 * Verbosity:: Informative Output |
| 862 @end menu |
| 863 |
| 864 @node Output File, Input Details, Options, Options |
| 865 @section Specifying the Location of the Output File |
| 866 |
| 867 @table @samp |
| 868 @item --output-file=@var{file} |
| 869 Allows you to specify the name of the file to which the output is written to. |
| 870 @end table |
| 871 |
| 872 The results are written to standard output if no output file is specified |
| 873 or if it is @samp{-}. |
| 874 |
| 875 @node Input Details, Output Language, Output File, Options |
| 876 @section Options that affect Interpretation of the Input File |
| 877 |
| 878 These options are also available as declarations in the input file |
| 879 (@pxref{Gperf Declarations}). |
| 880 |
| 881 @table @samp |
| 882 @item -e @var{keyword-delimiter-list} |
| 883 @itemx --delimiters=@var{keyword-delimiter-list} |
| 884 @cindex Delimiters |
| 885 Allows you to provide a string containing delimiters used to |
| 886 separate keywords from their attributes. The default is ",". This |
| 887 option is essential if you want to use keywords that have embedded |
| 888 commas or newlines. One useful trick is to use -e'TAB', where TAB is |
| 889 the literal tab character. |
| 890 |
| 891 @item -t |
| 892 @itemx --struct-type |
| 893 Allows you to include a @code{struct} type declaration for generated |
| 894 code. Any text before a pair of consecutive @samp{%%} is considered |
| 895 part of the type declaration. Keywords and additional fields may follow |
| 896 this, one group of fields per line. A set of examples for generating |
| 897 perfect hash tables and functions for Ada, C, C++, Pascal, Modula 2, |
| 898 Modula 3 and JavaScript reserved words are distributed with this release. |
| 899 |
| 900 @item --ignore-case |
| 901 Consider upper and lower case ASCII characters as equivalent. The string |
| 902 comparison will use a case insignificant character comparison. Note that |
| 903 locale dependent case mappings are ignored. This option is therefore not |
| 904 suitable if a properly internationalized or locale aware case mapping |
| 905 should be used. (For example, in a Turkish locale, the upper case equivalent |
| 906 of the lowercase ASCII letter @samp{i} is the non-ASCII character |
| 907 @samp{capital i with dot above}.) For this case, it is better to apply |
| 908 an uppercase or lowercase conversion on the string before passing it to |
| 909 the @code{gperf} generated function. |
| 910 @end table |
| 911 |
| 912 @node Output Language, Output Details, Input Details, Options |
| 913 @section Options to specify the Language for the Output Code |
| 914 |
| 915 These options are also available as declarations in the input file |
| 916 (@pxref{Gperf Declarations}). |
| 917 |
| 918 @table @samp |
| 919 @item -L @var{generated-language-name} |
| 920 @itemx --language=@var{generated-language-name} |
| 921 Instructs @code{gperf} to generate code in the language specified by the |
| 922 option's argument. Languages handled are currently: |
| 923 |
| 924 @table @samp |
| 925 @item KR-C |
| 926 Old-style K&R C. This language is understood by old-style C compilers and |
| 927 ANSI C compilers, but ANSI C compilers may flag warnings (or even errors) |
| 928 because of lacking @samp{const}. |
| 929 |
| 930 @item C |
| 931 Common C. This language is understood by ANSI C compilers, and also by |
| 932 old-style C compilers, provided that you @code{#define const} to empty |
| 933 for compilers which don't know about this keyword. |
| 934 |
| 935 @item ANSI-C |
| 936 ANSI C. This language is understood by ANSI C compilers and C++ compilers. |
| 937 |
| 938 @item C++ |
| 939 C++. This language is understood by C++ compilers. |
| 940 @end table |
| 941 |
| 942 The default is C. |
| 943 |
| 944 @item -a |
| 945 This option is supported for compatibility with previous releases of |
| 946 @code{gperf}. It does not do anything. |
| 947 |
| 948 @item -g |
| 949 This option is supported for compatibility with previous releases of |
| 950 @code{gperf}. It does not do anything. |
| 951 @end table |
| 952 |
| 953 @node Output Details, Algorithmic Details, Output Language, Options |
| 954 @section Options for fine tuning Details in the Output Code |
| 955 |
| 956 Most of these options are also available as declarations in the input file |
| 957 (@pxref{Gperf Declarations}). |
| 958 |
| 959 @table @samp |
| 960 @item -K @var{slot-name} |
| 961 @itemx --slot-name=@var{slot-name} |
| 962 @cindex Slot name |
| 963 This option is only useful when option @samp{-t} (or, equivalently, the |
| 964 @samp{%struct-type} declaration) has been given. |
| 965 By default, the program assumes the structure component identifier for |
| 966 the keyword is @samp{name}. This option allows an arbitrary choice of |
| 967 identifier for this component, although it still must occur as the first |
| 968 field in your supplied @code{struct}. |
| 969 |
| 970 @item -F @var{initializers} |
| 971 @itemx --initializer-suffix=@var{initializers} |
| 972 @cindex Initializers |
| 973 This option is only useful when option @samp{-t} (or, equivalently, the |
| 974 @samp{%struct-type} declaration) has been given. |
| 975 It permits to specify initializers for the structure members following |
| 976 @var{slot-name} in empty hash table entries. The list of initializers |
| 977 should start with a comma. By default, the emitted code will |
| 978 zero-initialize structure members following @var{slot-name}. |
| 979 |
| 980 @item -H @var{hash-function-name} |
| 981 @itemx --hash-function-name=@var{hash-function-name} |
| 982 Allows you to specify the name for the generated hash function. Default |
| 983 name is @samp{hash}. This option permits the use of two hash tables in |
| 984 the same file. |
| 985 |
| 986 @item -N @var{lookup-function-name} |
| 987 @itemx --lookup-function-name=@var{lookup-function-name} |
| 988 Allows you to specify the name for the generated lookup function. |
| 989 Default name is @samp{in_word_set}. This option permits multiple |
| 990 generated hash functions to be used in the same application. |
| 991 |
| 992 @item -Z @var{class-name} |
| 993 @itemx --class-name=@var{class-name} |
| 994 @cindex Class name |
| 995 This option is only useful when option @samp{-L C++} (or, equivalently, |
| 996 the @samp{%language=C++} declaration) has been given. It |
| 997 allows you to specify the name of generated C++ class. Default name is |
| 998 @code{Perfect_Hash}. |
| 999 |
| 1000 @item -7 |
| 1001 @itemx --seven-bit |
| 1002 This option specifies that all strings that will be passed as arguments |
| 1003 to the generated hash function and the generated lookup function will |
| 1004 solely consist of 7-bit ASCII characters (bytes in the range 0..127). |
| 1005 (Note that the ANSI C functions @code{isalnum} and @code{isgraph} do |
| 1006 @emph{not} guarantee that a byte is in this range. Only an explicit |
| 1007 test like @samp{c >= 'A' && c <= 'Z'} guarantees this.) This was the |
| 1008 default in versions of @code{gperf} earlier than 2.7; now the default is |
| 1009 to support 8-bit and multibyte characters. |
| 1010 |
| 1011 @item -l |
| 1012 @itemx --compare-lengths |
| 1013 Compare keyword lengths before trying a string comparison. This option |
| 1014 is mandatory for binary comparisons (@pxref{Binary Strings}). It also might |
| 1015 cut down on the number of string comparisons made during the lookup, since |
| 1016 keywords with different lengths are never compared via @code{strcmp}. |
| 1017 However, using @samp{-l} might greatly increase the size of the |
| 1018 generated C code if the lookup table range is large (which implies that |
| 1019 the switch option @samp{-S} or @samp{%switch} is not enabled), since the length |
| 1020 table contains as many elements as there are entries in the lookup table. |
| 1021 |
| 1022 @item -c |
| 1023 @itemx --compare-strncmp |
| 1024 Generates C code that uses the @code{strncmp} function to perform |
| 1025 string comparisons. The default action is to use @code{strcmp}. |
| 1026 |
| 1027 @item -C |
| 1028 @itemx --readonly-tables |
| 1029 Makes the contents of all generated lookup tables constant, i.e., |
| 1030 ``readonly''. Many compilers can generate more efficient code for this |
| 1031 by putting the tables in readonly memory. |
| 1032 |
| 1033 @item -E |
| 1034 @itemx --enum |
| 1035 Define constant values using an enum local to the lookup function rather |
| 1036 than with #defines. This also means that different lookup functions can |
| 1037 reside in the same file. Thanks to James Clark @code{<jjc@@ai.mit.edu>}. |
| 1038 |
| 1039 @item -I |
| 1040 @itemx --includes |
| 1041 Include the necessary system include file, @code{<string.h>}, at the |
| 1042 beginning of the code. By default, this is not done; the user must |
| 1043 include this header file himself to allow compilation of the code. |
| 1044 |
| 1045 @item -G |
| 1046 @itemx --global-table |
| 1047 Generate the static table of keywords as a static global variable, |
| 1048 rather than hiding it inside of the lookup function (which is the |
| 1049 default behavior). |
| 1050 |
| 1051 @item -P |
| 1052 @itemx --pic |
| 1053 Optimize the generated table for inclusion in shared libraries. This |
| 1054 reduces the startup time of programs using a shared library containing |
| 1055 the generated code. If the option @samp{-t} (or, equivalently, the |
| 1056 @samp{%struct-type} declaration) is also given, the first field of the |
| 1057 user-defined struct must be of type @samp{int}, not @samp{char *}, because |
| 1058 it will contain offsets into the string pool instead of actual strings. |
| 1059 To convert such an offset to a string, you can use the expression |
| 1060 @samp{stringpool + @var{o}}, where @var{o} is the offset. The string pool |
| 1061 name can be changed through the option @samp{--string-pool-name}. |
| 1062 |
| 1063 @item -Q @var{string-pool-name} |
| 1064 @itemx --string-pool-name=@var{string-pool-name} |
| 1065 Allows you to specify the name of the generated string pool created by |
| 1066 option @samp{-P}. The default name is @samp{stringpool}. This option |
| 1067 permits the use of two hash tables in the same file, with @samp{-P} and |
| 1068 even when the option @samp{-G} (or, equivalently, the @samp{%global-table} |
| 1069 declaration) is given. |
| 1070 |
| 1071 @item --null-strings |
| 1072 Use NULL strings instead of empty strings for empty keyword table entries. |
| 1073 This reduces the startup time of programs using a shared library containing |
| 1074 the generated code (but not as much as option @samp{-P}), at the expense |
| 1075 of one more test-and-branch instruction at run time. |
| 1076 |
| 1077 @item -W @var{hash-table-array-name} |
| 1078 @itemx --word-array-name=@var{hash-table-array-name} |
| 1079 @cindex Array name |
| 1080 Allows you to specify the name for the generated array containing the |
| 1081 hash table. Default name is @samp{wordlist}. This option permits the |
| 1082 use of two hash tables in the same file, even when the option @samp{-G} |
| 1083 (or, equivalently, the @samp{%global-table} declaration) is given. |
| 1084 |
| 1085 @item -S @var{total-switch-statements} |
| 1086 @itemx --switch=@var{total-switch-statements} |
| 1087 @cindex @code{switch} |
| 1088 Causes the generated C code to use a @code{switch} statement scheme, |
| 1089 rather than an array lookup table. This can lead to a reduction in both |
| 1090 time and space requirements for some input files. The argument to this |
| 1091 option determines how many @code{switch} statements are generated. A |
| 1092 value of 1 generates 1 @code{switch} containing all the elements, a |
| 1093 value of 2 generates 2 tables with 1/2 the elements in each |
| 1094 @code{switch}, etc. This is useful since many C compilers cannot |
| 1095 correctly generate code for large @code{switch} statements. This option |
| 1096 was inspired in part by Keith Bostic's original C program. |
| 1097 |
| 1098 @item -T |
| 1099 @itemx --omit-struct-type |
| 1100 Prevents the transfer of the type declaration to the output file. Use |
| 1101 this option if the type is already defined elsewhere. |
| 1102 |
| 1103 @item -p |
| 1104 This option is supported for compatibility with previous releases of |
| 1105 @code{gperf}. It does not do anything. |
| 1106 @end table |
| 1107 |
| 1108 @node Algorithmic Details, Verbosity, Output Details, Options |
| 1109 @section Options for changing the Algorithms employed by @code{gperf} |
| 1110 |
| 1111 @table @samp |
| 1112 @item -k @var{selected-byte-positions} |
| 1113 @itemx --key-positions=@var{selected-byte-positions} |
| 1114 Allows selection of the byte positions used in the keywords' |
| 1115 hash function. The allowable choices range between 1-255, inclusive. |
| 1116 The positions are separated by commas, e.g., @samp{-k 9,4,13,14}; |
| 1117 ranges may be used, e.g., @samp{-k 2-7}; and positions may occur |
| 1118 in any order. Furthermore, the wildcard '*' causes the generated |
| 1119 hash function to consider @strong{all} byte positions in each keyword, |
| 1120 whereas '$' instructs the hash function to use the ``final byte'' |
| 1121 of a keyword (this is the only way to use a byte position greater than |
| 1122 255, incidentally). |
| 1123 |
| 1124 For instance, the option @samp{-k 1,2,4,6-10,'$'} generates a hash |
| 1125 function that considers positions 1,2,4,6,7,8,9,10, plus the last |
| 1126 byte in each keyword (which may be at a different position for each |
| 1127 keyword, obviously). Keywords |
| 1128 with length less than the indicated byte positions work properly, since |
| 1129 selected byte positions exceeding the keyword length are simply not |
| 1130 referenced in the hash function. |
| 1131 |
| 1132 This option is not normally needed since version 2.8 of @code{gperf}; |
| 1133 the default byte positions are computed depending on the keyword set, |
| 1134 through a search that minimizes the number of byte positions. |
| 1135 |
| 1136 @item -D |
| 1137 @itemx --duplicates |
| 1138 @cindex Duplicates |
| 1139 Handle keywords whose selected byte sets hash to duplicate values. |
| 1140 Duplicate hash values can occur if a set of keywords has the same names, but |
| 1141 possesses different attributes, or if the selected byte positions are not well |
| 1142 chosen. With the -D option @code{gperf} treats all these keywords as |
| 1143 part of an equivalence class and generates a perfect hash function with |
| 1144 multiple comparisons for duplicate keywords. It is up to you to completely |
| 1145 disambiguate the keywords by modifying the generated C code. However, |
| 1146 @code{gperf} helps you out by organizing the output. |
| 1147 |
| 1148 Using this option usually means that the generated hash function is no |
| 1149 longer perfect. On the other hand, it permits @code{gperf} to work on |
| 1150 keyword sets that it otherwise could not handle. |
| 1151 |
| 1152 @item -m @var{iterations} |
| 1153 @itemx --multiple-iterations=@var{iterations} |
| 1154 Perform multiple choices of the @samp{-i} and @samp{-j} values, and |
| 1155 choose the best results. This increases the running time by a factor of |
| 1156 @var{iterations} but does a good job minimizing the generated table size. |
| 1157 |
| 1158 @item -i @var{initial-value} |
| 1159 @itemx --initial-asso=@var{initial-value} |
| 1160 Provides an initial @var{value} for the associate values array. Default |
| 1161 is 0. Increasing the initial value helps inflate the final table size, |
| 1162 possibly leading to more time efficient keyword lookups. Note that this |
| 1163 option is not particularly useful when @samp{-S} (or, equivalently, |
| 1164 @samp{%switch}) is used. Also, |
| 1165 @samp{-i} is overridden when the @samp{-r} option is used. |
| 1166 |
| 1167 @item -j @var{jump-value} |
| 1168 @itemx --jump=@var{jump-value} |
| 1169 @cindex Jump value |
| 1170 Affects the ``jump value'', i.e., how far to advance the associated |
| 1171 byte value upon collisions. @var{Jump-value} is rounded up to an |
| 1172 odd number, the default is 5. If the @var{jump-value} is 0 @code{gperf} |
| 1173 jumps by random amounts. |
| 1174 |
| 1175 @item -n |
| 1176 @itemx --no-strlen |
| 1177 Instructs the generator not to include the length of a keyword when |
| 1178 computing its hash value. This may save a few assembly instructions in |
| 1179 the generated lookup table. |
| 1180 |
| 1181 @item -r |
| 1182 @itemx --random |
| 1183 Utilizes randomness to initialize the associated values table. This |
| 1184 frequently generates solutions faster than using deterministic |
| 1185 initialization (which starts all associated values at 0). Furthermore, |
| 1186 using the randomization option generally increases the size of the |
| 1187 table. |
| 1188 |
| 1189 @item -s @var{size-multiple} |
| 1190 @itemx --size-multiple=@var{size-multiple} |
| 1191 Affects the size of the generated hash table. The numeric argument for |
| 1192 this option indicates ``how many times larger or smaller'' the maximum |
| 1193 associated value range should be, in relationship to the number of keywords. |
| 1194 It can be written as an integer, a floating-point number or a fraction. |
| 1195 For example, a value of 3 means ``allow the maximum associated value to be |
| 1196 about 3 times larger than the number of input keywords''. |
| 1197 Conversely, a value of 1/3 means ``allow the maximum associated value to |
| 1198 be about 3 times smaller than the number of input keywords''. Values |
| 1199 smaller than 1 are useful for limiting the overall size of the generated hash |
| 1200 table, though the option @samp{-m} is better at this purpose. |
| 1201 |
| 1202 If `generate switch' option @samp{-S} (or, equivalently, @samp{%switch}) is |
| 1203 @emph{not} enabled, the maximum |
| 1204 associated value influences the static array table size, and a larger |
| 1205 table should decrease the time required for an unsuccessful search, at |
| 1206 the expense of extra table space. |
| 1207 |
| 1208 The default value is 1, thus the default maximum associated value about |
| 1209 the same size as the number of keywords (for efficiency, the maximum |
| 1210 associated value is always rounded up to a power of 2). The actual |
| 1211 table size may vary somewhat, since this technique is essentially a |
| 1212 heuristic. |
| 1213 @end table |
| 1214 |
| 1215 @node Verbosity, , Algorithmic Details, Options |
| 1216 @section Informative Output |
| 1217 |
| 1218 @table @samp |
| 1219 @item -h |
| 1220 @itemx --help |
| 1221 Prints a short summary on the meaning of each program option. Aborts |
| 1222 further program execution. |
| 1223 |
| 1224 @item -v |
| 1225 @itemx --version |
| 1226 Prints out the current version number. |
| 1227 |
| 1228 @item -d |
| 1229 @itemx --debug |
| 1230 Enables the debugging option. This produces verbose diagnostics to |
| 1231 ``standard error'' when @code{gperf} is executing. It is useful both for |
| 1232 maintaining the program and for determining whether a given set of |
| 1233 options is actually speeding up the search for a solution. Some useful |
| 1234 information is dumped at the end of the program when the @samp{-d} |
| 1235 option is enabled. |
| 1236 @end table |
| 1237 |
| 1238 @node Bugs, Projects, Options, Top |
| 1239 @chapter Known Bugs and Limitations with @code{gperf} |
| 1240 |
| 1241 The following are some limitations with the current release of |
| 1242 @code{gperf}: |
| 1243 |
| 1244 @itemize @bullet |
| 1245 @item |
| 1246 The @code{gperf} utility is tuned to execute quickly, and works quickly |
| 1247 for small to medium size data sets (around 1000 keywords). It is |
| 1248 extremely useful for maintaining perfect hash functions for compiler |
| 1249 keyword sets. Several recent enhancements now enable @code{gperf} to |
| 1250 work efficiently on much larger keyword sets (over 15,000 keywords). |
| 1251 When processing large keyword sets it helps greatly to have over 8 megs |
| 1252 of RAM. |
| 1253 |
| 1254 @item |
| 1255 The size of the generate static keyword array can get @emph{extremely} |
| 1256 large if the input keyword file is large or if the keywords are quite |
| 1257 similar. This tends to slow down the compilation of the generated C |
| 1258 code, and @emph{greatly} inflates the object code size. If this |
| 1259 situation occurs, consider using the @samp{-S} option to reduce data |
| 1260 size, potentially increasing keyword recognition time a negligible |
| 1261 amount. Since many C compilers cannot correctly generate code for |
| 1262 large switch statements it is important to qualify the @var{-S} option |
| 1263 with an appropriate numerical argument that controls the number of |
| 1264 switch statements generated. |
| 1265 |
| 1266 @item |
| 1267 The maximum number of selected byte positions has an |
| 1268 arbitrary limit of 255. This restriction should be removed, and if |
| 1269 anyone considers this a problem write me and let me know so I can remove |
| 1270 the constraint. |
| 1271 @end itemize |
| 1272 |
| 1273 @node Projects, Bibliography, Bugs, Top |
| 1274 @chapter Things Still Left to Do |
| 1275 |
| 1276 It should be ``relatively'' easy to replace the current perfect hash |
| 1277 function algorithm with a more exhaustive approach; the perfect hash |
| 1278 module is essential independent from other program modules. Additional |
| 1279 worthwhile improvements include: |
| 1280 |
| 1281 @itemize @bullet |
| 1282 @item |
| 1283 Another useful extension involves modifying the program to generate |
| 1284 ``minimal'' perfect hash functions (under certain circumstances, the |
| 1285 current version can be rather extravagant in the generated table size). |
| 1286 This is mostly of theoretical interest, since a sparse table |
| 1287 often produces faster lookups, and use of the @samp{-S} @code{switch} |
| 1288 option can minimize the data size, at the expense of slightly longer |
| 1289 lookups (note that the gcc compiler generally produces good code for |
| 1290 @code{switch} statements, reducing the need for more complex schemes). |
| 1291 |
| 1292 @item |
| 1293 In addition to improving the algorithm, it would also be useful to |
| 1294 generate an Ada package as the code output, in addition to the current |
| 1295 C and C++ routines. |
| 1296 @end itemize |
| 1297 |
| 1298 @page |
| 1299 |
| 1300 @node Bibliography, Concept Index, Projects, Top |
| 1301 @chapter Bibliography |
| 1302 |
| 1303 [1] Chang, C.C.: @i{A Scheme for Constructing Ordered Minimal Perfect |
| 1304 Hashing Functions} Information Sciences 39(1986), 187-195. |
| 1305 |
| 1306 [2] Cichelli, Richard J. @i{Author's Response to ``On Cichelli's Minimal Perfect
Hash |
| 1307 Functions Method''} Communications of the ACM, 23, 12(December 1980), 729. |
| 1308 |
| 1309 [3] Cichelli, Richard J. @i{Minimal Perfect Hash Functions Made Simple} |
| 1310 Communications of the ACM, 23, 1(January 1980), 17-19. |
| 1311 |
| 1312 [4] Cook, C. R. and Oldehoeft, R.R. @i{A Letter Oriented Minimal |
| 1313 Perfect Hashing Function} SIGPLAN Notices, 17, 9(September 1982), 18-27. |
| 1314 |
| 1315 [5] Cormack, G. V. and Horspool, R. N. S. and Kaiserwerth, M. |
| 1316 @i{Practical Perfect Hashing} Computer Journal, 28, 1(January 1985), 54-58. |
| 1317 |
| 1318 [6] Jaeschke, G. @i{Reciprocal Hashing: A Method for Generating Minimal |
| 1319 Perfect Hashing Functions} Communications of the ACM, 24, 12(December |
| 1320 1981), 829-833. |
| 1321 |
| 1322 [7] Jaeschke, G. and Osterburg, G. @i{On Cichelli's Minimal Perfect |
| 1323 Hash Functions Method} Communications of the ACM, 23, 12(December 1980), |
| 1324 728-729. |
| 1325 |
| 1326 [8] Sager, Thomas J. @i{A Polynomial Time Generator for Minimal Perfect |
| 1327 Hash Functions} Communications of the ACM, 28, 5(December 1985), 523-532 |
| 1328 |
| 1329 [9] Schmidt, Douglas C. @i{GPERF: A Perfect Hash Function Generator} |
| 1330 Second USENIX C++ Conference Proceedings, April 1990. |
| 1331 |
| 1332 [10] Schmidt, Douglas C. @i{GPERF: A Perfect Hash Function Generator} |
| 1333 C++ Report, SIGS 10 10 (November/December 1998). |
| 1334 |
| 1335 [11] Sebesta, R.W. and Taylor, M.A. @i{Minimal Perfect Hash Functions |
| 1336 for Reserved Word Lists} SIGPLAN Notices, 20, 12(September 1985), 47-53. |
| 1337 |
| 1338 [12] Sprugnoli, R. @i{Perfect Hashing Functions: A Single Probe |
| 1339 Retrieving Method for Static Sets} Communications of the ACM, 20 |
| 1340 11(November 1977), 841-850. |
| 1341 |
| 1342 [13] Stallman, Richard M. @i{Using and Porting GNU CC} Free Software Foundation, |
| 1343 1988. |
| 1344 |
| 1345 [14] Stroustrup, Bjarne @i{The C++ Programming Language.} Addison-Wesley, 1986. |
| 1346 |
| 1347 [15] Tiemann, Michael D. @i{User's Guide to GNU C++} Free Software |
| 1348 Foundation, 1989. |
| 1349 |
| 1350 @node Concept Index, , Bibliography, Top |
| 1351 @unnumbered Concept Index |
| 1352 |
| 1353 @printindex cp |
| 1354 |
| 1355 @contents |
| 1356 @bye |
OLD | NEW |