Org\Heigl\Hyphenator is a package to enable word-hyphenation in PHP. It uses
the algorithms described by Marc Liang in his thesis Word Hyphenation by
computer and the extensions
described by László Németh in his work Automatic non-standard hyphenation in
These algorithms are based on matching words against certain patterns that describe places inside a word where hyphenation is possible or must not occur. This Hyphenator uses the pattern-files from OpenOffice which are based on the pattern-files created for TeX.
Theory of operation¶
Only words can be hyphenated and the beginning and the end of a word
are special boundaries that have to be considered for hyphenation. Therefore
the first part of the hyphenation-process is to split up any string into
words that can be hyphenated and other stuff. In this
that ist done by using special
Tokenizers. These split the given
string according to their special Task. So the
uses whitespace-characters as split-point whereas the
uses common punktuation.characters.
The next step in the hyphenation process is to determin the possible
hyphenation-places using special hyphenation-pattern. These patterns have been
used in the TeX-language for a long time now and are widely used in other
OpenSource-Projects. The pattern files used for this
from the OpenOffice.org-project. These are also based on the TeX-pattern, but
are more easy to parse than the original TeX-files. They are also in some cases
enriched with additional information. These patterns are locale-dependend and
are provided using a
After the patterns have been retrieved for a word, the possible hyphenation
positions can be defined. The word is then filtered using a
handles the actual hyphenation. According to the selected filter it is for
instance possible to mark every possible hyphenation-position with the given
SimpleFilter). Other Filters are possible.
The last step is to merge all the bits and pieces the tokenizers left over so we can ge a final hyphenation result. This too is handled by the Filters as the result might be different according to the used token-filter.