Parsing Tokens in Version 5 for English

When .tess files are ingested into Tesserae, tokens are parsed and stored so that Tesserae searches can be performed on them. With the move in code base from Perl (version 3) to Python (version 5), the token parsing algorithm was changed so that the Python version could parse tokens nearly as efficiently as the Perl version. This led to some difficulties in constructing the word-matcher and non-word-matcher (implemented as regular expressions) used to parse the tokens.

Version 3 parses tokens with the following algorithm:

  • While the input string is not empty, repeat the following:
    • Remove anything at the beginning that matches the word-matcher
      • If something was removed, store the normalized string as a token
    • Remove anything at the beginning that matches the non-word-matcher

This can be seen in tesserae/scripts/v3/add_column.pl.

Version 5 parses tokens with the following algorithm:

  • Normalize the input string
  • Break the normalized input string along substrings that match the non-word-matcher
  • For each substring that remains from the broken normalized input string:
    • If the substring matches the word-matcher, store it as a token

This can be seen in BaseTokenizer.tokenize (tesserae-v5/tesserae/tokenizers/base.py).

The two algorithms will produce the same output if the word-matcher matches on all characters that are not part of the non-word-matcher. However, if the word-matcher and non-word-matcher overlap in characters that they match on, the two algorithms can produce different outputs.

For example, consider the input string “the hill-top”. Suppose that the word-matcher matches on all strings containing any contiguous sequence of characters “a” to “z” and that the non-word-matcher matches on any contiguous sequence of characters that are not matched by the word-matcher. Then both algorithms will store the following tokens: “the”, “hill”, “top”.

However, suppose the word-matcher matches on all strings containing any contiguous sequence of “a” to “z” as well as “-” but the non-word-matcher remains the same as before. In this case, the two matchers overlap on “-”. The version 3 algorithm would then store the following tokens: “the”, “hill-top”. But the version 5 algorithm would still store “the”, “hill”, “top”. This is because the non-word-matcher would find the “-” between “hill” and “top” and break the two apart before the word-matcher could confirm that “hill-top” is a valid word.

The difference in algorithm outputs caused by the asymmetry of word-matcher and non-word-matcher posed a problem when attempting to re-create English capabilities for version 5. This is, of course, because the word-matcher and non-word-matcher for English shared characters that they matched on. To overcome this problem, the non-word-matcher had to be engineered very carefully so that the characters that overlapped in the version 3 word-matcher and non-word-matcher were special-cased. In particular, lookahead and lookbehind assertions were used to make sure the overlapping characters really should be considered part of a non-word sequence.

An edge case of particular difficulty was when multiple hyphens were next to each other, as in “Deception innocent–give ample space” (Cowper Task 1.353). The version 3 algorithm handles this case easily because it will find “innocent” as a word, then decide that ‘–’ is a non-word, and find “give” as a word. In an earlier attempt at constructing an effective non-word-matcher for version 5, the algorithm would mistakenly parse “innocent–give” as one token. The solution was to add the multiple hyphen case explicitly as a non-word sequence.