Parsing Tokens in Version 5 for English

When .tess files are ingested into Tesserae, tokens are parsed and stored so that Tesserae searches can be performed on them. With the move in code base from Perl (version 3) to Python (version 5), the token parsing algorithm was changed so that the Python version could parse tokens nearly as efficiently as the Perl version. This led to some difficulties in constructing the word-matcher and non-word-matcher (implemented as regular expressions) used to parse the tokens.

Version 3 parses tokens with the following algorithm:

  • While the input string is not empty, repeat the following:
    • Remove anything at the beginning that matches the word-matcher
      • If something was removed, store the normalized string as a token
    • Remove anything at the beginning that matches the non-word-matcher

This can be seen in tesserae/scripts/v3/add_column.pl.

Version 5 parses tokens with the following algorithm:

  • Normalize the input string
  • Break the normalized input string along substrings that match the non-word-matcher
  • For each substring that remains from the broken normalized input string:
    • If the substring matches the word-matcher, store it as a token

This can be seen in BaseTokenizer.tokenize (tesserae-v5/tesserae/tokenizers/base.py).

The two algorithms will produce the same output if the word-matcher matches on all characters that are not part of the non-word-matcher. However, if the word-matcher and non-word-matcher overlap in characters that they match on, the two algorithms can produce different outputs.

For example, consider the input string “the hill-top”. Suppose that the word-matcher matches on all strings containing any contiguous sequence of characters “a” to “z” and that the non-word-matcher matches on any contiguous sequence of characters that are not matched by the word-matcher. Then both algorithms will store the following tokens: “the”, “hill”, “top”.

However, suppose the word-matcher matches on all strings containing any contiguous sequence of “a” to “z” as well as “-” but the non-word-matcher remains the same as before. In this case, the two matchers overlap on “-”. The version 3 algorithm would then store the following tokens: “the”, “hill-top”. But the version 5 algorithm would still store “the”, “hill”, “top”. This is because the non-word-matcher would find the “-” between “hill” and “top” and break the two apart before the word-matcher could confirm that “hill-top” is a valid word.

The difference in algorithm outputs caused by the asymmetry of word-matcher and non-word-matcher posed a problem when attempting to re-create English capabilities for version 5. This is, of course, because the word-matcher and non-word-matcher for English shared characters that they matched on. To overcome this problem, the non-word-matcher had to be engineered very carefully so that the characters that overlapped in the version 3 word-matcher and non-word-matcher were special-cased. In particular, lookahead and lookbehind assertions were used to make sure the overlapping characters really should be considered part of a non-word sequence.

An edge case of particular difficulty was when multiple hyphens were next to each other, as in “Deception innocent–give ample space” (Cowper Task 1.353). The version 3 algorithm handles this case easily because it will find “innocent” as a word, then decide that ‘–’ is a non-word, and find “give” as a word. In an earlier attempt at constructing an effective non-word-matcher for version 5, the algorithm would mistakenly parse “innocent–give” as one token. The solution was to add the multiple hyphen case explicitly as a non-word sequence.

Tesserae Version 5 Local Installation Instructions

As long as the standalone version is unavailable, this document will guide the brave in installing the software necessary to run Tesserae on their own machines. Even after the standalone version becomes available, these instructions should shed light on the assumptions upon which the standalone version was built.

Prerequisites:

The following software will need to be installed on your machine before you can install Tesserae:

  • MongoDB (we developed for 4.0)
  • Python (we developed with 3.6; I’m running it now with 3.8)
    • It is recommended to install virtual environment support (Ubuntu, for example, does not distribute Python with virtual environment support by default)
  • git
  • nodejs and npm (installing nodejs should give you npm as well; the LTS version of nodejs is recommended)
  • A web browser (developed primarily in Firefox and Chrome)

Additionally, you will want about 5 GB of free space on your hard drive.

Backend Installation Instructions:

Start by opening a terminal window and creating a Python virtual environment where you will install the necessary Python packages. In Ubuntu, the following command creates a virtual environment called “tessenv” in your current working directory:

python3 -m venv tessenv

Next, activate the virtual environment. In Ubuntu, the following command does this:

source tessenv/bin/activate

Next, install the Tesserae API (this will also install tesserae-v5, among other things):

pip install --upgrade git+https://github.com/tesserae/apitess#egg=apitess</span

Now, download the script available at https://raw.githubusercontent.com/tesserae/apitess/master/example/example_launcher.py.

You will now want to edit the file you just downloaded. If you have special credentials set up for your MongoDB installation, change values in db_config to match your credentials. Otherwise, make sure that both values associated with MONGO_USER and MONGO_PASSWORD are set to the empty string ''. Finally, set DB_NAME to 'tesserae'.

Now, download the database dump available at https://www.wjscheirer.com/misc/tesserae/archivedbasedump20200731.gz. This may take a while.

You will then want to install the database dump to your database. This will definitely take some time. In Ubuntu, the command (assuming there are no credentials you need for your installation of MongoDB and that it is running on the default port, 27017) is:

mongorestore --port 27017 --gzip --archive=archivedbasedump20200731.gz \
                           --nsFrom="base.*" --nsTo="tesserae.*"

Now, run the Python script you downloaded with the environment variable “ADMIN_INSTANCE” set to “true”. In Ubuntu, the following command does this:

ADMIN_INSTANCE=true python3 example_launcher.py

The startup message should indicate the URL where the API is being served. On my machine, it was at http://localhost:5000. To make sure that the API is running, point your web browser to http://localhost:5000/languages/. This should return some information about what languages are installed in the database (“greek” and “latin” at this time).

If this is working, then you’ve got the backend set up.

Frontend Installation Instructions:

The frontend code is available at https://github.com/jeffkinnison/tesserae-frontend. Here are the instructions to install and run that.

First, open up a new terminal window and clone the repository:

git clone https://github.com/jeffkinnison/tesserae-frontend.git

Then, change your directory to the repository you just cloned:

cd tesserae-frontend

Install the javascript dependencies:

npm install

In the repository should be a file called “package.json”. Open that and add "homepage": "./" to the object in that file. If you did this correctly, the bottom of the file should look something like this:

    "not op_mini all"
  ],
  "homepage": "./"
}

Also open the file “.env” in the repository. Change the value following the equals sign after “REACT_APP_REST_API_URL” to wherever your backend API server is running. In my case, this was 'http://localhost:5000'. Also change the value following the equals sign after “REACT_APP_MODE”to'ADMIN'. This will enable some of the administrative features in the frontend, like adding and deleting texts in the database.

Now, run npm start. This should open the web browser and load up the frontend. If the backend is still running, then you should see the web page pop up.

Starting Tesserae After Installation:

If you’ve already installed everything, then here are the Ubuntu commands to get Tesserae up and running.

In one terminal window:

cd <place you put example_launcher.py>
ADMIN_INSTANCE=true python3 example_launcher.py

In another terminal window:

cd <repository location of tesserae-frontend>
npm start

Next Steps:

Now that you’ve installed Tesserae, you can run searches locally on your computer. You can also add/delete texts through the “Corpus” button near the top right. If you don’t like how the frontend looks or works, you could build your own on top of the API (API documentation is available at https://tesserae.caset.buffalo.edu/docs/api/).