On Saturday, April 12, 2014 Christopher Forstall and James Gawley conducted a workshop on Digital Text Analysis for Humanists, using the R software package. The workshop took place at the University at Buffalo (UB), and was sponsored by the Digital Scholarship and Culture Committee of the UB Techne Institute.
The program for the workshop is here: Digital Text Analysis for Humanists Workshop – Program
The data files are here: R_workshop
Tesserae search begins by matching a minimum of two words in one text with two words in another. The words can be matched either by their exact forms or by their dictionary headwords. Using headword matching permits, for instance, the Latin tuli to match latus, both forms of the headword fero.
For comparisons of even moderate-sized texts, basic matching produces thousands of results. We have therefore created a scoring system to sort results by likely potential interest.
Higher scores are given to parallels where the matched words in each text are closer together and where the matched words are more rare. Our testing has found that the top results produced by this method correspond well with the results found by commentators. In other words, preliminary tests show the current Tesserae identification and scoring processes help substantially to identify the most meaningful results.
Full testing of this system is still in progress, however, as are efforts to improve it further. In the meantime, the following description gives a somewhat more detailed account of its function.
First, the frequency of each matching term is calculated by dividing its count within its respective text by the total number of words in that text.
The frequency of a word will thus be different in the search and target texts. In a lemma-based search (the default), the count for a word includes every occurrence of an inflected form with which it shares one or more possible lemmata. These frequencies (very small fractions, even for the most common words) are then inverted and the results are added together across both phrases. The result is a very large number. This is then divided by the distance covered by the matching words in the source and target phrase.
Distance in each phrase is calculated as the number of tokens spanned by (and including) the two most-infrequent matching words. The distances from the source and target phrases are added together to make the overall distance. Finally, the natural logarithm of the result is taken. This helps to bring the exponential differences in word frequencies that occur in natural language into a more linear and human-interpretable range. For a given parallel, the rarer the words are, and the closer they are together in their respective texts, the higher its score will be.
Adding new texts to the current version of Tesserae is relatively simple. Once we have a text properly formatted, we can run a program to process it, make a few other adjustments (like adding it to drop-down lists and entering its information into our list of sources), and it will be ready to search.
So achieving our goal of adding all Perseus classical Greek and Latin texts this year should be a walk in the park, no?
In fact, there is a bottleneck: getting the texts properly formatted for addition. We need a plain text (in .txt format) that has the proper section markers at the beginning of each line (of poetry) or section (of prose). For Perseus texts, this means automatically stripping out the XML information and inserting the section markers.
We’re moving forward with this work, but would welcome help. A set of instructions gives further detail on how to put texts in the correct format. Anyone who wants to pitch in should email Tesserae Fellow James Gawley for further advice. Note that, although the instructions give some examples of English texts, we have no team members working on English at the moment, and so don’t currently have the capacity to process other English texts.