Lemma + Semantic Matching: Capture More Parallels

A “lemma-based search” identifies the co-occurrence of the same word with different inflections, and this is the basis for version 3.0 of the Tesserae software. In a benchmark test of Pharsalia 1 vs. Aeneid, 55% of parallels previously noted by commentators were retrieved by lemma-based search.

Previous posts have discussed the method of generating metonym (synonym, antonym, hyponym, etc.) dictionaries through the application of topic modeling. In order to capture more of our target intertexts, we worked to generate the most accurate possible metonym dictionary, then combined it with lemmatization in order to simultaneously capture different inflections of the same word and metonyms of that word.

In a repeat of the benchmark test above, the new ‘synonym + lemma’ feature retrieved commentator parallels at a rate of 68% (other search settings remained the same).

Feature_Comparison

View a version 3.1 synonym + lemma benchmark test.

View a version 3.0 lemma-based benchmark test.

A Greek to Latin Dictionary

As mentioned in previous posts, the Tesserae team has been working to create a digital Greek to Latin dictionary to aid in the retrieval of cross-language text reuse. Tesserae interns Nathaniel Durant, Theresa Mullin, and Elizabeth Hunter collectively assessed 1,000 Greek words, determining which, if any, method for producing a cross-language dictionary produced accurate translations. The winner proved to be an enhanced version of Chris Forstall’s topic-model-based ‘pivot’ method.

I crunched the numbers we got from the 1,000 translations tested by our faithful collaborators, and used them to generate the best possible Greek-to-Latin dictionary. Chris’s algorithm produced up to two Latin translations for each Greek word, with a similarity value attached to each translation. I set out to find a good ‘cutoff’ value for the probability of a translation. I balanced precision and recall according to the following criteria:

  1. It was very important to us that we retain at least 1 accurate translation.
  2. It was very important to us that we avoid retaining inaccurate translations.

Because we have two possible translations for most words, it proved best to use two different similarity-score cutoffs for translations A and B. The result is a Greek-to-Latin dictionary which correlates 34,000 Greek words with at least one semantically related Latin word. We have reason to believe that this dictionary is accurate at at rate of 75%-80%, according to our own parameters for accuracy (because we are searching for allusions, this is not a ‘translation’ dictionary; we consider antonyms and all other metonyms to be valid associations).

Publications on our methodology are forthcoming. For now, please experiment with the tool at http://tesserae.caset.buffalo.edu/cross.php. We welcome your feedback.

Achilleid course at the University of Geneva

Greetings from Geneva, Switzerland! [Continue in English]

Nous nous réjouissons de vous annoncer qu’un groupe d’étudiants de Master de l’Université de Genève utilisera Tesserae lors d’un cours intitulé «Travestissement et transgénéricité: l’Achilléide de Stace» pour explorer cette œuvre difficile et fascinante d’un point de vue intertextuel. Dans ce poème, écrit sous le signe de l’ambiguitas, l’auteur pousse le lecteur à s’interroger sur la relation qui peut être établie entre le choix d’un personnage et d’un moment spécifique de son histoire (Achille déguisé en femme à la cour du roi Lycomède) et les questionnements de l’auteur sur style, genre et gender. Sous la direction de Lavinia Galli Milic, nous examinons, en particulier, le (ou les) rôle(s) joué(s) par l’intertextualité dans la définition du genre littéraire de ce poème ainsi que dans le dépassement des limites du genre.

Lire la suite | Keep reading →

Digital Classics Association Panel at 2014 APA / AIA

DCA APA AIA 2014

Participants in the 2014 DCA APA / AIA Panel “Getting Started with Digital Classics.” From left to right: Monica Berti, Neil Bernstein, Adam Rabinowitz, Neil Coffee, Diane Cline, Hugh Cayless (partially obscured), Gregory Crane, and Francesco Mambrini. Also presenting were Ryan Baumann and Joshua Sosin.

The first APA / AIA session hosted by the Digital Classics Association was held at the meetings in Chicago on January 3, 2014. The topic was “Getting Started with Digital Classics.” The presentations highlighted some very interesting projects and offered perspectives on the future of digital classics research. Tesserae collaborator Neil Bernstein gave a terrific talk on his work with Kyle Gervais and Wei Lin using Tesserae to compare rates of overall intertextuality across the Latin corpus. There was a lot of good energy at the session, which continued into an informal DCA reception the following day.

(Screencasts are available on the Tesserae Youtube channel.)

The panel came away with accolades in the snap email poll on the day’s panels conducted by APA. Neil Coffee tied for best presider of the day, and the session papers swept the best afternoon session category. The DC3 team of Ryan Baumann, Hugh Cayless, and Joshua Sosin tied Gregory Crane for best paper, with honorable mentions going to the presentations of Neil Bernstein, Neil Coffee, and Diane Cline.

Tesserae Work Used in Lausanne Digital Humanities Course

Experimental work conducted by the Tesserae team using machine learning to improve allusion detection has been incorporated into the DH 101 course taught by Frederic Kaplan of the Digital Humanities Laboratory of the University of Lausanne. The Tesserae DH 2013 abstract,  “Modeling the Interpretation of Literary Allusion with Machine Learning Techniques,”  was discussed on the course blog as one of several approaches using machine learning to develop new humanities perspectives.

APA Blog Series on Digital Classics

I’ve been asked to write some blog posts on digital classics for the website of the American Philological Association (soon to be the Society for Classical Studies). In the first post, “Digits and Dactyls,” I try to give some sense of the possibilities of digital humanities approaches to classicists who may not be familiar with them. A fellow APA blogger, Tim Whitmarsh, writes about the mixed blessing of the need for UK classicists to search out funding sources these days (“Taking Classics for Granted“). Despite the pressures, he finds a silver lining in the collaborations produced, and suggests that many of these are moving in a digital direction. It’s exciting to see the APA (ok, SCS) opening up further to the digital world.

Tesserae at Digital Humanities 2013

Chris and Walter represented Tesserae at Digital Humanities 2013 in Lincoln, Nebraska.  We presented our ongoing work on the Tesserae scoring system in the new electronic poster format, which was perfectly suited to the material. We got some positive feedback and suggestions for new directions both from the text re-use/stylometry side and from the machine learning/pattern recognition side.

dh2013

We saw some excellent papers, although with such a diverse conference and up to six parallel sessions each day we had to make some difficult choices. David Hoover spoke about the use of various subsets vs. all of a feature set in authorship attribution, a subject which is interesting to us as we continue to work on methods for feature selection. Christof Schöch discussed working with a corpus unbalanced as to form (e.g. tragedy/comedy, poetry/prose) in stylometry, which caused us to rethink the usefulness of prose works as a comparandum for poetry.  Jean-Gabriel Ganascia showed promising new techniques for the detection of text re-use in literature. We’re looking forward to the remainder of the conference, and will post a more detailed report on our return home. Maciej Eder addressed the problem of open-set attribution, an important reality all to often overlooked in authorship and style studies. David Bamman and Adam Anderson presented an analysis of social networks in Old Assyrian correspondence, which beautifully demonstrated how Digital Humanities projects can make rigorous scientific analysis both approachable and relevant for a Humanities audience. Graham Sack demonstrated results from an intriguing model for the structuring of narrative attention  in novels.

As we head home from Nebraska we take with us new ideas and new perspectives, already planning our proposal(s) for DH 2014 in Lausanne.

All Perseus Latin Added, plus some Greek and English

When the Tesserae tool is demonstrated to Classics researchers, the most frequently asked question by a wide margin is: “When can you add my text?” It’s a fair request, and a testament to the interest in computer-assisted investigation of intertextuality. Our answer to date has always been “we’re working on it.” This has not been a brush-off; in fact the rapid addition of new texts to our searchable corpus has been one of, if not the top priority of the 2012-2013 academic year. I am pleased to report that we have increased the size of our searchable corpus by a factor of 10, from approximately eight hundred thousand words at the beginning of the 2012-2013 academic year to over eight million words at the time of this writing.

Before I launch into a detailed account of our progress, there are two things I want to make clear:

  1. There are other development teams on the Tesserae project which never stopped growing the functionality of the system while my team worked to expand its reach. The results can be seen in (among other things) the new multi-text search, and the much, much faster back-end we now enjoy.
  2. The massive increase in the Tesserae corpus is the result of a team effort. Veterans of our scoring team returned and were joined by fresh faces who contributed to a deep talent-pool as attested on our personnel page. We were also given a much-needed boost by Chris Forstall, as I will explain.

Our goals for the year included the incorporation of the entirety of the Perseus classical corpus into Tesserae and the addition of important work in the English language. In order to add the texts from the Perseus database, it was crucial to preserve the hierarchy of text, book, and line-number with which the works were already annotated. Tesserae makes use of these markers and to strip the information out would be a waste. Yet there were several obstacles.

First, Perseus texts were added and annotated by many different researchers over a period of several years. Each text presented unique problems to its annotator. Where these problems repeat themselves, a single annotator might solve them the same way each time–but different annotators developed unique solutions. This complicated Chris Forstall’s task of creating a universal parser for the Perseus XML.

In addition, some of the variation in the XML-structure of the annotated texts is a natural result of differential structure of the works themselves. Plays are not organized like novels which are not organized like histories. The variety of structures is hinted at in Chris Forstall’s blog post. What is worse, several authors have been overly complicated by their textual tradition. Take Cicero, for example.

At one point, Cicero’s works were organized by text and chapter. These chapter numbers were based on the page numbers of a very early print publication, and they often break up the text mid-sentence. Later tradition re-divided the work into text, book, chapter, and line–often interrupting the old divisions. Modern texts still include the old chapter numbers as well as the new. The researchers at Perseus, in their effort to faithfully reproduce the information contained in a print volume, include both numbering systems simultaneously in the XML structure of the text. Untangling these conflicting annotations correctly is time-consuming; luckily we were able to rely on the wise counsel of John Dugan and the tireless efforts of Anna Glenn in order to incorporate every single text in Cicero’s oeuvre into Tesserae. For those who aren’t aware, that’s a good chunk of Latin. Cicero’s works contain nearly half of the words in the existing canon of classical Latin.

In fact, the project was able to dramatically increase the size of the corpus in all aspects. Some numbers:

# of words in the Tesserae corpus circa August, 2012: 795,141
# of words in the Tesserae corpus circa June, 2013: 8,198,402

Screen Shot 2013-06-18 at 2.17.54 PM

That’s an increase of 1,031%. It was made possible by the initial work of Chris Forstall, who developed a universal XML-parsing tool for use on the Perseus corpus, and the sustained efforts of our force of volunteers.

The increase has been so dramatic that we are currently considering new methods of organization to relieve the now-overburdened menu system. In addition, still more texts are being processed even as I write, so if you’d like to make sure your particular text will be incorporated into the corpus, feel free to drop us a line, and be assured: we’re working on it.