Discovering Intertextuality with Sequence Alignment

Our approach to intertextuality begins from the time-tested technique of word-level n-gram matching, i.e., matching words in one text with those in another. I recently had a chance to meet with Peter Leonard of Yale, who reacquainted me to another approach that he was involved with at the University of Chicago, called sequence alignment. This work was led by Mark Olsen as part of the ARTFL project at Chicago. This method is more flexible in searching for sequences of letters in an adaptive way. It’s useful to keep this in mind as another important approach for comparison and evaluation.

More information is available in this slide deck.

Tesserae Work Used in Lausanne Digital Humanities Course

Experimental work conducted by the Tesserae team using machine learning to improve allusion detection has been incorporated into the DH 101 course taught by Frederic Kaplan of the Digital Humanities Laboratory of the University of Lausanne. The Tesserae DH 2013 abstract,  “Modeling the Interpretation of Literary Allusion with Machine Learning Techniques,”  was discussed on the course blog as one of several approaches using machine learning to develop new humanities perspectives.

Intertextual Methodology Workshop – Introduction

On February 13-15, 2014, the Fondation Hardt outside Geneva in Vandoeuvres will host a workshop entitled “Intertextualité et humanités numériques: approches, méthodes, tendances. Intertextuality and digital humanities: approaches, methods, trends.”

The goal of the workshop is to develop a better understanding of:

  • what emerging digital methods for approaching intertextuality are and how they are likely to develop
  • the potential and limitations of these methods
  • the practical consequences for the interpretation of literature and language
  • the theoretical consequences for conceptions of intertextuality

The workshop will bring together representatives from several teams developing digital approaches to intertextuality to discuss their work and research plans. Joining them will be scholars of Latin literature with long experience discovering and interpreting instances of intertextuality, as well as defining the larger phenomenon.

This page is intended to host blog posts from participants relating to the conference.

APA Blog Series on Digital Classics

I’ve been asked to write some blog posts on digital classics for the website of the American Philological Association (soon to be the Society for Classical Studies). In the first post, “Digits and Dactyls,” I try to give some sense of the possibilities of digital humanities approaches to classicists who may not be familiar with them. A fellow APA blogger, Tim Whitmarsh, writes about the mixed blessing of the need for UK classicists to search out funding sources these days (“Taking Classics for Granted“). Despite the pressures, he finds a silver lining in the collaborations produced, and suggests that many of these are moving in a digital direction. It’s exciting to see the APA (ok, SCS) opening up further to the digital world.

Data for Claudian – Lucan Study

Chris and I recently submitted an article, “Claudian’s Engagement with Lucan in his Historical and Mythological Hexameters,” based on our presentations in Geneva in November 2012, for inclusion in a conference volume to be published by Winter Verlag. It focuses on Claudian’s creation of intertexts (high-scoring bigram lemma matches found by Tesserae) consisting of phrases that are unique between two of his poems, De Raptu Proserpinae and De Consulatu Stilichonis, and Lucan’s Civil War. The idea is that phrases that are unique to Claudian and Lucan could be of particular interest in their intertextual relationship. “Unique” in this case means that the phrases do not appear in the other authors prior to Claudian in our corpus of Latin poetry at the time the article was produced (this included all canonical poets, but the corpus has since grown). To put Claudian’s intertextuality in context, we also produced similar comparisons of the intertextual relationships between prior epic poets and the Aeneid and Civil War respectively.

The data for these comparisons is available in folders through the following links:

Comparison of later epic poets with Vergil’s Aeneid

Comparison of later epic poets with Lucan’s Civil War

WARNING: the files are large, from 4-70MB.

 

Ranking Results: The Scoring System

Tesserae search begins by matching a minimum of two words in one text with two words in another. The words can be matched either by their exact forms or by their dictionary headwords. Using headword matching permits, for instance, the Latin tuli to match latus, both  forms of the headword fero.

For comparisons of even moderate-sized texts, basic matching produces thousands of results. We have therefore created a scoring system to sort results by likely potential interest.

Higher scores are given to parallels where the matched words in each text are closer together and where the matched words are more rare. Our testing has found that the top results produced by this method correspond well with the results found by commentators. In other words, preliminary tests show the current Tesserae identification and scoring processes help substantially to identify the most meaningful results.

Full testing of this system is still in progress, however, as are efforts to improve it further. In the meantime, the following description gives a somewhat more detailed account of its function.

First, the frequency of each matching term is calculated by dividing its count within its respective text by the total number of words in that text.

The frequency of a word will thus be different in the search and target texts. In a lemma-based search (the default), the count for a word includes every occurrence of an inflected form with which it shares one or more possible lemmata. These frequencies (very small fractions, even for the most common words) are then inverted and the results are added together across both phrases. The result is a very large number. This is then divided by the distance covered by the matching words in the source and target phrase.

Distance in each phrase is calculated as the number of tokens spanned by (and including) the two most-infrequent matching words. The distances from the source and target phrases are added together to make the overall distance. Finally, the natural logarithm of the result is taken. This helps to bring the exponential differences in word frequencies that occur in natural language into a more linear and human-interpretable range. For a given parallel, the rarer the words are, and the closer they are together in their respective texts, the higher its score will be.

Porte-Parole

Chris and I returned last Sunday from the Lucan – Claudian conference held outside Geneva at the Fondation Hardt. I had a great time expanding my knowledge of both poets and meeting colleagues from a variety of institutions and backgrounds. I learned from Professor Jean-Louis Charlet and others various ways in which Claudian was far more than a porte-parole for Stilicho.

Our presentations on Tesserae were something of a novelty for this group, but the idea of using computing to trace intertextuality seemed to go over well. One distinguished Italian scholar of Lucan encouraged us with the exhortation vivant Tesserae! and an American scholar generously asked how he and others could help. We also had a productive meeting with our hosts Damien Nelis, Valery Berlincourt, and Lavinia Galli – Milic about further collaboration. They were terrifically gracious to everyone. We’re hope to see them again before long and continue our discussions.

Chris, Damien Nelis, Neil, Valery Berlincourt, and Yannick Zannetti in front at the Fondation Hardt

Adding New Texts

Adding new texts to the current version of Tesserae is relatively simple. Once we have a text properly formatted, we can run a program to process it, make a few other adjustments (like adding it to drop-down lists and entering its information into our list of sources), and it will be ready to search.

So achieving our goal of adding all Perseus classical Greek and Latin texts this year should be a walk in the park, no?

In fact, there is a bottleneck: getting the texts properly formatted for addition. We need a plain text (in .txt format) that has the proper section markers at the beginning of each line (of poetry) or section (of prose). For Perseus texts, this means automatically stripping out the XML information and inserting the section markers.

We’re moving forward with this work, but would welcome help. A set of instructions gives further detail on how to put texts in the correct format. Anyone who wants to pitch in should email Tesserae Fellow James Gawley for further advice. Note that, although the instructions give some examples of English texts, we have no team members working on English at the moment, and so don’t currently have the capacity to process other English texts.