A Greek to Latin Dictionary

As mentioned in previous posts, the Tesserae team has been working to create a digital Greek to Latin dictionary to aid in the retrieval of cross-language text reuse. Tesserae interns Nathaniel Durant, Theresa Mullin, and Elizabeth Hunter collectively assessed 1,000 Greek words, determining which, if any, method for producing a cross-language dictionary produced accurate translations. The winner proved to be an enhanced version of Chris Forstall’s topic-model-based ‘pivot’ method.

I crunched the numbers we got from the 1,000 translations tested by our faithful collaborators, and used them to generate the best possible Greek-to-Latin dictionary. Chris’s algorithm produced up to two Latin translations for each Greek word, with a similarity value attached to each translation. I set out to find a good ‘cutoff’ value for the probability of a translation. I balanced precision and recall according to the following criteria:

  1. It was very important to us that we retain at least 1 accurate translation.
  2. It was very important to us that we avoid retaining inaccurate translations.

Because we have two possible translations for most words, it proved best to use two different similarity-score cutoffs for translations A and B. The result is a Greek-to-Latin dictionary which correlates 34,000 Greek words with at least one semantically related Latin word. We have reason to believe that this dictionary is accurate at at rate of 75%-80%, according to our own parameters for accuracy (because we are searching for allusions, this is not a ‘translation’ dictionary; we consider antonyms and all other metonyms to be valid associations).

Publications on our methodology are forthcoming. For now, please experiment with the tool at http://tesserae.caset.buffalo.edu/cross.php. We welcome your feedback.

Augustine vs. The Rhetoricians

The following data is the basis for an article entitled: “Paul is the New Cicero: Repurposing Roman Rhetoric in Augustine’s De Doctrina Christiana,” under review with the Journal Mouseion. These files are archived here for the benefit of readers who wish to inspect the results of Tesserae comparisons in greater detail than is possible in the article. The first file contains the results of a comparison of Augustine’s De Doctrina Christiana to Cicero’s Orator.

The following links lead to comma-separated-value (CSV) files which can be opened in any spreadsheet editor.

Below are links to tab-separated-value files, whose contents represent raw data collected in a batch Tesserae search:

  • runs: each line represents a single comparison and its details.
  • scores: coded by the numbers found in the ‘runs’ file, each row represents the number of results returned at a given score level.

Knauer’s list of parallels between Aeneid (Book 1) and the Iliad

Knauer’s original commentary on the Aeneid listed places of parallelism with Homer’s Iliad, but did not specify criteria for intertextuality. The Google Docs spreadsheet below uses Knauer’s citations of the Aeneid, Book I, with his citations of the Iliad and lists the verbal correspondences between the Latin and the Greek. This work began in September, 2013 and has been intermittently edited and expanded up until June, 2014, when it has been mostly completed.

It is free to use with credit to Tesserae and Konnor Clark, who compiled the list.

https://docs.google.com/spreadsheet/ccc?key=0AmBfs72ChHaodDJPV2s1Mk1EeW5lRm5HNnRLN1hHV2c&usp=sharing

Discovering Intertextuality with Sequence Alignment

Our approach to intertextuality begins from the time-tested technique of word-level n-gram matching, i.e., matching words in one text with those in another. I recently had a chance to meet with Peter Leonard of Yale, who reacquainted me to another approach that he was involved with at the University of Chicago, called sequence alignment. This work was led by Mark Olsen as part of the ARTFL project at Chicago. This method is more flexible in searching for sequences of letters in an adaptive way. It’s useful to keep this in mind as another important approach for comparison and evaluation.

More information is available in this slide deck.

How the text-alignment method works

As explained by Chris Forstall in his earlier post, we are currently experimenting with a new cross-language detection feature over on the Tesserae Development server. We are using two different approaches, and the naïve Bayesian alignment approach bears a little explanation. The purpose of this post is to provide a simple introduction to the theory behind the algorithm; a link to my PERL script which aligns two texts in Tesserae format will be provided at the end.

To begin with, let’s assume we have a corpus which consists of the same text in two languages. Let’s further assume that our texts are perfectly aligned, sentence-by-sentence (the difficulty of finding texts like this has led us to use the New Testament for our experiments). We want to know which word in language A corresponds to which word in language B. Initially, we assign each word an equal probability. Here’s a simple example sentence in Greek and Latin:

Sentence A (Language A) Sentence B (Language B)
Amo libros legere  Φιλω βιβλους ἀναγιγνωσκειν

We’re going to try to figure out which word is a translation of Amo. First we assign an equal probability to all translation candidates. Because there are three words in Sentence B, the probability that Amo corresponds to Φιλω is 0.33, and the probability that it corresponds to βιβλους is also 0.33 (remember that a probability of 1.0 means that something is definitely true). The key to correctly lining up Latin words with their Greek translations is repetition. Let’s add another aligned sentence to our comparison:

Sentence A (Language A) Sentence B (Language B)
Amo philosophiam  Φιλω φιλοσοφιαν

This time, the sentence from language B doesn’t contain βιβλους or ἀναγιγνωσκειν, so it’s less likely that either of those are legitimate translations for Amo. Φιλω has also appeared again, so the probability assigned to a possible Amo/Φιλω alignment is increased.

The equation that smooths out the probabilities of each conceivable alignment over the course of many, many sentences is called Bayes’ theorem. It looks like this:

Bayes' Theorem

Here’s what the first part, P(A|B), means to us: “the probability that word A in language A is a correct translation of word B in language B.” The next part, P(B|A), means “the probability that word B in language B is the correct translation of word A in language A.” You’ll notice that putting these two statements on opposite sides of an ‘equals’ sign looks a little like circular logic. The key here is that Bayes’ theorem works backward in order to more appropriately weight the probability associated with each possible translation candidate. This will become clearer in the next paragraph. The rest of the equation has to do with ‘smoothing’ the results; remember that our goal is to correctly weight these probabilities according to the pattern which emerges through repetition. The next two parts, P(A) and P(B), mean, for our purposes, “the probability of word A occurring in language A” and “the probability of word B occurring in language B.” For these probabilities we substitute “the number of occurrences of word B in the ‘language B’ (or word A in the ‘language A’) text, divided by the total number of words in that text.”

Because Bayes’ theorem works backward from translation to antecedent, the application of this theorem in text alignment can look a bit complicated. This is how it works: to determine P(A|B) for any given Latin word, the program looks at all the sentences (actually Bible verses in our corpus) which contain that word. We’ll call this Verse Group 1. The program then gathers up all the Greek words in the corresponding Greek verses. These Greek words are our translation candidates, and we look at each of them in turn. To calculate P(B|A) (the probability of the original Latin word, given the current Greek translation candidate), the program looks at all the Greek verses which contain the translation candidate. We can call this group of verses ‘Verse Group 2.’ The program then gathers up all the Latin words in the Latin versions of Verse Group 2. The important factor here is that we’re grabbing a different set of verses than those in Verse Group 1. The amount of overlap between Verse Group 1 and Verse Group 2 depends on how good a translation candidate we’re looking at. In other words, when we look back from Greek to Latin, we may find verses that don’t contain the original Latin word under scrutiny. This is especially true if the Greek translation candidate is not actually the word we ultimately want; if we are looking at the wrong Greek word, we’ll end up gathering a bunch of Latin verses which don’t contain our original word and that will lower the value of P(B|A).

The rest of the program is what my high school Physics teacher used to call “plug and chug.” ‘Probabilities’ are really just the number of times that a given word appears divided by the total number of words in the group in which it appears. An important feature of this approach is that for each word we examine, the program returns the probability of an alignment between that word and each possible translation word–just like in the first set of sentences at the top of this post. Many tools for this type of operation can be found online; a popular one is mGIZA. My own code for this project can be found on github.

Feel free to ask questions or leave feedback in the comments section.

 

Latin-Greek search: competing methods

Given the indebtedness of many Latin literary forms to earlier Greek originals, it has long been a goal of ours at Tesserae to one day implement a Latin-Greek search on our site. Currently, word-level n-grams form the foundation of the principal search algorithm. To apply this system where a Latin text alludes to Greek, Tesserae requires a translation dictionary linking Greek lemmata to associated Latin terms.

James Gawley and I are currently working on two different methods for producing such a dictionary. James is working on the “parallel texts” method. This method compares the Greek New Testament with Jerome’s Latin text to probabilistically assign a Latin translation (actually, several likely candidates) to each Greek word. James is writing an algorithm for machine text alignment based on Bayes’ theorem. This algorithm, similar to more complex models such as the IBM methods for machine alignment, looks at the frequency with which each Latin word appears in the same verses as each Greek word.

My method, the “dictionary method,” uses English as a pivot language. Expanding on a method developed by Jeff Rydberg-Cox at Perseus, I compare entries in the Liddell-Scott Greek-English lexicon with entries in the Lewis and Short Latin-English lexicon using the Gensim topic modelling package. The similarity of a given Greek and Latin headword is determined based on the similarity of their English definitions in the two dictionaries.

Each method produces its own Greek-Latin translation set. These are used to “translate” Tesserae’s existing Greek lemma indices, which can then be searched against the Latin indices. The success of this method depends a lot on how many Greek lemmata we can successfully link with Latin translations (a better term might be “related words”). While it’s still in the alpha stage, it shows a lot of promise.

For example, in the opening of Vergil’s poem, the narrator asks his Muse about the causes of the Trojans’ trials as they wandered with Aeneas:

Musa, mihi causas memora, quo numine laeso Aen. 1.8)
Muse, remind me of the causes, on account of which god’s anger…

Compare the words of Priam to Helen, as, gazing from the wall at the warriors below, he reflects on the source of the Trojans’ suffering:

οὔ τί μοι αἰτίη ἐσσί, θεοί νύ μοι αἴτιοί εἰσιν (Il. 3.164)
To me, you are not the cause; to me, the gods are the causes…

In this case, the dictionary method allows Tesserae to detect the parallel based on the correspondences, numine (“god”) ~ θεοί (“gods”), and causas (“causes”) ~ αἰτίη/αἴτιοι(“cause”/“causes”).

We’re pitting the two methods against each other, head to head. They’ll be tested by their ability to detect a subset of AeneidIliad parallels collated from G. N. Knauer’s Die Aeneis und Homer by Konnor Clark and Amy Miu, and similar to our Lucan-Vergil benchmark set. For now, you can test them on our development site here. (NB: once you’re at the development page, links lead to other development pages. To leave the develop branch click on the blog link in the upper right.)

While each of the two methods on its own can identify significant Latin-Greek allusions, we ultimately aim to combine their output in a single feature set. We’re excited to be presenting this work at DHCS 2013 in Chicago this December 5–7.

Data for Claudian – Lucan Study

Chris and I recently submitted an article, “Claudian’s Engagement with Lucan in his Historical and Mythological Hexameters,” based on our presentations in Geneva in November 2012, for inclusion in a conference volume to be published by Winter Verlag. It focuses on Claudian’s creation of intertexts (high-scoring bigram lemma matches found by Tesserae) consisting of phrases that are unique between two of his poems, De Raptu Proserpinae and De Consulatu Stilichonis, and Lucan’s Civil War. The idea is that phrases that are unique to Claudian and Lucan could be of particular interest in their intertextual relationship. “Unique” in this case means that the phrases do not appear in the other authors prior to Claudian in our corpus of Latin poetry at the time the article was produced (this included all canonical poets, but the corpus has since grown). To put Claudian’s intertextuality in context, we also produced similar comparisons of the intertextual relationships between prior epic poets and the Aeneid and Civil War respectively.

The data for these comparisons is available in folders through the following links:

Comparison of later epic poets with Vergil’s Aeneid

Comparison of later epic poets with Lucan’s Civil War

WARNING: the files are large, from 4-70MB.

 

Benchmark Data

*** See our updated benchmark data on our recent blog post “Collected Benchmark Sets” ***

Here is the data produced by our two surveys (in 2010 and 2012) of intertexts between Lucan, Bellum Civile 1, and Vergil, Aeneid.

The 2010 spreadsheet lists parallels reported from six different sources: four professional commentaries and two versions of Tesserae. Each parallel was hand ranked by members of our team of graduate student and faculty readers.

This is the source of the data reported in the 2012 TAPA and LLC articles.

The 2012 spreadsheet lists all parallels returned by a Version 3 search of the same two texts, plus any parallels found in the commentaries but not returned by Tesserae. The presence of a given parallel in one or more of the four commentaries is represented by the commentators’ initials. This sheet gives hand ranks from both the 2012 and 2010 tests.

Click to Download:
Tesserae 2010 Benchmark
Tesserae 2012 Benchmark

Please feel welcome to contact us with comments or questions on these data.

Slight Score Change

We’ve recently fixed a small bug in the scoring system, and you may notice that some scores are higher than they used to be.  Scores are calculated as floating-point values, but displayed as integers in the web interface.  To this point the decimal part of the score has been truncated, so that all scores were effectively rounded down to the next lowest integer.  From now on we will use the more customary rounding rules, so that partial scores equal to and above .5 will be rounded up.  If you compare the results of a search done now to those of the same search before the change, you can expect about half of the scores to be one higher.

As we continue our research on quantifying the literary significance of allusions, scores may change again, and perhaps significantly.  We will post notice and explanations of any such changes here. If for any reason you need access to a previous version of the software in order to replicate older results, please just let us know and we can help you.  Every version of Tesserae, once published on our web site, is archived and can be retrieved.