Lemma + Semantic Matching: Capture More Parallels

A “lemma-based search” identifies the co-occurrence of the same word with different inflections, and this is the basis for version 3.0 of the Tesserae software. In a benchmark test of Pharsalia 1 vs. Aeneid, 55% of parallels previously noted by commentators were retrieved by lemma-based search.

Previous posts have discussed the method of generating metonym (synonym, antonym, hyponym, etc.) dictionaries through the application of topic modeling. In order to capture more of our target intertexts, we worked to generate the most accurate possible metonym dictionary, then combined it with lemmatization in order to simultaneously capture different inflections of the same word and metonyms of that word.

In a repeat of the benchmark test above, the new ‘synonym + lemma’ feature retrieved commentator parallels at a rate of 68% (other search settings remained the same).

Feature_Comparison

View a version 3.1 synonym + lemma benchmark test.

View a version 3.0 lemma-based benchmark test.

A Greek to Latin Dictionary

As mentioned in previous posts, the Tesserae team has been working to create a digital Greek to Latin dictionary to aid in the retrieval of cross-language text reuse. Tesserae interns Nathaniel Durant, Theresa Mullin, and Elizabeth Hunter collectively assessed 1,000 Greek words, determining which, if any, method for producing a cross-language dictionary produced accurate translations. The winner proved to be an enhanced version of Chris Forstall’s topic-model-based ‘pivot’ method.

I crunched the numbers we got from the 1,000 translations tested by our faithful collaborators, and used them to generate the best possible Greek-to-Latin dictionary. Chris’s algorithm produced up to two Latin translations for each Greek word, with a similarity value attached to each translation. I set out to find a good ‘cutoff’ value for the probability of a translation. I balanced precision and recall according to the following criteria:

  1. It was very important to us that we retain at least 1 accurate translation.
  2. It was very important to us that we avoid retaining inaccurate translations.

Because we have two possible translations for most words, it proved best to use two different similarity-score cutoffs for translations A and B. The result is a Greek-to-Latin dictionary which correlates 34,000 Greek words with at least one semantically related Latin word. We have reason to believe that this dictionary is accurate at at rate of 75%-80%, according to our own parameters for accuracy (because we are searching for allusions, this is not a ‘translation’ dictionary; we consider antonyms and all other metonyms to be valid associations).

Publications on our methodology are forthcoming. For now, please experiment with the tool at http://tesserae.caset.buffalo.edu/cross.php. We welcome your feedback.

Stylométrie et genre, II

Nous avons déjà disposé les textes de notre corpus dans un espace défini par les fréquences des mots, ce qui a démontré qu’ils se regroupent par genre, épopée et élegie. Ici, nous avons coupé les oeuvres en portions de 30 vers pour examiner les variances à l’intérieur des textes eux-mêmes en comparaison avec celles entre les textes et entre les genres.
Having seen how the texts of our corpus can be located in a feature space defined by word frequencies in a manner that illustrates the separation of the two genres, epic and elegy, we next cut up the poems into smaller sections in order to see how stylistic heterogeneity within individual texts compares with the variation among texts and between genres.

genre-with-met

Lire la suite | Keep reading →

Tesserae: validation, rappel et précision, Ach. 1,1-396

Nous disposons actuellement de données plus ou moins complètes pour les premiers quatre centaines de vers, jusqu’à la fin du commentaire de Uccellini. Les chiffres à ce stade-ci ressemblent bien à ceux des premières semaines : les resultats de Tesserae validés par nos participants ne comprennent que 10% (ou moins) des intertextes indiqués par les commentaires ; à la même fois, ils augmentent le total de bons intertextes par environ 20%.
We now have more or less complete data for the first 400 lines or so, as far as the end of Uccellini’s commentary. The numbers at this point are very similar to what we saw in the first weeks of the course: Tesserae results, after validation by the participants in the class, only include 10% or less of the intertexts noted in the commentaries; at the same time, they also add about 20% to the total number of valid intertexts.

venn_valid_1-396_cropped

Lire la suite | Keep reading →

Stylométrie et genre, I

Notre thèse de départ s’appuyait sur le postulat que l’épopée et l’élégie emploient un langage différent, et que la présence de quelques traits distinctifs de chacun de ces genres dans le 1er livre de l’Achilléide montrerait la nature génériquement hybride de ce poème. Nous nous sommes inspirés entre autre du chapitre de Francis Cairns, «Dido and the Elegiac Tradition» où Cairns évalue les éléments spécifiques au lexique et aux topiques élégiaques du quatrième livre de l’Enéide. Ici, nous nous proposons d’examiner les traits généraux de la langue de l’Achilléide et de les comparer aux traits de la langue employée par les autres textes de notre corpus, à travers des données statistiques à large échelle. Il ne s’agit pas, à ce stade, de relever des allusions spécifiques du texte-cible au texte-source.
Early in this course, we considered the hypothesis that elegy and epic employ different poetic languages, and that the first book of the Achilleid might reveal its intertextual relationships to these two genres by the ways in which it re-uses the distinctive elements of each. In this approach we were inspired by, among other things, our reading of Francis Cairns’ chapter “Dido and the Elegiac Tradition,” in which Cairns interprets the presence of elegiac diction in Aeneid 4. In this and future posts we will explore some simple ways to test our hypothesis, demonstrating some simple methods in which we can compare the language employed by the Achilleid with that of the other works in our corpus of source texts. As a complement to our close reading, however, here we will be looking at large-scale statistical properties rather than specific borrowings.

Lire la suite | Keep reading →

Achilléide, deuxième semaine

Nous avons maintenant des resultats complets (ou presque) pour les premièrs 197 vers de l’Achilléide. Cela comprend les references aux quatre corpus–Virgile, Énéide, Ovide, poèmes élégiaques; Ovide, Metamorphoses; et Stace, Thébaïde–indiqués dans nos quatre commentaires, et aussi des résultats de Tesserae pour les mêmes sources validés par des participants au cours. Voici, donc, pour ces vers, nos premiers constats suite à la comparaison des données de Tesserae et du travail des commentateurs.
We now have (more or less) complete results for the first 197 lines of the Achilleid. These include all intertexts to the four source corpora–Vergil’s Aeneid, Ovid’s elegies, Ovid’s Metamorphoses, and Statius’ Thebaid–noted by our four commentaries, as well as Tesserae results, hand-checked by participants in the seminar, for the same four sources. These verses thus present our first opportunity to make a direct comparison of Tesserae’s output against the work of the commentators.

Lire la suite | Keep reading →

Augustine vs. The Rhetoricians

The following data is the basis for an article entitled: “Paul is the New Cicero: Repurposing Roman Rhetoric in Augustine’s De Doctrina Christiana,” under review with the Journal Mouseion. These files are archived here for the benefit of readers who wish to inspect the results of Tesserae comparisons in greater detail than is possible in the article. The first file contains the results of a comparison of Augustine’s De Doctrina Christiana to Cicero’s Orator.

The following links lead to comma-separated-value (CSV) files which can be opened in any spreadsheet editor.

Below are links to tab-separated-value files, whose contents represent raw data collected in a batch Tesserae search:

  • runs: each line represents a single comparison and its details.
  • scores: coded by the numbers found in the ‘runs’ file, each row represents the number of results returned at a given score level.

Achilleid course at the University of Geneva

Greetings from Geneva, Switzerland! [Continue in English]

Nous nous réjouissons de vous annoncer qu’un groupe d’étudiants de Master de l’Université de Genève utilisera Tesserae lors d’un cours intitulé «Travestissement et transgénéricité: l’Achilléide de Stace» pour explorer cette œuvre difficile et fascinante d’un point de vue intertextuel. Dans ce poème, écrit sous le signe de l’ambiguitas, l’auteur pousse le lecteur à s’interroger sur la relation qui peut être établie entre le choix d’un personnage et d’un moment spécifique de son histoire (Achille déguisé en femme à la cour du roi Lycomède) et les questionnements de l’auteur sur style, genre et gender. Sous la direction de Lavinia Galli Milic, nous examinons, en particulier, le (ou les) rôle(s) joué(s) par l’intertextualité dans la définition du genre littéraire de ce poème ainsi que dans le dépassement des limites du genre.

Lire la suite | Keep reading →