Stylométrie et genre, II

Nous avons déjà disposé les textes de notre corpus dans un espace défini par les fréquences des mots, ce qui a démontré qu’ils se regroupent par genre, épopée et élegie. Ici, nous avons coupé les oeuvres en portions de 30 vers pour examiner les variances à l’intérieur des textes eux-mêmes en comparaison avec celles entre les textes et entre les genres.
Having seen how the texts of our corpus can be located in a feature space defined by word frequencies in a manner that illustrates the separation of the two genres, epic and elegy, we next cut up the poems into smaller sections in order to see how stylistic heterogeneity within individual texts compares with the variation among texts and between genres.


Lire la suite | Keep reading →

Tesserae: validation, rappel et précision, Ach. 1,1-396

Nous disposons actuellement de données plus ou moins complètes pour les premiers quatre centaines de vers, jusqu’à la fin du commentaire de Uccellini. Les chiffres à ce stade-ci ressemblent bien à ceux des premières semaines : les resultats de Tesserae validés par nos participants ne comprennent que 10% (ou moins) des intertextes indiqués par les commentaires ; à la même fois, ils augmentent le total de bons intertextes par environ 20%.
We now have more or less complete data for the first 400 lines or so, as far as the end of Uccellini’s commentary. The numbers at this point are very similar to what we saw in the first weeks of the course: Tesserae results, after validation by the participants in the class, only include 10% or less of the intertexts noted in the commentaries; at the same time, they also add about 20% to the total number of valid intertexts.


Lire la suite | Keep reading →

Stylométrie et genre, I

Notre thèse de départ s’appuyait sur le postulat que l’épopée et l’élégie emploient un langage différent, et que la présence de quelques traits distinctifs de chacun de ces genres dans le 1er livre de l’Achilléide montrerait la nature génériquement hybride de ce poème. Nous nous sommes inspirés entre autre du chapitre de Francis Cairns, «Dido and the Elegiac Tradition» où Cairns évalue les éléments spécifiques au lexique et aux topiques élégiaques du quatrième livre de l’Enéide. Ici, nous nous proposons d’examiner les traits généraux de la langue de l’Achilléide et de les comparer aux traits de la langue employée par les autres textes de notre corpus, à travers des données statistiques à large échelle. Il ne s’agit pas, à ce stade, de relever des allusions spécifiques du texte-cible au texte-source.
Early in this course, we considered the hypothesis that elegy and epic employ different poetic languages, and that the first book of the Achilleid might reveal its intertextual relationships to these two genres by the ways in which it re-uses the distinctive elements of each. In this approach we were inspired by, among other things, our reading of Francis Cairns’ chapter “Dido and the Elegiac Tradition,” in which Cairns interprets the presence of elegiac diction in Aeneid 4. In this and future posts we will explore some simple ways to test our hypothesis, demonstrating some simple methods in which we can compare the language employed by the Achilleid with that of the other works in our corpus of source texts. As a complement to our close reading, however, here we will be looking at large-scale statistical properties rather than specific borrowings.

Lire la suite | Keep reading →

Achilléide, deuxième semaine

Nous avons maintenant des resultats complets (ou presque) pour les premièrs 197 vers de l’Achilléide. Cela comprend les references aux quatre corpus–Virgile, Énéide, Ovide, poèmes élégiaques; Ovide, Metamorphoses; et Stace, Thébaïde–indiqués dans nos quatre commentaires, et aussi des résultats de Tesserae pour les mêmes sources validés par des participants au cours. Voici, donc, pour ces vers, nos premiers constats suite à la comparaison des données de Tesserae et du travail des commentateurs.
We now have (more or less) complete results for the first 197 lines of the Achilleid. These include all intertexts to the four source corpora–Vergil’s Aeneid, Ovid’s elegies, Ovid’s Metamorphoses, and Statius’ Thebaid–noted by our four commentaries, as well as Tesserae results, hand-checked by participants in the seminar, for the same four sources. These verses thus present our first opportunity to make a direct comparison of Tesserae’s output against the work of the commentators.

Lire la suite | Keep reading →

Achilleid course at the University of Geneva

Greetings from Geneva, Switzerland! [Continue in English]

Nous nous réjouissons de vous annoncer qu’un groupe d’étudiants de Master de l’Université de Genève utilisera Tesserae lors d’un cours intitulé «Travestissement et transgénéricité: l’Achilléide de Stace» pour explorer cette œuvre difficile et fascinante d’un point de vue intertextuel. Dans ce poème, écrit sous le signe de l’ambiguitas, l’auteur pousse le lecteur à s’interroger sur la relation qui peut être établie entre le choix d’un personnage et d’un moment spécifique de son histoire (Achille déguisé en femme à la cour du roi Lycomède) et les questionnements de l’auteur sur style, genre et gender. Sous la direction de Lavinia Galli Milic, nous examinons, en particulier, le (ou les) rôle(s) joué(s) par l’intertextualité dans la définition du genre littéraire de ce poème ainsi que dans le dépassement des limites du genre.

Lire la suite | Keep reading →

R Workshop


On Saturday, April 12, 2014 Christopher Forstall and James Gawley conducted a workshop on Digital Text Analysis for Humanists, using the R software package. The workshop took place at the University at Buffalo (UB), and was sponsored by the Digital Scholarship and Culture Committee of the UB Techne Institute.

The program for the workshop is here: Digital Text Analysis for Humanists Workshop – Program

The data files are here: R_workshop


Latin-Greek search: competing methods

Given the indebtedness of many Latin literary forms to earlier Greek originals, it has long been a goal of ours at Tesserae to one day implement a Latin-Greek search on our site. Currently, word-level n-grams form the foundation of the principal search algorithm. To apply this system where a Latin text alludes to Greek, Tesserae requires a translation dictionary linking Greek lemmata to associated Latin terms.

James Gawley and I are currently working on two different methods for producing such a dictionary. James is working on the “parallel texts” method. This method compares the Greek New Testament with Jerome’s Latin text to probabilistically assign a Latin translation (actually, several likely candidates) to each Greek word. James is writing an algorithm for machine text alignment based on Bayes’ theorem. This algorithm, similar to more complex models such as the IBM methods for machine alignment, looks at the frequency with which each Latin word appears in the same verses as each Greek word.

My method, the “dictionary method,” uses English as a pivot language. Expanding on a method developed by Jeff Rydberg-Cox at Perseus, I compare entries in the Liddell-Scott Greek-English lexicon with entries in the Lewis and Short Latin-English lexicon using the Gensim topic modelling package. The similarity of a given Greek and Latin headword is determined based on the similarity of their English definitions in the two dictionaries.

Each method produces its own Greek-Latin translation set. These are used to “translate” Tesserae’s existing Greek lemma indices, which can then be searched against the Latin indices. The success of this method depends a lot on how many Greek lemmata we can successfully link with Latin translations (a better term might be “related words”). While it’s still in the alpha stage, it shows a lot of promise.

For example, in the opening of Vergil’s poem, the narrator asks his Muse about the causes of the Trojans’ trials as they wandered with Aeneas:

Musa, mihi causas memora, quo numine laeso Aen. 1.8)
Muse, remind me of the causes, on account of which god’s anger…

Compare the words of Priam to Helen, as, gazing from the wall at the warriors below, he reflects on the source of the Trojans’ suffering:

οὔ τί μοι αἰτίη ἐσσί, θεοί νύ μοι αἴτιοί εἰσιν (Il. 3.164)
To me, you are not the cause; to me, the gods are the causes…

In this case, the dictionary method allows Tesserae to detect the parallel based on the correspondences, numine (“god”) ~ θεοί (“gods”), and causas (“causes”) ~ αἰτίη/αἴτιοι(“cause”/“causes”).

We’re pitting the two methods against each other, head to head. They’ll be tested by their ability to detect a subset of AeneidIliad parallels collated from G. N. Knauer’s Die Aeneis und Homer by Konnor Clark and Amy Miu, and similar to our Lucan-Vergil benchmark set. For now, you can test them on our development site here. (NB: once you’re at the development page, links lead to other development pages. To leave the develop branch click on the blog link in the upper right.)

While each of the two methods on its own can identify significant Latin-Greek allusions, we ultimately aim to combine their output in a single feature set. We’re excited to be presenting this work at DHCS 2013 in Chicago this December 5–7.

Tesserae at Digital Humanities 2013

Chris and Walter represented Tesserae at Digital Humanities 2013 in Lincoln, Nebraska.  We presented our ongoing work on the Tesserae scoring system in the new electronic poster format, which was perfectly suited to the material. We got some positive feedback and suggestions for new directions both from the text re-use/stylometry side and from the machine learning/pattern recognition side.


We saw some excellent papers, although with such a diverse conference and up to six parallel sessions each day we had to make some difficult choices. David Hoover spoke about the use of various subsets vs. all of a feature set in authorship attribution, a subject which is interesting to us as we continue to work on methods for feature selection. Christof Schöch discussed working with a corpus unbalanced as to form (e.g. tragedy/comedy, poetry/prose) in stylometry, which caused us to rethink the usefulness of prose works as a comparandum for poetry.  Jean-Gabriel Ganascia showed promising new techniques for the detection of text re-use in literature. We’re looking forward to the remainder of the conference, and will post a more detailed report on our return home. Maciej Eder addressed the problem of open-set attribution, an important reality all to often overlooked in authorship and style studies. David Bamman and Adam Anderson presented an analysis of social networks in Old Assyrian correspondence, which beautifully demonstrated how Digital Humanities projects can make rigorous scientific analysis both approachable and relevant for a Humanities audience. Graham Sack demonstrated results from an intriguing model for the structuring of narrative attention  in novels.

As we head home from Nebraska we take with us new ideas and new perspectives, already planning our proposal(s) for DH 2014 in Lausanne.

Benchmark Data

*** See our updated benchmark data on our recent blog post “Collected Benchmark Sets” ***

Here is the data produced by our two surveys (in 2010 and 2012) of intertexts between Lucan, Bellum Civile 1, and Vergil, Aeneid.

The 2010 spreadsheet lists parallels reported from six different sources: four professional commentaries and two versions of Tesserae. Each parallel was hand ranked by members of our team of graduate student and faculty readers.

This is the source of the data reported in the 2012 TAPA and LLC articles.

The 2012 spreadsheet lists all parallels returned by a Version 3 search of the same two texts, plus any parallels found in the commentaries but not returned by Tesserae. The presence of a given parallel in one or more of the four commentaries is represented by the commentators’ initials. This sheet gives hand ranks from both the 2012 and 2010 tests.

Click to Download:
Tesserae 2010 Benchmark
Tesserae 2012 Benchmark

Please feel welcome to contact us with comments or questions on these data.