Tesserae: validation, rappel et précision, Ach. 1,1-396
Stylométrie et genre, I
Achilléide, deuxième semaine
Résumé: première semaine
Bonjour aux participants au cours sur Stace: voici les résultats de la première semaine de vos recherches, vers 1 à 103 de l’Achilléide. J’ai corrigé un peu ce que j’ai présenté dans le cours du 12 mars, mais pour la plupart ce sont les mêmes données.
Préférences des commentateurs
Achilleid course at the University of Geneva
Greetings from Geneva, Switzerland! [Continue in English]
Nous nous réjouissons de vous annoncer qu’un groupe d’étudiants de Master de l’Université de Genève utilisera Tesserae lors d’un cours intitulé «Travestissement et transgénéricité: l’Achilléide de Stace» pour explorer cette œuvre difficile et fascinante d’un point de vue intertextuel. Dans ce poème, écrit sous le signe de l’ambiguitas, l’auteur pousse le lecteur à s’interroger sur la relation qui peut être établie entre le choix d’un personnage et d’un moment spécifique de son histoire (Achille déguisé en femme à la cour du roi Lycomède) et les questionnements de l’auteur sur style, genre et gender. Sous la direction de Lavinia Galli Milic, nous examinons, en particulier, le (ou les) rôle(s) joué(s) par l’intertextualité dans la définition du genre littéraire de ce poème ainsi que dans le dépassement des limites du genre.
R Workshop
On Saturday, April 12, 2014 Christopher Forstall and James Gawley conducted a workshop on Digital Text Analysis for Humanists, using the R software package. The workshop took place at the University at Buffalo (UB), and was sponsored by the Digital Scholarship and Culture Committee of the UB Techne Institute.
The program for the workshop is here: Digital Text Analysis for Humanists Workshop – Program
The data files are here: R_workshop
Latin-Greek search: competing methods
Given the indebtedness of many Latin literary forms to earlier Greek originals, it has long been a goal of ours at Tesserae to one day implement a Latin-Greek search on our site. Currently, word-level n-grams form the foundation of the principal search algorithm. To apply this system where a Latin text alludes to Greek, Tesserae requires a translation dictionary linking Greek lemmata to associated Latin terms.
James Gawley and I are currently working on two different methods for producing such a dictionary. James is working on the “parallel texts” method. This method compares the Greek New Testament with Jerome’s Latin text to probabilistically assign a Latin translation (actually, several likely candidates) to each Greek word. James is writing an algorithm for machine text alignment based on Bayes’ theorem. This algorithm, similar to more complex models such as the IBM methods for machine alignment, looks at the frequency with which each Latin word appears in the same verses as each Greek word.
My method, the “dictionary method,” uses English as a pivot language. Expanding on a method developed by Jeff Rydberg-Cox at Perseus, I compare entries in the Liddell-Scott Greek-English lexicon with entries in the Lewis and Short Latin-English lexicon using the Gensim topic modelling package. The similarity of a given Greek and Latin headword is determined based on the similarity of their English definitions in the two dictionaries.
Each method produces its own Greek-Latin translation set. These are used to “translate” Tesserae’s existing Greek lemma indices, which can then be searched against the Latin indices. The success of this method depends a lot on how many Greek lemmata we can successfully link with Latin translations (a better term might be “related words”). While it’s still in the alpha stage, it shows a lot of promise.
For example, in the opening of Vergil’s poem, the narrator asks his Muse about the causes of the Trojans’ trials as they wandered with Aeneas:
Muse, remind me of the causes, on account of which god’s anger…
Compare the words of Priam to Helen, as, gazing from the wall at the warriors below, he reflects on the source of the Trojans’ suffering:
To me, you are not the cause; to me, the gods are the causes…
In this case, the dictionary method allows Tesserae to detect the parallel based on the correspondences, numine (“god”) ~ θεοί (“gods”), and causas (“causes”) ~ αἰτίη/αἴτιοι(“cause”/“causes”).
We’re pitting the two methods against each other, head to head. They’ll be tested by their ability to detect a subset of Aeneid–Iliad parallels collated from G. N. Knauer’s Die Aeneis und Homer by Konnor Clark and Amy Miu, and similar to our Lucan-Vergil benchmark set. For now, you can test them on our development site here. (NB: once you’re at the development page, links lead to other development pages. To leave the develop branch click on the blog link in the upper right.)
While each of the two methods on its own can identify significant Latin-Greek allusions, we ultimately aim to combine their output in a single feature set. We’re excited to be presenting this work at DHCS 2013 in Chicago this December 5–7.
Tesserae at Digital Humanities 2013
Chris and Walter represented Tesserae at Digital Humanities 2013 in Lincoln, Nebraska. We presented our ongoing work on the Tesserae scoring system in the new electronic poster format, which was perfectly suited to the material. We got some positive feedback and suggestions for new directions both from the text re-use/stylometry side and from the machine learning/pattern recognition side.
We saw some excellent papers, although with such a diverse conference and up to six parallel sessions each day we had to make some difficult choices. David Hoover spoke about the use of various subsets vs. all of a feature set in authorship attribution, a subject which is interesting to us as we continue to work on methods for feature selection. Christof Schöch discussed working with a corpus unbalanced as to form (e.g. tragedy/comedy, poetry/prose) in stylometry, which caused us to rethink the usefulness of prose works as a comparandum for poetry. Jean-Gabriel Ganascia showed promising new techniques for the detection of text re-use in literature. We’re looking forward to the remainder of the conference, and will post a more detailed report on our return home. Maciej Eder addressed the problem of open-set attribution, an important reality all to often overlooked in authorship and style studies. David Bamman and Adam Anderson presented an analysis of social networks in Old Assyrian correspondence, which beautifully demonstrated how Digital Humanities projects can make rigorous scientific analysis both approachable and relevant for a Humanities audience. Graham Sack demonstrated results from an intriguing model for the structuring of narrative attention in novels.
As we head home from Nebraska we take with us new ideas and new perspectives, already planning our proposal(s) for DH 2014 in Lausanne.
Benchmark Data
*** See our updated benchmark data on our recent blog post “Collected Benchmark Sets” ***
Here is the data produced by our two surveys (in 2010 and 2012) of intertexts between Lucan, Bellum Civile 1, and Vergil, Aeneid.
The 2010 spreadsheet lists parallels reported from six different sources: four professional commentaries and two versions of Tesserae. Each parallel was hand ranked by members of our team of graduate student and faculty readers.
This is the source of the data reported in the 2012 TAPA and LLC articles.
The 2012 spreadsheet lists all parallels returned by a Version 3 search of the same two texts, plus any parallels found in the commentaries but not returned by Tesserae. The presence of a given parallel in one or more of the four commentaries is represented by the commentators’ initials. This sheet gives hand ranks from both the 2012 and 2010 tests.
Click to Download:
Tesserae 2010 Benchmark
Tesserae 2012 Benchmark
Please feel welcome to contact us with comments or questions on these data.