Data for Claudian – Lucan Study

Chris and I recently submitted an article, “Claudian’s Engagement with Lucan in his Historical and Mythological Hexameters,” based on our presentations in Geneva in November 2012, for inclusion in a conference volume to be published by Winter Verlag. It focuses on Claudian’s creation of intertexts (high-scoring bigram lemma matches found by Tesserae) consisting of phrases that are unique between two of his poems, De Raptu Proserpinae and De Consulatu Stilichonis, and Lucan’s Civil War. The idea is that phrases that are unique to Claudian and Lucan could be of particular interest in their intertextual relationship. “Unique” in this case means that the phrases do not appear in the other authors prior to Claudian in our corpus of Latin poetry at the time the article was produced (this included all canonical poets, but the corpus has since grown). To put Claudian’s intertextuality in context, we also produced similar comparisons of the intertextual relationships between prior epic poets and the Aeneid and Civil War respectively.

The data for these comparisons is available in folders through the following links:

Comparison of later epic poets with Vergil’s Aeneid

Comparison of later epic poets with Lucan’s Civil War

WARNING: the files are large, from 4-70MB.

 

Benchmark Data

*** See our updated benchmark data on our recent blog post “Collected Benchmark Sets” ***

Here is the data produced by our two surveys (in 2010 and 2012) of intertexts between Lucan, Bellum Civile 1, and Vergil, Aeneid.

The 2010 spreadsheet lists parallels reported from six different sources: four professional commentaries and two versions of Tesserae. Each parallel was hand ranked by members of our team of graduate student and faculty readers.

This is the source of the data reported in the 2012 TAPA and LLC articles.

The 2012 spreadsheet lists all parallels returned by a Version 3 search of the same two texts, plus any parallels found in the commentaries but not returned by Tesserae. The presence of a given parallel in one or more of the four commentaries is represented by the commentators’ initials. This sheet gives hand ranks from both the 2012 and 2010 tests.

Click to Download:
Tesserae 2010 Benchmark
Tesserae 2012 Benchmark

Please feel welcome to contact us with comments or questions on these data.

Slight Score Change

We’ve recently fixed a small bug in the scoring system, and you may notice that some scores are higher than they used to be.  Scores are calculated as floating-point values, but displayed as integers in the web interface.  To this point the decimal part of the score has been truncated, so that all scores were effectively rounded down to the next lowest integer.  From now on we will use the more customary rounding rules, so that partial scores equal to and above .5 will be rounded up.  If you compare the results of a search done now to those of the same search before the change, you can expect about half of the scores to be one higher.

As we continue our research on quantifying the literary significance of allusions, scores may change again, and perhaps significantly.  We will post notice and explanations of any such changes here. If for any reason you need access to a previous version of the software in order to replicate older results, please just let us know and we can help you.  Every version of Tesserae, once published on our web site, is archived and can be retrieved.

 

Synsets vs. Similarities

Harry Diakoff has shared with us a set of Greek synsets—groups of words purported to be mutually synonymous—while Tesserae has an algorithm which is supposed to measure semantic similarity between any two words. While both approaches ultimately are based upon the Perseus XML version of the Liddell-Scott-Jones lexicon, they employ very different methods. We hope that by comparing the two we can improve upon both.

Synsets

I don’t really know how these were created—maybe by translating entries in the English WordNet?  Harry, can you fill in any details here?  The important characteristics of the synonym sets are:

  • each set has a unique id
  • within a set, all relationships are presumed to be mutual and symmetrical
  • words can belong to more than one set

Similarities

Tesserae calculates word similarities using the Python topic modelling package Gensim.  We treat every entry in LSJ as a “document,” which is digested to produce a bag of English words used in defining the Greek headword.  These English words are TF-IDF weighted and used to create a feature vector describing the headword.  Headwords are compared using gensim.similarities.Similarity()—for any query word this returns a score between 0 and 1 for every other word in the corpus.  In addition to this absolute similarity score, we can also sort all results by score and consider the rank position of a given result some measure of its relationship to the query word.

  • each pair of words has a unique similarity score
    • some words within a synset can be more alike than others;
    • but homonyms are flattened
  • this similarity is symmetrical, but the rank positions aren’t:
    • the rank of result B given query A is not the same as that of A given B

You can see the code I used to calculate these metrics in the synonymy directory of the tess.experiments git repository.  But please be patient—it’s still quite rough; please feel free to improve it…

Current problems

Both of these datasets have their difficulties.  Each set is probably missing some headwords.  The synsets include some false positives and negatives.  The similarity scores can’t be turned into lists of synonyms without a threshold—either a similarity or rank position—that defines synonymy.

Ultimately, we need ground-truthing.  What follows is merely a first attempt to compare the two approaches and figure out to what degree they are in agreement, and to get some ideas about where and in what ways they differ.

A first glance

What I’ve done here is to break Harry’s synsets down into pairwise relationships, and then measure similarity and rank position (in each direction) for all of the pairs that exist in the Tesserae similarity matrix.

Harry sent us 17,342 synsets, which decomposed into 235,702 unique pairs of words.  Of these, 174,816 pairs returned results from a Tesserae similarity query.  In the remaining cases, one or both of Harry’s words didn’t exist in our corpus.  Although we both used the same dictionary, we each had our troubles reading it; more about this in another post.

Initial results look like this:

KEYPAIR SIM RANKA RANKB SYNSETS
θαλασσουργία->ἁλιεία 0.522157 71 216 454121;453935
κόπρος->σπατίλη 0.301455 38 55 14853947
συμπάρειμι->ἔησθα 0.239762 228 136 4353803;7959016;5861067;…
τρωπάω->ἐξυπτιάζω 0.139253 1250 1090 346532;7423365;457382;…
κηλιδόω->μελαντήριονστίγμα None None None 6794666

SIM gives the similarity score for each pair; RANKA gives the rank position of the righthand member among results when the lefthand member is used as the query; RANKB, the rank position of the lefthand member when the righthand one is the query; and SIMSETS gives the id number(s) of synsets in which this pair appears. In the bottom row you see what happens when Tesserae can’t find one or both of the words—in this case it seems that one member of the pair is actually a phrase, although there are other cases where Tesserae can’t find a word that clearly should be in the dictionary.  You can download the full dataset here.

Similarity and synonymy

Given that all the word pairs extracted from the synsets are supposed to be synonyms, and that similarity is supposed to be a measure of synonymy, we might hope that most of the pairs will have high SIM scores.  This didn’t turn out to be the case: while a significant number scored 1, the majority of the pairs scored 0; among the rest, there seemed to be preference for low scores over high.

sim

Rank position and synonymy

On the other hand, rank position did better.  I added together RANKA and RANKB to flatten out weird asymmetries for now, and found that a large majority of word pairs had high ranks:

rank

It seems safe to say we’re not interested in results that ranked 50,000th in an ordered list of most similar words.  Here’s a closeup of just the top-ranked (i.e. furthest left on the x-axis) according to RANKA only.  It does pretty much what we had hoped for:

rank100

So it seems on a first pass as though rank is working out of the box, while similarity needs work.  What if we use rank as a filter on similarity?  Here is the distribution of similarity scores among pairs whose combined RANKA + RANKB is less than 100.  Not only are these pairs high ranking, but they’re also relatively symmetrical, given that the two ranks can differ by no more than 99 in each case.  Here, the huge spike at SIM=0 is gone; the spike at 1 is preserved, and the rest form a nice curve around the middle of the similarity spectrum.

sim+rank

Clearly more work to be done here, but this seems to be an exciting start!

Mapping the diversity of Perseus texts

Adding texts to Tesserae’s searchable database requires ensuring that every line has a human-readable locus associated with it.  Checking through Perseus TEI and selecting one, consistent numbering system to apply to all the lines of a text is no easy job—as James and his team of interns will readily attest.

One way in which this might be made easier is processing similarly-structured texts in batches.  But TEI is flexible enough that texts with the same structure in print (e.g. Book–Poem–Line, or Book–Chapter–Section) don’t necessarily have the same XML structure.  To take a simple example, some poetic texts enclose each line in <l> tags, with line numbers coded as attributes of the line elements, as in the case of Ovid’s Metamorphoses:

<l>In nova fert animus mutatas dicere formas</l>
<l>corpora; di, coeptis (nam vos mutastis et illas)</l>
<l>adspirate meis primaque ab origine mundi</l>
<l>ad mea perpetuum deducite tempora carmen.</l>
<l n="5">Ante mare et terras et quod tegit omnia caelum</l>
<l>unus erat toto naturae vultus in orbe,</l>
<l>quem dixere chaos: rudis indigestaque moles</l>
<l>nec quicquam nisi pondus iners congestaque eodem</l>
<l>non bene iunctarum discordia semina rerum.</l>

Other texts encode the same structure by interspersing numbered line breaks throughout a block of text, as in Silius Italicus’ Punica:

<lb rend="displayNum" n="1" />Ordior arma, quibus caelo se gloria tollit
<lb rend="displayNum" n="2" />Aeneadum, patiturque ferox Oenotria iura
<lb rend="displayNum" n="3" />Carthago. da, Musa, decus memorare laborum
<lb rend="displayNum" n="4" />antiquae Hesperiae, quantosque ad bella crearit
<lb rend="displayNum" n="5" />et quot Roma uiros, sacri cum perfida pacti
<lb rend="displayNum" n="6" />gens Cadmea super regno certamina mouit
<lb rend="displayNum" n="7" />quaesitumque diu, qua tandem poneret arce
<lb rend="displayNum" n="8" />terrarum Fortuna caput. ter Marte sinistro
<lb rend="displayNum" n="9" />iuratumque Ioui foedus conuentaque patrum
<lb rend="displayNum" n="10" />Sidonii fregere duces, atque impius ensis
<lb rend="displayNum" n="11" />ter placitam suasit temerando rumpere pacem.

When it comes to automatically adding the correct locus to each line of text, these two encodings demand different treatments, as in one case the line number is an attribute of the parent element, whereas in the other the line number is an attribute of a sibling.

I thought it might be interesting to see whether we could automatically classify texts based on the type of XML tags used in encoding them. This could identify which texts would need similar treatment without making assumptions based on the way the print texts were structured.

I decided to try a rough classification of documents based solely on what kinds of nodes they contained and the hierarchical arrangement of those nodes. For example, you can guess which of the following paths occurs in Cicero’s Letters to Atticus, and which in Plautus’ Menaechmi:

TEI.2/text/body/div1[@type='book']/div2[@type='letter']/opener/salute
TEI.2/text/body/div1[@type='act']/div2[@type='scene']/sp/speaker

I generated a list of all unique paths from root to leaf in each text. I only kept attribute values in two cases, the @type of <divn> and the @unit of <milestone>. This is because important information about the structure of the text may be in the attributes here.

In this first experiment I didn’t even bother considering how many instances of each path a text contained; I just set the feature to 1 if the path was present and 0 if not. Each text was ultimately represented by a vector of 1095 binary features, one for each of the unique paths that occurred anywhere in the corpus.

Here we see the texts represented by the first two principal components of those feature vectors. The points have also been colored according to an independent, k-means classification of the original vectors into 8 classes.

overview

For me, three things jump out immediately: first, that drama is set apart from all the other texts; second, that Cicero manages to cover almost the entire feature space; third, that the remaining genres do cluster, but overall tend to show a gradient of characteristics.

Even as we continue to work on a universal text-parsing tool, this line of investigation could potentially speed the addition of Perseus texts.  I think the next step will be add information about who edited the digital text to the feature vector.  This will help move classification from primarily genre-driven, identifying differences we could have predicted, to include TEI coding idiosyncrasies such as the difference in line numbering illustrated above, which we wouldn’t have been able to guess without examining all the files by hand.

Reading Thebaid 2

Kyle Gervais of the University of Otago is working on a commentary on Statius Thebaid book 2, and emailed comments on  his use of Tesserae. It’s encouraging to see scholars putting the system to use in this way, and to get some perceptive feedback.

I’ve been using Tesserae in writing my commentary on Statius, Thebaid 2. Of course, it’s not my primary tool for tracking down intertexts, since it doesn’t understand context and doesn’t do synonyms or sound-alike words very well (although I understand that these are areas under development). I typically use it after I’ve written notes on a hundred lines or so, to help me catch any intertexts I’ve missed through traditional methods. I work at a slow pace (no more than two lines of poetry per day) and am very thorough in searching for intertexts (constant searches of the PHI database, consulting half a dozen ancient and modern commentaries and editions, trolling through papers on Statius and commentaries on other authors, and of course my own knowledge of the ancient sources)–so it’s impressive how many new intertexts Tesserae picks up. An example:

After finishing Theb. 2.1-101, I ran the lines against the Aeneid on Tesserae (using the basic search mode). I got 740 hits, and within 30-45 min. skimmed through to find 10 promising hits that I hadn’t found in the traditional ways (I’m sure I could have cut out a lot of the poor quality hits by manipulating the search settings, but I worry about missing things, and find it just as easy to skim). Of the ten, four led nowhere. Of the remaining six:

One reinforced an intertextual frame I already recognized (Hector’s epiphany in Aen. 2 as a frame for Laius’ epiphany): Theb. 2.101 pectora et has uisus fatorum expromere uoces, Aen. 2.280 compellare virum et maestas expromere voces. Obviously no one (including me) had thought to search for expromere uoces.

One helped to flesh out Laius’ role as an agent of discord: Theb. 2.99 infula per crines, glaucaeque innexus oliuae [/ uittarum prouenit honos], Aen. 6.281 ‘[Discordia] vipereum crinem vittis innexa cruentis‘. On a slow day, I might have searched the PHI for innex-, but on most days it would have seemed like a waste of time. Even if I had, I might have skimmed by Aen. 6.281 (since crinem wouldn’t have been highlighted).

Two revealed a subtle link between the underworld at Theb. 2.48ff. and Priam’s palace at Aen. 2.486ff.: 2.49 uacua atria ditat, 2.528 uacua atria lustrat; 2.51 stridor ibi et gemitus poenarum, atroque tumultu…, 2.486  at domus interior gemitu miseroque tumultu…. Never thought to search for uacua atria; never would have searched for gemit– + tumult-.

Two were really exciting:

Baccho + matres pointed to: Theb. 2.79f. ipse etiam gaudens nemorosa per auia sanas / impulerat matres Baccho meliore Cithaeron and Aen. 7.580ff. tum quorum attonitae Baccho nemora avia matres / insultant thiasis (neque enim leue nomen Amatae) / undique collecti coeunt Martemque fatigant. A clear intertext, and more importantly, a good (very modern and very much in Statius’ style) explanation for Baccho meliore, which has been a crux: Bacchus is ‘better’ than he was in the Aeneid.

Finally, Theb. 2.42 (a mountain’s shadow on the water) exigit atque ingens medio natat umbra profundo and Aen. 5.422f. (Entellus) magna ossa lacertosque / exuit atque ingens media consistit harena (note the added correspondence between exigit and exuit, which Tesserae can’t [yet?] pick up). It’s a genuine and interesting intertext, I think, but I never would have found it myself: the contexts aren’t obviously similar, I wouldn’t have had time to search the PHI for atque, ingens, or medius (too many hits), and it wouldn’t have occurred to me to search for combinations of any of those three words. It’s most exciting to me because it’s the kind of intertext that always gets missed since we’re not very good at thinking in the proper way (my comment on the link:  ‘An intertext perhaps best *read in reverse*, as an augmentation of Virgil: thanks to Statius the mighty Entellus casts a shadow big as a mountain’).

Visualizing Sound Patterns in Homer

In his 1974 article “Sound-Patterns in Homer,” David W. Packard compared a wide range of critical opinions about the artistic use of sound in the poetics of the Iliad and Odyssey with a statistical analysis of letter frequencies.  This is a seminal paper in digital humanities not only because Packard was a pioneer in designing the hardware and software necessary to digitize ancient Greek texts, but also because it addresses the interface between empirical data and critical interpretation, a problem that persists forty years on, despite huge advances in many areas of the field.

In the DHIB Textual Analysis Working Group, projects such as Tesserae attempt to adapt for the humanistic goals of literary criticism methods designed for such cold-blooded forensic purposes as authorship attribution and plagiarism detection.  This means not only digitizing and analyzing, but also being able to return from statistics and data to subjective appreciation, and creating new value for readers.  Here I want to show some preliminary results from my dissertation research, which benefits greatly from the intellectual cross-fertilization among the various efforts of Text Analysis.  I’ll draw some parallels to Packard’s work, trying to emphasize methods that I hope show the potential for digital interpretation as well as digital analysis of literary works.

The Iliad and Odyssey are, in one way or another, the products of a long oral tradition.  Despite the uncertainty that intervening changes in both pronunciation and spelling impose on any understanding we can have of these poems’ first-millennium realization, it’s clear that sound was a vital component of their composition and appreciation.  Packard was primarily investigating the question of whether sound patterns were the result of deliberate poetic artistry, but others have argued that they may have served an unconscious mnemonic role, allowing illiterate singers to store vast texts in memory using a sort of data compression.

In either case, digital analysis can aid us by providing the statistics to test theories about what sort of patterns exist.  But can it also help us  “read” the sounds of the poem in new ways, perhaps pointing us to new hypotheses we wouldn’t otherwise have formed?

Digital Analysis

Following Packard, I begin by breaking the poems down into an alphabet of sounds, most of which have one-to-one correspondence with orthographic characters.  From these atoms we can work up hierarchically to lines, either via words and n-grams, or via syllables and feet.  But for now, let’s just consider the sounds themselves.  The question I want to examine is, do some sounds show an interesting distribution in the poems, and, if so, what does that look like?

I downloaded the texts of the Iliad and the Odyssey from the Perseus Digital Library, concatenated them, then split them into 20-line samples.  In order to get a feel for what kind of variation you might expect to see by chance alone, I created a control set where the lines of the two poems were randomly shuffled before splitting into 20-line samples.  In fact, I did that ten different times.  These ten control sets, then, represent a sort of background noise against which any pattern must clearly distinguish itself.

The graph below looks at the distribution of every unique pair of adjacent sounds that occurs in the two poems.  The y-axis shows the portion of all samples in which a pair is found.  Sound-pairs are ranged along the x-axis from most common (on average across the ten control sets) at the left, to least common at the right.  The most common sound pairs occur in all samples, the least common in only one or two.

There are ten superimposed red curves, one for each of the control sets.  The black curve represents the poem in its proper order.  You can see that the black falls away from the red in places.  Here, a sound-pair is found in rather fewer samples than you’d expect by chance alone.  This means that in the original version it’s clumping up in some samples, leaving others bare.

Here’s a close-up showing two prime candidates for interesting behavior, hι and δυ. (I transliterated initial /h/ with a Latin “h” because it has no Greek letter.)

While this chart gives us a clue about which sounds might be interesting, it is a far cry from “interpretable” in a literary sense.  Packard’s approach is similar.  He begins with a chart showing, for each sound, the number of lines in which it does not occur at all, the number of lines in which it occurs once, twice, and so on (e.g. his Table 1).  In another giant table, he lists all the lines in which a given sound occurs unusually frequently (e.g. his Table 3).

These tables serve two functions for Packard.  First, where a critic has claimed that a particular line is notable for the density of some sound or other, Packard can tell at a glance how many and which other lines share the same characteristic.  Second, he can survey the most “interesting” single lines to see whether they tend to be particularly charged with literary significance.  But can these data be reintegrated into a new reading?  Can computational techniques be turned from analysis to interpretation?

Digital Interpretation

Packard makes an exciting attempt in this direction, although he cautions that as it stands it is overly simplistic, undertaken “purely as an experiment.”  He turns to the work of Dionysius of Halicarnassus, a scholar of the first century BCE who assessed the relative “harshness” of every letter of the Greek alphabet and used this as the basis for poetic criticism.  Assigning to every sound a numerical value based on Dionysius’ rankings, Packard calculates for every line in the Iliad and Odyssey a “Dionysian” harshness metric.

My approach to reintegrating sound frequencies into a subjective appreciation of the larger poem draws on techniques I used when I studied satellite image processing in the Earth and Environmental Science department at Lehigh University.  There we would visualize three variables from a larger set simultaneously by assigning them to red, green, and blue intensities respectively. In the following figures, each square represents twenty lines of text.  The texts proceed from left to right, top to bottom, beginning with the first line of the Iliad.

In this first image, the red value represents density of the sound-pair hι, green represents ιπ, and blue represents ππ.  These sounds are all components of the word ἵππος, “horse,” and the biggest bright stripe (a little more than halfway down, on the left) represents the chariot race in Iliad Book 23.  Compare the picture above with the one below, made in the same way but using the first control set.

The control set shows the same variability among samples, but no large-scale patterns like the bright stripe in the first picture.

In my first experiment, the three variables used to create the colors tended to co-vary, being part of the same relatively common word.  In the next example, they show more independence.  Here I used sound triplets: red shows the density of the string δυσ, green represents χιλ, and blue represents τυδ.  The frequency of these strings are dominated by the presence of three main characters, Odysseus, Achilles, and Diomedes (“son of Tydeus”).

The huge red region at the bottom is books 5-24 of the Odyssey.  The green region in the middle is where Achilles returns to the fighting in the later part of the Iliad.  Near the beginning is a blue section corresponding to the Aristeia of Diomedes.

For now, this analysis remains relatively crude, and limited to showing content-driven patterns in sound, rather than purely stylistic ones.  My original aim was to perform principal components analysis on all the sound frequencies together, then assign the three color intensities to the first three principal components.  So far, though, it’s turned up nothing appreciably different from what you see in the control sets.

Instead, let me close with a tribute to Packard’s approach.  Here I’ve calculated his “Dionysian” score for each of my samples and assigned it to a grey scale value.  Brighter samples are harsher sounding, to Dionysius of Halicarnassus’ ear, at any rate, while the black squares represent the most mellifluous passages.

But Packard’s metric was designed to examine the sound of individual lines.  Perhaps it would be better read in this way:

The graphs above were made using R, the other pictures, using Processing.  I used Perl for everything in between.  I’d appreciate advice/comments on any aspect of this from one and all…

Originally posted to the DHIB blog.