Slight Score Change

Posted on June 14, 2013 by Chris Forstall

We’ve recently fixed a small bug in the scoring system, and you may notice that some scores are higher than they used to be. Scores are calculated as floating-point values, but displayed as integers in the web interface. To this point the decimal part of the score has been truncated, so that all scores were effectively rounded down to the next lowest integer. From now on we will use the more customary rounding rules, so that partial scores equal to and above .5 will be rounded up. If you compare the results of a search done now to those of the same search before the change, you can expect about half of the scores to be one higher.

As we continue our research on quantifying the literary significance of allusions, scores may change again, and perhaps significantly. We will post notice and explanations of any such changes here. If for any reason you need access to a previous version of the software in order to replicate older results, please just let us know and we can help you. Every version of Tesserae, once published on our web site, is archived and can be retrieved.

Synsets vs. Similarities

Posted on April 23, 2013 by Chris Forstall

Harry Diakoff has shared with us a set of Greek synsets—groups of words purported to be mutually synonymous—while Tesserae has an algorithm which is supposed to measure semantic similarity between any two words. While both approaches ultimately are based upon the Perseus XML version of the Liddell-Scott-Jones lexicon, they employ very different methods. We hope that by comparing the two we can improve upon both.

Synsets

I don’t really know how these were created—maybe by translating entries in the English WordNet? Harry, can you fill in any details here? The important characteristics of the synonym sets are:

each set has a unique id
within a set, all relationships are presumed to be mutual and symmetrical
words can belong to more than one set

Similarities

Tesserae calculates word similarities using the Python topic modelling package Gensim. We treat every entry in LSJ as a “document,” which is digested to produce a bag of English words used in defining the Greek headword. These English words are TF-IDF weighted and used to create a feature vector describing the headword. Headwords are compared using gensim.similarities.Similarity()—for any query word this returns a score between 0 and 1 for every other word in the corpus. In addition to this absolute similarity score, we can also sort all results by score and consider the rank position of a given result some measure of its relationship to the query word.

each pair of words has a unique similarity score
- some words within a synset can be more alike than others;
- but homonyms are flattened
this similarity is symmetrical, but the rank positions aren’t:
- the rank of result B given query A is not the same as that of A given B

You can see the code I used to calculate these metrics in the synonymy directory of the tess.experiments git repository. But please be patient—it’s still quite rough; please feel free to improve it…

Current problems

Both of these datasets have their difficulties. Each set is probably missing some headwords. The synsets include some false positives and negatives. The similarity scores can’t be turned into lists of synonyms without a threshold—either a similarity or rank position—that defines synonymy.

Ultimately, we need ground-truthing. What follows is merely a first attempt to compare the two approaches and figure out to what degree they are in agreement, and to get some ideas about where and in what ways they differ.

A first glance

What I’ve done here is to break Harry’s synsets down into pairwise relationships, and then measure similarity and rank position (in each direction) for all of the pairs that exist in the Tesserae similarity matrix.

Harry sent us 17,342 synsets, which decomposed into 235,702 unique pairs of words. Of these, 174,816 pairs returned results from a Tesserae similarity query. In the remaining cases, one or both of Harry’s words didn’t exist in our corpus. Although we both used the same dictionary, we each had our troubles reading it; more about this in another post.

Initial results look like this:

KEYPAIR	SIM	RANKA	RANKB	SYNSETS
θαλασσουργία->ἁλιεία	0.522157	71	216	454121;453935
κόπρος->σπατίλη	0.301455	38	55	14853947
συμπάρειμι->ἔησθα	0.239762	228	136	4353803;7959016;5861067;…
τρωπάω->ἐξυπτιάζω	0.139253	1250	1090	346532;7423365;457382;…
κηλιδόω->μελαντήριονστίγμα	None	None	None	6794666

SIM gives the similarity score for each pair; RANKA gives the rank position of the righthand member among results when the lefthand member is used as the query; RANKB, the rank position of the lefthand member when the righthand one is the query; and SIMSETS gives the id number(s) of synsets in which this pair appears. In the bottom row you see what happens when Tesserae can’t find one or both of the words—in this case it seems that one member of the pair is actually a phrase, although there are other cases where Tesserae can’t find a word that clearly should be in the dictionary. You can download the full dataset here.

Similarity and synonymy

Given that all the word pairs extracted from the synsets are supposed to be synonyms, and that similarity is supposed to be a measure of synonymy, we might hope that most of the pairs will have high SIM scores. This didn’t turn out to be the case: while a significant number scored 1, the majority of the pairs scored 0; among the rest, there seemed to be preference for low scores over high.

Rank position and synonymy

On the other hand, rank position did better. I added together RANKA and RANKB to flatten out weird asymmetries for now, and found that a large majority of word pairs had high ranks:

It seems safe to say we’re not interested in results that ranked 50,000th in an ordered list of most similar words. Here’s a closeup of just the top-ranked (i.e. furthest left on the x-axis) according to RANKA only. It does pretty much what we had hoped for:

So it seems on a first pass as though rank is working out of the box, while similarity needs work. What if we use rank as a filter on similarity? Here is the distribution of similarity scores among pairs whose combined RANKA + RANKB is less than 100. Not only are these pairs high ranking, but they’re also relatively symmetrical, given that the two ranks can differ by no more than 99 in each case. Here, the huge spike at SIM=0 is gone; the spike at 1 is preserved, and the rest form a nice curve around the middle of the similarity spectrum.

Clearly more work to be done here, but this seems to be an exciting start!

Mapping the diversity of Perseus texts

Posted on March 22, 2013 by Chris Forstall

Adding texts to Tesserae’s searchable database requires ensuring that every line has a human-readable locus associated with it. Checking through Perseus TEI and selecting one, consistent numbering system to apply to all the lines of a text is no easy job—as James and his team of interns will readily attest.

One way in which this might be made easier is processing similarly-structured texts in batches. But TEI is flexible enough that texts with the same structure in print (e.g. Book–Poem–Line, or Book–Chapter–Section) don’t necessarily have the same XML structure. To take a simple example, some poetic texts enclose each line in <l> tags, with line numbers coded as attributes of the line elements, as in the case of Ovid’s Metamorphoses:

<l>In nova fert animus mutatas dicere formas</l>
<l>corpora; di, coeptis (nam vos mutastis et illas)</l>
<l>adspirate meis primaque ab origine mundi</l>
<l>ad mea perpetuum deducite tempora carmen.</l>
<l n="5">Ante mare et terras et quod tegit omnia caelum</l>
<l>unus erat toto naturae vultus in orbe,</l>
<l>quem dixere chaos: rudis indigestaque moles</l>
<l>nec quicquam nisi pondus iners congestaque eodem</l>
<l>non bene iunctarum discordia semina rerum.</l>

Other texts encode the same structure by interspersing numbered line breaks throughout a block of text, as in Silius Italicus’ Punica:

<lb rend="displayNum" n="1" />Ordior arma, quibus caelo se gloria tollit
<lb rend="displayNum" n="2" />Aeneadum, patiturque ferox Oenotria iura
<lb rend="displayNum" n="3" />Carthago. da, Musa, decus memorare laborum
<lb rend="displayNum" n="4" />antiquae Hesperiae, quantosque ad bella crearit
<lb rend="displayNum" n="5" />et quot Roma uiros, sacri cum perfida pacti
<lb rend="displayNum" n="6" />gens Cadmea super regno certamina mouit
<lb rend="displayNum" n="7" />quaesitumque diu, qua tandem poneret arce
<lb rend="displayNum" n="8" />terrarum Fortuna caput. ter Marte sinistro
<lb rend="displayNum" n="9" />iuratumque Ioui foedus conuentaque patrum
<lb rend="displayNum" n="10" />Sidonii fregere duces, atque impius ensis
<lb rend="displayNum" n="11" />ter placitam suasit temerando rumpere pacem.

When it comes to automatically adding the correct locus to each line of text, these two encodings demand different treatments, as in one case the line number is an attribute of the parent element, whereas in the other the line number is an attribute of a sibling.

I thought it might be interesting to see whether we could automatically classify texts based on the type of XML tags used in encoding them. This could identify which texts would need similar treatment without making assumptions based on the way the print texts were structured.

I decided to try a rough classification of documents based solely on what kinds of nodes they contained and the hierarchical arrangement of those nodes. For example, you can guess which of the following paths occurs in Cicero’s Letters to Atticus, and which in Plautus’ Menaechmi:

TEI.2/text/body/div1[@type='book']/div2[@type='letter']/opener/salute
TEI.2/text/body/div1[@type='act']/div2[@type='scene']/sp/speaker

I generated a list of all unique paths from root to leaf in each text. I only kept attribute values in two cases, the @type of <divn> and the @unit of <milestone>. This is because important information about the structure of the text may be in the attributes here.

In this first experiment I didn’t even bother considering how many instances of each path a text contained; I just set the feature to 1 if the path was present and 0 if not. Each text was ultimately represented by a vector of 1095 binary features, one for each of the unique paths that occurred anywhere in the corpus.

Here we see the texts represented by the first two principal components of those feature vectors. The points have also been colored according to an independent, k-means classification of the original vectors into 8 classes.

For me, three things jump out immediately: first, that drama is set apart from all the other texts; second, that Cicero manages to cover almost the entire feature space; third, that the remaining genres do cluster, but overall tend to show a gradient of characteristics.

Even as we continue to work on a universal text-parsing tool, this line of investigation could potentially speed the addition of Perseus texts. I think the next step will be add information about who edited the digital text to the feature vector. This will help move classification from primarily genre-driven, identifying differences we could have predicted, to include TEI coding idiosyncrasies such as the difference in line numbering illustrated above, which we wouldn’t have been able to guess without examining all the files by hand.

Tesserae at DHCS 2012

Posted on December 3, 2012 by Chris Forstall

This year’s Chicago Colloquium on Computer Science and Digital Humanities was hosted by the University of Chicago, November 17–19. Tesserae researchers presented two posters:

James Gawley, Christopher Forstall and Neil Coffee, “Evaluating the literary significance of text re-use in Latin poetry,” which showcased Tesserae’s scoring system; and,

Christopher Forstall and Walter Scheirer, “Revealing hidden patterns in the meter of Homer’s Iliad,” which presented results from Chris and Walter’s work on sound in Greek poetry.

While all the presentations were excellent, particularly interesting from our point of view were a number of papers which took a network view of intertextual relationships.

Hoyt Long illustrated literary coteries in Modernist Japanese poetry by analyzing the networks created when poets published in the same journals. He suggested some intriguing comparisons of similar networks from the same period in the USA and China. You can read more here.

Ryan Cordell and David Smith used some exciting methods in text alignment to locate stories reprinted with modification in antebellum American newspapers, even in very noisy texts, and then used network tools to analyze the connections between publishers. There’s a bit more here. Both this and the previous talk made exciting connections between geo-social networks in the real world and the literary networks of intertextual connections.

Mark Wolff showed a prototype interface to a database of text re-use in French western novels which allows users to visualize self-plagiarism and other text re-use as a web of connections. Try it here; read more here. This is particularly exciting for us, as our own multi-text search could perhaps one day feature a similar interface.

A lesson to be taken from all of these talks was that new light can be shed on intertextual relationships if one moves away from a binary or hierarchical framework toward something more complex and nuanced.

Martin Mueller’s keynote had particular resonance for digital Classics, reminding us that even as methods of analysis move forward, we continue to rely on old and poorly curated texts, in large part because our discipline no longer rewards editing and curation as it once did. This is a message that certainly resonates will all of us at Tesserae who have worked with adding texts…the labor involved in preparing digital texts is enormous, even when one has the benefit of the high quality data so generously provided by Perseus. It is astonishing that editing these texts is no longer acknowledged as serious scholarly work. Until academics are appropriately rewarded for their efforts in this domain, we will continue to find ourselves applying cutting-edge technology to shamefully outdated and noisy texts.

Visualizing Sound Patterns in Homer

Posted on October 16, 2012 by Chris Forstall

In his 1974 article “Sound-Patterns in Homer,” David W. Packard compared a wide range of critical opinions about the artistic use of sound in the poetics of the Iliad and Odyssey with a statistical analysis of letter frequencies. This is a seminal paper in digital humanities not only because Packard was a pioneer in designing the hardware and software necessary to digitize ancient Greek texts, but also because it addresses the interface between empirical data and critical interpretation, a problem that persists forty years on, despite huge advances in many areas of the field.

In the DHIB Textual Analysis Working Group, projects such as Tesserae attempt to adapt for the humanistic goals of literary criticism methods designed for such cold-blooded forensic purposes as authorship attribution and plagiarism detection. This means not only digitizing and analyzing, but also being able to return from statistics and data to subjective appreciation, and creating new value for readers. Here I want to show some preliminary results from my dissertation research, which benefits greatly from the intellectual cross-fertilization among the various efforts of Text Analysis. I’ll draw some parallels to Packard’s work, trying to emphasize methods that I hope show the potential for digital interpretation as well as digital analysis of literary works.

The Iliad and Odyssey are, in one way or another, the products of a long oral tradition. Despite the uncertainty that intervening changes in both pronunciation and spelling impose on any understanding we can have of these poems’ first-millennium realization, it’s clear that sound was a vital component of their composition and appreciation. Packard was primarily investigating the question of whether sound patterns were the result of deliberate poetic artistry, but others have argued that they may have served an unconscious mnemonic role, allowing illiterate singers to store vast texts in memory using a sort of data compression.

In either case, digital analysis can aid us by providing the statistics to test theories about what sort of patterns exist. But can it also help us “read” the sounds of the poem in new ways, perhaps pointing us to new hypotheses we wouldn’t otherwise have formed?

Digital Analysis

Following Packard, I begin by breaking the poems down into an alphabet of sounds, most of which have one-to-one correspondence with orthographic characters. From these atoms we can work up hierarchically to lines, either via words and n-grams, or via syllables and feet. But for now, let’s just consider the sounds themselves. The question I want to examine is, do some sounds show an interesting distribution in the poems, and, if so, what does that look like?

I downloaded the texts of the Iliad and the Odyssey from the Perseus Digital Library, concatenated them, then split them into 20-line samples. In order to get a feel for what kind of variation you might expect to see by chance alone, I created a control set where the lines of the two poems were randomly shuffled before splitting into 20-line samples. In fact, I did that ten different times. These ten control sets, then, represent a sort of background noise against which any pattern must clearly distinguish itself.

The graph below looks at the distribution of every unique pair of adjacent sounds that occurs in the two poems. The y-axis shows the portion of all samples in which a pair is found. Sound-pairs are ranged along the x-axis from most common (on average across the ten control sets) at the left, to least common at the right. The most common sound pairs occur in all samples, the least common in only one or two.

There are ten superimposed red curves, one for each of the control sets. The black curve represents the poem in its proper order. You can see that the black falls away from the red in places. Here, a sound-pair is found in rather fewer samples than you’d expect by chance alone. This means that in the original version it’s clumping up in some samples, leaving others bare.

Here’s a close-up showing two prime candidates for interesting behavior, hι and δυ. (I transliterated initial /h/ with a Latin “h” because it has no Greek letter.)

While this chart gives us a clue about which sounds might be interesting, it is a far cry from “interpretable” in a literary sense. Packard’s approach is similar. He begins with a chart showing, for each sound, the number of lines in which it does not occur at all, the number of lines in which it occurs once, twice, and so on (e.g. his Table 1). In another giant table, he lists all the lines in which a given sound occurs unusually frequently (e.g. his Table 3).

These tables serve two functions for Packard. First, where a critic has claimed that a particular line is notable for the density of some sound or other, Packard can tell at a glance how many and which other lines share the same characteristic. Second, he can survey the most “interesting” single lines to see whether they tend to be particularly charged with literary significance. But can these data be reintegrated into a new reading? Can computational techniques be turned from analysis to interpretation?

Digital Interpretation

Packard makes an exciting attempt in this direction, although he cautions that as it stands it is overly simplistic, undertaken “purely as an experiment.” He turns to the work of Dionysius of Halicarnassus, a scholar of the first century BCE who assessed the relative “harshness” of every letter of the Greek alphabet and used this as the basis for poetic criticism. Assigning to every sound a numerical value based on Dionysius’ rankings, Packard calculates for every line in the Iliad and Odyssey a “Dionysian” harshness metric.

My approach to reintegrating sound frequencies into a subjective appreciation of the larger poem draws on techniques I used when I studied satellite image processing in the Earth and Environmental Science department at Lehigh University. There we would visualize three variables from a larger set simultaneously by assigning them to red, green, and blue intensities respectively. In the following figures, each square represents twenty lines of text. The texts proceed from left to right, top to bottom, beginning with the first line of the Iliad.

In this first image, the red value represents density of the sound-pair hι, green represents ιπ, and blue represents ππ. These sounds are all components of the word ἵππος, “horse,” and the biggest bright stripe (a little more than halfway down, on the left) represents the chariot race in Iliad Book 23. Compare the picture above with the one below, made in the same way but using the first control set.

The control set shows the same variability among samples, but no large-scale patterns like the bright stripe in the first picture.

In my first experiment, the three variables used to create the colors tended to co-vary, being part of the same relatively common word. In the next example, they show more independence. Here I used sound triplets: red shows the density of the string δυσ, green represents χιλ, and blue represents τυδ. The frequency of these strings are dominated by the presence of three main characters, Odysseus, Achilles, and Diomedes (“son of Tydeus”).

The huge red region at the bottom is books 5-24 of the Odyssey. The green region in the middle is where Achilles returns to the fighting in the later part of the Iliad. Near the beginning is a blue section corresponding to the Aristeia of Diomedes.

For now, this analysis remains relatively crude, and limited to showing content-driven patterns in sound, rather than purely stylistic ones. My original aim was to perform principal components analysis on all the sound frequencies together, then assign the three color intensities to the first three principal components. So far, though, it’s turned up nothing appreciably different from what you see in the control sets.

Instead, let me close with a tribute to Packard’s approach. Here I’ve calculated his “Dionysian” score for each of my samples and assigned it to a grey scale value. Brighter samples are harsher sounding, to Dionysius of Halicarnassus’ ear, at any rate, while the black squares represent the most mellifluous passages.

But Packard’s metric was designed to examine the sound of individual lines. Perhaps it would be better read in this way:

The graphs above were made using R, the other pictures, using Processing. I used Perl for everything in between. I’d appreciate advice/comments on any aspect of this from one and all…

Originally posted to the DHIB blog.