How to Calculate the Relative Influence of an Author

At the end of the first century, Quintilian asked “Is it not sufficient to model our every utterance on Cicero? For my own part, I should consider it sufficient, if I could always imitate him successfully. But what harm is there in occasionally borrowing the vigour of Caesar, the vehemence of Caelius, the precision of Pollio or the sound judgment of Calvus?”

As philologists of the 21st century, we might ask “How often did Roman authors actually borrow phrases from Caesar as opposed to Cicero?”

Caitlin Diddams and I recently published an article in Digital Scholarship in the Humanities which lays out the best practices for determining:

  1. Which phrases shared between two authors did not come from a second possible source
  2. How to measure the relative strength of an “intertextual signal”
  3. How to compare the relative influence of multiple authors on a cross-section of literature

As a test-case, we compared the influence of Cicero and Caesar during the early imperial and late imperial periods.

The methodology we outline in this article can be used on any number of source and target authors, regardless of language. Our formula for calculating the strength of an intertextual signal can be used with any tool for detecting intertextuality (not just Tesserae).

To read the abstract and obtain the full article, visit the Oxford Journals website: https://academic.oup.com/dsh/article-abstract/doi/10.1093/llc/fqx038/4061474/Comparing-the-intertextuality-of-multiple-authors

Relative influence in our methodology is compared according to the ‘rate of intertextuality,’ which is a normalized representation of the number of results you get in a Tesserae search. Normalization is necessary because the length of a work influences the number of results obtained. Previous methods of normalization assumed that Tesserae’s scoring algorithm would perform consistently across various authors and genres of literature. We propose that best practice should avoid such assumptions wherever possible.

Our normalization method in brief (the following is excerpted from a pre-print copy of the article):

The number of results of two searches cannot be meaningfully compared until we consider how many results each search could have produced. The number of search results depends on two factors: the level of engagement between the authors and the length of the texts being compared. Longer texts create more sentence-by-sentence comparisons. There are more opportunities for unique intertexts to occur. The number which can be meaningfully compared is not the number of unique results of a Tesserae search, but the ratio of the results found to the results that could have been found. We normalize the number of results according to the following formula:

We define the rate of intertextuality as the number of connected phrases per pair of phrases considered. This is derived by dividing the absolute value of the set of results by the absolute value of the cross-product of the sets of sentences in source and target texts. This cross-multiplication is necessary because Tesserae compares every sentence in a source text to all of the sentences in a target text.6 Therefore the number of possible results in a comparison of any source and target is the product of the number of sentences in the source and the number of sentences in the target.

Appendix to “Measuring the Presence of Roman Rhetoric: An Intertextual Analysis of Augustine’s De Doctrina Christiana IV”

This appendix contains the intertextual parallels that inform the paper “Measuring the Presence of Roman Rhetoric: An Intertextual Analysis of Augustine’s De Doctrina Christiana IV” published in Mouseion Vol. 14 No. 3, Open Digital Corpora of Greek and Latin. The search parameters for these comparisons are listed at the beginning of each file. Please direct any questions to Caitlin Diddams at acstaab@buffalo.edu or James Gawley at jamesgaw@buffalo.edu.

 Vita Washingtonii vs. DDC

Germania vs. DDC IV

Bello Gallico vs. DDC IV

Dialogus vs. DDC IV

Orator vs. DDC IV

Institutio Oratoria vs. DDC IV

 

Abstract:

This paper examines the intertextual relationship between Augustine’s De Doctrina Christiana IV and Cicero’s Orator. We use quantitative methods to compare Augustine’s level of engagement with Orator against his engagement with other handbooks of classical Latin rhetoric. Our results inform a close reading of the text as body metaphor in DDC 4.13. Augustine incorporates Ciceronian colometry into his presentation of the epistles to demonstrate Paul’s eloquence. We argue that Augustine’s comparatively heavy use of Cicero is an attempt to justify the use of rhetoric in Christian teaching while adapting that rhetoric to Christian purposes.

Lemma + Semantic Matching: Capture More Parallels

A “lemma-based search” identifies the co-occurrence of the same word with different inflections, and this is the basis for version 3.0 of the Tesserae software. In a benchmark test of Pharsalia 1 vs. Aeneid, 55% of parallels previously noted by commentators were retrieved by lemma-based search.

Previous posts have discussed the method of generating metonym (synonym, antonym, hyponym, etc.) dictionaries through the application of topic modeling. In order to capture more of our target intertexts, we worked to generate the most accurate possible metonym dictionary, then combined it with lemmatization in order to simultaneously capture different inflections of the same word and metonyms of that word.

In a repeat of the benchmark test above, the new ‘synonym + lemma’ feature retrieved commentator parallels at a rate of 68% (other search settings remained the same).

Feature_Comparison

View a version 3.1 synonym + lemma benchmark test.

View a version 3.0 lemma-based benchmark test.

A Greek to Latin Dictionary

As mentioned in previous posts, the Tesserae team has been working to create a digital Greek to Latin dictionary to aid in the retrieval of cross-language text reuse. Tesserae interns Nathaniel Durant, Theresa Mullin, and Elizabeth Hunter collectively assessed 1,000 Greek words, determining which, if any, method for producing a cross-language dictionary produced accurate translations. The winner proved to be an enhanced version of Chris Forstall’s topic-model-based ‘pivot’ method.

I crunched the numbers we got from the 1,000 translations tested by our faithful collaborators, and used them to generate the best possible Greek-to-Latin dictionary. Chris’s algorithm produced up to two Latin translations for each Greek word, with a similarity value attached to each translation. I set out to find a good ‘cutoff’ value for the probability of a translation. I balanced precision and recall according to the following criteria:

  1. It was very important to us that we retain at least 1 accurate translation.
  2. It was very important to us that we avoid retaining inaccurate translations.

Because we have two possible translations for most words, it proved best to use two different similarity-score cutoffs for translations A and B. The result is a Greek-to-Latin dictionary which correlates 34,000 Greek words with at least one semantically related Latin word. We have reason to believe that this dictionary is accurate at at rate of 75%-80%, according to our own parameters for accuracy (because we are searching for allusions, this is not a ‘translation’ dictionary; we consider antonyms and all other metonyms to be valid associations).

Publications on our methodology are forthcoming. For now, please experiment with the tool at http://tesserae.caset.buffalo.edu/cross.php. We welcome your feedback.

Augustine vs. The Rhetoricians

The following data is the basis for an article entitled: “Paul is the New Cicero: Repurposing Roman Rhetoric in Augustine’s De Doctrina Christiana,” under review with the Journal Mouseion. These files are archived here for the benefit of readers who wish to inspect the results of Tesserae comparisons in greater detail than is possible in the article. The first file contains the results of a comparison of Augustine’s De Doctrina Christiana to Cicero’s Orator.

The following links lead to comma-separated-value (CSV) files which can be opened in any spreadsheet editor.

Below are links to tab-separated-value files, whose contents represent raw data collected in a batch Tesserae search:

  • runs: each line represents a single comparison and its details.
  • scores: coded by the numbers found in the ‘runs’ file, each row represents the number of results returned at a given score level.

Knauer’s list of parallels between Aeneid (Book 1) and the Iliad

Knauer’s original commentary on the Aeneid listed places of parallelism with Homer’s Iliad, but did not specify criteria for intertextuality. The Google Docs spreadsheet below uses Knauer’s citations of the Aeneid, Book I, with his citations of the Iliad and lists the verbal correspondences between the Latin and the Greek. This work began in September, 2013 and has been intermittently edited and expanded up until June, 2014, when it has been mostly completed.

It is free to use with credit to Tesserae and Konnor Clark, who compiled the list.

https://docs.google.com/spreadsheet/ccc?key=0AmBfs72ChHaodDJPV2s1Mk1EeW5lRm5HNnRLN1hHV2c&usp=sharing

How the text-alignment method works

As explained by Chris Forstall in his earlier post, we are currently experimenting with a new cross-language detection feature over on the Tesserae Development server. We are using two different approaches, and the naïve Bayesian alignment approach bears a little explanation. The purpose of this post is to provide a simple introduction to the theory behind the algorithm; a link to my PERL script which aligns two texts in Tesserae format will be provided at the end.

To begin with, let’s assume we have a corpus which consists of the same text in two languages. Let’s further assume that our texts are perfectly aligned, sentence-by-sentence (the difficulty of finding texts like this has led us to use the New Testament for our experiments). We want to know which word in language A corresponds to which word in language B. Initially, we assign each word an equal probability. Here’s a simple example sentence in Greek and Latin:

Sentence A (Language A) Sentence B (Language B)
Amo libros legere  Φιλω βιβλους ἀναγιγνωσκειν

We’re going to try to figure out which word is a translation of Amo. First we assign an equal probability to all translation candidates. Because there are three words in Sentence B, the probability that Amo corresponds to Φιλω is 0.33, and the probability that it corresponds to βιβλους is also 0.33 (remember that a probability of 1.0 means that something is definitely true). The key to correctly lining up Latin words with their Greek translations is repetition. Let’s add another aligned sentence to our comparison:

Sentence A (Language A) Sentence B (Language B)
Amo philosophiam  Φιλω φιλοσοφιαν

This time, the sentence from language B doesn’t contain βιβλους or ἀναγιγνωσκειν, so it’s less likely that either of those are legitimate translations for Amo. Φιλω has also appeared again, so the probability assigned to a possible Amo/Φιλω alignment is increased.

The equation that smooths out the probabilities of each conceivable alignment over the course of many, many sentences is called Bayes’ theorem. It looks like this:

Bayes' Theorem

Here’s what the first part, P(A|B), means to us: “the probability that word A in language A is a correct translation of word B in language B.” The next part, P(B|A), means “the probability that word B in language B is the correct translation of word A in language A.” You’ll notice that putting these two statements on opposite sides of an ‘equals’ sign looks a little like circular logic. The key here is that Bayes’ theorem works backward in order to more appropriately weight the probability associated with each possible translation candidate. This will become clearer in the next paragraph. The rest of the equation has to do with ‘smoothing’ the results; remember that our goal is to correctly weight these probabilities according to the pattern which emerges through repetition. The next two parts, P(A) and P(B), mean, for our purposes, “the probability of word A occurring in language A” and “the probability of word B occurring in language B.” For these probabilities we substitute “the number of occurrences of word B in the ‘language B’ (or word A in the ‘language A’) text, divided by the total number of words in that text.”

Because Bayes’ theorem works backward from translation to antecedent, the application of this theorem in text alignment can look a bit complicated. This is how it works: to determine P(A|B) for any given Latin word, the program looks at all the sentences (actually Bible verses in our corpus) which contain that word. We’ll call this Verse Group 1. The program then gathers up all the Greek words in the corresponding Greek verses. These Greek words are our translation candidates, and we look at each of them in turn. To calculate P(B|A) (the probability of the original Latin word, given the current Greek translation candidate), the program looks at all the Greek verses which contain the translation candidate. We can call this group of verses ‘Verse Group 2.’ The program then gathers up all the Latin words in the Latin versions of Verse Group 2. The important factor here is that we’re grabbing a different set of verses than those in Verse Group 1. The amount of overlap between Verse Group 1 and Verse Group 2 depends on how good a translation candidate we’re looking at. In other words, when we look back from Greek to Latin, we may find verses that don’t contain the original Latin word under scrutiny. This is especially true if the Greek translation candidate is not actually the word we ultimately want; if we are looking at the wrong Greek word, we’ll end up gathering a bunch of Latin verses which don’t contain our original word and that will lower the value of P(B|A).

The rest of the program is what my high school Physics teacher used to call “plug and chug.” ‘Probabilities’ are really just the number of times that a given word appears divided by the total number of words in the group in which it appears. An important feature of this approach is that for each word we examine, the program returns the probability of an alignment between that word and each possible translation word–just like in the first set of sentences at the top of this post. Many tools for this type of operation can be found online; a popular one is mGIZA. My own code for this project can be found on github.

Feel free to ask questions or leave feedback in the comments section.

 

All Perseus Latin Added, plus some Greek and English

When the Tesserae tool is demonstrated to Classics researchers, the most frequently asked question by a wide margin is: “When can you add my text?” It’s a fair request, and a testament to the interest in computer-assisted investigation of intertextuality. Our answer to date has always been “we’re working on it.” This has not been a brush-off; in fact the rapid addition of new texts to our searchable corpus has been one of, if not the top priority of the 2012-2013 academic year. I am pleased to report that we have increased the size of our searchable corpus by a factor of 10, from approximately eight hundred thousand words at the beginning of the 2012-2013 academic year to over eight million words at the time of this writing.

Before I launch into a detailed account of our progress, there are two things I want to make clear:

  1. There are other development teams on the Tesserae project which never stopped growing the functionality of the system while my team worked to expand its reach. The results can be seen in (among other things) the new multi-text search, and the much, much faster back-end we now enjoy.
  2. The massive increase in the Tesserae corpus is the result of a team effort. Veterans of our scoring team returned and were joined by fresh faces who contributed to a deep talent-pool as attested on our personnel page. We were also given a much-needed boost by Chris Forstall, as I will explain.

Our goals for the year included the incorporation of the entirety of the Perseus classical corpus into Tesserae and the addition of important work in the English language. In order to add the texts from the Perseus database, it was crucial to preserve the hierarchy of text, book, and line-number with which the works were already annotated. Tesserae makes use of these markers and to strip the information out would be a waste. Yet there were several obstacles.

First, Perseus texts were added and annotated by many different researchers over a period of several years. Each text presented unique problems to its annotator. Where these problems repeat themselves, a single annotator might solve them the same way each time–but different annotators developed unique solutions. This complicated Chris Forstall’s task of creating a universal parser for the Perseus XML.

In addition, some of the variation in the XML-structure of the annotated texts is a natural result of differential structure of the works themselves. Plays are not organized like novels which are not organized like histories. The variety of structures is hinted at in Chris Forstall’s blog post. What is worse, several authors have been overly complicated by their textual tradition. Take Cicero, for example.

At one point, Cicero’s works were organized by text and chapter. These chapter numbers were based on the page numbers of a very early print publication, and they often break up the text mid-sentence. Later tradition re-divided the work into text, book, chapter, and line–often interrupting the old divisions. Modern texts still include the old chapter numbers as well as the new. The researchers at Perseus, in their effort to faithfully reproduce the information contained in a print volume, include both numbering systems simultaneously in the XML structure of the text. Untangling these conflicting annotations correctly is time-consuming; luckily we were able to rely on the wise counsel of John Dugan and the tireless efforts of Anna Glenn in order to incorporate every single text in Cicero’s oeuvre into Tesserae. For those who aren’t aware, that’s a good chunk of Latin. Cicero’s works contain nearly half of the words in the existing canon of classical Latin.

In fact, the project was able to dramatically increase the size of the corpus in all aspects. Some numbers:

# of words in the Tesserae corpus circa August, 2012: 795,141
# of words in the Tesserae corpus circa June, 2013: 8,198,402

Screen Shot 2013-06-18 at 2.17.54 PM

That’s an increase of 1,031%. It was made possible by the initial work of Chris Forstall, who developed a universal XML-parsing tool for use on the Perseus corpus, and the sustained efforts of our force of volunteers.

The increase has been so dramatic that we are currently considering new methods of organization to relieve the now-overburdened menu system. In addition, still more texts are being processed even as I write, so if you’d like to make sure your particular text will be incorporated into the corpus, feel free to drop us a line, and be assured: we’re working on it.