How to Calculate the Relative Influence of an Author

At the end of the first century, Quintilian asked “Is it not sufficient to model our every utterance on Cicero? For my own part, I should consider it sufficient, if I could always imitate him successfully. But what harm is there in occasionally borrowing the vigour of Caesar, the vehemence of Caelius, the precision of Pollio or the sound judgment of Calvus?”

As philologists of the 21st century, we might ask “How often did Roman authors actually borrow phrases from Caesar as opposed to Cicero?”

Caitlin Diddams and I recently published an article in Digital Scholarship in the Humanities which lays out the best practices for determining:

  1. Which phrases shared between two authors did not come from a second possible source
  2. How to measure the relative strength of an “intertextual signal”
  3. How to compare the relative influence of multiple authors on a cross-section of literature

As a test-case, we compared the influence of Cicero and Caesar during the early imperial and late imperial periods.

The methodology we outline in this article can be used on any number of source and target authors, regardless of language. Our formula for calculating the strength of an intertextual signal can be used with any tool for detecting intertextuality (not just Tesserae).

To read the abstract and obtain the full article, visit the Oxford Journals website:

Relative influence in our methodology is compared according to the ‘rate of intertextuality,’ which is a normalized representation of the number of results you get in a Tesserae search. Normalization is necessary because the length of a work influences the number of results obtained. Previous methods of normalization assumed that Tesserae’s scoring algorithm would perform consistently across various authors and genres of literature. We propose that best practice should avoid such assumptions wherever possible.

Our normalization method in brief (the following is excerpted from a pre-print copy of the article):

The number of results of two searches cannot be meaningfully compared until we consider how many results each search could have produced. The number of search results depends on two factors: the level of engagement between the authors and the length of the texts being compared. Longer texts create more sentence-by-sentence comparisons. There are more opportunities for unique intertexts to occur. The number which can be meaningfully compared is not the number of unique results of a Tesserae search, but the ratio of the results found to the results that could have been found. We normalize the number of results according to the following formula:

We define the rate of intertextuality as the number of connected phrases per pair of phrases considered. This is derived by dividing the absolute value of the set of results by the absolute value of the cross-product of the sets of sentences in source and target texts. This cross-multiplication is necessary because Tesserae compares every sentence in a source text to all of the sentences in a target text.6 Therefore the number of possible results in a comparison of any source and target is the product of the number of sentences in the source and the number of sentences in the target.

R Workshop


On Saturday, April 12, 2014 Christopher Forstall and James Gawley conducted a workshop on Digital Text Analysis for Humanists, using the R software package. The workshop took place at the University at Buffalo (UB), and was sponsored by the Digital Scholarship and Culture Committee of the UB Techne Institute.

The program for the workshop is here: Digital Text Analysis for Humanists Workshop – Program

The data files are here: R_workshop


Ranking Results: The Scoring System

Tesserae search begins by matching a minimum of two words in one text with two words in another. The words can be matched either by their exact forms or by their dictionary headwords. Using headword matching permits, for instance, the Latin tuli to match latus, both  forms of the headword fero.

For comparisons of even moderate-sized texts, basic matching produces thousands of results. We have therefore created a scoring system to sort results by likely potential interest.

Higher scores are given to parallels where the matched words in each text are closer together and where the matched words are more rare. Our testing has found that the top results produced by this method correspond well with the results found by commentators. In other words, preliminary tests show the current Tesserae identification and scoring processes help substantially to identify the most meaningful results.

Full testing of this system is still in progress, however, as are efforts to improve it further. In the meantime, the following description gives a somewhat more detailed account of its function.

First, the frequency of each matching term is calculated by dividing its count within its respective text by the total number of words in that text.

The frequency of a word will thus be different in the search and target texts. In a lemma-based search (the default), the count for a word includes every occurrence of an inflected form with which it shares one or more possible lemmata. These frequencies (very small fractions, even for the most common words) are then inverted and the results are added together across both phrases. The result is a very large number. This is then divided by the distance covered by the matching words in the source and target phrase.

Distance in each phrase is calculated as the number of tokens spanned by (and including) the two most-infrequent matching words. The distances from the source and target phrases are added together to make the overall distance. Finally, the natural logarithm of the result is taken. This helps to bring the exponential differences in word frequencies that occur in natural language into a more linear and human-interpretable range. For a given parallel, the rarer the words are, and the closer they are together in their respective texts, the higher its score will be.

Adding New Texts

Adding new texts to the current version of Tesserae is relatively simple. Once we have a text properly formatted, we can run a program to process it, make a few other adjustments (like adding it to drop-down lists and entering its information into our list of sources), and it will be ready to search.

So achieving our goal of adding all Perseus classical Greek and Latin texts this year should be a walk in the park, no?

In fact, there is a bottleneck: getting the texts properly formatted for addition. We need a plain text (in .txt format) that has the proper section markers at the beginning of each line (of poetry) or section (of prose). For Perseus texts, this means automatically stripping out the XML information and inserting the section markers.

We’re moving forward with this work, but would welcome help. A set of instructions gives further detail on how to put texts in the correct format. Anyone who wants to pitch in should email Tesserae Fellow James Gawley for further advice. Note that, although the instructions give some examples of English texts, we have no team members working on English at the moment, and so don’t currently have the capacity to process other English texts.