Estimating the Size of the Corpus

We recently had the opportunity to assess where our corpus stands and thought it could be useful for users to know its aggregate numbers. A convenient point of comparison is the largest publication environment for open source texts in Greek and Latin: the Scaife Viewer, which includes Open Greek and Latin texts and all CTS-compliant texts from the Perseus Digital Library. 

The following is an estimated word count for the Tesserae corpus, broken down into a number of steps to make it clear how the calculation was made. The result is a potentially interesting overview of the corpus. 

1.) Total corpus word-count for Version 5, Greek and Latin: 19,700,723 words

2.) Total word-count for Tesserae texts not included in the Scaife Viewer (“Tesserae-only texts”): 469,270 words

  • Incidentally the “Tesserae-only texts” are all Latin texts
  • The number here is relatively small (less than 3% of the corpus as a whole); this is because the overwhelming majority of texts in the Tesserae corpus draw from the same repositories as Scaife (OGL, Perseus DL, CSEL, First 1K Greek, etc.)

3.) Total currently available in the Scaife Viewer: 67,900,000 words (30,300,000 Greek, 16,500,000 Latin)

4.) Difference between the Tesserae corpus and what’s in Scaife (discounting the extra materials in Tesserae): 48,668,547 words

In order to search the entire body of texts available in the Scaife viewer Tesserae would need to add roughly 50,000,000 (48,668,547) words of Greek and Latin from the Open Greek and Latin corpus (with its associated repositories). For Tesserae, there is plenty of room for growth in this new and evolving environment of open source Greek and Latin texts.  

Comments are closed.