Sound Features in Tesserae Version 5

Tesserae compares any two texts of the user’s choice either line-by-line or phrase-by-phrase.  The user can choose to receive the results of matches based on any one of five “features”: form, lemmata, semantic, combined lemma and semantic, or sound.  Pairs of lines with the highest scores appear at the top of the results with the “matching” words highlighted.  All of this functionality was available in version 3, and now we are proud to make it available, through different implementation, in version 5.  Among the most recently implemented of these is the sound feature.

The idea of the sound feature is to score texts based on how much they sound alike.  Using this feature, researchers might find that some authors echoed each other not in the terms they used or even their sentiment, but in sound.  This analysis might reveal shared alliterative choices or even wordplay.  

We chose to represent sound features with character-level trigrams.  While we briefly considered attempting something more phonetically precise such as IPA (International Phonetic Alphabet) transcription or dividing words up into syllables, we realized these methods were more advanced than what users needed.  Latin and Greek have more regular orthographic systems than English, so the standard spelling of a word can be considered to be a decent phonetic representation of the word. 

We did omit all diacritical marks, however, from the sound features.  Firstly, we did this because diacritics are interpreted by computers as separate characters.  If you had a set of three characters but one of them had a diacritical mark, it would actually be interpreted as a set of four characters.  When that set of four characters gets divided up into two trigrams (our trigrams are produced by a 3-character window that moves down one character at a time until the end of the window meets the end of the word), the diacritic might be present in one trigram while the character it is meant to modify is not.  This trigram would be informationally sparse.

This brings us to our second reason for omitting diacritics.  The more characters there are, the more possible kinds of trigrams there are.  The more kinds of trigrams there are, the less likely it will be that any two trigrams will match.  Many sequence pairs which do, in truth, produce the same sound will not get identified as matches because one sequence is missing a diacritic.  While diacritical marks supply additional phonetic and morphological information to readers, they mostly supply noise to a matching algorithm like ours.  

The sound features are stored in the database with the word type they belong to just like the form, lemmata, and semantic features and are matched the same way too, with the exception of scoring.  While all these features are scored by the frequency of the words they belong to and the distance between them, we chose to score sound features according to the frequency and distance of the trigrams themselves.  Since sound similarity will be of greatest interest to users who are searching on this feature, it seemed most appropriate to represent this with the rarity and spacing of the sound segments rather than the rarity and spacing of their words, which are not likely to be the same words as in the matched line. 

A pair of lines may have many matches, but only the distance of the rarest pair of trigrams will be calculated.  This is where choosing to score based on the trigrams themselves makes the biggest difference.  If scoring were word-based, then matched trigrams occurring within the same word would receive a distance of 0 because the distance between a word and itself is 0.  This is significant because although a pair with a smaller distance tends to receive a higher score, a pair with a distance of 0 is discarded.  Pairs which receive the highest scores on account of distance when scoring by trigram would instead not even make it to output when scoring by word.  The distance from source and the distance from target are added together.  The final scoring formula for each pair of lines is the same as the default: score = ln (sum of the inverse frequencies of the matched sound features calculated from both source and target / distance)

While we do not yet have screenshots of sound matching in action from the website, below are screenshots of some unsorted demo results:

From excerpts of Vergil’s Aeneid and Lucan’s Pharsalia:

From excerpts of Homer’s Iliad and Plato’s Gorgias: