Cluster analysis and "Plain Sermons": my experience at dhsI Robert Ellison, Marshall University

Can computer tools help identify the contributors to an anonymously-published volume of Victorian sermons?

This is the question I explored at the Digital Humanities Summer Institute, which took place June 4-8, 2018, at the University of Victoria. My class, taught by David Hoover of NYU, was called “Out-of-the-Box Text Analysis for Digital Humanities.” The title had a double meaning: we would (1) use free or low-cost programs right out of the box, with no programming required, and (2) think outside the box to use these programs in innovative ways.

The text I chose to explore was Volume VII of Plain Sermons, by Contributors to the "Tracts for the Times," a collection I have used in research projects on Tractarianism and the Oxford Movement (I skipped to VII because it had multiple contributors and is one of the first to come up in a search of the Internet Archive).

Like the other 9 volumes in the series, it was published anonymously, but we now know that the contributors were John Keble, George Prevost, Robert Francis Wilson, and Isaac Williams (contributors to other volumes included John's brother Thomas Keble, E.B Pusey, and John Henry Newman).

Authorship attribution, or "stylometry," is based in statistical analyses of word frequencies and other stylistic features of a group of texts. The work we did in class involved 4 steps:

Preparing the texts to be analyzed
Generating word frequencies via the Intelligent Archive, a free, Java-based program developed at the University of Newcastle in Australia
Doing some minor clean-up and data preparation in Excel
Generating cluster analyses in Minitab, a statistics program that offers reasonably-priced 6- and 12-month licenses for faculty and students

Steps 2-4 are described in detail on one of Prof. Hoover's websites. Rather than trying to replicate his work, I will simply summarize how I conducted my experiment.

I started by downloading a plain-text version of Volume VII from the Internet Archive. As you can see, the OCR is not very good, but we decided not to take the time to try to clean it up. All I did was put <div> and </div> tabs around each sermon so they would be analyzed as separate texts.

I then added the text to the Intelligent Archive and generated the word frequencies. In the image below, PS7 is the file name (shorthand for "Plain Sermons, Volume 7"); the columns are the sermons, as designated by the <div> tabs; and the frequencies are expressed as proportions of the whole.

Finally, I did the clean-up in Excel, pasted the data into Minitab, and got a dendrogram of the results (in brief, a dendrogram is a graphic representation of correlations among sets of data--in this case, texts. The sets that are most similar are grouped together; the shorter the lines, the greater the similarity between them).

It appears, then, that the sermons do in fact cluster into groups, so the next step was to see if they were correctly grouped by author. I went back into Minitab, added authorship information to the column with the sermon numbers, and generated the dendrogram again. This is what I saw:

Success! It turns out that Minitab was 100% accurate in grouping the sermons by author. It's interesting, though, that the Keble sermons were grouped into 2 clusters rather than 1. It might be worthwhile to reexamine the collection and try to figure out what sermons 14-19 and 20-25 have in common (this is, in fact, one of the principal tenets of computational analysis: the software can provide data, but there still needs to be a human reader who will return to the texts to posit answers and craft an argument).

Prof. Hoover and I were both surprised that the experiment worked so well, given the poor quality of the texts we had to work with. The next step, which we did not have time to do in class, would be to add in other sermons known to be written by these contributors to see whether we would get similar correlations.

Even without that next step, we have a viable "proof of concept." As Prof. Hoover noted, if computational analysis can work accurately with a known set of texts, we can have a reasonable expectation that it can help identify the authorship of unknown sets. A conclusion should not be based, of course, on a single iteration of a single analysis, but if multiple experiments yield similar results, we can have strong grounds for believing that our conclusions are correct.

Created By

Robert Ellison

Appreciate