Unsupervised Compositor Attribution in Historical Documentsepxing/Class/10708-17/... ·...

Unsupervised Compositor Attribution in Historical Documents

Maria Ryskina (mryskina) 1

1. IntroductionIn this project, we consider the problem of compositor at-tribution in historical printed documents: clustering thepages of the document by the individual (a compositor)who set the type. There were multiple compositors work-ing on the book in parallel, and the bibliographers distin-guish between them by looking at their orthographic pref-erences (word spelling choices) as well as visual evidence(e.g. spacing before and after punctuation).

This is an important task for literary analysis. The attri-bution has traditionally been performed by hand, which isa time-consuming and painstaking process. We propose anovel unsupervised system for automatic compositor attri-bution. Our preliminary experiments show that even in-corporating the observations scholars made into a simplegenerative model gives up to 85% agreement with the au-thoritative scholarly attribution. In this project we proposea nonparametric generative model that will overcome thelimitations of the simple one, and tackle the challenges thatarise in the inference process.

Due to the difficulty of manual compositor attribution, au-thoritative attributions currently exist for just a handful ofbooks. The most important example is the First Folioof Shakespeare, the earliest printed collection of Shake-speare’s works, which we will use as case study in thisproject.

2. Related workUnfortunately, there has been no prior work in automaticcompositor attribution. The bibliographers have been per-forming it manually, using automated techniques only forgathering the statistics to rely on in their observations. Theclosest area of NLP would be stylometry, that focuses onattrbution of the authourship of the text (Holmes, 1994;Hope, 1994; Juola, 2006; Koppel et al., 2009; Jockers &Witten, 2010). Authorship attribution is usually done as asupervised task. There is little unsupervised work: in sim-plest case it is just a document clustering problem (Laytonet al., 2013), and more complex versions include document

1Carnegie Mellon University, Pittsburgh, PA 15213, USA.

10708 Class Project, Spring, 2017.

segmentation according to author which is irrelevant forus (Koppel et al., 2011). Stylometry methods are not appli-cable for our task, first of all because we need a completelydifferent set of features: we are conditioning on modernversion of the text, so our model needs to explain ortho-graphic preferences, not lexical or syntactic choices like instylometry. Conditioning also makes the model more com-plex, because spacing and orthography features have to beparameterized differently.

In (Ryskina et al., 2017), we propose a new model designedspecifically for the task. Our model design is informed bycompositor studies of Shakespeare’s First Folio, drawingon the methods proposed by Hinman (1963), Howard-Hill(1973), and Taylor (1981).

Hinman’s landmark 1963 study identified five differentcompositors based on variations in spelling among threecommon words. The Folio was printed before the spellingswere standardized, so for most words compositors werefree to choose how to spell them. He relied on the as-sumption that the compositors would be consistent in theirspelling preferences for the convenience of the typesettingprocess, and treated the spelling preferences (qualitatively)as separate multinomial outcomes.

Subsequent studies looked at larger sets of words and moregeneral orthographic preferences (e.g. the preference to ter-minate words with -ie instead of -y), leading to modifi-cations of Hinman’s original analysis (Howard-Hill, 1973;Taylor, 1981). Another important addition from thoseworks is the whitespace evidence: the patterns of addingwhitespace after and before punctuation can also be char-acteristic.

We have been able to model all the patterns describedabove. However, some of the observations made by schol-ars require a more complicated approach. One of them isthe number of compositors: it has been debated by bib-liographers over decades, and the true number is still notknown, so we do not want to constrain our model by speci-fying it in advance. Another important observation was thepresence of a multiple compositors in some pages: a pageis split in a sequence of segments, each set by a differentperson. Our goal is to design the models that reflect theseassumptions and perform inference on them.


(a) Parametric BASIC model. Each compositor has a categoricaldistribution over spellings for each of the modern word types, andthe spellings occurring on the compositor’s pages are drawn fromthese.

ci

sik

mij

dij

dear

deereKi Ji

I

Compositor

C! !! !! !

Edit operation weights:

C

wc

Word variant weights:

(b) Parametric FEAT model. Here we model individual edit opera-tions as well as spelling choices and incorporate those features intothe model in a log-linear fashion. Spacing preferences are also mod-eled separately and parameterized in the same way.

Figure 1. Baseline models introduced in (Ryskina et al., 2017)

For this task, we will borrow techniques from related workon non-parametric topic models (Teh et al., 2004) . Ourown modeling problem is similar in that we need multiplemixture components because there could be multiple com-positors per page, and at the same time we did not knowthe total number of compositors in advance.

3. Baseline descriptionWe will use our previous parametric model as a base-line (Ryskina et al., 2017).

The generative models are presented in Figure 1. We as-sume that we have a diplomatic transcription of the text (atranscription faithful to the original orthography), automat-ically aligned to the modernized version of the text usingword-level Levenshtein distance. Both modern and diplo-matic wordsmij and dij in each pair are observed. Each ofthe I pages is generated indepedently, and the compositorci is the latent variable that we compute maximum likeli-hood estimate for. Whitespace pixel lengths are extractedfrom the page scans and parameterized with a multinomialdistribution, conditioned on ci.

BASIC model variant: The first version of parametricmodel, inspired by the earliest methods proposed by schol-ars, serves as a simple baseline. Here we only model thespelling preferences of the compositors. Figure 1(a) showsthe generative model: for page i, compositor ci is gener-ated from a multinomial prior, and for each of the modernword types mij in the page a diplomatic spelling is gener-ated from a categorical distribution for this modern wordand compositor:

P (d|m, c) = Cat(γ(c)m )

We use EM algorithm for inference in this model.

Trying to explain all possible diplomatic spellings for allwords in the vocabulary would make the process extremelytime-consuming and noisy. We started by considering onlythe three words selected by Hinman: do, go and here(referred to as HINMAN). For a more thorough analysiswe constructed an automatically filtered AUTO word list,consisting of words that appear frequently enough (at least70 occurrences) and exibit sufficient variance in diplomaticspellings (most common variant accounts for no more than80% of the occurrences). Names were automatically fil-tered out due to noisy matching. For the more complexFEAT model, only AUTO word list is used.

FEAT model variant: For the model shown in Figure 1(b)for page i, compositor ci is generated from from a multi-nomial prior. Then, each diplomatic word, dij , is gener-ated conditioned on ci and the corresponding modern word,mij , from a distribution parameterized by weight vectorwc. Finally, each medial comma spacing width (measuredin pixels), sik, is generated conditioned on ci from a distri-bution parameterized by θc.

In this setup we generalize across different words (forexample, a compositor’s habit to replace -y with -ie),and also incorporate compositors’ whitespace preferences.Those features are incorporated into the model in log-linearfashion:

P (d|m, c;w) ∝ exp(w>c f(m, d))

where f(m, d) is a feature function defined on modernword m paired with diplomatic word d, while wc is anedit feature weight vector corresponding to compositor c.Whitespace pixel lengths are generated by a multinomial


distribution parameterized by θc and are added in similarfashion. To do inference in this setup, we use the feature-enhanced EM algorithm proposed in (Berg-Kirkpatricket al., 2010). The E-step is accomplished via a tractablesum over compositor assignments, while the M-step forwc

is accomplished via gradient ascent. The M-step for spac-ing parameters, θc, uses the standard multinomial update.Predicting compositor groups is accomplished via an inde-pendent argmax over each ci.

Other baselines: Our goal is to avoid specifying thenumber of compositors in advance. We propose a non-parametric model to overcome this restriction. Anotherway to avoid the pre-specified maximum number of com-positors, as suggested in the proposal feedback, is varyingthe number of compositors in parametric model and thenperforming cross-validation. We do not have definitive ar-guments for choosing a non-parametric model over this,however, our hypothesis is that a non-parametric one mightbe more successful. To verify this, we will use the holdout-validated parametric model as a more sophisticated base-line.

4. Proposed methodIn this project we propose a non-parametric model that usesa Chinese Restaurant Process (CRP) prior on compositorvariable ci. Our approach to modeling the scholarly obser-vations is described in the previous section, but we wouldneed a completely different set of methods for inference.

The CRP prior is defined as follows:

p(ci = k|c−i;β) =

I(k)−i

i+β−1 , if compositor k seen beforeβ

i+β−1 , if compositor k is new

Here c−i are compositor assignments for all pages up to i,I is the total number of pages and I(k)−i is the number ofpages previously assigned to compositor k.

BASIC model and collapsed Gibbs sampling: First, let usconsider a nonparametric version of BASIC model in Fig-ure 2(a), where we only look at diplomatic spellings andp(d|m, c; γ) is a multinomial distribution for each compos-itor. Here we will use a collapsed Gibbs sampler for infinitemixture of multinomial components, with a Dirichlet prioron compositor parameters γ(c). Each mixture componentis a collection of multinomials for each modern word type,and the pages can be viewed as lists of collections of multi-nomial outcomes for each of those types.

For convenience let us introduce new notation: each com-positor c will have a set of parameters γ(c) = {γ(c)

m },where γ(c)

m are parameters for multinomial distribution overspellings of each modern word type m ∈ M . A collectionof diplomatic spellings of modern word type m occurring

on page i is denoted by dm,i = (dm,i,1, . . . , dm,i,Jm,i).

We will also write p(dm,i,j |m, c;γ(c)) as p(dm,i,j ;γ(c)m ) =

Mult(γ(c)m ) to avoid redundancy in notation.

Joint distribution takes the following form (here di andmi denote the whole sequence of diplomatic and modernwords on page i):

p({ci}I , {dm,i}M,I , {γ(c)m }M,C ;β) ∝

p({ci};β) ·I∏i

∏m∈M

Jm,i∏j

p(dm,i,j ;γ(ci)m ),

where p({ci}I ;β) is defined by the CRP:

p({ci}I ;β) =I∏i=1

p(ci|c−i;β)

We would like to sample the new multinomial param-eters and compositor assignments from the posteriorp({ci}I , {γ(c)

m }M,C |{dm,i}M,I ;β). For that, we can usea sampling method described in (Neal, 2000). This methodconsists of iterating between two sampling steps: given acurrent Markov chain state ({ci}I , {γ(c)

m }M,C) we sequen-tially resample values of ci and then values of parameters.

The two steps are as follows. Given an assignment of pagesto compositors, we sample the next value of compositorparameters from the posterior. Due to using a conjugateprior, we get:

γ(c)m ∼ Dir

(α(c)m

)γ(c)m |{dm,i : ci = c} ∼ Dir

(α(c)m′)

where α(c)m′ = α

(c)m +

∑Ii=1

∑m∈M

∑Jm,i

j=1 δ(c)m,i,j and

δ(c)m,i,j is a one-hot vector (for each occurrence of spellingdm,i,j of modern word m for compositor c we add 1 to acorresponding α).

Now, given the parameters for each compositor, we reas-sign the pages to compositors by sampling from a new dis-tribution:

p(ci = k|c−i, {dm,i}M ;γ(k)m , β) ∝

I(k)−i

i+β−1 ·∏

m∈M

Jm,i∏j=1

p(dm,i,j ;γ(k)m ), k seen before

βi+β−1 ·

∏m∈M

Jm,i∏j=1

p(dm,i,j), k new

In case of a new compositor we marginalize over theirmultinomial parameters. However, we do not need the ac-tual values of γ(c)

m at all, so we can do the same for the


(a) Non-parametric BASIC model. The collections of diplo-matic spellings in pages are generated by an infinite mixtureof categorical distributions for (c,m) pairs. We introduce aChinese Restaurant Process prior on the compositors and aconjugate Dirichlet prior on the multinomial parameters.

(b) Non-parametric FEAT model. Again, we use a CRP prioron compositors, and the feature weights are generated by anisotropic Gaussian distribution. As in the parametric model,the features are incorporated into the model using a log-linear parameterization.

Figure 2. Proposed non-parametric models.

existing compositors. That means we directly use a poste-rior predictive distribution:

dm,i|dm,−i, ci ∼ DirMult(dm,i|α(ci)m′)

So the Gibbs sampler will perform the following update(for first iteration, I should be replaced with i):

p(ci = k|c−i, {dm,i}M , {dm,−i}M ;β) ∝I(k)−i

I+β−1 ·∏

m∈M

Jm,i∏j=1

pk(dm,i,j |dm,−i), k seen before

βI+β−1 ·

∏m∈M

Jm,i∏j=1

pk(dm,i,j), k new

where the distribution pk(d|dm,−i) is a Dirichlet-multinomial posterior predictive for spellings of word mset by compositor k with parameters α(k)

m′.

FEAT model: Figure 2(b) shows the proposed non-parametric version of the FEAT model. For compositors westill use a CRP prior, but since feature weights do not forma probability distribution, we change the prior on weightsto an isotropic Gaussian: wc ∼ N (0, σ2I).

In this case our model loses conjugacy, so we need to intro-duce a new sampling procedure.Here we will be samplingfrom the following distribution:

p(ci = k|c−i,di;wk, β) ∝I(k)−i

i+β−1 ·Ji∏j=1

p(dij |mij ;wk), k seen before

βi+β−1 ·

Ji∏j=1

∫p(dij |mij ;w)p(w)dw, k new

Without conjugacy, we can not compute the integral abovedirectly. We approximate it is by using a Monte Carlo es-timate with samples from the posterior. We also cannotsample from the posterior directly, so we will use impor-tance sampling (Rasmussen & Ghahramani, 2003) with aGaussian proposal.

In the second step of the Gibbs sampling procedurewe also sample new values of wc from the posteriorp(wc|{{dm,i}M : ci = c}). For that we can useMetropolis-Hastings sampling with a symmetric proposaldistribution q(w∗|w) = N (w, σ2I)

Then the new sampled parameter valuew∗c is accepted witha probability:

a(w∗c ,wc) =p({{dm,i}M : ci = c};w∗c)p(w∗c ;σ)p({{dm,i}M : ci = c};wc)p(wc;σ)

The rest of the sampling procedure stays the same as in thenon-marginalized version of BASIC model.


(a) Number of compositors recovered by a marginalized versionof the model. The two bottom plots correspond to HINMAN wordlist, the top plots show the 160-word long AUTO list. The effectof the CRP parameter β is also demonstrated: the higher it is, themore likely the model is to form new clusters.

(b) Number of compositors recovered by a non-marginalized ver-sion of the model. For the HINMAN setup the result stays approx-imately the same. With the AUTO word list, the model has toomany parameters and starts by forming around 700 clusters andthen gradually reduces the number.

Figure 3. Number of compositors identified by the nonparametric BASIC model vs. number of Gibbs sampling iterations averaged over5 random restarts.

5. Experimental setup5.1. Dataset

We are using the Bodleian digitized copy of the First Fo-lio1. The modern play texts are taken from MIT Com-plete Works of Shakespeare2 and aligned with the Bodleiandiplomatic transcriptions by running a word-level edit dis-tance calculation. We choose to use a common modern edi-tion of Shakespeare’s plays rather than a modernized diplo-matic version to test the robustness of the model.

5.2. Model selection

In the parametric case, we ran a number of random restartsfor each model and then chose the one that had highestlog-marginal likelihood over all the pages for evaluation.However, in a non-parametric case evaluating log marginallikelihood becomes challenging, so instead we choose toevaluate the log-marginal likelihood of holdout data. Werandomly select 100 out of 885 pages for the holdout set,learn the compositor parameters and assignments from theother pages (choosing the best of 5 random restarts) andthen marginalize over compositor assignment on the hold-out set. To reconstruct the compositor assignment on theholdout part for evaluation, we choose the compositor foreach page performing maximum likelihood estimation foreach page independently. This is not fully correct in a non-parametric case, because we restrict the model to the num-

1http://firstfolio.bodleian.ox.ac.uk/2http://shakespeare.mit.edu/

ber of compositors it recovered in the training data.

For the parametric baseline with holdout validation we per-form the same procedure: for each number of composi-tors we choose the random restart that gives the best log-marginal likelihood for the training set, evaluate the log-marginal likelihood on the holdout set and choose the num-ber of compositors that produced the best holdout likeli-hood.

5.3. Evaluation

The purpose of our experiments is to measure agreementbetween the compositor attribution recovered by our mod-els and the ones proposed by bibliographers. We evalu-ate against an authoritative attribution compiled by PeterBlayney (Blayney, 1996) which includes the work of var-ious scholars (Hinman, 1963; Howard-Hill, 1973; 1976;1980; Taylor, 1981; O’Connor, 1975; Werstine, 1982). Wealso compare to an earlier, highly influential model pro-posed by Hinman (1963), which we approximate by re-verting certain compositor divisions in Blayney’s attribu-tion. Hinman’s attribution posited five compositors, whileBlayney’s posited eight; in the parametric case, we set themaximum number of compositors to C = 8 and C = 5 re-spectively for evaluation. Since we are interested in model-ing specific observations, we will compare the predictionsof the models on HINMAN word list to Hinman’s attributionand the larger AUTO word list to Blayney’s attribution.

For baseline evaluation we used one-to-one and many-to-

http://firstfolio.bodleian.ox.ac.uk/

http://shakespeare.mit.edu/


one accuracy, mapping the recovered page groups to thegold compositors to maximize accuracy, as is standardfor many unsupervised clustering tasks, e.g. POS induc-tion (see Christodoulopoulos et al. (2010)). For the non-parametric case, computing one-to-one accuracy might notbe possible because the number of compositors varies, sowe only use one-to-many accuracy.

However, mapping our clusters to the authoritative attribu-tion might not be as informative in our setting, since wedo not impose restrictions on the number of clusters. An-other metric we can use here is the pair-counting F1 mea-sure (Achtert et al., 2012), where we take each possible pairof pages and check if they are in the same cluster or not forboth attributions.

5.4. Hyperparameters

For all the experiments, we use α = (0.1, 0.1, . . . 0.1) forthe prior on each of the spelling multinomials. We usedthe CRP strength parameter β = 0.1 and also looked at theeffect of varying β in Subsection 6.1.

6. Results and analysis6.1. Number of compositors

The most interesting part of the analysis is to see how manycompositors our models predicted on the data. For the para-metric models we set this number beforehand; however, forBlayney’s 8-compositor case our parametric models splitall the pages into 5 clusters, leaving out the other 3 com-positors. We also ran the holdout validation model for themaximum numbers of compositors between 2 and 10; sur-prisingly, it shows the best marginal likelihood for 2 com-positors both in Hinman and Blayney attribution cases.

For the non-parametric models we looked at how the num-ber of compositors changes with iterations while Gibbssampling is performed (by iteration we mean one cycle ofresampling all the variables). Figure 3(a) shows the resultsfor the marginalized BASIC model. The CRP model cap-tures exactly the same pattern as the scholars did: when itonly looks at the spellings of the three words Hinman stud-ied, it predicts 3-5 compositors while Hinman posited 5.For a longer word list it predicts 7-9 compositors, and thecurrent authoritative attribution posits 8. We also look atthe effect of the strength parameter β: increasing it makesthe model more likely to create new clusters, but the num-ber of compositors only varies slightly in the end.

Figure 3(b) demonstrates the dependency for the non-marginalized model where we sample all the compositorparameters explicitly. Nothing changes for HINMAN wordlist, but for AUTO the number of parameters ends up beingso high that the model is confused. At the first iteration it

usualy predicts 600-700 compositors and then the numbergradually reduces. We have done 1000 sampling iterations,and by then the number goes down to only 35-40.

6.2. Comparison to manual attribution

The accuracy measures for all the experiments are pre-sented in Table 1. In the chosen setting, the parametricHOLDOUT model predicted 2 compositors for both cases,NP-BASIC predicted 6 and 36 compositors respectively andNP-BASIC-MARGINALIZED – 4 and 7 compositors respec-tively. The non-parametric models tend to show better F-1score than the parametric ones, but the many-to-one accu-racy drops. However, these metrics do not seem very in-formative for comparing models with variable number ofcompositors.

ReferencesAchtert, Elke, Goldhofer, Sascha, Kriegel, Hans-Peter,

Schubert, Erich, and Zimek, Arthur. Evaluation ofclusterings–metrics and visual support. In Data Engi-neering (ICDE), 2012 IEEE 28th International Confer-ence on, pp. 1285–1288. IEEE, 2012.

Berg-Kirkpatrick, Taylor, Bouchard-Cote, Alexandre,DeNero, John, and Klein, Dan. Painless unsupervisedlearning with features. In Proceedings of the NorthAmerican Chapter of the Association for ComputationalLinguistics: Human Language Technologies, 2010.

Blayney, Peter W. M. (ed.). The First Folio of Shakespeare:The Norton Facsimile. Norton, 1996.

Christodoulopoulos, Christos, Goldwater, Sharon, andSteedman, Mark. Two decades of unsupervised POS in-duction: How far have we come? In Proceedings of theConference on Empirical Methods in Natural LanguageProcessing, 2010.

Hinman, Charlton. The printing and proof-reading of theFirst Folio of Shakespeare, volume 1. Oxford: Claren-don Press, 1963.

Holmes, David I. Authorship attribution. Computers andthe Humanities, 28(2):87–106, 1994.

Hope, Jonathan. The authorship of Shakespeare’s plays:A socio-linguistic study. Cambridge University Press,1994.

Howard-Hill, Trevor H. The compositors of Shakespeare’sFolio Comedies. Studies in bibliography, 26:61–106,1973.


Model Setup Hinman Attr Blayney AttrF-1 M-to-1 F-1 M-to-1

RANDOM 19.5 49.6 28.4 49.6P-BASIC 54.0 73.2 54.4 81.4P-HOLDOUT 71.1 70.3 68.9 64.2NP-BASIC 57.8 67.1 66.0 74.0NP-BASIC-MARGINALIZED 59.7 66.6 70.6 77.6P-FEAT w/ EDIT + WORD 70.9 81.1 70.6 80.6P-FEAT w/ ALL 75.6 83.7 69.8 83.4

Table 1. The experimental results for both paramet-ric and nonparametric models. The table showspair-counting F-1 measure and the many-to-one ac-curacy of mapping the predicted clusters to thecompositors in manual attribution. A random base-line is included for comparison.

Howard-Hill, Trevor H. Compositors B and E in theShakespeare First Folio and Some Recent Studies. Self-published, 1976.

Howard-Hill, Trevor H. New light on compositor E ofthe Shakespeare First Folio. The Library, 6(2):156–178,1980.

Jockers, Matthew L. and Witten, Daniela M. A compar-ative study of machine learning methods for authorshipattribution. Literary and Linguistic Computing, 25(2):215–223, 2010.

Juola, Patrick. Authorship attribution. Foundations andTrends in Information Retrieval, 1(3):233–334, 2006.

Koppel, Moshe, Schler, Jonathan, and Argamon, Shlomo.Computational methods in authorship attribution. J. Am.Soc. Inf. Sci. Technol., 60(1):9–26, January 2009. ISSN1532-2882. doi: 10.1002/asi.v60:1. URL http://dx.doi.org/10.1002/asi.v60:1.

Koppel, Moshe, Akiva, Navot, Dershowitz, Idan, and Der-showitz, Nachum. Unsupervised decomposition of adocument into authorial components. In Proceedingsof the 49th Annual Meeting of the Association for Com-putational Linguistics: Human Language Technologies-Volume 1, pp. 1356–1364. Association for Computa-tional Linguistics, 2011.

Layton, Robert, Watters, Paul, and Dazeley, Richard. Auto-mated unsupervised authorship analysis using evidenceaccumulation clustering. Natural Language Engineer-ing, 19(01):95–120, 2013.

Neal, Radford M. Markov chain sampling methods fordirichlet process mixture models. Journal of computa-tional and graphical statistics, 9(2):249–265, 2000.

O’Connor, John. Compositors D and F of the ShakespeareFirst Folio. Studies in Bibliography, 28:81–117, 1975.

Rasmussen, Carl Edward and Ghahramani, Zoubin.Bayesian monte carlo. Advances in neural informationprocessing systems, pp. 505–512, 2003.

Ryskina, Maria, Alpert-Abrams, Hannah, Garrette, Dan,and Berg-Kirkpatrick, Taylor. Automatic compositor at-tribution in the First Folio of Shakespeare. arXiv preprintarXiv:1704.07875, 2017.

Taylor, Gary. The shrinking compositor A of the Shake-speare First Folio. Studies in Bibliography, 34:96–117,1981.

Teh, Yee Whye, Jordan, Michael I, Beal, Matthew J, andBlei, David M. Sharing clusters among related groups:Hierarchical dirichlet processes. In NIPS, pp. 1385–1392, 2004.

Werstine, Paul. Cases and compositors in the ShakespeareFirst Folio Comedies. Studies in Bibliography, 35:206–234, 1982.

http://dx.doi.org/10.1002/asi.v60:1

http://dx.doi.org/10.1002/asi.v60:1

Date post:	24-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times