Correct · Review · About · Login

Shakespeare His Contemporaries:
Collaborative Curation of EEBO-TCP Texts with AnnoLex

By Martin Mueller (Northwestern University)

1. What is AnnoLex?

AnnoLex is a collaborative data curation tool for use with EEBO-TCP texts. It is useful for the identification and correction of incompletely or incorrectly transcribed words. It can also be used for the manual correction of algorithmically applied lemmatization and part-of-speech tagging. It is built using the Python-based Django framework and stores its data in a MySQL database. Annolex has been developed by Craig Berry under a grant from Academic Research Technologies at Northwestern University.

This document explains how to use AnnoLex for the second-phase curation of "Shakespeare His Contemporaries," a corpus of approximately 500 non-Shakespearean plays between 1576 and 1642 that underwent a first-phase curation at the hands of five Northwestern undergraduates working with me during the summer of 2013 . Nayoon, Ahn, Hannah Bredar, Madeline Burg, Nicole Sheriko, and Melina Yeh fixed ~36,000 manifest textual errors with the help of AnnoLex. They focused on errors that could be corrected with a high degree of confidence by consulting EEBO images accessible from within AnnoLex. They fixed ~36,000 of approximately 56,000 errors—not bad for a first-round rough cleanup. Most of the residual errors require a look at the original printed page to be fixed with confidence.

This document addresses you as a person who can be persuaded to think of an EEBO-TCP text as a "collaboratively curatable object" (CCO) and wants to contribute to the improvement of this or that text by looking at the pages of printed source, preferably the copy in the Rare Book Library that provided the microfilm source for the digital scan from which the text was transcribed. Looking at the printed original will in many cases make it easy to spot a transcriptional error that could not be identified from the digital scan. There are 465 plays that still contain one or more known errors. Half of them contain fewer than twenty problems, most of them simple and fixable in an hour or so.

A play that has gone through a second-phase clean-up and from which all or most known errors have been removed is not a perfect diplomatic edition, let alone a critical one. It may contain philological “cruxes” that stubbornly resist solution. Was it an "Indian" or "Iudean" that "threw away a pearl richer than all his tribe"? A lot of philological good is done if all or most of the mundane and solvable problems are fixed first, leaving the more interesting problems for further and possibly endless speculation. In the interim a text with only its hard problems unsolved will be a text that is good enough for most readerly purposes. And a dramatic corpus that has undergone this level of curation is unlikely to bias or distort corpus-wide inquiries.

2. How to use AnnoLex for second-phase curation

The current instance of AnnoLex builds on last summer's work and contains only the text of pages that still contain one or more known error(s). A known error is a place in the text where the transcriber could not identify a letter, word, or passage and marked it as a gap. AnnoLex uses the following symbols for different types of gaps:

  1. The black dot (●) is used for missing letters on 6,687 occasions
  2. The lozenge (◊) is used for missing words on 4,524 occasions
  3. The ellipsis (…) is used for a "span" of indeterminate length, but less than three word is used on 1,377 occasions
  4. The black square (■) is used for an ambiguous punctuation mark on 7,318 occasions

The transcribers were conscientious in their counting of different gaps. Thus a spelling like 'W●●t' probably means that the transcriber accurately counted two missing letters. But it is not certain. Black dots at the end of a word not infrequently denote punctuation marks that should have been transcribed as black squares. The reverse error is very rare.

3. How Does AnnoLex Work?

AnnoLex has two major views: Correct and Review. They are accessible from the top menu bar. You may look at either, but you must be logged in to suggest corrections in the Correct view, and you must have special editorial privileges to approve corrections in the Review view. Unless you are a reviewer, you can ignore the Review panel. A future revision of AnnoLex may include a feature that lets you review and delete or amend your own corrections. But this useful feature is not yet available.

A future revision of AnnoLex will also allow you to create your own user account. For the time being you can only get a user account by asking me for it via email at martinmueller@northwestern.edu

Neither a correction nor its approval changes the underlying source text. Think of a correction as an annotation attached to a place in the text and of its approval as an additional annotation about the status of that correction. The actual correction of the source texts is a separate process. However, if a correction has been approved, AnnoLex will display the corrected text so that the same error will not be corrected multiple times.

3.1 The Correct View

In the Correct View your browser window is divided into three parts. The entire right half is taken up by a display panel. The upper left part is a search panel in which you define what you are looking for. The lower left part is an edit panel where you make your suggestions.

The display panel shows you a Spelling in Context with the spelling highlighted between the left and right context. In separate columns it shows the spelling, lemma, and POS tag. The last column contains an Edit button, which activates the Edit panel. For the purpose of this second-phase curation, the lemma and POS values are irrelevant.

3.1.1 The Search Panel

The Search panel gives you various options for constraining your search. There are nine options, which you may combine in any way, including some that are unlikely to produce useful results. For the purpose of correcting the residual errors in a given play the two critical options are Text and Filter.

The Text option has a drop-down menu that lets you choose from plays listed by author and title. Ignore the All option, which does not return coherent results with the current data set of AnnoLex

The Filter option lets you choose between All and Preselected. The Preselected filter selects all tokens that contain a black dot, a black square, an ellipsis , or a lozenge. In other words, it selects all the "tokens of interest" for this curation phase. If you correct all of them and your corrections are approved, the play you curated joins the list of plays with no known errors (which is not the same thing as a play with no errors).

4. The Edit Panel

The Edit panel occupies the lower left part of your browser window. You must be logged in to save your corrections. If, after clicking the Edit button, you do not see a button that says: View EEBO Image you are not logged in.

If you click on the Edit button on the right edge of the display panel, the top line of the Edit panel changes and displays a command of the following kind:

Edit word 17-b-3790 from Jacob and Esau

"17-b-3790" is a three-part unique identifier where

  1. The first part consisting of one or more digits, represents a digital image identifier and retrieves that image, which is nearly always a double page. N.B. The image number refers to the EEBO image set. It is not the page number of the printed original.
  2. The second part ('a' or 'b') tells you whether the word is found on the left ('a') or right ('b') side of a double page image
  3. The third part is a wordcounter incrementing by ten. In this case it identifies the word as word 379.

Your first task is to use the image number to find the page number of your print original. If the original had page numbers they will show up on the digital scan, and finding the page is trivial. If the original has no page number, it may take a little ingenuity to find the right page quickly.

The word counter will tell you whether to look for the word towards the top, the middle, or the bottom of your page. A look at the size of the page and the type font will let you calculate the rough number of words on a page—typically between 250 and 350, but as many as 1,000 in the case of double column folio texts, such as The 1647 edition of Beaumont and Fletcher.

When you click on the Edit button something else happens: the labels for the Spelling, Lemma, and POS fields of the Edit panel will be populated with data from the data whose Edit button you clicked. Those values will stay the same until you click another Edit button and populate the labels with new values.

4.1 What happens when you correct an error?

In order to correct an error, you must have a user account and log in to AnnoLex. You cannot create your own user account but must request it from martinmueller@northwestern.edu.

It is important to have a clear sense of what happens or does not happen when you spot an error and correct it. It is impossible for you to overwrite the original text of the source. A correction you make is a suggestion that is recorded as a distinct transaction and passed on to an editor for review and approval.

You can suggest new values for Spelling, Lemma, or POS, separately or together. You may gnore the values for Lemma or POS, but do make use of Annotation where appropriate. It is a text field that lets you enter free text of any kind. If you cannot decide on the proper reading of a word, a simple entry like 'crux' creates a record saying that a word has been looked at but no solution has been found. That is useful.

Be sure to click the "Save," button to save your correction or annotation. Clicking the save button enters a user transaction in a separate correction table that automatically records:

  1. your user id
  2. a time stamp for the transaction
  3. the token id associated with the correction
  4. your suggested new value for the spelling
  5. an optional annotation indicating the rationale for the change

You can see a record of that transaction if you switch from the Correct to the Review view. There you see a curation log, and its five-column table tells you something about the workflow of AnnoLex. The first column shows the correction with the original text in strike-through mode and the replacement in bold. You see who made the correction and when. You see whether the correction has been approved and by whom. You also see whether the approved correction has been "applied," that is, incorporated into the source text.

4.2 Three types of curation: Update, Insert, Delete

In the lower left corner of the Edit field you see a drop-down selector that lets you choose from three values: Update, Insert, Delete . They refer to three different modes of curation. The default setting is Update. Keep in mind that this button sets a mode of operation, but does not perform any action itself. It is the Save button that executes the operation.

4.2.1 Update

In an Update operation you change the value of a spelling , lemma, or POS tag, but this change does not affect the sequence of tokens in the underlying text. Changing "staunderous" to "slaunderous" is an update. This is by far the most common form of curation, and it is a simple and single-step operation.

4.2.2 Insert and Delete

Words that are wrongly joined or split are quite common in Early Modern drama texts. Most of the cases were caught and fixed in the first-phase curation. The current AnnoLex procedures for insert and delete operations are cumbersome and error prone. If you come across wrongly joined or split words, the simplest thing to do is to leave a note in the annotation field saying 'wrongly joined' or 'wrongly split'.

4.3 A digression about long "s"

In some TCP texts long "s" is transcribed as such. In others it is recorded as an ordinary "s." Practice is consistent within a text but inconsistent across the corpus. The AnnoLex data tables show all forms of "s" in the normalized form. But in the texts from which those tables are derived long "s" is preserved if it was transcribed in the first place. Use the modern "s" in all your corrections. There will be ways of adjusting spellings for those texts originally encoded with long "s."

5. What AnnoLex cannot do

AnnoLex is a tool for the collaborative curation of the most common errors in TCP texts. It does not help with missing paragraphs or pages for several reasons including the simple fact that the page images may be missing. The transcribers of TCP texts were instructed to ignore passages written in other alphabets--mainly Greek or Hebrew. Transcription of such passages requires quite specialized and palaeographical skills. The same is true of texts with a lot of mathematical, scientific, or musical notation. Successful curation of texts with such passages probably requires an approach in which you begin with a census of what is missing and seek to match texts with interested and qualified curators. Such "brokering" is likely to be more important than the choice of a particular curation tool.

The TCP transcriptions include some, but not very many, errors in TEI-encoding. Prose and verse are not always encoded in <p> or <l> tags, a stage direction may be tagged as a speaker label, and so forth. AnnoLex is of no help for such cases. There are important forms of curation that go beyond the correction of obvious errors. For instance, only about half of Early Modern plays have cast lists and are clearly divided into acts and scenes. AnnoLex is not a proper tool for the enrichment of texts with metadata, although it is an excellent tool for the review and correction of algorithmically supplied linguistic metadata.

In summary, AnnoLex is quite good at what it does, but what it does is quite limited. However, and to repeat an earlier point: if the scholarly community that works with Early Modern data took to collaborative and dispersed curation. using AnnoLex to fix all or most of the little things that can be fixed with, many of the EEBO-TCP texts would be in much better shape and--excepting texts with sizable lacunae-- most of them would be good enough for most purposes. And the great thing about "good enough" is that it is good enough.