previous sub-section
Chapter 1— Making Technology Work for Scholarship Investing in the Data
next sub-section

Text Encoding Initiative

The humanities computing community was among the early adopters of SGML, for two very simple reasons. Humanities primary source texts can be very complex, and they need to be shared and used by different scholars. They can be in different languages and writing systems and can contain textual variants, nonstandard characters, annotations and emendations, multiple parallel texts, and hypertext links, as well as complex canonical reference systems. In electronic form, these texts can be used for many different purposes, including the preparation of new editions, word and phrase searches, stylistic analyses and research on syntax, and other linguistic features. By 1987 it was clear that many encoding schemes existed for humanities electronic texts, but none was sufficiently powerful to allow for all the different features that might be of interest. Following a planning meeting attended by representatives of leading humanities computing projects, a major international project called the Text Encoding Initiative (TEI) was launched.[7] Sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing, the TEI enlisted the help of volunteers all over the world to define what features might be of interest to humanities scholars working with electronic text. It built on the expertise of groups such as the Perseus Project (then at Harvard, now at Tufts University), the Brown University Women Writers Project, the Alfa Informatica Group in Groningen, Netherlands, and others who were already working with SGML, to create SGML tags that could be used for many different types of text.

The TEI published its Guidelines for the Encoding and Interchange of Electronic Texts in May 1994 after more than six years' work. The guidelines identify some four hundred tags, but of course no list of tags can be truly comprehensive, and so the TEI builds its DTDs in a way that makes it easy for users to modify them. The TEI SGML application is built on the assumption that all texts share some common core of features to which can be added tags for specific application areas. Very few tags are mandatory, and most of these are concerned with documenting the text


24

and will be further discussed below. The TEI Guidelines are simply guidelines. They serve to help the encoder identify features of interest, and they provide the DTDs with which the encoder will work. The core consists of the header, which documents the text, plus basic structural tags and common features, such as lists, abbreviations, bibliographic citations, quotations, simple names and dates, and so on. The user selects a base tag set: prose, verse, drama, dictionaries, spoken texts, or terminological data. To this are added one or more additional tag sets. The options here include simple analytic mechanisms, linking and hypertext, transcription of primary sources, critical apparatus, names and dates, and some methods of handling graphics. The TEI has also defined a method of handling nonstandard alphabets by using a Writing System Declaration, which the user specifies. This method can also be used for nonalphabetic writing systems, for example, Japanese. Building a TEI DTD has been likened to the preparation of a pizza, where the base tag set is the crust, the core tags are the tomato and cheese, and the additional tag sets are the toppings.

One of the issues addressed at the TEI planning meeting was the need for documentation of an electronic text. Many electronic texts now exist about which little is known, that is, what source text they were taken from, what decisions were made in encoding the text, or what changes have been made to the text. All this information is extremely important to a scholar wanting to work on the text, since it will determine the academic credibility of his or her work. Unknown sources are unreliable at best and lead to inferior work. Experience has shown that electronic texts are more likely to contain errors or have bits missing, but these are more difficult to detect than with printed material. It seems that one of the main reasons for this lack of documentation for electronic texts was simply that there was no common methodology for providing it.

The TEI examined various models for documenting electronic texts and concluded that some SGML elements placed as a header at the beginning of an electronic text file would be the most appropriate way of providing this information. Since the header is part of the electronic text file, it is more likely to remain with that file throughout its life. It can also be processed by the same software as the rest of the text. The TEI header contains four major sections.[8] One section is a bibliographic description of the electronic text file using SGML elements that map closely on to some MARC fields. The electronic text is an intellectual object different from the source from which it was created, and the source is thus also identified in the header. The encoding description section provides information about the principles used in encoding the text, for example, whether the spelling has been normalized, treatment of end-of-line hyphens, and so forth. For spoken texts, the header provides a way of identifying the participants in a conversation and of attaching a simple identifier to each participant that can then be used as an attribute on each utterance. The header also provides a revision history of the text, indicating who made what changes to it and when.

As far as can be ascertained, the TEI header is the first systematic attempt to


25

provide documentation for an electronic text that is a part of the text file itself. A good many projects are now using it, but experience has shown that it would perhaps benefit from some revision. Scholars find it hard to create good headers. Some elements in the header are very obvious, but the relative importance of the remaining elements is not so clear. At some institutions, librarians are creating TEI headers, but they need training in the use and importance of the nonbibliographic sections and in how the header is used by computer software other than the bibliographic tools that they are familiar with.


previous sub-section
Chapter 1— Making Technology Work for Scholarship Investing in the Data
next sub-section