Making Technology Work for Scholarship
Investing in the Data
The introduction of any kind of new technology is often a painful and timeconsuming process, at least for those who must incorporate it into their everyday lives. This is particularly true of computing technology, where the learning curve can be steep, what is learned changes rapidly, and ever more new and exciting things seem to be perpetually on the horizon. How can the providers and consumers of electronic information make the best use of this new medium and ensure that the information they create and use will outlast the current system on which it is used? In this chapter we examine some of these issues, concentrating on the humanities, where the nature of the information studied by scholars can be almost anything and where the information can be studied for almost any purpose.
Today's computer programs are not sophisticated enough to process raw data sensibly. This situation will remain true until artificial intelligence and natural language processing research has made very much more progress. Early on in my days as a humanities computing specialist, I saw a library catalog that had been typed into the computer without anything to separate the fields in the information. There was no way of knowing what was the author, title, publisher, or call number of any of the items. The catalog could be printed out, but the titles could not be searched at all, nor could the items in the catalog be sorted by author name. Although a human can tell which is the author or title from reading the catalog, a computer program cannot. Something must be inserted in the data to give the program more information. This situation is a very simple example of markup, or encoding, which is needed to make computers work better for us. Since we are so far from having the kind of intelligence we really need in computer programs, we must put that intelligence in the data so that computer programs can be informed by it. The more intelligence there is in our data, the better our programs will perform. But what should that intelligence look like? How can we ensure that we make the right decisions in creating it so that computers can really do what we
want? Some scholarly communication and digital library projects are beginning to provide answers to these questions.
New Technology or Old?
Many current technology and digital library projects use the new technology as an access mechanism to deliver the old technology. These projects rest on the assumption that the typical scholarly product is an article or monograph and that it will be read in a sequential fashion as indeed we have done for hundreds of years, ever since these products began to be produced on paper and be bound into physical artifacts such as books. The difference is that instead of going only to the library or bookstore to obtain the object, we access it over the network-and then almost certainly have to print a copy of it in order to read it. Of course there is a tremendous savings of time for those who have instant access to the network, can find the material they are looking for easily, and have high-speed printers. I want to argue here that delivering the old technology via the new is only a transitory phase and that it must not be viewed as an end in itself. Before we embark on the large-scale compilation of electronic information, we must consider how future scholars might use this information and what are the best ways of ensuring that the information will last beyond the current technology.
The old (print) technology developed into a sophisticated model over a long period of time. Books consist of pages bound up in sequential fashion, delivering the text in a single linear sequence. Page numbers and running heads are used for identification purposes. Books also often include other organizational aids, such as tables of contents and back-of-the-book indexes, which are conventionally placed at the beginning and end of the book respectively. Footnotes, bibliographies, illustrations, and so forth, provide additional methods of cross-referencing. A title page provides a convention for identifying the book and its author and publication details. The length of a book is often determined by publishers' costs or requirements rather than by what the author really wants to say about the subject. Journal articles exhibit similar characteristics, also being designed for reproduction on pieces of paper. Furthermore, the ease of reading printed books and journals is determined by their typography, which is designed to help the reader by reinforcing what the author wants to say. Conventions of typography (headings, italic, bold, etc.) make things stand out on the page.
When we put information into electronic form, we find that we can do many more things with it than we can with a printed book. We can still read it, though not as well as we can read a printed book. The real advantage of the electronic medium is that we can search and manipulate the information in many different ways. We are no longer dependent on the back-of-the-book index to find things within the text, but can search for any word or phrase using retrieval software. We no longer need the whole book to look up one paragraph but can just access the piece of information we need. We can also access several different pieces of infor-
mation at the same time and make links between them. We can find a bibliographic reference and go immediately to the place to which it points. We can merge different representations of the same material into a coherent whole and we can count instances of features within the information. We can thus begin to think of the material we want as "information objects."
To reinforce the arguments I am making here, I call electronic images of printed pages "dead text" and use the term "live text" for searchable representations of text. For dead text we can use only those retrieval tools that were designed for finding printed items, and even then this information must be added as searchable live text, usually in the form of bibliographic references or tables of contents. Of course most of the dead text produced over the past fifteen or so years began its life as live text in the form of word-processed documents. The obvious question is, how can the utility of that live text be retained and not lost forever?
Electronic Text and Data Formats
Long before digital libraries became popular, live electronic text was being created for many different purposes, most often, as we have seen, with word-processing or typesetting programs. Unfortunately this kind of live electronic text is normally searchable only by the word-processing program that produced it and then only in a very simple way. We have all encountered the problems involved in moving from one word-processing program to another. Although some of these problems have been solved in more recent versions of the software, maintaining an electronic document as a word-processing file is not a sensible option for the long term unless the creator of the document is absolutely sure that this document will be needed only in the short-term future and only for the purposes of word processing by the program that created it. Word-processed documents contain typographic markup, or codes, to specify the formatting. If there were no markup, the document would be much more difficult to read. However, typesetting markup is ambiguous and thus cannot be used sensibly by any retrieval program. For example, italics can be used for titles of books, or for emphasized words, or for foreign words. With typographic markup, we cannot distinguish titles of books from foreign words, which we may, at some stage, want to search for separately.
Other electronic texts were created for the purposes of retrieval and analysis. Many such examples exist, ranging from the large text databases of legal statutes to humanities collections such as the Thesaurus Linguae Graecae (TLG) and the Trésor de la langue française. The scholars working on these projects all realized that they needed to put some intelligence into the data in order to search it effectively. Most project staff devised markup schemes that focus on ways of identifying the reference citations for items that have been retrieved; for example, in the TLG, those items would be the name of the author, work, book, and chapter number. Such
markup schemes do not easily provide for representing items of interest within a text, for example, foreign words or quotations. Most of these markup schemes are specific to one or two computer programs, and texts prepared in them are not easily interchangeable. A meeting in 1987 examined the many markup schemes for humanities electronic texts and concluded that the present situation was "chaos." No existing markup scheme satisfied the needs of all users, and much time was being wasted converting from one deficient scheme to another.
Another commonly used method of storing and retrieving information is a relational database such as, for example, Microsoft Access or dBASE or the mainframe program Oracle. In a relational database, data is assumed to take the form of one or more tables consisting of rows and columns, that is, the form of rectangular structures. A simple table of biographical information may have rows representing people and columns holding information about those people, for example, name, date of birth, occupation, and so on. When a person has more than one occupation, the data becomes clumsy and the information is best represented in two tables, in which the second has a row for each occupation of each person. The tables are linked, or related, by the person. A third table may hold information about the occupations. It is not difficult for a human to conceptualize the data structures of a relational database or for a computer to process them. Relational databases work well for some kinds of information, for example, an address list, but in reality not much data in the real world fits well into rectangular structures. Such a structure means that the information is distorted when it is entered into the computer, and processing and analyses are carried out on the distorted forms, whose distortion tends to be forgotten. Relational databases also force the allocation of information to fixed data categories, whereas, in the humanities at any rate, much of the information is subject to scholarly debate and dispute, requiring multiple views of the material to be represented. Furthermore, getting information out of a relational database for use by other programs usually requires some programming knowledge.
The progress of too many retrieval and database projects can be characterized as follows. A project group decides that it wants to "make a CD-ROM." It finds that it has to investigate possible software programs for delivery of the results and chooses the one that has the most seductive user interface or most persuasive salesperson. If the data include some nonstandard characters, the highest priority is often given to displaying those characters on the screen; little attention is paid to the functions needed to manipulate those characters. Data are then entered directly into this software over a period of time during which the software interface begins to look outmoded as technology changes. By the time the data have been entered for the project, the software company has gone out of business, leaving the project staff with a lot of valuable information in a proprietary software format that is no longer supported. More often than not, the data are lost and much time and money has been wasted. The investment is clearly in the data, and it makes
sense to ensure that these data are not dependent on one particular program but can be used by other programs as well.
Standard Generalized Markup Language (Sgml)
Given the time and effort involved in creating electronic information, it makes sense to step back and think about how to ensure that the information can outlast the computer system on which it is created and can also be used for many different purposes. These are the two main principles of the Standard Generalized Markup Language (SGML), which became an international standard (ISO 8879) in 1986. SGML was designed as a general purpose markup scheme that can be applied to many different types of documents and in fact to any electronic information. It consists of plain ASCII files, which can easily be moved from one computer system to another. SGML is a descriptive language. Most encoding schemes prior to SGML use prescriptive markup. One example of prescriptive markup is word-processing or typesetting codes embedded in a text that give instructions to the computer such as "center the next line" or "print these words in italic." Another example is fielded data that is specific to a retrieval program, for example, reference citations or author's names, which must be in a specific format for the retrieval program to recognize them as such. By contrast, a descriptive markup language merely identifies what the components of a document are. It does not give specific instructions to any program. In it, for example, a title is encoded as a title, or a paragraph as a paragraph. This very simple approach ultimately allows much more flexibility. A printing program can print all the titles in italic, a retrieval program can search on the titles, and a hypertext program can link to and from the titles, all without making any changes to the data.
Strictly speaking, SGML itself is not a markup scheme, but a kind of computer language for defining markup, or encoding, schemes. SGML markup schemes assume that each document consists of a collection of objects that nest within each other or are related to each other in some other way. These objects or features can be almost anything. Typically they are structural components such as title, chapter, paragraph, heading, act, scene, speech, but they can also be interpretive information such as parts of speech, names of people and places, quotations (direct and indirect), and even literary or historical interpretation. The first stage of any SGML-based project is document analysis, which identifies all the textual features that are of interest and identifies the relationships between them. This step can take some time, but it is worth investing the time since a thorough document analysis can ensure that data entry proceeds smoothly and that the documents are easily processable by computer programs.
In SGML terms, the objects within a document are called elements. They are identified by a start and end tag as follows: <title>Pride and Prejudice</title>.
The SGML syntax allows the document designer to specify all the possible elements as a Document Type Definition (DTD), which is a kind of formal model of the document structure. The DTD indicates which elements are contained within other elements, which are optional, which can be repeated, and so forth. For example, in simple terms a journal article consists of a title, one or more author names, an optional abstract, and an optional list of keywords, followed by the body of the article. The body may contain sections, each with a heading followed by one or more paragraphs of text. The article may finish with a bibliography. The paragraphs of text may contain other features of interest, including quotations, lists, and names, as well as links to notes. A play has a rather different structure: title; author; cast list; one or more acts, each containing one or more scenes, which in turn contain one or more speeches and stage directions; and so on.
SGML elements may also have attributes that further specify or modify the element. One use of attributes may be to normalize the spelling of names for indexing purposes. For example, the name Jack Smyth could be encoded as <name norm="SmithJ"> Jack Smyth</name>, but indexed under S as if it were Smith. Attributes can also be used to normalize date forms for sorting, for example, <date norm=19970315>the Ides of March 1997</date>. Another important function of attributes is to assign a unique identifier to each instance of each SGML element within a document. These identifiers can be used as a cross-reference by any kind of hypertext program. The list of possible attributes for an element may be defined as a closed set, allowing the encoder to pick from a list, or it may be entirely open.
SGML has another very useful feature. Any piece of information can be given a name and be referred to by that name in an SGML document. These names are called entities and are enclosed in an ampersand and a semicolon. One use is for nonstandard characters, where, for example, é can be encoded as é thus ensuring that it can be transmitted easily across networks and from one machine to another. A standard list of these characters exists, but the document encoder can also create more. Entity references can also be used for any boilerplate text. This use avoids repetitive typing of words and phrases that are repeated, thus also reducing the chance of errors. An entity reference can be resolved to any amount of text from a single letter up to the equivalent of an entire chapter.
The formal structure of SGML means that the encoding of a document can be validated automatically, a process known as parsing. The parser makes use of the SGML DTD to determine the structure of the document and can thus help to eliminate whole classes of encoding errors before the document is processed by an application program. For example, an error can be detected if the DTD specifies that a journal article must have one or more authors, but the author's name has been omitted accidentally. Mistyped element names can be detected as errors, as can elements that are wrongly nested-for example, an act within a scene when the DTD specifies that acts contain scenes. Attributes can also be validated when there is a closed set of possible values. The validation process can also detect un-
resolved cross-references that use SGML's built-in identifiers. The SGML document structure and validation process means that any application program can operate more efficiently because it derives information from the DTD about what to expect in the document. It follows that the stricter the DTD, the easier it is to process the document. However, very strict DTDs may force the document encoder to make decisions that simplify what is being encoded. Free DTDs might better reflect the nature of the information but usually require more processing. Another advantage of SGML is very apparent here. Once a project is under way, if a document encoder finds a new feature of interest, that feature can simply be added to the DTD without the need to restructure work that has already been done. Many documents can be encoded and processed with the same DTD.
Text Encoding Initiative
The humanities computing community was among the early adopters of SGML, for two very simple reasons. Humanities primary source texts can be very complex, and they need to be shared and used by different scholars. They can be in different languages and writing systems and can contain textual variants, nonstandard characters, annotations and emendations, multiple parallel texts, and hypertext links, as well as complex canonical reference systems. In electronic form, these texts can be used for many different purposes, including the preparation of new editions, word and phrase searches, stylistic analyses and research on syntax, and other linguistic features. By 1987 it was clear that many encoding schemes existed for humanities electronic texts, but none was sufficiently powerful to allow for all the different features that might be of interest. Following a planning meeting attended by representatives of leading humanities computing projects, a major international project called the Text Encoding Initiative (TEI) was launched. Sponsored by the Association for Computers and the Humanities, the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing, the TEI enlisted the help of volunteers all over the world to define what features might be of interest to humanities scholars working with electronic text. It built on the expertise of groups such as the Perseus Project (then at Harvard, now at Tufts University), the Brown University Women Writers Project, the Alfa Informatica Group in Groningen, Netherlands, and others who were already working with SGML, to create SGML tags that could be used for many different types of text.
The TEI published its Guidelines for the Encoding and Interchange of Electronic Texts in May 1994 after more than six years' work. The guidelines identify some four hundred tags, but of course no list of tags can be truly comprehensive, and so the TEI builds its DTDs in a way that makes it easy for users to modify them. The TEI SGML application is built on the assumption that all texts share some common core of features to which can be added tags for specific application areas. Very few tags are mandatory, and most of these are concerned with documenting the text
and will be further discussed below. The TEI Guidelines are simply guidelines. They serve to help the encoder identify features of interest, and they provide the DTDs with which the encoder will work. The core consists of the header, which documents the text, plus basic structural tags and common features, such as lists, abbreviations, bibliographic citations, quotations, simple names and dates, and so on. The user selects a base tag set: prose, verse, drama, dictionaries, spoken texts, or terminological data. To this are added one or more additional tag sets. The options here include simple analytic mechanisms, linking and hypertext, transcription of primary sources, critical apparatus, names and dates, and some methods of handling graphics. The TEI has also defined a method of handling nonstandard alphabets by using a Writing System Declaration, which the user specifies. This method can also be used for nonalphabetic writing systems, for example, Japanese. Building a TEI DTD has been likened to the preparation of a pizza, where the base tag set is the crust, the core tags are the tomato and cheese, and the additional tag sets are the toppings.
One of the issues addressed at the TEI planning meeting was the need for documentation of an electronic text. Many electronic texts now exist about which little is known, that is, what source text they were taken from, what decisions were made in encoding the text, or what changes have been made to the text. All this information is extremely important to a scholar wanting to work on the text, since it will determine the academic credibility of his or her work. Unknown sources are unreliable at best and lead to inferior work. Experience has shown that electronic texts are more likely to contain errors or have bits missing, but these are more difficult to detect than with printed material. It seems that one of the main reasons for this lack of documentation for electronic texts was simply that there was no common methodology for providing it.
The TEI examined various models for documenting electronic texts and concluded that some SGML elements placed as a header at the beginning of an electronic text file would be the most appropriate way of providing this information. Since the header is part of the electronic text file, it is more likely to remain with that file throughout its life. It can also be processed by the same software as the rest of the text. The TEI header contains four major sections. One section is a bibliographic description of the electronic text file using SGML elements that map closely on to some MARC fields. The electronic text is an intellectual object different from the source from which it was created, and the source is thus also identified in the header. The encoding description section provides information about the principles used in encoding the text, for example, whether the spelling has been normalized, treatment of end-of-line hyphens, and so forth. For spoken texts, the header provides a way of identifying the participants in a conversation and of attaching a simple identifier to each participant that can then be used as an attribute on each utterance. The header also provides a revision history of the text, indicating who made what changes to it and when.
As far as can be ascertained, the TEI header is the first systematic attempt to
provide documentation for an electronic text that is a part of the text file itself. A good many projects are now using it, but experience has shown that it would perhaps benefit from some revision. Scholars find it hard to create good headers. Some elements in the header are very obvious, but the relative importance of the remaining elements is not so clear. At some institutions, librarians are creating TEI headers, but they need training in the use and importance of the nonbibliographic sections and in how the header is used by computer software other than the bibliographic tools that they are familiar with.
Encoded Archival Description (Ead)
Another SGML application that has attracted a lot of attention in the scholarly community and archival world is the Encoded Archival Description (EAD). First developed by Daniel Pitti at the University of California at Berkeley and now taken over by the Library of Congress, the EAD is an SGML application for archival finding aids. Finding aids are very suitable for SGML because they are basically hierarchic in structure. In simple terms, a collection is divided into series, which consist of boxes, which contain folders, and so on. Prior to the EAD, there was no effective standard way of preparing finding aids. Typical projects created a collection level record in one of the bibliographic utilities, such as RLIN, and used their own procedures, often a word-processing program, for creating the finding aid. Possibilities now exist for using SGML to link electronic finding aids with electronic representations of the archival material itself. One such experiment, conducted at the Center for Electronic Texts in the Humanities (CETH), has created an EAD-encoded finding aid for part of the Griffis Collection at Rutgers University and has encoded a small number of the items in the collection (nineteenth-century essays) in the TEI scheme. The user can work with the finding aid to locate the item of interest and then move directly to the encoded text and an image of the text to study the item in more detail. The SGML browser program Panorama allows the two DTDs to exist side by side and in fact uses an extended pointer mechanism devised by the TEI to move from one to the other.
Other Applications of SGML
SGML is now being widely adopted in the commercial world as companies see the advantage of investment in data that will move easily from one computer system to another. It is worth noting that the few books on SGML that appeared early in its life were intended for an academic audience. More recent books are intended for a commercial audience and emphasize the cost savings involved in SGML as well as the technical requirements. This is not to say that these books are not of any value to academic users. The SGML Web pages list many projects in the areas of health, legal documents, electronic journals, rail and air transport, semiconductors, the U.S. Internal Revenue Service, and more. SGML is extremely useful
for technical documentation, as can be evidenced by the list of customers on the Web page of one of the major SGML software companies, INSO/EBT. This list includes United Airlines, Novell, British Telecom, AT&T, Shell, Boeing, Nissan, and Volvo.
SGML need not be used only with textual data. It can be used to describe almost anything. SGML should not therefore be seen as an alternative to Acrobat, PostScript, or other document formats but as a way of describing and linking together documents in these and other formats, forming the "underground tunnels" that make the documents work for users. SGML can be used to encode the searchable textual information that must accompany images or other formats in order to make them useful. With SGML, the searchable elements can be defined to fit the data exactly and can be used by different systems. This encoding is in contrast with storing image data in some proprietary database system, which is common practice. We can imagine a future situation in which a scholar wants to examine the digital image of a manuscript and also have available a searchable text. He or she may well find something of interest on the image and want to go to occurrences of the same feature elsewhere within the text. In order to do this, the encoded version of the text must know what that feature of interest is and where it occurs on the digital image. Knowing which page it is on is not enough. The exact position on the page must be encoded. This information can be represented in SGML, which thus provides the sophisticated kind of linking needed for scholarly applications. SGML structures can also point to places within a recording of speech or other sound and can be used to link the sound to a transcription of the conversation, again enabling the sound and text to be studied together. Other programs exist that can perform these functions, but the problem with all of them is that they use a proprietary data format that cannot be used for any other purpose.
SGML, HTML, and XML
The relationship between SGML and the Hypertext Markup Language (HTML) needs to be clearly understood. Although not originally designed as such, HTML is now an SGML application, even though many HTML documents exist that cannot be validated according to the rules of SGML. HTML consists of a set of elements that are interpreted by Web browsers for display purposes. The HTML tags were designed for display and not for other kinds of analysis, which is why only crude searches are possible on Web documents. HTML is a rather curious mixture of elements. Larger ones, such as <body>; <h1>, <h2>, and so on for head levels; <p> for paragraph; and <ul> for unordered list, are structural, but the smaller elements, such as <b> for bold and <i> for italic, are typographic, which, as we have seen above, are ambiguous and thus cannot be searched effectively. HTML version 3 attempts to rectify this ambiguity somewhat by introducing a few semantic level elements, but these are very few in comparison with those
identified in the TEI core set. HTML can be a good introduction to structured markup. Since it is so easy to create, many project managers begin by using HTML and graduate to SGML once they become used to working with structured text and begin to see the weakness of HTML for anything other than the display of text. SGML can easily be converted automatically to HTML for delivery on the Web, and Web clients have been written for the major SGML retrieval programs.
The move from HTML to SGML can be substantial, and in 1996 work began on XML (Extensible Markup Language), which is a simplified version of SGML for delivery on the Web. It is "an extremely simple dialect of SGML," the goal of which "is to enable generic SGML to be served, received, and processed on the Web in the way that is now possible with HTML" (see http://www.w3.org/TR/REC-xml ). XML is being developed under the auspices of the World Wide Web Consortium, and the first draft of the specification for it was available by the SGML conference in December 1996. Essentially XML is SGML with some of the more complex and esoteric features removed. It has been designed for interoperability with both SGML and HTML-that is, to fill the gap between HTML, which is too simple, and fullblown SGML, which can be complicated. As yet there is no specific XML software, but the work of this group has considerable backing and the design of XML has proceeded quickly.
SGML and New Models of Scholarship
SGML's objectlike structures make it possible for scholarly communication to be seen as "chunks" of information that can be put together in different ways. Using SGML, we no longer have to squeeze the product of our research into a single linear sequence of text whose size is often determined by the physical medium in which it will appear; instead we can organize it in many different ways, privileging some for one audience and others for a different audience. Some projects are already exploiting this potential, and I am collaborating in two that are indicative of the way I think humanities scholarship will develop in the twenty-first century. Both projects make use of SGML to create information objects that can be delivered in many different ways.
The Model Editions Partnership (MEP) is defining a set of models for electronic documentary editions. Directed by David Chesnutt of the University of South Carolina, with the TEI Editor, C. Michael Sperberg-McQueen, and myself as co-coordinators, the MEP also includes seven documentary editing projects. Two of these projects are creating image editions, and the other five are preparing letterpress publications. These documentary editions provide the basic source material for the study of American history by adding the historical context that makes the material meaningful to readers. Much of this source material consists of letters, which often refer to people and places by words that only the author and recipient understand. A good deal of the source material is in handwriting that
can be read only by scholars specializing in the field. Documentary editors prepare the material for publication by transcribing the documents, organizing the sources into a coherent sequence that tells the story (the history) behind them, and annotating them with information to help the reader understand them. However, the printed page is not a very good vehicle for conveying the information that documentary editors need to say. It forces one organizing principle on the material (the single linear sequence of the book) when the material could well be organized in several different ways (chronologically, for example, or by recipient of letters). Notes must appear at the end of an item to which they refer or at the end of the book. When the same note-for example, a short biographical sketch of somebody mentioned in the sources-is needed in several places, it can appear only once, and after that it is cross-referenced by page numbers, often to earlier volumes. Something that has been crossed out and rewritten in a source document can only be represented clumsily in print even though it may reflect a change of mind that altered the course of history.
At the beginning of the MEP project, the three coordinators visited all seven partner projects, showed the project participants some very simple demonstrations, and then invited them to "dream" about what they would like to do in this new medium. The ideas collected during these visits were incorporated into a prospectus for electronic documentary editions. The MEP sees SGML as the key to providing all the functionality outlined in the prospectus. The MEP has developed an SGML DTD for documentary editions that is based on the TEI and has begun to experiment with delivery of samples from the partner projects. The material for the image editions is wrapped up in an "SGML envelope" that provides the tools to access the images. This envelope can be generated automatically from the relational databases in which the image access information is now stored. For the letterpress editions, many more possibilities are apparent. If desired, it will be possible to merge material from different projects that are working on the same period of history. It will be possible to select subsets of the material easily by any of the tagged features. This means that editions for high school students or the general public could be created almost automatically from the archive of scholarly material. With a click of a mouse, the user can go from a diplomatic edition to a clear reading text and thus trace the author's thoughts as the document was being written. The documentary editions also include very detailed conceptual indexes compiled by the editors. It will be possible to use these indexes as an entry point to the text and also to merge indexes from different projects. The MEP sees the need for making dead text image representations of existing published editions available quickly and believes that these can be made much more useful by wrapping them in SGML and using the conceptual indexes as an entry point to them.
The second project is even more ambitious than the MEP, since it is dealing with entirely new material and has been funded for five years. The Orlando Project at the Universities of Alberta and Guelph is a major collaborative research ini-
tiative funded by the Canadian Social Sciences and Humanities Research Council. Directed by Patricia Clements, the project is to create an Integrated History of Women's Writing in the British Isles, which will appear in print and electronic formats. A team of graduate research assistants is carrying out basic research for the project in libraries and elsewhere. The research material they are assembling is being encoded in SGML so that it can be retrieved in many different ways. SGML DTDs have been designed to reflect the biographical details for each woman writer as well as her writing history, other historical events that influenced her writing, a thesaurus of keyword terms, and so forth. The DTDs are based on the TEI but they incorporate much descriptive and interpretive information, reflecting the nature of the research and the views of the literary scholars in the team. Tag sets have been devised for topics such as the issues of authorship and attribution, genre issues, and issues of reception of an author's work.
The Orlando Project is thus building up an SGML-encoded database of many different kinds of information about women's writing in the British Isles. The SGML encoding, for example, greatly assists in the preparation of a chronology by allowing the project to pull out all chronology items from the different documents and sort them by their dates. It facilitates an overview of where the women writers lived, their social background, and what external factors influenced their writing. It helps with the creation and consistency of new entries, since the researchers can see immediately if similar information has already been encountered. The authors of the print volumes will draw on this SGML archive as they write, but the archive can also be used to create many different hypertext products for research and teaching.
Both Orlando and the MEP are, essentially, working with pieces of information, which can be linked in many different ways. The linking, or rather the interpretation that gives rise to the linking, is what humanities scholarship is about. When the information is stored as encoded pieces of information, it can be put together in many different ways and used for many different purposes, of which creating a print publication is only one. We can expect other projects to begin to work in this way as they see the advantages of encoding the features of interest in their material and of manipulating them in different ways.
It is useful to look briefly at some other possibilities. Dictionary publishers were among the first to use SGML. (Although not strictly SGML since it does not have a DTD, the Oxford English Dictionary was the first academic project to use structured markup.) When well designed, the markup enables the dictionary publishers to create spin-off products for different audiences by selecting a subset of the tagged components of an entry. A similar process can be used for other kinds of reference works. Tables of contents, bibliographies, and indexes can all be compiled automatically from SGML markup and can also be cumulative across volumes or collections of material.
The MEP is just one project that uses SGML for scholarly editions. A notable
example is the CD-ROM of Chaucer's Wife of Bath's Prologue, prepared by Peter Robinson and published by Cambridge University Press in 1996. This CD-ROM contains all fifty-eight pre-1500 manuscripts of the text, with encoding for all the variant readings as well as digitized images of every page of all the manuscripts. Software programs provided with the CD-ROM can manipulate the material in many different ways, enabling a scholar to collate manuscripts, move immediately from one manuscript to another, and compare transcriptions, spellings, and readings. All the material is encoded in SGML, and the CD-ROM includes more than one million hypertext links generated by a computer program, which means that the investment in the project's data is carried forward from one delivery system to another, indefinitely, into the future.
Making SGML Work Effectively
Getting started with SGML can seem to be a big hurdle to overcome, but in fact the actual mechanics of working with SGML are nowhere near as difficult as is often assumed. SGML tags are rarely typed in, but are normally inserted by software programs such as Author/Editor. These programs can incorporate a template that is filled in with data. Like other SGML software, these programs make use of the DTD. They know which tags are valid at any position in the document and can offer only those tags to the user, who can pick from a menu. They can also provide a pick list of attributes and their values if the values are a closed set. These programs ensure that what is produced is a valid SGML document. They can also toggle the display of tags on and off very easily-Author/Editor encloses them in boxes that are very easy to see. The programs also incorporate style sheets that define the display format for every element.
Nevertheless, inserting tags in this way can be rather cumbersome, and various software tools exist to help in the translation of "legacy" data to SGML. Of course, these tools cannot add intelligence to data if it was not there in the legacy format, but they can do a reasonable and low-cost job of converting material for large-scale projects in which only broad structural information is needed. For UNIX users, the shareware program sgmls and its successor, sp, are excellent tools for validating SGML documents and can be incorporated in processing programs. There are also ways in which the markup can be minimized. End tags can be omitted in some circumstances, for example, in a list where the start of a new list item implies that the previous one has ended.
There is no doubt that SGML is considered expensive by some project managers, but further down the line the payoff can be seen many times over. The quick and dirty solution to a computing problem does not last very long, and history has shown how much time can be wasted converting from one system to another or how much data can be lost because they are in a proprietary system. It is rather surprising that the simple notion of encoding what the parts of a document are,
rather than what the computer is supposed to do with them, took so long to catch on. Much of the investment in any computer project is in the data, and SGML is the best way we know so far of ensuring that the data will last for a long time and that they can be used and reused for many different purposes. It also ensures that the project is not dependent on one software vendor.
The amount of encoding is obviously a key factor in the cost, and so any discussion about the cost-effectiveness of an SGML project should always be made with reference to the specific DTD in use and the level of markup to be inserted. Statements that SGML costs x dollars per page are not meaningful without further qualification. Unfortunately at present such further qualification seems rarely to be the case, and misconceptions often occur. It is quite possible, although clearly not sensible, to have a valid SGML document that consists of one start tag at the beginning and one corresponding end tag at the end with no other markup in between. At the other extreme, each word (or even letter) in the document could have several layers of markup attached to it. What is clear is that the more markup there is, the more useful the document is and the more expensive it is to create. As far as I am aware, little research has been done on the optimum level of markup, but at least with SGML it is possible to add markup to a document later without prejudicing what is already encoded.
In my view, it is virtually impossible to make some general cost statements for SGML-based work. Each project needs to be assessed differently depending on its current situation and its objectives. However, I will attempt to discuss some of the issues and the items that make up the overall cost. Many of the costs of an SGML-based project are no different from those of other computer-based projects in that both have start-up costs and ongoing costs.
Start-up costs can depend on how much computing experience and expertise there already is in the organization. Projects that are being started now have the advantage of not being encumbered by large amounts of legacy data and proprietary systems, but they also will need to be started from scratch with the three things that make any computer project work: hardware, software, and skilled people. Hardware costs are insignificant these days, and SGML software will work on almost any current PC or UNIX-based hardware. It does not need an expensive proprietary system. An individual scholar can acquire PC software for creating and viewing SGML-encoded text for under $500. Public domain UNIX tools cost nothing to acquire. That leaves what is, in my view, the most essential component of any computing project, namely, people with good technical skills. Unfortunately, these people are expensive. The market is such that they can expect higher salaries than librarians and publishers receive at the equivalent stages in their careers. However, I think that it is unwise for any organization to embark on a computer-based project without having staff with the proper skills to do the work. Like people in all other disciplines, computer people specialize in one or two areas, and so it is important to hire staff with the right computing skills and thus impor-
tant for the person doing the hiring to understand what those skills should be. There are still not many SGML specialists around, but someone with a good basic background in computing could be trained in SGML at a commercial or academic course in a week or so, with some follow-up time for experimentation. This person can then use mailing lists and the SGML Web site to keep in touch with new developments. Having the right kind of technical person around early on in any computing project also means that there is somebody who can advise on the development of the system and ensure that expensive mistakes are not made by decision makers who have had little previous involvement with computing systems. The technical person will also be able to see immediately where costs can be saved by implementing shortcuts.
The one specific start-up cost with SGML is the choice or development of the DTD. Many digital library projects are utilizing existing DTDs-for example, the cut-down version of the TEI called TEILite-either with no changes at all or with only light modifications. However, I think that it is important for project managers to look hard at an existing DTD to see whether it really satisfies their requirements rather than just decide to use it because everyone else they know is using it. A project in a very specialized area may need to have its own DTD developed. This could mean the hiring of SGML consultants for a few days plus time spent by the project's own staff in specifying the objectives of the project in great detail and in defining and refining the features of interest within the project's documents.
Computer-based projects seldom proceed smoothly, and in the start-up phase, time must be allowed for false starts and revisions. SGML is no different here, but by its nature it does force project managers to consider very many aspects at the beginning and thus help prevent the project from going a long way down a wrong road. SGML elements can also be used to assist with essential aspects of project administration, for example, tags for document control and management.
Ongoing costs are largely concerned with document creation and encoding, but they also include general maintenance, upgrades, and revisions. If the material is not already in electronic form, it may be possible to convert it by optical character recognition (OCR). The accuracy of the result will depend on the quality of the type fonts and paper of the original, but the document will almost certainly need to be proofread and edited to reach the level of quality acceptable to the scholarly community. OCR also yields a typographic representation of a document, which is ambiguous for other kinds of computer processing. Whether it comes from word processors or OCR, typographic encoding needs to be converted to SGML. It is possible to write programs or purchase software tools to do this, but only those features that can be unambiguously defined can be converted in this way. Any markup that requires interpretive judgment must be inserted manually at the cost of human time. Most electronic text projects in the humanities have had the material entered directly by keyboarding, not only to attain higher levels of accuracy than with OCR, but also to insert markup at the same time. More often than not, project managers employ graduate students for this
work, supervised by a textbase manager who keeps records of decisions made in the encoding and often assisted by a programmer who can identify shortcuts and write programs where necessary to handle these shortcuts.
There are also, of course, costs associated with delivering SGML-encoded material once it has been created. These costs fall into much the same categories as the costs for creating the material. Start-up costs include the choice and installation of delivery software. In practice, most digital library projects use the Opentext search engine, which is affordable for a library or a publisher. The search engine also needs a Web client, which need not be a big task for a programmer. Naturally it takes longer to write a better Web client, but a better client may save end users much time as they sort through the results of a query. Opentext is essentially a retrieval program, and it does not provide much in the way of hypertext linking. INSO/EBT's suite of programs, including DynaText and DynaWeb, provides a model of a document that is much more like an electronic book with hypertext links. INSO's Higher Education Grant Program has enabled projects like MEP and Orlando to deliver samples of their material without the need to purchase SGML delivery software. INSO offers some technical support as part of the grant package, but skilled staff are once again the key component for getting a delivery system up and running. When any delivery system is functioning well, the addition of new SGML-encoded material to the document database can be fully automated with little need for human intervention unless something goes wrong.
Experience has shown that computer-based projects are rarely, if ever, finished. They will always need maintenance and upgrades and will incur ongoing costs more or less forever if the material is not to be lost. SGML seems to me to be the best way of investing for the future, since there are no technical problems in migrating it to new systems. However, I find it difficult to envisage a time when there will be no work and no expense involved with maintaining and updating electronic information. It is as well to understand this and to budget for these ongoing costs at the beginning of a project rather than have them come out gradually as the project proceeds.
SGML does have one significant weakness. It assumes that each document is a single hierarchic structure, but in the real world (at least of the humanities) very few documents are as simple as this. For example, a printed edition of a play has one structure of acts, scenes, and speeches and another of pages and line numbers. A new act or scene does not normally start on a new page, and so there is no relationship between the pages and the act and scene structure. It is simply an accident of the typography. The problem arises even with paragraphs in prose texts, since a new page does not start with a new paragraph or a new paragraph with a new page. For well-known editions the page numbers are important, but they cannot easily be encoded in SGML other than as "empty" tags that simply indicate a point in the text, not the beginning and end of a piece of information. The disadvantage here is that the processing of information marked by empty tags cannot make full use of SGML's capabilities. Another example of the same problem is
quotations spanning over paragraphs. They have to be closed and then opened again with attributes to indicate that they are really all the same quotation.
For many scholars, SGML is exciting to work with because it opens up so many more possibilities for working with source material. We now have a much better way than ever before of representing in electronic form the kinds of interpretation and discussion that are the basis of scholarship in the humanities. But as we begin to understand these new possibilities, some new challenges appear. What happens when documents from different sources (and thus different DTDs) are merged into the same database? In theory, computers make it very easy to do this, but how do we merge material that has been encoded according to different theoretical perspectives and retain the identification and individuality of each perspective? It is possible to build some kind of "mega-DTD," but the mega-DTD may become so free in structure that it is difficult to do any useful processing of the material.
Attention must now turn to making SGML work more effectively. Finding better ways of adding markup to documents is a high priority. The tagging could be speeded up by a program that can make intelligent tagging guesses based on information it has derived from similar material that has already been tagged, in much the same way that some word class tagging programs "learn" from text that has already been tagged manually. We also need to find ways of linking encoded text to digital images of the same material without the need for hand coding. Easier ways must be found for handling multiple parallel structures. All research leading to better use of SGML could benefit from a detailed analysis of documents that have already been encoded in SGML. The very fact that they are in SGML makes this analysis easy to do.