SGML and New Models of Scholarship
SGML's objectlike structures make it possible for scholarly communication to be seen as "chunks" of information that can be put together in different ways. Using SGML, we no longer have to squeeze the product of our research into a single linear sequence of text whose size is often determined by the physical medium in which it will appear; instead we can organize it in many different ways, privileging some for one audience and others for a different audience. Some projects are already exploiting this potential, and I am collaborating in two that are indicative of the way I think humanities scholarship will develop in the twenty-first century. Both projects make use of SGML to create information objects that can be delivered in many different ways.
The Model Editions Partnership (MEP) is defining a set of models for electronic documentary editions. Directed by David Chesnutt of the University of South Carolina, with the TEI Editor, C. Michael Sperberg-McQueen, and myself as co-coordinators, the MEP also includes seven documentary editing projects. Two of these projects are creating image editions, and the other five are preparing letterpress publications. These documentary editions provide the basic source material for the study of American history by adding the historical context that makes the material meaningful to readers. Much of this source material consists of letters, which often refer to people and places by words that only the author and recipient understand. A good deal of the source material is in handwriting that
can be read only by scholars specializing in the field. Documentary editors prepare the material for publication by transcribing the documents, organizing the sources into a coherent sequence that tells the story (the history) behind them, and annotating them with information to help the reader understand them. However, the printed page is not a very good vehicle for conveying the information that documentary editors need to say. It forces one organizing principle on the material (the single linear sequence of the book) when the material could well be organized in several different ways (chronologically, for example, or by recipient of letters). Notes must appear at the end of an item to which they refer or at the end of the book. When the same note-for example, a short biographical sketch of somebody mentioned in the sources-is needed in several places, it can appear only once, and after that it is cross-referenced by page numbers, often to earlier volumes. Something that has been crossed out and rewritten in a source document can only be represented clumsily in print even though it may reflect a change of mind that altered the course of history.
At the beginning of the MEP project, the three coordinators visited all seven partner projects, showed the project participants some very simple demonstrations, and then invited them to "dream" about what they would like to do in this new medium. The ideas collected during these visits were incorporated into a prospectus for electronic documentary editions. The MEP sees SGML as the key to providing all the functionality outlined in the prospectus. The MEP has developed an SGML DTD for documentary editions that is based on the TEI and has begun to experiment with delivery of samples from the partner projects. The material for the image editions is wrapped up in an "SGML envelope" that provides the tools to access the images. This envelope can be generated automatically from the relational databases in which the image access information is now stored. For the letterpress editions, many more possibilities are apparent. If desired, it will be possible to merge material from different projects that are working on the same period of history. It will be possible to select subsets of the material easily by any of the tagged features. This means that editions for high school students or the general public could be created almost automatically from the archive of scholarly material. With a click of a mouse, the user can go from a diplomatic edition to a clear reading text and thus trace the author's thoughts as the document was being written. The documentary editions also include very detailed conceptual indexes compiled by the editors. It will be possible to use these indexes as an entry point to the text and also to merge indexes from different projects. The MEP sees the need for making dead text image representations of existing published editions available quickly and believes that these can be made much more useful by wrapping them in SGML and using the conceptual indexes as an entry point to them.
The second project is even more ambitious than the MEP, since it is dealing with entirely new material and has been funded for five years. The Orlando Project at the Universities of Alberta and Guelph is a major collaborative research ini-
tiative funded by the Canadian Social Sciences and Humanities Research Council. Directed by Patricia Clements, the project is to create an Integrated History of Women's Writing in the British Isles, which will appear in print and electronic formats. A team of graduate research assistants is carrying out basic research for the project in libraries and elsewhere. The research material they are assembling is being encoded in SGML so that it can be retrieved in many different ways. SGML DTDs have been designed to reflect the biographical details for each woman writer as well as her writing history, other historical events that influenced her writing, a thesaurus of keyword terms, and so forth. The DTDs are based on the TEI but they incorporate much descriptive and interpretive information, reflecting the nature of the research and the views of the literary scholars in the team. Tag sets have been devised for topics such as the issues of authorship and attribution, genre issues, and issues of reception of an author's work.
The Orlando Project is thus building up an SGML-encoded database of many different kinds of information about women's writing in the British Isles. The SGML encoding, for example, greatly assists in the preparation of a chronology by allowing the project to pull out all chronology items from the different documents and sort them by their dates. It facilitates an overview of where the women writers lived, their social background, and what external factors influenced their writing. It helps with the creation and consistency of new entries, since the researchers can see immediately if similar information has already been encountered. The authors of the print volumes will draw on this SGML archive as they write, but the archive can also be used to create many different hypertext products for research and teaching.
Both Orlando and the MEP are, essentially, working with pieces of information, which can be linked in many different ways. The linking, or rather the interpretation that gives rise to the linking, is what humanities scholarship is about. When the information is stored as encoded pieces of information, it can be put together in many different ways and used for many different purposes, of which creating a print publication is only one. We can expect other projects to begin to work in this way as they see the advantages of encoding the features of interest in their material and of manipulating them in different ways.
It is useful to look briefly at some other possibilities. Dictionary publishers were among the first to use SGML. (Although not strictly SGML since it does not have a DTD, the Oxford English Dictionary was the first academic project to use structured markup.) When well designed, the markup enables the dictionary publishers to create spin-off products for different audiences by selecting a subset of the tagged components of an entry. A similar process can be used for other kinds of reference works. Tables of contents, bibliographies, and indexes can all be compiled automatically from SGML markup and can also be cumulative across volumes or collections of material.
The MEP is just one project that uses SGML for scholarly editions. A notable
example is the CD-ROM of Chaucer's Wife of Bath's Prologue, prepared by Peter Robinson and published by Cambridge University Press in 1996. This CD-ROM contains all fifty-eight pre-1500 manuscripts of the text, with encoding for all the variant readings as well as digitized images of every page of all the manuscripts. Software programs provided with the CD-ROM can manipulate the material in many different ways, enabling a scholar to collate manuscripts, move immediately from one manuscript to another, and compare transcriptions, spellings, and readings. All the material is encoded in SGML, and the CD-ROM includes more than one million hypertext links generated by a computer program, which means that the investment in the project's data is carried forward from one delivery system to another, indefinitely, into the future.