Electronic Text and Data Formats
Long before digital libraries became popular, live electronic text was being created for many different purposes, most often, as we have seen, with word-processing or typesetting programs. Unfortunately this kind of live electronic text is normally searchable only by the word-processing program that produced it and then only in a very simple way. We have all encountered the problems involved in moving from one word-processing program to another. Although some of these problems have been solved in more recent versions of the software, maintaining an electronic document as a word-processing file is not a sensible option for the long term unless the creator of the document is absolutely sure that this document will be needed only in the short-term future and only for the purposes of word processing by the program that created it. Word-processed documents contain typographic markup, or codes, to specify the formatting. If there were no markup, the document would be much more difficult to read. However, typesetting markup is ambiguous and thus cannot be used sensibly by any retrieval program. For example, italics can be used for titles of books, or for emphasized words, or for foreign words. With typographic markup, we cannot distinguish titles of books from foreign words, which we may, at some stage, want to search for separately.
Other electronic texts were created for the purposes of retrieval and analysis. Many such examples exist, ranging from the large text databases of legal statutes to humanities collections such as the Thesaurus Linguae Graecae (TLG) and the Trésor de la langue française. The scholars working on these projects all realized that they needed to put some intelligence into the data in order to search it effectively. Most project staff devised markup schemes that focus on ways of identifying the reference citations for items that have been retrieved; for example, in the TLG, those items would be the name of the author, work, book, and chapter number. Such
markup schemes do not easily provide for representing items of interest within a text, for example, foreign words or quotations. Most of these markup schemes are specific to one or two computer programs, and texts prepared in them are not easily interchangeable. A meeting in 1987 examined the many markup schemes for humanities electronic texts and concluded that the present situation was "chaos." No existing markup scheme satisfied the needs of all users, and much time was being wasted converting from one deficient scheme to another.
Another commonly used method of storing and retrieving information is a relational database such as, for example, Microsoft Access or dBASE or the mainframe program Oracle. In a relational database, data is assumed to take the form of one or more tables consisting of rows and columns, that is, the form of rectangular structures. A simple table of biographical information may have rows representing people and columns holding information about those people, for example, name, date of birth, occupation, and so on. When a person has more than one occupation, the data becomes clumsy and the information is best represented in two tables, in which the second has a row for each occupation of each person. The tables are linked, or related, by the person. A third table may hold information about the occupations. It is not difficult for a human to conceptualize the data structures of a relational database or for a computer to process them. Relational databases work well for some kinds of information, for example, an address list, but in reality not much data in the real world fits well into rectangular structures. Such a structure means that the information is distorted when it is entered into the computer, and processing and analyses are carried out on the distorted forms, whose distortion tends to be forgotten. Relational databases also force the allocation of information to fixed data categories, whereas, in the humanities at any rate, much of the information is subject to scholarly debate and dispute, requiring multiple views of the material to be represented. Furthermore, getting information out of a relational database for use by other programs usually requires some programming knowledge.
The progress of too many retrieval and database projects can be characterized as follows. A project group decides that it wants to "make a CD-ROM." It finds that it has to investigate possible software programs for delivery of the results and chooses the one that has the most seductive user interface or most persuasive salesperson. If the data include some nonstandard characters, the highest priority is often given to displaying those characters on the screen; little attention is paid to the functions needed to manipulate those characters. Data are then entered directly into this software over a period of time during which the software interface begins to look outmoded as technology changes. By the time the data have been entered for the project, the software company has gone out of business, leaving the project staff with a lot of valuable information in a proprietary software format that is no longer supported. More often than not, the data are lost and much time and money has been wasted. The investment is clearly in the data, and it makes
sense to ensure that these data are not dependent on one particular program but can be used by other programs as well.