Technology and Scholarly Communication "d0e515"

Making SGML Work Effectively

Getting started with SGML can seem to be a big hurdle to overcome, but in fact the actual mechanics of working with SGML are nowhere near as difficult as is often assumed. SGML tags are rarely typed in, but are normally inserted by software programs such as Author/Editor. These programs can incorporate a template that is filled in with data. Like other SGML software, these programs make use of the DTD. They know which tags are valid at any position in the document and can offer only those tags to the user, who can pick from a menu. They can also provide a pick list of attributes and their values if the values are a closed set. These programs ensure that what is produced is a valid SGML document. They can also toggle the display of tags on and off very easily-Author/Editor encloses them in boxes that are very easy to see. The programs also incorporate style sheets that define the display format for every element.

Nevertheless, inserting tags in this way can be rather cumbersome, and various software tools exist to help in the translation of "legacy" data to SGML. Of course, these tools cannot add intelligence to data if it was not there in the legacy format, but they can do a reasonable and low-cost job of converting material for large-scale projects in which only broad structural information is needed. For UNIX users, the shareware program sgmls and its successor, sp, are excellent tools for validating SGML documents and can be incorporated in processing programs. There are also ways in which the markup can be minimized. End tags can be omitted in some circumstances, for example, in a list where the start of a new list item implies that the previous one has ended.

There is no doubt that SGML is considered expensive by some project managers, but further down the line the payoff can be seen many times over. The quick and dirty solution to a computing problem does not last very long, and history has shown how much time can be wasted converting from one system to another or how much data can be lost because they are in a proprietary system. It is rather surprising that the simple notion of encoding what the parts of a document are,

― 31 ―

rather than what the computer is supposed to do with them, took so long to catch on. Much of the investment in any computer project is in the data, and SGML is the best way we know so far of ensuring that the data will last for a long time and that they can be used and reused for many different purposes. It also ensures that the project is not dependent on one software vendor.

The amount of encoding is obviously a key factor in the cost, and so any discussion about the cost-effectiveness of an SGML project should always be made with reference to the specific DTD in use and the level of markup to be inserted. Statements that SGML costs x dollars per page are not meaningful without further qualification. Unfortunately at present such further qualification seems rarely to be the case, and misconceptions often occur. It is quite possible, although clearly not sensible, to have a valid SGML document that consists of one start tag at the beginning and one corresponding end tag at the end with no other markup in between. At the other extreme, each word (or even letter) in the document could have several layers of markup attached to it. What is clear is that the more markup there is, the more useful the document is and the more expensive it is to create. As far as I am aware, little research has been done on the optimum level of markup, but at least with SGML it is possible to add markup to a document later without prejudicing what is already encoded.

In my view, it is virtually impossible to make some general cost statements for SGML-based work. Each project needs to be assessed differently depending on its current situation and its objectives.^[15] However, I will attempt to discuss some of the issues and the items that make up the overall cost. Many of the costs of an SGML-based project are no different from those of other computer-based projects in that both have start-up costs and ongoing costs.

Start-up costs can depend on how much computing experience and expertise there already is in the organization. Projects that are being started now have the advantage of not being encumbered by large amounts of legacy data and proprietary systems, but they also will need to be started from scratch with the three things that make any computer project work: hardware, software, and skilled people. Hardware costs are insignificant these days, and SGML software will work on almost any current PC or UNIX-based hardware. It does not need an expensive proprietary system. An individual scholar can acquire PC software for creating and viewing SGML-encoded text for under $500. Public domain UNIX tools cost nothing to acquire. That leaves what is, in my view, the most essential component of any computing project, namely, people with good technical skills. Unfortunately, these people are expensive. The market is such that they can expect higher salaries than librarians and publishers receive at the equivalent stages in their careers. However, I think that it is unwise for any organization to embark on a computer-based project without having staff with the proper skills to do the work. Like people in all other disciplines, computer people specialize in one or two areas, and so it is important to hire staff with the right computing skills and thus impor-

― 32 ―

tant for the person doing the hiring to understand what those skills should be. There are still not many SGML specialists around, but someone with a good basic background in computing could be trained in SGML at a commercial or academic course in a week or so, with some follow-up time for experimentation. This person can then use mailing lists and the SGML Web site to keep in touch with new developments. Having the right kind of technical person around early on in any computing project also means that there is somebody who can advise on the development of the system and ensure that expensive mistakes are not made by decision makers who have had little previous involvement with computing systems. The technical person will also be able to see immediately where costs can be saved by implementing shortcuts.

The one specific start-up cost with SGML is the choice or development of the DTD. Many digital library projects are utilizing existing DTDs-for example, the cut-down version of the TEI called TEILite-either with no changes at all or with only light modifications. However, I think that it is important for project managers to look hard at an existing DTD to see whether it really satisfies their requirements rather than just decide to use it because everyone else they know is using it. A project in a very specialized area may need to have its own DTD developed. This could mean the hiring of SGML consultants for a few days plus time spent by the project's own staff in specifying the objectives of the project in great detail and in defining and refining the features of interest within the project's documents.

Computer-based projects seldom proceed smoothly, and in the start-up phase, time must be allowed for false starts and revisions. SGML is no different here, but by its nature it does force project managers to consider very many aspects at the beginning and thus help prevent the project from going a long way down a wrong road. SGML elements can also be used to assist with essential aspects of project administration, for example, tags for document control and management.

Ongoing costs are largely concerned with document creation and encoding, but they also include general maintenance, upgrades, and revisions. If the material is not already in electronic form, it may be possible to convert it by optical character recognition (OCR). The accuracy of the result will depend on the quality of the type fonts and paper of the original, but the document will almost certainly need to be proofread and edited to reach the level of quality acceptable to the scholarly community. OCR also yields a typographic representation of a document, which is ambiguous for other kinds of computer processing. Whether it comes from word processors or OCR, typographic encoding needs to be converted to SGML. It is possible to write programs or purchase software tools to do this, but only those features that can be unambiguously defined can be converted in this way. Any markup that requires interpretive judgment must be inserted manually at the cost of human time. Most electronic text projects in the humanities have had the material entered directly by keyboarding, not only to attain higher levels of accuracy than with OCR, but also to insert markup at the same time. More often than not, project managers employ graduate students for this

― 33 ―

work, supervised by a textbase manager who keeps records of decisions made in the encoding and often assisted by a programmer who can identify shortcuts and write programs where necessary to handle these shortcuts.

There are also, of course, costs associated with delivering SGML-encoded material once it has been created. These costs fall into much the same categories as the costs for creating the material. Start-up costs include the choice and installation of delivery software. In practice, most digital library projects use the Opentext search engine, which is affordable for a library or a publisher. The search engine also needs a Web client, which need not be a big task for a programmer. Naturally it takes longer to write a better Web client, but a better client may save end users much time as they sort through the results of a query. Opentext is essentially a retrieval program, and it does not provide much in the way of hypertext linking. INSO/EBT's suite of programs, including DynaText and DynaWeb, provides a model of a document that is much more like an electronic book with hypertext links. INSO's Higher Education Grant Program has enabled projects like MEP and Orlando to deliver samples of their material without the need to purchase SGML delivery software. INSO offers some technical support as part of the grant package, but skilled staff are once again the key component for getting a delivery system up and running. When any delivery system is functioning well, the addition of new SGML-encoded material to the document database can be fully automated with little need for human intervention unless something goes wrong.

Experience has shown that computer-based projects are rarely, if ever, finished. They will always need maintenance and upgrades and will incur ongoing costs more or less forever if the material is not to be lost. SGML seems to me to be the best way of investing for the future, since there are no technical problems in migrating it to new systems. However, I find it difficult to envisage a time when there will be no work and no expense involved with maintaining and updating electronic information. It is as well to understand this and to budget for these ongoing costs at the beginning of a project rather than have them come out gradually as the project proceeds.

SGML does have one significant weakness. It assumes that each document is a single hierarchic structure, but in the real world (at least of the humanities) very few documents are as simple as this.^[16] For example, a printed edition of a play has one structure of acts, scenes, and speeches and another of pages and line numbers. A new act or scene does not normally start on a new page, and so there is no relationship between the pages and the act and scene structure. It is simply an accident of the typography. The problem arises even with paragraphs in prose texts, since a new page does not start with a new paragraph or a new paragraph with a new page. For well-known editions the page numbers are important, but they cannot easily be encoded in SGML other than as "empty" tags that simply indicate a point in the text, not the beginning and end of a piece of information. The disadvantage here is that the processing of information marked by empty tags cannot make full use of SGML's capabilities. Another example of the same problem is

― 34 ―

quotations spanning over paragraphs. They have to be closed and then opened again with attributes to indicate that they are really all the same quotation.

For many scholars, SGML is exciting to work with because it opens up so many more possibilities for working with source material. We now have a much better way than ever before of representing in electronic form the kinds of interpretation and discussion that are the basis of scholarship in the humanities. But as we begin to understand these new possibilities, some new challenges appear.^[17] What happens when documents from different sources (and thus different DTDs) are merged into the same database? In theory, computers make it very easy to do this, but how do we merge material that has been encoded according to different theoretical perspectives and retain the identification and individuality of each perspective? It is possible to build some kind of "mega-DTD," but the mega-DTD may become so free in structure that it is difficult to do any useful processing of the material.

Attention must now turn to making SGML work more effectively. Finding better ways of adding markup to documents is a high priority. The tagging could be speeded up by a program that can make intelligent tagging guesses based on information it has derived from similar material that has already been tagged, in much the same way that some word class tagging programs "learn" from text that has already been tagged manually. We also need to find ways of linking encoded text to digital images of the same material without the need for hand coding. Easier ways must be found for handling multiple parallel structures. All research leading to better use of SGML could benefit from a detailed analysis of documents that have already been encoded in SGML. The very fact that they are in SGML makes this analysis easy to do.