previous sub-section
Chapter 1— Making Technology Work for Scholarship Investing in the Data
next sub-section

Standard Generalized Markup Language (Sgml)

Given the time and effort involved in creating electronic information, it makes sense to step back and think about how to ensure that the information can outlast the computer system on which it is created and can also be used for many different purposes. These are the two main principles of the Standard Generalized Markup Language (SGML), which became an international standard (ISO 8879) in 1986.[6] SGML was designed as a general purpose markup scheme that can be applied to many different types of documents and in fact to any electronic information. It consists of plain ASCII files, which can easily be moved from one computer system to another. SGML is a descriptive language. Most encoding schemes prior to SGML use prescriptive markup. One example of prescriptive markup is word-processing or typesetting codes embedded in a text that give instructions to the computer such as "center the next line" or "print these words in italic." Another example is fielded data that is specific to a retrieval program, for example, reference citations or author's names, which must be in a specific format for the retrieval program to recognize them as such. By contrast, a descriptive markup language merely identifies what the components of a document are. It does not give specific instructions to any program. In it, for example, a title is encoded as a title, or a paragraph as a paragraph. This very simple approach ultimately allows much more flexibility. A printing program can print all the titles in italic, a retrieval program can search on the titles, and a hypertext program can link to and from the titles, all without making any changes to the data.

Strictly speaking, SGML itself is not a markup scheme, but a kind of computer language for defining markup, or encoding, schemes. SGML markup schemes assume that each document consists of a collection of objects that nest within each other or are related to each other in some other way. These objects or features can be almost anything. Typically they are structural components such as title, chapter, paragraph, heading, act, scene, speech, but they can also be interpretive information such as parts of speech, names of people and places, quotations (direct and indirect), and even literary or historical interpretation. The first stage of any SGML-based project is document analysis, which identifies all the textual features that are of interest and identifies the relationships between them. This step can take some time, but it is worth investing the time since a thorough document analysis can ensure that data entry proceeds smoothly and that the documents are easily processable by computer programs.

In SGML terms, the objects within a document are called elements. They are identified by a start and end tag as follows: <title>Pride and Prejudice</title>.


22

The SGML syntax allows the document designer to specify all the possible elements as a Document Type Definition (DTD), which is a kind of formal model of the document structure. The DTD indicates which elements are contained within other elements, which are optional, which can be repeated, and so forth. For example, in simple terms a journal article consists of a title, one or more author names, an optional abstract, and an optional list of keywords, followed by the body of the article. The body may contain sections, each with a heading followed by one or more paragraphs of text. The article may finish with a bibliography. The paragraphs of text may contain other features of interest, including quotations, lists, and names, as well as links to notes. A play has a rather different structure: title; author; cast list; one or more acts, each containing one or more scenes, which in turn contain one or more speeches and stage directions; and so on.

SGML elements may also have attributes that further specify or modify the element. One use of attributes may be to normalize the spelling of names for indexing purposes. For example, the name Jack Smyth could be encoded as <name norm="SmithJ"> Jack Smyth</name>, but indexed under S as if it were Smith. Attributes can also be used to normalize date forms for sorting, for example, <date norm=19970315>the Ides of March 1997</date>. Another important function of attributes is to assign a unique identifier to each instance of each SGML element within a document. These identifiers can be used as a cross-reference by any kind of hypertext program. The list of possible attributes for an element may be defined as a closed set, allowing the encoder to pick from a list, or it may be entirely open.

SGML has another very useful feature. Any piece of information can be given a name and be referred to by that name in an SGML document. These names are called entities and are enclosed in an ampersand and a semicolon. One use is for nonstandard characters, where, for example, é can be encoded as &eacute; thus ensuring that it can be transmitted easily across networks and from one machine to another. A standard list of these characters exists, but the document encoder can also create more. Entity references can also be used for any boilerplate text. This use avoids repetitive typing of words and phrases that are repeated, thus also reducing the chance of errors. An entity reference can be resolved to any amount of text from a single letter up to the equivalent of an entire chapter.

The formal structure of SGML means that the encoding of a document can be validated automatically, a process known as parsing. The parser makes use of the SGML DTD to determine the structure of the document and can thus help to eliminate whole classes of encoding errors before the document is processed by an application program. For example, an error can be detected if the DTD specifies that a journal article must have one or more authors, but the author's name has been omitted accidentally. Mistyped element names can be detected as errors, as can elements that are wrongly nested-for example, an act within a scene when the DTD specifies that acts contain scenes. Attributes can also be validated when there is a closed set of possible values. The validation process can also detect un-


23

resolved cross-references that use SGML's built-in identifiers. The SGML document structure and validation process means that any application program can operate more efficiently because it derives information from the DTD about what to expect in the document. It follows that the stricter the DTD, the easier it is to process the document. However, very strict DTDs may force the document encoder to make decisions that simplify what is being encoded. Free DTDs might better reflect the nature of the information but usually require more processing. Another advantage of SGML is very apparent here. Once a project is under way, if a document encoder finds a new feature of interest, that feature can simply be added to the DTD without the need to restructure work that has already been done. Many documents can be encoded and processed with the same DTD.


previous sub-section
Chapter 1— Making Technology Work for Scholarship Investing in the Data
next sub-section