A guide to metadata by
the Metadata Advisory Group of the MIT Libraries
TEI (Text Encoding Initiative) Metadata
Definition:
Text Encoding Initiative: defines a general-purpose scheme
that makes it possible to encode different textual views.
“Grew out of technology based textual analysis applications
employed by Humanities scholars”[1]
e.g., tracing the use of the word ‘love’ in the
genre poems within a specific historical period. Focus has
been on text capture (in electronic form from already existing
text in another medium) rather than text creation, i.e., no
other text copy exists. [2] Assumes texts
and works on texts have a common core of textual features.
Constituency:
Originally a joint project of the
• Association of Computers in the Humanities
• Association of Computational Linguistics
• Association for Literary and Linguistic Computing
TEI addresses many of the needs of the “language technology
community which is amassing substantial multi-lingual, multi-modal
corpora of spoken and written texts and lexicons in order
to advance research in human language, understanding, production,
and translation.” [3]
History of use:
Begun in 1987 as an international project for the encoding
of electronic textual materials. Planning conference held
at Vassar College in 1987 led to an agreement on basic design
goals. From the initial stages, there has been a relationship
between TEI and MARC bibliographic records. The TEI Header
was based on ISBD and intended to supply information suitable
to create a catalog record. Similar to MARC, TEI makes a distinction
between required, recommended, and optional encoding practices
and provides a mechanism for user-defined extensions to the
scheme.
E-text centers and MARC communities have fostered communication.
Library of Congress MARC to SGML crosswalk located on LC’s
web site: http://lcweb.loc.gov/marc/marcsgml.html
• 1990 : First
draft version of TEI Header and Guidelines was distributed.
• May 1994: “Guidelines for the
encoding and interchange of Machine-Readable Texts”
was issued. Guidelines provide conventions for describing
physical and logical structures of text types for research
in language technology, computational linguistics, and the
humanities.
• June 1998: TEI and XML in Digital Libraries workshop
sponsored by LC and the Digital Library Federation charged
a working group to recommend some best practices for TEI header
content.
• June 2001 revision: TEI P4 disseminated to provide
equal support for XML and SGML applications. Next revision
expected early 2002.
Prerequisites:
TEI is an interchange format independent of application.
Progress towards standardization:
Joint project of the Associations resulted in an extensible
SGML ‘document type definition’. TEI Guidelines
published. TEI continues to develop and maintain encoding
standards. Has a specific mark-up syntax as well as a large
well-defined tag set, but few tags are mandatory. Similar
to MARC in specifying input standard each tag.
Responsibility divided between two committees: Committee on
Text Representation and Committee on Text Interpretation and
Analysis. Standard feature: TEI Header which provides bibliographic
history, provenance information and information about the
text and its creation (encoder, file size, file availability,
encoding practices.)
TEI/MARC best practices for TEI Headers distributed by University
of Michigan:
http://www-personal.umich.edu/~jaheim/teiguide.html
Encoding:
SGML (ISO 8879) and ISO 646 (7-bit character set standard).
Encodings for different views of text; alternative encodings
for the same text features; mechanisms for user-defined extensions
to the scheme. The Guidelines make it possible to encode many
different views of the text, simulataneously if necessary.
TEI Guidelines are not prescriptive: few features are mandatory,
but the Guidelines define a core set of tags. Extensible.
The focus is on the capture of text that already exists in
another medium rather than text creation.
TEI Header is a set of descriptions prefixed to a
TEI encoded document that specifies four components:
• file description (a full bibliographic description),
• encoding description (level of detail of the analysis-the
aim or purpose for which an electronic file was encoded;
editorial principles and practices used during the encoding
of the text),
• text profile (classificatory and contextual information
such as the text’s subject matter; the languages and
sublanguages used, the situation in which it was produced,
the participants and their setting),
• revision history (history of changes during the
electronic files’ development). contains bibliographic
information supporting resource discovery, and data management
portions supporting use of the resource.
If TEI Header is similar to the information contained in a MARC
record, why didn’t the scholarly community simply use
MARC?
Workflow is the primary answer…the TEI drafters envisioned
that the individuals who marked up the electronic texts would
be creating the metadata for them and shouldn’t be expected
to know cataloging rules, but the Header was deliberately
designed to provide a trained cataloger the information necessary
to create a good cataloging record. The difference is that
the rules for obtaining and representing the content are not
prescribed and, consequently, catalogers find that the data
is usable only to the extent that the encoder followed cataloging
rules.[4]
Over time, the progression of the TEI header has been towards
greater consistency and compatability with traditional library
cataloging and greater syntactical congruence with MARC. It
is conceivable that the TEI header will evolve such that it
would carry detailed encoding, profile and revision information,
but would point to a MARC record that would contain the bibliographic
description.
The TEI Header supports a number of field categories which
cannot be captured in MARC, e.g., the change history section
provides a structure for logging changes made to an electronic
text, including date, responsible party and the nature of
the change. The source desc within the file desc
allows for a detailed a richly content-designated description
particularly for non-print sources. The encoding desc
provides for a lengthy and detailed description of the
encoding of the electronic file including the data about the
project, the purpose for which it was created, the editorial
decisions that were made, and the transcription practices
that were used.
Unlike METS, the TEI Guidelines do not specify a particular
approach to the problem of fidelity to the source text and
recoverability of the original, but it does provide for typographic
and linguistic characteristics of the text rather than a detailed
mark up of the layout or fine distinctions of the manuscript.
TEI does not restrict combining objective and subjective information
in the encoding. The Guidelines provide a means for encoding
for the text representation as well as the text interpretation
and analysis.
Implementations:
(see http://www.tei-c.org/applications/index.html)
• University of Michigan Digital
Library Program
• Making of America Project
• Women Writers Online: Brown University
• Electronic Text Center: University of Virginia
Useful links:
• TEI home page: http://www.tei-c.org/
• TEI Guidelines: http://www.tei-c.org/P4X/
• International metadata initiative: http://lcweb.loc.gov/catdir/bibcontrol/caplan_paper.html
• Guidelines for electronic encoding and interchange:
http://www.hti.umich.edu/t/tei/
• Teach yourself TEI: http://www.tei-c.org/Tutorials/
• Projects using the TEI: http://www.tei-c.org/Applications/
[1] OCLC Systems and Services, v. 17, no.
3, p. 117.
[2] Guidelines for Electronic Text Encoding
and Interchange (TEI P3), p. 2
[3] TEI Guidelines as posted on the University
of Illinois (Champagne/Urbana) http://www.uic.edu/orgs/tei/p3/
[4] Priscilla Caplan, “International
Metadata Initiatives: Lessons in Bibliographic Control”,
p. 2 Conference on Bibliographic Control in the New Millennium
(Library of Congress) http://lcweb.loc.gov/catdir/bibcontrol/caplan_paper.html
|