Encoding

A method of digitally processing oral history transcripts so that a computer can analyze interview data

Summary

  • By converting oral history interviews from Microsoft Word into XML files, the DDHI renders them machine-readable
  • The DDHI uses TEI, a pre-existing encoding standard, in order to create a Basic Layer suited specifically to the needs of our project
  • XML files can be accessed by both people and computers so that the DDHI can keep track of data like people, places, and events that are mentioned in oral history interviews
Transcript marked up with TEI

Key Terms and Tools

The DDHI encoding process is built on a set of tools which both enable and facilitate the encoding of oral history interview transcripts.

  • XML: DDHI uses the Extensible Markup Language (XML) for encoding oral history interview transcripts. XML defines a set of rules for encoding documents so that both people and computers can analyze them. Interviews—initially Microsoft Word documents—are converted into XML files before encoding begins.
  • TEI: The Text Encoding Initiative (TEI) is an XML-based encoding standard. Using TEI, DDHI has created a “Basic Layer” which allows us to tag elements within our XML transcripts. TEI was designed as a digital humanities tool to enable users to mark up texts with predefined tags, and is used in a wide range of digital humanities projects—making it ideal for DDHI’s work.
  • The Basic Layer: The DDHI’s encoding schema adopts a “layered” approach to tagging data. The core of the schema is DDHI’s “Basic Layer,” which defines the five categories of information that DDHI tags in oral history transcripts: people, dates, organizations, places, and events. These criteria are meant to reflect the kinds of data that are of particular interest to oral historians. Insofar as oral historians are interested in questions about people, space, time, chronology, and institutional and social contexts, the DDHI’s basic layer facilitates the explorations and analyses they want to undertake. See table below for examples of the basic layer.
  • The OH Encoder: Developed by the DDHI in coordination with Agile Humanities, the OH Encoder is a software bundle that turns a “raw” plain-text transcript into a well-formed TEI document. Currently, the OH Encoder produces well-formed TEI documents with automated tags that adhere to our basic layer schema.

The DDHI Basic Layer

Data category TEI Element/Attribute Example
Place <placeName> <placeName>Saigon</placeName>
Person <persName> <persName>Colin Powell</persName>
Organization <orgName> <orgName>Vietnam Veterans Against the War</orgName>
Date <date when=yyyy-mm-dd> <date when=1968-01-31>January 31, 1968</date>
Event <name type="event"> <name type="event">Vietnam War</name>

 

Workflow

Encoding is the first part of the DDHI workflow. All oral history interview transcripts begin as Microsoft Word documents, and are converted into well-formed TEI documents by the OH Encoder. Because the interviews are now in XML format, they are ready to be tagged using the DDHI’s encoding schema.

At this stage, the DDHI tags elements according to the Basic Layer: places, persons, organizations, dates, and events are all given an individual tag. Some entities are pre-tagged by the OH Encoder, but many still need to be added or modified by our team. For applying tags to interview transcripts, the DDHI uses Oxygen, an XML text editor. Interviews undergo two rounds of encoding by two different DDHI associates, in order to  eliminate tagging errors.

Once encoded and peer-reviewed, transcripts are ready to undergo the next phase of the DDHI’s workflow: data linking.