Data Linking

A method for linking encoded data to external databases

Summary

  • By encoding all of the entities in an interview, the user has made it possible for the computer to recognize each entity as a person, place, organization, event, or date. 
  • Named Entity Linking is the process we use to connect the encoded data entities in our interviews to external databases. This process allows us to attach additional information to each of our encoded entities.
  • By linking encoded entities to external information sources, we are able to associate each encoded entity with a unique identifier as well as other additional useful data about the entity. Linking also provides a way to standardize the data into consistent (“tidy”) formats.

Key Terms and Tools

Once the user is satisfied that all of the entities in a transcript have been properly encoded, the XML file is reprocessed by a software script that compiles all of the encoded entities into lists.  Each of the five lists of encoded entities is then automatically written back into a separate section of the XML file.

  • Wikidata: Our primary source for external data is Wikidata, a massive, open-source, document-based database. In addition to providing free access, Wikidata has several features that are especially useful for our purposes. Each Wikidata entry also contains additional data about the subject of the entry. This could include latitude and longitude coordinates for places or time data for dates and events.
  • QID: Every Wikidata entry has a unique identifier known as a QID. Once an entity is linked to a QID, the computer can recognize cases in which different encoded entities refer to the same real-world person, place, organization, or event. So linking to a QID enables the computer to recognize all instances of each instance. 
  • DDHI Data Linker: The DDHI linking process is facilitated by our DDHI Data Linker, a software bundle that produces lists (in .TSV format) of the encoded entities contained in a document. After the encoded entities have been linked to entries in Wikidata, the Data Linker writes them back into the XML version of the transcripts.

Workflow

The DDHI linking workflow is an example of Named Entity Linking.

A user will first apply the Data Linker to the XML version of an encoded transcript to generate lists (in .TSV format) of all four types of encoded entities contained in the document. The individual entities on those lists are then linked by the user to their corresponding entries in Wikidata, so they can be associated with the QID, location coordinates, and other relevant data. The user then employs the Data Linker a second time to write the linked data back into the XML transcript file.

After the Data Linker has written the linked data back into the transcript file, the encoded data in the transcript is ready for visualization.