Named Entity Recognition and Disambiguation

Entity extraction and disambiguation is the task of determining the identity of entities mentioned in a text against a knowledge base. The identification and resolution of named-entities such as person-name or location, provides many practical applications. For instance, users can extract lists of people, map different texts, generate timelines, and provide an enhanced search. This is of great importance not only for research but also for the publishing process.

OAPEN is testing the integration of the NERD API in the workflow of publishing platforms to enhance discoverability and usage of open access monographs. The data of several thousand monographs and chapters are currently available as a CSV (comma separated) text file here:

Download nerd_oapen_response_database.7z

Entity-fishing, the NERD implementation developed by INRIA, is a service available within the DARIAH-EU infrastructure and used by the partners of the OPERAS HIRMEOS-project to enrich open access digital monographs published on five digital platforms.

Description of the data

The data is divided into the following columns:  

OAPEN_IDUnique ID of the publication in the OAPEN Library
rawNameThe entity as it appears in the text
nerd_scoreDisambiguation  confidence score
nerd_selection_scoreSelection confidence score, indicates how certain the disambiguated entity is actually valid for the text mention
wikipediaExternalRefID of the Wikipedia page
wiki_URLURL of the Wikipedia page
typeNER class of the entity
domainsDescription of subject domain

Each book may contain more than one occurrence of the same entity. The nerd_score and the nerd_selection_score may vary. This allows researchers to count the number of occurrences and use this as an additional method to assess the contents of the book. The OAPEN_ID refers to the identifier of the title in the OAPEN Library. A full description of all books in the OAPEN Library is available here.

For more information about the entity-fishing query processing service see