The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms

Martinez De Miguel, Claudia; Segura-Bedmar, Isabel; Chacon Solano, Esteban Gonzalo; Guerrero Aspizua, Sara

Publication:
The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms

Identifiers

URI: https://hdl.handle.net/10016/36462

ISSN: 1532-0464

DOI: https://doi.org/10.1016/j.jbi.2021.103961

UXXI: AR/0000030803

Files

RareDis_JBI_2022.pdf (4.53 MB)

Publication date

2022-01

Authors

Martinez De Miguel, Claudia

Segura-Bedmar, Isabel

Chacon Solano, Esteban Gonzalo

Guerrero Aspizua, Sara

Publisher

Elsevier

Impact

Export

Abstract

Rare diseases affect a small number of people compared to the general population. However, more than 6,000 different rare diseases exist and, in total, they affect more than 300 million people worldwide. Rare diseases share as part of their main problem, the delay in diagnosis and the sparse information available for researchers, clinicians, and patients. Finding a diagnostic can be a very long and frustrating experience for patients and their families. The average diagnostic delay is between 6–8 years. Many of these diseases result in different manifestations among patients, which hampers even more their detection and the correct treatment choice. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments, but most NLP techniques require manually annotated corpora. Therefore, our goal is to create a gold standard corpus annotated with rare diseases and their clinical manifestations. It could be used to train and test NLP approaches and the information extracted through NLP could enrich the knowledge of rare diseases, and thereby, help to reduce the diagnostic delay and improve the treatment of rare diseases. The paper describes the selection of 1,041 texts to be included in the corpus, the annotation process and the annotation guidelines. The entities (disease, rare disease, symptom, sign and anaphor) and the relationships (produces, is a, is acron, is synon, increases risk of, anaphora) were annotated. The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.

Keywords

Gold-standard corpus, Named entity recognition, Relation extraction, Rare diseases

Bibliographic citation

Martínez-de Miguel, C., Segura-Bedmar, I., Chacón-Solano, E. & Guerrero-Aspizua, S. (2022). The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms. Journal of Biomedical Informatics, 125, 103961.

Collections

DBIAB - TERMEG - Journal Articles
DI - LABDA - Artículos de Revistas

Full item page

Publication:
The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms

Identifiers

Files

Publication date

Defense date

Authors

Advisors

Tutors

Journal Title

Journal ISSN

Volume Title

Publisher

Impact

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Bibliographic citation

Collections

Publication: The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms

Identifiers

Files

Publication date

Defense date

Authors

Advisors

Tutors

Journal Title

Journal ISSN

Volume Title

Publisher

Impact

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Bibliographic citation

Collections

Publication:
The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms