The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms

Martinez De Miguel, Claudia; Segura-Bedmar, Isabel; Chacon Solano, Esteban Gonzalo; Guerrero Aspizua, Sara

Publication:
The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms

dc.affiliation.dpto	UC3M. Departamento de Bioingeniería	es
dc.affiliation.dpto	UC3M. Departamento de Informática	es
dc.affiliation.grupoinv	UC3M. Grupo de Investigación: Tissue Engineering and Regenerative Medicine (TERMeG)	es
dc.affiliation.grupoinv	UC3M. Grupo de Investigación: Human Language and Accessibility Technologies (HULAT)	es
dc.contributor.author	Martinez De Miguel, Claudia
dc.contributor.author	Segura-Bedmar, Isabel
dc.contributor.author	Chacon Solano, Esteban Gonzalo
dc.contributor.author	Guerrero Aspizua, Sara
dc.contributor.funder	Comunidad de Madrid	es
dc.contributor.funder	Ministerio de Economía y Competitividad (España)	es
dc.contributor.funder	Universidad Carlos III de Madrid	es
dc.date.accessioned	2023-02-03T14:11:14Z
dc.date.available	2023-02-03T14:11:14Z
dc.date.issued	2022-01
dc.description.abstract	Rare diseases affect a small number of people compared to the general population. However, more than 6,000 different rare diseases exist and, in total, they affect more than 300 million people worldwide. Rare diseases share as part of their main problem, the delay in diagnosis and the sparse information available for researchers, clinicians, and patients. Finding a diagnostic can be a very long and frustrating experience for patients and their families. The average diagnostic delay is between 6–8 years. Many of these diseases result in different manifestations among patients, which hampers even more their detection and the correct treatment choice. Therefore, there is an urgent need to increase the scientific and medical knowledge about rare diseases. Natural Language Processing (NLP) can help to extract relevant information about rare diseases to facilitate their diagnosis and treatments, but most NLP techniques require manually annotated corpora. Therefore, our goal is to create a gold standard corpus annotated with rare diseases and their clinical manifestations. It could be used to train and test NLP approaches and the information extracted through NLP could enrich the knowledge of rare diseases, and thereby, help to reduce the diagnostic delay and improve the treatment of rare diseases. The paper describes the selection of 1,041 texts to be included in the corpus, the annotation process and the annotation guidelines. The entities (disease, rare disease, symptom, sign and anaphor) and the relationships (produces, is a, is acron, is synon, increases risk of, anaphora) were annotated. The RareDis corpus contains more than 5,000 rare diseases and almost 6,000 clinical manifestations are annotated. Moreover, the Inter Annotator Agreement evaluation shows a relatively high agreement (F1-measure equal to 83.5% under exact match criteria for the entities and equal to 81.3% for the relations). Based on these results, this corpus is of high quality, supposing a significant step for the field since there is a scarcity of available corpus annotated with rare diseases. This could open the door to further NLP applications, which would facilitate the diagnosis and treatment of these rare diseases and, therefore, would improve dramatically the quality of life of these patients.	en
dc.description.sponsorship	This work was supported by the Madrid Government (Comunidad de Madrid) under the Multiannual Agreement with UC3M in the line of "Fostering Young Doctors Research" (NLP4RARE-CM-UC3M) and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation; the Multiannual Agreement with UC3M in the line of "Excellence of University Professors (EPUC3M17)"; and a grant from Spanish Ministry of Economy and Competitiveness (SAF2017-86810-R).	en
dc.format.extent	12
dc.identifier.bibliographicCitation	Martínez-de Miguel, C., Segura-Bedmar, I., Chacón-Solano, E. & Guerrero-Aspizua, S. (2022). The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms. Journal of Biomedical Informatics, 125, 103961.	en
dc.identifier.doi	https://doi.org/10.1016/j.jbi.2021.103961
dc.identifier.issn	1532-0464
dc.identifier.publicationfirstpage	1
dc.identifier.publicationissue	103961
dc.identifier.publicationlastpage	12
dc.identifier.publicationtitle	Journal of Biomedical Informatics	en
dc.identifier.publicationvolume	125
dc.identifier.uri	https://hdl.handle.net/10016/36462
dc.identifier.uxxi	AR/0000030803
dc.language.iso	eng
dc.publisher	Elsevier	en
dc.relation.dataset	https://doi.org/10.21950/DEURZF
dc.relation.projectID	Gobierno de España. SAF2017-86810-R	es
dc.relation.projectID	Comunidad de Madrid. NLP4RARE-CM-UC3M	es
dc.relation.projectID	Comunidad de Madrid. EPUC3M17	es
dc.rights	© 2021 The Author(s).	en
dc.rights	Atribución-NoComercial-SinDerivadas 3.0 España	*
dc.rights.accessRights	open access	en
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/es/	*
dc.subject.eciencia	Biología y Biomedicina	es
dc.subject.eciencia	Informática	es
dc.subject.eciencia	Medicina	es
dc.subject.other	Gold-standard corpus	en
dc.subject.other	Named entity recognition	en
dc.subject.other	Relation extraction	en
dc.subject.other	Rare diseases	en
dc.title	The RareDis corpus: A corpus annotated with rare diseases, their signs and symptoms	en
dc.type	research article	*
dc.type.hasVersion	VoR	*
dspace.entity.type	Publication