Analyzing transfer learning impact in biomedical cross lingual named entity recognition and normalization

Rivera Zavala, Renzo; Martínez Fernández, Paloma

Publication:
Analyzing transfer learning impact in biomedical cross lingual named entity recognition and normalization

Identifiers

URI: https://hdl.handle.net/10016/37680

ISSN: 1471-2105

DOI: https://doi.org/10.1186/s12859-021-04247-9

UXXI: AR/0000030763

Files

analyzing_BMCB_2021.pdf (1.63 MB)

Publication date

2021-12-17

Authors

Rivera Zavala, Renzo

Martínez Fernández, Paloma

Publisher

BMC

Impact

Export

Abstract

Background The volume of biomedical literature and clinical data is growing at an exponential rate. Therefore, efficient access to data described in unstructured biomedical texts is a crucial task for the biomedical industry and research. Named Entity Recognition (NER) is the first step for information and knowledge acquisition when we deal with unstructured texts. Recent NER approaches use contextualized word representations as input for a downstream classification task. However, distributed word vectors (embeddings) are very limited in Spanish and even more for the biomedical domain. Methods In this work, we develop several biomedical Spanish word representations, and we introduce two Deep Learning approaches for pharmaceutical, chemical, and other biomedical entities recognition in Spanish clinical case texts and biomedical texts, one based on a Bi-STM-CRF model and the other on a BERT-based architecture. Results Several Spanish biomedical embeddigns together with the two deep learning models were evaluated on the PharmaCoNER and CORD-19 datasets. The PharmaCoNER dataset is composed of a set of Spanish clinical cases annotated with drugs, chemical compounds and pharmacological substances; our extended Bi-LSTM-CRF model obtains an F-score of 85.24% on entity identification and classification and the BERT model obtains an F-score of 88.80% . For the entity normalization task, the extended Bi-LSTM-CRF model achieves an F-score of 72.85% and the BERT model achieves 79.97%. The CORD-19 dataset consists of scholarly articles written in English annotated with biomedical concepts such as disorder, species, chemical or drugs, gene and protein, enzyme and anatomy. Bi-LSTM-CRF model and BERT model obtain an F-measure of 78.23% and 78.86% on entity identification and classification, respectively on the CORD-19 dataset. Conclusion These results prove that deep learning models with in-domain knowledge learned from large-scale datasets highly improve named entity recognition performance. Moreover, contextualized representations help to understand complexities and ambiguity inherent to biomedical texts. Embeddings based on word, concepts, senses, etc. other than those for English are required to improve NER tasks in other languages.

Keywords

natural language processing, clinical texts, deep learning, contextual information

Bibliographic citation

Rivera-Zavala, R.M., Martínez, P. Analyzing transfer learning impact in biomedical cross-lingual named entity recognition and normalization. BMC Bioinformatics 22 (Suppl 1), 601 (2021). https://doi.org/10.1186/s12859-021-04247-9

Collections

DI - LABDA - Artículos de Revistas

Full item page

Publication:
Analyzing transfer learning impact in biomedical cross lingual named entity recognition and normalization

Identifiers

Files

Publication date

Defense date

Authors

Advisors

Tutors

Journal Title

Journal ISSN

Volume Title

Publisher

Impact

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Bibliographic citation

Collections

Publication: Analyzing transfer learning impact in biomedical cross lingual named entity recognition and normalization

Identifiers

Files

Publication date

Defense date

Authors

Advisors

Tutors

Journal Title

Journal ISSN

Volume Title

Publisher

Impact

Export

Research Projects

Organizational Units

Journal Issue

Abstract

Description

Keywords

Bibliographic citation

Collections

Publication:
Analyzing transfer learning impact in biomedical cross lingual named entity recognition and normalization