Linguistic Representation enhanced LSTM-CRF for Disease Named Entity Recognition in Biomedical Literature

  • 楊 馥謙

Student thesis: Master's Thesis


The recognition of disease named entities in biomedical literature is a crucial task in biomedical research which can facilitate the research on further research (e g disease-chemical relation extraction) Most of the disease named entity recognition systems rely heavily on hand-craft features and domain knowledge Hence the systems are usually based on machine learning methodology such as conditional random fields However this method has highly computationally complexity at the training stage Besides the diversity of disease names also makes the recognition more difficult As a result we propose a system L-LSTM-CRF which is based on neural network architecture with effective linguistic representation The representation has consisted of pre-trained word embedding character and stem character representations obtained from a convolutional neural network and a feature group embedding which is composed by three powerful features that are commonly utilized in the current systems The feature group includes dictionary lookup disease ending word and abbreviation detection After obtaining the linguistic representation we passed the representation into the LSTM-CRF layer which is leveraged to predict the labels of the sentences and extract the disease named entities In the evaluation stage we collected four corpora that are disease-related such as the CDR corpus from BioCreative V CDR task the NCBI disease corpus the miRNA corpus and the DISAE corpus Our approach achieves the state-of-the-art performances in these disease extraction corpora and get 91 16% in CDR corpus
Date of Award2018 Aug 31
Original languageEnglish
SupervisorHung-Yu Kao (Supervisor)

Cite this