Curation-oriented Recognition and Retrieval from Biomedical Literature

  • 徐 禕佑

Student thesis: Doctoral Thesis

Abstract

With a huge increase of biomedical literature there has been an upsurge need for integrating text mining and machine learning in biological databases Many databases have collected specific topics and corresponding resources such as experimental data and research literature However processing unstructured data through text mining is a complex and dynamic area which interests different disciplines (e g chemists biologists and computer scientists To automatically extract knowledge from texts and effectively confirm the knowledge recorded in biological databases the biomedical named-entity recognition (NER) and document triage have been considered as more challenging tasks Thus we focus on the two major topics in this dissertation Determining the semantic relatedness of two biomedical terms is an important task for many text-mining applications in the biomedical field Previous studies such as those using ontology-based and corpus-based approaches measured semantic relatedness by using information from the structure of biomedical literature but these methods are limited by the small size of training resources To increase the size of training datasets the outputs of search engines have been used extensively to analyze the lexical patterns of biomedical terms In this work we propose the Mutually Reinforcing Lexical Pattern Ranking (ReLPR) algorithm for learning and exploring the lexical patterns of synonym pairs in biomedical text ReLPR employs lexical patterns and their pattern containers to assess the semantic relatedness of biomedical terms By combining sentence structures and the linking activities between containers and lexical patterns our algorithm can explore the correlation between two biomedical terms NER plays an important role in the development of biological databases However the existing NER tools produce multifarious named-entities which may result in both curatable and non-curatable markers To facilitate biocuration with a straightforward approach classifying curatable named-entities is helpful with regard to accelerating the biocuration workflow Co-occurrence Interaction Nexus with Named-entity Recognition (CoINNER) is a web-based tool that allows users to identify genes chemicals diseases and action term mentions in the Comparative Toxicogenomic Database (CTD) We extended our previous system in developing CoINNER The pre-tagging results of CoINNER were developed based on the state-of-the-art named entity recognition tools in BioCreative III Next a method based on conditional random fields (CRFs) is proposed to predict chemical and disease mentions in the articles Finally action term mentions were collected by latent Dirichlet allocation (LDA) The results of the CoINNER were significantly superior to those of previous methods In recent years there was a rapid increase in the number of medical articles The number of articles in PubMed has increased exponentially Thus the workload for biocurators has also increased exponentially Under these circumstances a system that can automatically determine in advance which article has a higher priority for curation can effectively reduce the workload of biocurators Determining how to effectively find the articles required by biocurators has become an important task the Article Classification Task (ACT) In the BioCreative 2012 workshop we proposed the Co-occurrence Interaction Nexus (CoIN) for learning and exploring relations in articles We constructed a co-occurrence analysis system which is applicable to PubMed articles and suitable for gene chemical and disease queries CoIN uses co-occurrence features and their network centralities to assess the influence of curatable articles from the Comparative Toxicogenomics Database The experimental results show that our network-based approach combined with co-occurrence features can effectively classify curatable and non-curatable articles CoIN also allows biocurators to retrieve the related articles for specific queries without reviewing meaningless information At the BioCreative CTD ACT Task CoIN achieved a 0 778 mean average precision in the triage task thus finishing in second place out of all participants
Date of Award2015 Feb 6
Original languageEnglish
SupervisorHung-Yu Kao (Supervisor)

Cite this

'