权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Named Entity Recognition and Relationship Extraction in Biomedicine

生物医学中的命名实体识别和关系提取

基本信息

批准号：
9796762
负责人：
Zhiyong Lu
金额：
$ 225.49万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

来源：
https://reporter.nih.gov/project-details/9796762
关键词：
Address Algorithms Area Benchmarking Biological Biological Neural Networks Catalogs Chemicals Classification Clinical Color Data Data Set Databases Dependence Diagnostic radiologic examination Disease Drug Interactions Evaluation Fundus Gene Proteins Genes Goals Human Image Analysis Knowledge Label Learning Literature Machine Learning Manuals Medical Medical Imaging Methods Mining Names National Human Genome Research Institute Natural Language Processing Online Systems Output Paper Pattern Performance Pharmaceutical Preparations Process PubMed Radiology Specialty Recurrence Reporting Research Running Supervision SwissProt System Techniques Technology Text Thoracic Radiography Time Training Triage Uncertainty Voting Work annotation system base deep learning genetic variant genome wide association study genomic variation human-in-the-loop interest learning strategy novel recurrent neural network success text searching

项目摘要

Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. Synonyms pose another challenge for high quality relevance searches. This is a problem for ordinary words, but it is even more of a difficult for entities that can be named in a number of different ways. LitVar address this problem for genetic variants. For example, searching for one of A146T, c.436G>A, or rs121913527 also finds instances of the other two. The goal is to extend this ability to other entity types. We participated in The CHEMPROT track at BioCreative VI, which aims to assess the state of the art in automatically extracting the chemicalprotein relations in running text (PubMed abstracts). We proposed an ensemble of three systems, including a support vector machine, a convolutional neural network, and a recurrent neural network. Their output is combined using majority voting or stacking for final predictions. Our system obtained 0.7266 in precision and 0.5735 in recall for an F-score of 0.6410 during the challenge, achieving the highest performance among all team submissions during the challenge. In addition to tackling relation extraction tasks with supervised machine-learning methods, we proposed a novel adversarial learning algorithm for unsupervised domain adaptation tasks where no labeled data are available in the target domain. We show domain invariant features can be learned in the latest neural networks such that classifiers trained for one relation type (proteinprotein) can be re-purposed to others (drugdrug). Compared to prior convolutional and recurrent NN-based relation classification methods without domain adaptation, we achieve improvements as high as 30% in F1-score. To further assist NLP tasks without pre-existing training data, we developed ezTag, a web-based annotation tool that allows users to perform annotation and provide training data with humans in the loop. ezTag supports both abstracts in PubMed and full-text articles in PubMed Central. Negative and uncertain medical findings are frequent in radiology reports, but discriminating them from positive findings remains challenging for information extraction. Here, we propose a new algorithm, NegBio, to detect negative and uncertain findings in radiology reports. Unlike previous rule-based methods, NegBio utilizes patterns on universal dependencies to identify the scope of triggers that are indicative of negation or uncertainty. We evaluated NegBio on four datasets, including two public benchmarking corpora of radiology reports, a new radiology corpus that we annotated for this work, and a public corpus of general clinical texts. Evaluation on these datasets demonstrates that NegBio is highly accurate for detecting negative and uncertain findings and compares favorably to the current state of the art. One promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we applied automated deep learning techniques to the literature triage process of UniProtKB/Swiss-Prot and the NHGRI-EBI GWAS Catalog for genomic variation by collaborating with their database curators. Both the manual curation teams confirmed that our method achieved higher precision than their previous query-based triage methods without compromising recall. Both results show that our method is more efficient and can replace the traditional query-based triage methods of manually curated databases. Our method can give human curators more time to focus on more challenging tasks such as actual curation as well as the discovery of novel papers/experimental techniques to consider for inclusion. Deep learning, a class of machine learning algorithms, has showed impressive results in several of our recent studies as shown above in FY18. In addition to its applications in natural language processing, we have also seen its success in our medical image analysis such as processing chest X-ray images and colors fundus photographs.

从生物医学文献中挖掘有用的知识有助于文献检索，自动化生物数据管理和许多其他科学任务。因此，我们专注于识别自由文本中的各种类型的生物实体，如基因/蛋白质，疾病/条件，药物/化学品等，以及它们之间的关系。同义词对高质量的相关性搜索提出了另一个挑战。这对于普通的单词来说是个问题，但是对于可以以多种不同方式命名的实体来说，这就更加困难了。LitVar解决了遗传变异的这个问题。例如，搜索A146 T、c.436G>A或rs 121913527中的一个也会找到其他两个的实例。我们的目标是将这种能力扩展到其他实体类型。我们参加了BioCreative VI的CHEMPROT跟踪，旨在评估自动提取运行文本中的化学蛋白质关系的最新技术水平（PubMed摘要）。我们提出了三个系统的集成，包括支持向量机，卷积神经网络和递归神经网络。他们的输出使用多数投票或叠加进行最终预测。我们的系统在挑战期间获得了0.7266的精确度和0.5735的召回率，F分数为0.6410，在挑战期间的所有团队提交中实现了最高的性能。除了使用监督机器学习方法处理关系提取任务外，我们还提出了一种新的对抗性学习算法，用于无监督域自适应任务，其中目标域中没有标记数据。我们证明了域不变特征可以在最新的神经网络中学习，这样为一种关系类型（蛋白质）训练的分类器可以重新用于其他关系类型（药物）。与先前的卷积和基于递归NN的关系分类方法相比，没有域自适应，我们在F1分数上实现了高达30%的改进。为了在没有预先存在的训练数据的情况下进一步帮助NLP任务，我们开发了ezTag，这是一种基于Web的注释工具，允许用户执行注释并提供训练数据。ezTag支持PubMed中的摘要和PubMed Central中的全文文章。在放射学报告中经常出现阴性和不确定的医学发现，但将其与阳性发现区分开来仍然是信息提取的挑战。在这里，我们提出了一种新的算法NegBio，以检测放射学报告中的阴性和不确定结果。与以前的基于规则的方法不同，NegBio利用普遍依赖性的模式来识别指示否定或不确定性的触发器的范围。我们在四个数据集上评估了NegBio，包括两个放射学报告的公共基准语料库，一个我们为这项工作注释的新放射学语料库，以及一个一般临床文本的公共语料库。对这些数据集的评价表明，NegBio在检测阴性和不确定结果方面具有高度准确性，与当前最先进的技术相比具有优势。文本挖掘研究的一个很有前途的应用领域是辅助人工文献管理，这是一个非常耗时和劳动密集型的过程。在这方面，我们将自动化深度学习技术应用于UniProtKB/Swiss-Prot和NHGRI-EBI GWAS目录的文献分类过程中，通过与其数据库管理员合作进行基因组变异。两个人工策展团队都证实，我们的方法比他们以前基于查询的分类方法实现了更高的精度，而不会影响召回率。这两个结果都表明，我们的方法是更有效的，可以取代传统的基于查询的分类方法的手动策划的数据库。我们的方法可以让人类策展人有更多的时间专注于更具挑战性的任务，例如实际的策展以及发现新的论文/实验技术以考虑纳入。深度学习是一类机器学习算法，在我们最近的几项研究中显示了令人印象深刻的结果，如上文所示。除了在自然语言处理中的应用外，我们还看到了它在医学图像分析中的成功，例如处理胸部X光图像和彩色眼底照片。