权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Extraction of biomedical knowledge from literature and its systematization

文献中生物医学知识的提取及其系统化

基本信息

批准号：
12208001
负责人：
TAKAGI Toshihisa
金额：
$ 117.25万
依托单位：
The University of Tokyo
依托单位国家：
日本
项目类别：
Grant-in-Aid for Scientific Research on Priority Areas
财政年份：
2000
资助国家：
日本
起止时间：
2000 至 2004
项目状态：
已结题

来源：
https://kaken.nii.ac.jp/en/grant/KAKENHI-PROJECT-12208001/
关键词：
ontology genome databases signal transduction information extraction from literature natural language processing tagged corpus gene dictionary pathway databases キナーゼデータベース

项目摘要

It is indispensable to develop databases of gene and protein interactions and their functions extracted from literature so that we can systematically understand lives based on flood of biological data such as genome sequences, gene expressions, and interactions between molecules. From this perspective, we have been tackling two challenges, that is, 1) automatically extracting knowledge of biological functions from literature and 2) representing and utilizing the extracted knowledge on computers. Followings are brief descriptions of our efforts.a)We developed a knowledge extraction system. We almost established a method of extracting information of gene / protein / chemical compounds interaction from literature. Our system achieved a recall of about 50 % and a precision of about 90 %.b)We developed dictionaries of gene names and gene family names that are used for identifying those names in literature. GENA, one of the dictionaries, stores about 880,000-gene names and, depending on organisms, covers 90-95 % of all the genes appearing in literature). By using the dictionaries and the above mentioned extraction system, we developed and published an interaction database called PRIME and a dictionary of biological functional terms. PRIME stores about three million interactions of six eukaryotes such as human and rat.c)We prepared a corpus and an ontology for knowledge extraction. To develop and evaluate a knowledge extraction system, a tagged corpus and an ontology of defining domain specific terms are needed. We, therefore, developed and published the GENIA corpus that is composed from 2,000 MEDLINE abstracts whose terms are given semantic and part-of-speech tags accordingly. In addition, we developed the GENIA ontology to be used for adding semantic tags to terms in literature.

从文献中提取基因和蛋白质相互作用及其功能的数据库是必不可少的，这样我们就可以根据大量的生物数据，如基因组序列，基因表达和分子之间的相互作用，系统地了解生命。从这个角度来看，我们一直在解决两个挑战，即1）从文献中自动提取生物功能的知识，2）在计算机上表示和利用提取的知识。以下是我们的工作的简要描述。a）我们开发了一个知识提取系统。我们几乎建立了一种从文献中提取基因/蛋白质/化合物相互作用信息的方法。我们的系统实现了约50%的召回率和约90%的准确率。B）我们开发了基因名称和基因家族名称的词典，用于在文献中识别这些名称。GENA是字典之一，存储了大约880，000个基因名称，并且根据生物体的不同，涵盖了文献中出现的所有基因的90- 95%）。利用这些词典和上述提取系统，我们开发并出版了一个名为PRIME的交互作用数据库和一本生物功能术语词典。PRIME存储了人类和大鼠等六种真核生物的大约300万次交互。c）我们准备了一个语料库和一个本体用于知识提取。为了开发和评估一个知识抽取系统，需要一个带标签的语料库和一个定义特定领域术语的本体。因此，我们开发并发布了GENIA语料库，该语料库由2，000篇MEDLINE摘要组成，其术语相应地被赋予语义和词性标签。此外，我们还开发了GENIA本体，用于为文献中的术语添加语义标签。