Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
基本信息
- 批准号:9362446
- 负责人:
- 金额:$ 140.39万
- 依托单位:
- 依托单位国家:美国
- 项目类别:
- 财政年份:
- 资助国家:美国
- 起止时间:至
- 项目状态:未结题
- 来源:
- 关键词:AdoptionAgreementAmino Acid Sequence DatabasesAreaBioinformaticsBiologicalBiologyBiomedical ResearchChemicalsClinicalCommunitiesCountryDataDatabasesDevelopmentDiseaseDrug InteractionsEpidemiologyEventGene ProteinsGenesGoalsGoldHumanHuman GenomeIndividualInternationalInternetInvestmentsJavaJointsKnowledgeLiteratureMachine LearningMaintenanceManualsMethodologyMethodsMiningModelingNamesPharmaceutical PreparationsProcessProductionPubMedReportingResearchResearch InfrastructureResearch PersonnelResearch Project GrantsScienceServicesSoftware ToolsSource CodeSwissProtSystemSystems DevelopmentTechniquesTechnologyTextTimeTrainingVariantWorkbasecrowdsourcingdata formatdesignflexibilityimprovedinterestknowledge baseopen sourcetext searchingtoolweb services
项目摘要
Mining useful knowledge from the biomedical literature holds potentials for helping literature searching, automating biological data curation and many other scientific tasks. Hence, it is important to be able to recognize various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc. Indeed, our previous PubMed log analysis revealed that people search certain biomedical concepts more often than others and that there exist strong associations between different concepts. For example, a disease name often co-occurs with gene/proteins and drug names. To assess the state of the art in biomedical entity recognition and relation extraction, we organized a science competition at BioCreative V, an international challenge event for evaluating advances in text mining research for biology. Specifically, we designed two challenge tasks: disease named entity recognition (DNER) and chemical-induced disease (CID) relation extraction. To assist system development and assessment, we created a large annotated text corpus that consisted of human annotations of chemicals, diseases and their interactions from 1500 PubMed articles. 34 teams worldwide participated in the CDR task: 16 (DNER) and 18 (CID). The best systems achieved an F-score of 86.46% for the DNER task--a result that approaches the human inter-annotator agreement (0.8875)--and an F-score of 57.03% for the CID task, the highest results ever reported for such tasks. Given the level of participation and team results, we found our task to be successful in engaging the text-mining research community, producing a large annotated corpus and improving the results of automatic disease recognition and CDR extraction.
In addition to organizing the BioCreative task, we continued our own development of biomedical named entity taggers in 2015-2016. First and foremost, we created a general toolkit called TaggerOne: the first machine learning model for joint named entity recognition and normalization. TaggerOne is an all-purpose tagger (i.e. not specific to any entity type), requiring only annotated training data and a corresponding lexicon, and has been optimized for high throughput. We validated TaggerOne with multiple gold-standard corpora containing both mention- and concept-level annotations. Its results compare favorably to the previous state of the art, notwithstanding the greater flexibility of the model. TaggerOne is implemented in Java and its source code has been made publicly available to the research community.
However, large-scale use of open-source tools sometimes requires a significant investment in infrastructure and maintenance time. These investments not only impair the continued adoption of text mining tools, but also reduce the ability of individual researchers to explore applying text mining to problems in their research area. In contrast, Web services provide on-demand access to software tools through the Internet using straightforward interfaces and data formats. Providing text mining tools as web services therefore reduces the bar to use for biocurators and bioinformatics researchers not working specifically in text mining, allowing free exploration and the ability to focus on results rather than methodology.
Therefore, in 2015 we developed NCBI text-mining web services, an online version of our text mining tool suite for biomedical concept recognition and information extraction. Our service incorporates multiple state of the art tools for identifying critical entity types: DNorm (for diseases), GNormPlus (genes and proteins), SR4GN (species), tmChem (chemicals and drugs), and tmVar (variants). Our web service has already processed over 60 million requests since its inception from researchers in 46 countries, supporting research projects in biocuration, crowdsourcing and translational bioinformatics. We anticipate that providing text mining tools as web services will greatly expand their utility to the biomedical research community.
Finally, as mentioned earlier, one promising application area for text mining research is to assist manual literature curation, a highly time-consuming and labor-intensive process. In this regard, we continued to improve our previous curation-assisting tool PubTator and to collaborate with domain experts: human database curators in this case. With these efforts, our PubTator system is continuously being used in the production curation pipeline of two external databases on a daily basis:
1. HuGE Navigator a CDCs knowledgebase of human genome epidemiology
2. SwissProt an annotated database of protein sequence and functional information
从生物医学文献中挖掘有用的知识具有帮助文献搜索、自动化生物数据管理和许多其他科学任务的潜力。因此,能够识别自由文本中的各种类型的生物实体是很重要的,例如基因/蛋白质、疾病/条件和药物/化学物质等。事实上,我们之前的PubMed日志分析显示,人们比其他人更频繁地搜索某些生物医学概念,并且不同概念之间存在强烈的关联。例如,疾病名称经常与基因/蛋白质和药物名称同时出现。为了评估生物医学实体识别和关系提取方面的最新水平,我们在BioCreative V组织了一场科学竞赛,这是一项国际挑战活动,旨在评估生物学文本挖掘研究的进展。具体地说,我们设计了两个挑战任务:疾病命名实体识别(DNER)和化学诱发疾病(CID)关系提取。为了帮助系统开发和评估,我们创建了一个大型注释文本语料库,由1500篇PubMed文章中的化学物质、疾病及其相互作用的人类注释组成。全球有34个团队参与了CDR任务:16个(DNER)和18个(CID)。最好的系统在DNER任务中获得了86.46%的F-分数--这一结果接近于人类注释员之间的协议(0.8875)--并且在CID任务中获得了57.03%的F-分数,这是迄今报道的此类任务的最高结果。考虑到参与程度和团队结果,我们发现我们的任务成功地参与了文本挖掘研究社区,产生了大型注释语料库,并改进了自动疾病识别和CDR提取的结果。
除了组织BioCreative任务外,我们在2015-2016年继续开发生物医学命名实体标记器。首先,我们创建了一个名为TaggerOne的通用工具包:第一个用于联合命名实体识别和规范化的机器学习模型。TaggerOne是一个通用标记器(即不特定于任何实体类型),只需要带注释的训练数据和相应的词典,并且已经针对高吞吐量进行了优化。我们使用多个黄金标准语料库验证了TaggerOne,这些语料库既包含提及级别的注释,也包含概念级别的注释。尽管该模型有更大的灵活性,但其结果与以前的技术水平相比是有利的。TaggerOne是用Java实现的,其源代码已经向研究社区公开。
然而,大规模使用开源工具有时需要在基础设施和维护时间上投入大量资金。这些投资不仅损害了文本挖掘工具的持续采用,还降低了个人研究人员探索将文本挖掘应用于其研究领域问题的能力。相比之下,Web服务使用直接的接口和数据格式通过Internet提供对软件工具的按需访问。因此,将文本挖掘工具作为Web服务提供,降低了非专门从事文本挖掘的生物专家和生物信息学研究人员的使用门槛,允许自由探索,并能够专注于结果而不是方法。
因此,我们在2015年开发了NCBI文本挖掘Web服务,这是我们的文本挖掘工具套件的一个在线版本,用于生物医学概念识别和信息提取。我们的服务结合了多种最先进的工具来识别关键实体类型:DNorm(疾病)、GNormPlus(基因和蛋白质)、SR4GN(物种)、tmChem(化学品和药物)和tmVar(变种)。自成立以来,我们的网络服务已经处理了来自46个国家和地区的研究人员的6000多万份请求,支持生物合成、众包和翻译生物信息学方面的研究项目。我们预计,将文本挖掘工具作为Web服务提供将极大地扩展它们对生物医学研究社区的效用。
最后,如前所述,文本挖掘研究的一个很有前途的应用领域是辅助手动文献整理,这是一个非常耗时和劳动密集型的过程。在这方面,我们继续改进我们以前的馆藏协助工具PubTator,并与领域专家合作:在这种情况下,是人类数据库馆长。通过这些努力,我们的PubTator系统每天在两个外部数据库的生产管理管道中持续使用:
1.巨型导航器--疾控中心的人类基因组流行病学知识库
2.SwissProt是一个带注释的蛋白质序列和功能信息数据库
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Zhiyong Lu其他文献
Zhiyong Lu的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Zhiyong Lu', 18)}}的其他基金
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
- 批准号:
9564626 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Machine Learning and Natural Language Processing for Biomedical Applications
生物医学应用的机器学习和自然语言处理
- 批准号:
10927050 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
- 批准号:
10007525 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Automatic Analysis and Annotation of Document Keywords in Biomedical Literature
生物医学文献中文档关键词的自动分析与标注
- 批准号:
8149607 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
- 批准号:
9796762 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
- 批准号:
8558092 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
- 批准号:
8344934 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
- 批准号:
8943212 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
- 批准号:
8943240 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
- 批准号:
8558091 - 财政年份:
- 资助金额:
$ 140.39万 - 项目类别:
相似海外基金
A study for cross borders Indonesian nurses and care workers: Case of Japan-Indonesia Economic Partnership Agreement
针对跨境印度尼西亚护士和护理人员的研究:日本-印度尼西亚经济伙伴关系协定的案例
- 批准号:
22KJ0334 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Grant-in-Aid for JSPS Fellows
NSF-NOAA Interagency Agreement (IAA) for the Global Oscillations Network Group (GONG)
NSF-NOAA 全球振荡网络组 (GONG) 机构间协议 (IAA)
- 批准号:
2410236 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Cooperative Agreement
Conditions for U.S. Agreement on the Closure of Contested Overseas Bases: Relations of Threat, Alliance and Base Alternatives
美国关于关闭有争议的海外基地协议的条件:威胁、联盟和基地替代方案的关系
- 批准号:
23K18762 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Grant-in-Aid for Research Activity Start-up
MSI Smart Manufacturing Data Hub – Open Calls Grant Funding Agreement
MSI 智能制造数据中心 – 公开征集赠款资助协议
- 批准号:
900240 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Collaborative R&D
Challenges of the Paris Agreement Exposed by the Energy Shift by External Factors: The Case of Renewable Energy Policies in Japan, the U.S., and the EU
外部因素导致的能源转移对《巴黎协定》的挑战:以日本、美国和欧盟的可再生能源政策为例
- 批准号:
23H00770 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Grant-in-Aid for Scientific Research (B)
Continuation of Cooperative Agreement between U.S. Food and Drug Administration and S.C. Department of Health and Environmental Control (DHEC) for MFRPS Maintenance.
美国食品和药物管理局与南卡罗来纳州健康与环境控制部 (DHEC) 继续签订 MFRPS 维护合作协议。
- 批准号:
10829529 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
National Ecological Observatory Network Governing Cooperative Agreement
国家生态观测站网络治理合作协议
- 批准号:
2346114 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Cooperative Agreement
The Kansas Department of Agriculture's Flexible Funding Model Cooperative Agreement for MFRPS Maintenance, FPTF, and Special Project.
堪萨斯州农业部针对 MFRPS 维护、FPTF 和特别项目的灵活资助模式合作协议。
- 批准号:
10828588 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Robust approaches for the analysis of agreement between clinical measurements: development of guidance and software tools for researchers
分析临床测量之间一致性的稳健方法:为研究人员开发指南和软件工具
- 批准号:
MR/X029301/1 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Research Grant
Doctoral Dissertation Research: Linguistic transfer in a contact variety of Spanish: Gender agreement production and attitudes
博士论文研究:西班牙语接触变体中的语言迁移:性别协议的产生和态度
- 批准号:
2234506 - 财政年份:2023
- 资助金额:
$ 140.39万 - 项目类别:
Standard Grant














{{item.name}}会员




