Machine Learning and Natural Language Processing for Biomedical Applications

生物医学应用的机器学习和自然语言处理

基本信息

  • 批准号:
    10927050
  • 负责人:
  • 金额:
    $ 387.34万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
  • 财政年份:
  • 资助国家:
    美国
  • 起止时间:
  • 项目状态:
    未结题

项目摘要

Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs updated online PubMed system. However, finding data relevant to a users information need is not always easy. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. For instance, queries with similar information needs tend to have similar document clicks, especially in biomedical literature search engines where queries are generally short and top documents account for most of the total clicks. Motivated by this, we present a novel architecture for biomedical literature search, namely Log-Augmented DEnse Retrieval (LADER), which is a simple plug-in module that augments a dense retriever with the click logs retrieved from similar training queries. Specifically, LADER finds both similar documents and queries to the given query by a dense retriever. Then, LADER scores relevant (clicked) documents of similar queries weighted by their similarity to the input query. Our results demonstrate that LADER achieves new state-of-the-art (SOTA) performance on TripClick, a recently released benchmark for biomedical literature retrieval. Using advanced machine-learning and NLP techniques, we are able to provide enhanced access to special topics in the biomedical literature. One such example is for tracking variant-related information from relevant genomic literature, a crucial task for genomic research and precision medicine. We previously developed LitVar, a semantic search system that makes use of advanced text- and data-mining techniques to identify and normalize variant information in full-length articles. In 2022, we launched LitVar 2.0, a significantly improved system that features several major expansions over its predecessor, including: (1) improved variant recognition accuracy; (2) the inclusion of variant information from article supplementary data; (3) more powerful search capabilities; and (4) a redesigned user interface for more convenient results navigation. Another successful example is LitCovid, a literature database of COVID-19 related papers in PubMed that was first created and first launched in 2020. To date, LitCovid has accumulated over 360,000 articles with millions of accesses since its inception. Approximately several thousand new articles are added to LitCovid every month in 2023. In response to the continuing evolution of the COVID-19 pandemic, significant updates to LitCovid have been made over the last 2 years. First, we introduced the long Covid collection consisting of the articles on COVID-19 survivors experiencing ongoing multisystemic symptoms, including respiratory issues, cardiovascular disease, cognitive impairment, and profound fatigue. Second, we provided new annotations on the latest COVID-19 strains and vaccines mentioned in the literature. Third, we improved several existing features with more accurate machine learning algorithms for annotating topics and classifying articles relevant to COVID-19. In addition to providing enhanced access to specific literature information as discussed above, directly extracting useful knowledge from the biomedical literature holds potentials for accelerating literature-based discovery, automating biological data curation and many other scientific tasks. We have therefore focused on recognizing various types of biological entities in free text, such as gene/proteins, disease/conditions, and drug/chemicals, etc, and their relationships. Manually labeling training data for building biomedical named entity recognition (BioNER) algorithms is costly, due to the significant domain expertise required for accurate annotation. As a result, current BioNER approaches are prone to overfitting and suffer from limited generalizability. In response, we proposed a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. Specifically, we introduced AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluated AIONER on 14 BioNER benchmark tasks and showed that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrated the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g. the entire PubMed data). Since late last year, ChatGPT, a general purpose chatbot developed by OpenAI, has been widely reported to have the potential to revolutionize how people interact with information online. Like other large language models (LLMs), ChatGPT has been trained on a large text corpus to predict probable words from the surrounding context. ChatGPT, however, has received substantial popular attention for generating human-like conversational responses, and new developments are occurring rapidly. Recent work has discussed applications of ChatGPT for medical education and clinical decision support. However, health care professionals should be aware of the drawbacks and limitationsand potential capabilitiesof using ChatGPT and similar LLMs to interact with medical knowledge. In a recent perspective, we envision that a retrieve, summarize, and verify paradigm could greatly benefit biomedical information seeking. This approach leverages the impressive capability of LLMs to generate high-level summaries while minimizing the risk of directly using false or fabricated information by combining LLMs and search engines. Augmenting LLMs with domain-specific tools such as database utilities is another way to facilitate easier and more precise access to specialized knowledge. To this end, we developed GeneGPT, a novel method for teaching LLMs to use the Web APIs of the National Center for Biotechnology Information (NCBI) for answering genomics questions. Specifically, we prompt Codex to solve the GeneTuring tests with NCBI Web APIs by in-context learning and an augmented decoding algorithm that can detect and execute API calls. Experimental results show that GeneGPT achieves state-of-the-art performance on eight tasks in the GeneTuring benchmark with an average score of 0.83, largely surpassing retrieval-augmented LLMs such as the new Bing (0.44), biomedical LLMs such as BioMedLM (0.08) and BioGPT (0.04), as well as GPT-3 (0.16) and ChatGPT (0.12).

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Zhiyong Lu其他文献

Zhiyong Lu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Zhiyong Lu', 18)}}的其他基金

Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    9362446
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
  • 批准号:
    9564626
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    10007525
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Automatic Analysis and Annotation of Document Keywords in Biomedical Literature
生物医学文献中文档关键词的自动分析与标注
  • 批准号:
    8149607
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    8558092
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    9796762
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
  • 批准号:
    8344934
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
  • 批准号:
    8943212
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    8943240
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
  • 批准号:
    8558091
  • 财政年份:
  • 资助金额:
    $ 387.34万
  • 项目类别:

相似海外基金

Improving access to information in perinatal women: Creating and piloting a needs-based information tools
改善围产期妇女获取信息的机会:创建和试点基于需求的信息工具
  • 批准号:
    23K16469
  • 财政年份:
    2023
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
Is Better Access to Information Effective in Improving Labor Market Outcomes? Experimental Evidence
更好地获取信息是否能有效改善劳动力市场成果?
  • 批准号:
    1954016
  • 财政年份:
    2019
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Standard Grant
Collaborative Research: CESER: EAGER: "FabWave" - A Pilot Manufacturing Cyberinfrastructure for Shareable Access to Information Rich Product Manufacturing Data
合作研究:CESER:EAGER:“FabWave”——用于共享访问信息丰富的产品制造数据的试点制造网络基础设施
  • 批准号:
    1812687
  • 财政年份:
    2018
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Standard Grant
Collaborative Research: CESER: EAGER: "FabWave" - A Pilot Manufacturing Cyberinfrastructure for Shareable Access to Information Rich Product Manufacturing Data
合作研究:CESER:EAGER:“FabWave”——用于共享访问信息丰富的产品制造数据的试点制造网络基础设施
  • 批准号:
    1812675
  • 财政年份:
    2018
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Standard Grant
Is Better Access to Information Effective in Improving Labor Market Outcomes? Experimental Evidence
更好地获取信息是否能有效改善劳动力市场成果?
  • 批准号:
    1824465
  • 财政年份:
    2018
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Standard Grant
Index Herbariorum Upgrade: A Project to Improve Access to Information about the World's Plant and Fungal Collections Assets
Index Herbariorum 升级:改善获取世界植物和真菌收藏资产信息的项目
  • 批准号:
    1600051
  • 财政年份:
    2016
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Standard Grant
TC: Large: Collaborative Research: Facilitating Free and Open Access to Information on the Internet
TC:大型:合作研究:促进互联网上信息的自由和开放获取
  • 批准号:
    1540066
  • 财政年份:
    2015
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Continuing Grant
INEQUALITY IN HIGHER EDUCATION OUTCOMES IN THE UK: SUBJECTIVE EXPECTATIONS, PREFERENCES, AND ACCESS TO INFORMATION
英国高等教育成果的不平等:主观期望、偏好和信息获取
  • 批准号:
    ES/M008622/1
  • 财政年份:
    2015
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Research Grant
Collaborative Access to Information about Physical Objects via See-Through Displays
通过透视显示器协作访问有关物理对象的信息
  • 批准号:
    413142-2011
  • 财政年份:
    2013
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Strategic Projects - Group
Study on the social system for guaranteeing equal access to information in Scandinavia as human rights protection system
斯堪的纳维亚地区保障平等信息的社会制度作为人权保障制度的研究
  • 批准号:
    24530777
  • 财政年份:
    2012
  • 资助金额:
    $ 387.34万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了