Automatic Analysis and Annotation of Document Keywords in Biomedical Literature

生物医学文献中文档关键词的自动分析与标注

基本信息

  • 批准号:
    8149607
  • 负责人:
  • 金额:
    $ 39.17万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
  • 财政年份:
  • 资助国家:
    美国
  • 起止时间:
  • 项目状态:
    未结题

项目摘要

As a document retrieval system, PubMed aims at providing efficient access to millions of scientific documents. For this purpose, it relies on matching keywords and semantic representations of PubMed documents to user queries. One type of semantic representation used in MEDLINE citations is known as Medical Subject Heading (MeSH) indexing terms, which are assigned by professional human indexers at the National Library of Medicine. Alternatively, author keywords, provided by authors when submitting an article, capture the essence of the topic of a document from the authors perspective. Last but not least, readers have their own opinions about what words are of importance to an article, which may or may not agree with either MeSH terms or author keywords of the same article. PubMed relies on human indexers to assign the appropriate MeSH indexing terms to PubMed articles a very time and labor-intensive process. As a result, these terms are not immediately available for new articles. In fact, our analysis shows that on average it takes over 90 days for a PubMed citation to be manually annotated with MeSH terms. In response, we have developed a machine learning algorithm for automatically predicting MeSH terms with a set of novel features. When compared to other state-of-the-art methods, our approach achieved significantly better performance. We are currently exploring its potential for assisting the manual MeSH curation process in practice. As MeSH terms require human curation, author keywords can be obtained freely from journal articles when they are available. We conducted a first study on author keywords in biomedical articles where we described the growth of author keywords in biomedical journal articles and presented a comparative study of author keywords and MeSH indexing terms. A similarity metric from our past study was used to automatically assess the relatedness between pairs of author keywords and MeSH indexing terms. Furthermore, a set of 300 pairs was manually reviewed to evaluate the metric and characterize the relationships between the term types. Results show that author keywords are increasingly available in biomedical articles and that over 60% of author keywords can be linked to a closely related indexing term. Results of this work have implications in both MEDLINE document indexing and MeSH terminology development. Finally by comparison, we found neither MeSH terms nor author keywords overlap significantly with the important words from the users point of view, which motivated us to learn what characteristics make document words important from a collective user perspective. Specifically, we applied machine learning to identify document keywords which would likely be used frequently in user queries. Each word was represented by a set of features that included different types of information, such as semantic type, part of speech tag, TF-IDF weight and location in the abstract. We examined both traditional features such as TF-IDF, as well as novel ones such as named entity, which have not been explored before in this context. We identified the most important features and evaluated our model using months of real-world PubMed log data. Our results suggest that, in addition to carrying high TF-IDF weight, important words from the users perspective tend to be biomedical entities, to exist in article titles, and to occur repeatedly in article abstracts. This study enabled us to automatically predict words likely to appear in user queries that lead to document clicks. The relative importance of predicted words can also play a role in ranking documents by relevance.
作为一个文献检索系统,PubMed旨在提供对数百万科学文献的有效访问。为此,它依赖于匹配的关键字和语义表示的PubMed文档的用户查询。MEDLINE引文中使用的一种语义表示类型称为医学主题标题(MeSH)索引术语,由国家医学图书馆的专业人类索引员指定。或者,作者在提交文章时提供的作者关键字从作者的角度捕捉文档主题的本质。最后但并非最不重要的是,读者有自己的意见,什么词是重要的文章,这可能会或可能不会同意任何MeSH条款或作者的关键字的同一篇文章。 PubMed依赖于人类索引人员为PubMed文章分配适当的MeSH索引术语,这是一个非常耗时耗力的过程。因此,这些术语不能立即用于新文章。事实上,我们的分析表明,平均需要超过90天的PubMed引文手动注释MeSH术语。作为回应,我们开发了一种机器学习算法,用于自动预测具有一组新特征的MeSH术语。与其他最先进的方法相比,我们的方法取得了显着更好的性能。我们目前正在探索其在实践中协助手动MeSH策展过程的潜力。 由于MeSH术语需要人工策展,作者关键词可以从期刊文章中免费获得。我们对生物医学文章中的作者关键词进行了第一次研究,我们描述了生物医学期刊文章中作者关键词的增长,并对作者关键词和MeSH索引术语进行了比较研究。我们过去研究中的相似性度量用于自动评估作者关键词和MeSH索引词之间的相关性。此外,人工审查了一组300对,以评估指标并表征术语类型之间的关系。结果表明,作者的关键词越来越多地出现在生物医学文章中,超过60%的作者关键词可以链接到一个密切相关的索引词。这项工作的结果在MEDLINE文献索引和MeSH术语的发展都有影响。 最后,通过比较,我们发现无论是MeSH术语还是作者关键词都与用户角度的重要词没有明显重叠,这促使我们从集体用户的角度了解是什么特征使文档词重要。具体来说,我们应用机器学习来识别可能在用户查询中频繁使用的文档关键字。每个词由一组包含不同类型信息的特征表示,如语义类型、词性标记、TF-IDF权重和在摘要中的位置。我们研究了传统的功能,如TF-IDF,以及新颖的,如命名实体,这在此背景下还没有被探索过。我们确定了最重要的特征,并使用数月的真实PubMed日志数据评估了我们的模型。我们的研究结果表明,除了携带高TF-IDF权重,从用户的角度来看,重要的话往往是生物医学实体,存在于文章标题,并在文章摘要中反复出现。这项研究使我们能够自动预测可能出现在导致文档点击的用户查询中的单词。预测单词的相对重要性也可以在根据相关性对文档进行排名中发挥作用。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Zhiyong Lu其他文献

Zhiyong Lu的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Zhiyong Lu', 18)}}的其他基金

Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    9362446
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
  • 批准号:
    9564626
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Machine Learning and Natural Language Processing for Biomedical Applications
生物医学应用的机器学习和自然语言处理
  • 批准号:
    10927050
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    10007525
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    9796762
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    8558092
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
  • 批准号:
    8344934
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
  • 批准号:
    8943212
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Named Entity Recognition and Relationship Extraction in Biomedicine
生物医学中的命名实体识别和关系提取
  • 批准号:
    8943240
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:
Query Log Analysis for Improving User Access to NCBI Web Services
用于改善用户对 NCBI Web 服务的访问的查询日志分析
  • 批准号:
    8558091
  • 财政年份:
  • 资助金额:
    $ 39.17万
  • 项目类别:

相似国自然基金

车载中央计算平台软件框架及泊车功能研发与产业化应用
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
低空飞行器及其空域的设计与监管平台软件
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
基于金刚石高效散热封装的高功率高压GaN器件研发与产业化
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
新能源智能汽车高性能精密零部件装备研制与产业化
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
高效智能化超低风速风电机组关键技术及装备研制
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
绿氢制储加注关键技术与装备研发
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
复杂电子产品超精密加工及检测关键技术研究与应用
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
抗消化性溃疡新药研发
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
基于合成生物学的动物底盘品种优化及中试应用研究
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目
1.1 类中药创新药“鱼酱排毒合剂”开发
  • 批准号:
  • 批准年份:
    2025
  • 资助金额:
    0.0 万元
  • 项目类别:
    省市级项目

相似海外基金

A Knowledge-aware Multi-tasks-based Disease Network Construction on Biomedical Literature
基于生物医学文献的知识感知多任务疾病网络构建
  • 批准号:
    24K15097
  • 财政年份:
    2024
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Decolonization, Appropriation and the Materials of Literature in Africa and its Diaspora
非洲及其侨民的非殖民化、挪用和文学材料
  • 批准号:
    EP/Y024516/1
  • 财政年份:
    2024
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Research Grant
Transnational Utopia: Cosmopolitanism and Empire in Wartime Japanese-Language Literature
跨国乌托邦:战时日语文学中的世界主义与帝国
  • 批准号:
    24K15977
  • 财政年份:
    2024
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
Narrating War in Meiji Japan: Investigating the relationship between journalism and literature via the writing of dispatched war reporters
叙述日本明治战争:从派遣战地记者的写作探寻新闻与文学的关系
  • 批准号:
    24K15983
  • 财政年份:
    2024
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Grant-in-Aid for Early-Career Scientists
Tuning Large language models to read biological literature
调整大型语言模型以阅读生物文献
  • 批准号:
    BB/Y514032/1
  • 财政年份:
    2024
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Research Grant
Minor Literature, Identity, and Transnational Communities in Imperial Japanese Hansen's Disease Sanatoria
日本帝国汉森病疗养院中的小文学、身份和跨国社区
  • 批准号:
    24K03629
  • 财政年份:
    2024
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Modernism's East Asia: Semi-Asiatic Literature and Global Modernity
现代主义的东亚:半亚洲文学与全球现代性
  • 批准号:
    DE240101070
  • 财政年份:
    2024
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Discovery Early Career Researcher Award
Unruly heroines and the man-eating giantesses: Representations of Saracen Women in Medieval French and English literature, 1100 - 1400
不守规矩的女英雄和食人女巨人:中世纪法国和英国文学中撒拉逊妇女的表现,1100 - 1400 年
  • 批准号:
    2886726
  • 财政年份:
    2023
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Studentship
A comprehensive historical study on the correlation between modern Japanese literature and historical stylistic concepts
日本现代文学与历史文体概念关联的综合历史研究
  • 批准号:
    23K00313
  • 财政年份:
    2023
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
Literature and War: The Yugoslavia conflict in German literature with special reference to the texts by Peter Handke and Saša Stanišić
文学与战争:德国文学中的南斯拉夫冲突,特别参考彼得·汉德克和萨的文本
  • 批准号:
    23K00442
  • 财政年份:
    2023
  • 资助金额:
    $ 39.17万
  • 项目类别:
    Grant-in-Aid for Scientific Research (C)
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了