Automatic Analysis and Annotation of Document Keywords in Biomedical Literature

生物医学文献中文档关键词的自动分析与标注

基本信息

  • 批准号:
    8558117
  • 负责人:
  • 金额:
    $ 26.03万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
  • 财政年份:
  • 资助国家:
    美国
  • 起止时间:
  • 项目状态:
    未结题

项目摘要

1) Electronic Textbook and PubMed Central Indexing Current processing of the electronic textbook material involves a number of steps designed to produce the most meaningful phrases in the text to be used as reference points. The first task is to identify grammatically reasonable phrases. We use a version of the Brill transformation based tagger, rewritten in C++, for part-of- speech tagging. This forms the basis for determining grammatically reasonable phrases. There is a significant post processing step that removes phrases that involve inappropriate references to context (e.g., different cells, final mutation). After finding grammatically reasonable phrases we attempt to eliminate those that are too common or generic to be useful (e.g., significant result, short time). The next step is to compare a phrase with previously rated phrases that have been collected over the life of the project. The final stage is to estimate the importance of a phrase in the passage where it is found in a textbook. Such an estimate is based on the frequency of the phrase and the size of the passage compared with the frequency of the phrase throughout the book and the overall size of the book. In order to improve such an estimate we attempt to take account of the phrase or any phrase that represents the same concept. For this purpose we use the UMLS Metathesaurus and also stemming and combine these two approaches into a consistent picture of the concept as it occurs in the text. The result of this processing is a scored list of phrase-book section pairs for each textbook. These are used to guide the response of general searching in the books. When a user types in a phrase that is on our curated list the first results given are the highly rated book sections for that phrase. We are now applying a similar indexing scheme to the text of articles in PMCentral. This allows us to give a list of highly rated phrases for each article as an enhanced reference point for searchers. 2) A significant fraction of queries in PubMed are multiterm queries and PubMed generally handles them as a Boolean conjunction of the terms. However, analysis of queries in PubMed indicates that many such queries are meaningful phrases, rather than simply collections of terms. We have examined whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that classes of records that contain all the search terms, but not the phrase, qualitatively differ from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching. The important insight here for indexing is that in some cases where the words of a phrase occur in text, but not as the phrase, the phrase may still be an appropriate concept to use in indexing the text. 3) Currently we are studying how good phrases can be recognized by their characteristics, such as frequency, tendency to be repeated in documents where they occur, and other numerical properties. These features allow one to predict which phrases are of high quality. We have found such predictions to be useful in studying different kinds of terms that may appear in text and how an ontoloogy might be extracted from text.
1)电子教科书与PubMed中心索引 目前对电子教科书材料的处理涉及若干步骤,这些步骤旨在产生文本中最有意义的短语,以用作参考点。第一个任务是识别语法上合理的短语。我们使用一个版本的Brill转换为基础的标记,在C++重写,词性标记。这构成了确定语法上合理的短语的基础。存在一个重要的后处理步骤,其移除涉及对上下文的不适当引用的短语(例如,不同的细胞,最终突变)。在找到语法上合理的短语后,我们试图消除那些太常见或通用而无用的短语(例如,结果显著,时间短)。 下一步是将一个短语与在项目生命周期中收集的先前评级的短语进行比较。最后一个阶段是评估一个短语在课文中的重要性。这样的估计是基于短语的频率和段落的大小与短语在整本书中的频率和整本书的大小进行比较。为了改进这种估计,我们试图考虑代表同一概念的短语或任何短语。为了这个目的,我们使用UMLS元词库和词干提取,并将这两种方法联合收割机结合成一个一致的概念,因为它出现在文本中。 这个处理的结果是每本教科书的短语书部分对的评分列表。这些都是用来指导一般的图书检索的反应。当用户输入我们精选列表中的短语时,给出的第一个结果是该短语的高评分书籍部分。我们现在正在对PMCentral中的文章文本应用类似的索引方案。这使我们能够为每篇文章提供一个高度评价的短语列表,作为搜索者的增强参考点。 2)PubMed中的一大部分查询是多项查询,PubMed通常将其作为术语的布尔连接来处理。然而,PubMed中的查询分析表明,许多这样的查询是有意义的短语,而不仅仅是术语的集合。我们已经研究了它是否会产生差异,在检索质量方面,如果这样的查询被解释为一个短语或查询词的连接。如果是这样,用这种查询进行搜索的最佳方式是什么。为了解决这个问题,我们开发了一种基于机器学习技术的自动检索评估方法,使我们能够评估和比较各种检索结果。我们表明,类的记录,包含所有的搜索条件,但不是短语,定性不同于类的记录包含的短语。我们还表明,差异是系统性的,这取决于查询条件的接近记录内彼此。根据这些结果,可以建立记录的最佳检索顺序。我们的研究结果与邻近搜索的研究结果一致。这里对索引的重要见解是,在某些情况下,短语的单词出现在文本中,但不是作为短语,短语仍然可以是用于索引文本的适当概念。 3)目前,我们正在研究如何通过它们的特征来识别好的短语,例如频率,在它们出现的文档中重复的趋势,以及其他数字属性。这些特征允许人们预测哪些短语是高质量的。我们发现,这种预测在研究可能出现在文本中的不同类型的术语以及如何从文本中提取本体论方面非常有用。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Willy Wilbur其他文献

Willy Wilbur的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Willy Wilbur', 18)}}的其他基金

General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
  • 批准号:
    8558105
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
Natural Language Processing Techniques To Enhance Information Access.
增强信息访问的自然语言处理技术。
  • 批准号:
    8943224
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
Automatic Analysis and Annotation of Document Keywords in Biomedical Literature
生物医学文献中文档关键词的自动分析与标注
  • 批准号:
    8344960
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    8344939
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
PubMed Query Log Analysis and Use in Access Inhancement
PubMed 查询日志分析及其在访问增强中的使用
  • 批准号:
    7969244
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
Automatic Bayesian Methods In Text Retrieval
文本检索中的自动贝叶斯方法
  • 批准号:
    8149591
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    8149592
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
General and Semi-supervised Machine Learning Applied to Bioinformatics
应用于生物信息学的通用和半监督机器学习
  • 批准号:
    8149602
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    9160906
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:
A Document Processing System
文档处理系统
  • 批准号:
    7969199
  • 财政年份:
  • 资助金额:
    $ 26.03万
  • 项目类别:

相似海外基金

Rational design of rapidly translatable, highly antigenic and novel recombinant immunogens to address deficiencies of current snakebite treatments
合理设计可快速翻译、高抗原性和新型重组免疫原,以解决当前蛇咬伤治疗的缺陷
  • 批准号:
    MR/S03398X/2
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Fellowship
Re-thinking drug nanocrystals as highly loaded vectors to address key unmet therapeutic challenges
重新思考药物纳米晶体作为高负载载体以解决关键的未满足的治疗挑战
  • 批准号:
    EP/Y001486/1
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Research Grant
CAREER: FEAST (Food Ecosystems And circularity for Sustainable Transformation) framework to address Hidden Hunger
职业:FEAST(食品生态系统和可持续转型循环)框架解决隐性饥饿
  • 批准号:
    2338423
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Continuing Grant
Metrology to address ion suppression in multimodal mass spectrometry imaging with application in oncology
计量学解决多模态质谱成像中的离子抑制问题及其在肿瘤学中的应用
  • 批准号:
    MR/X03657X/1
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Fellowship
CRII: SHF: A Novel Address Translation Architecture for Virtualized Clouds
CRII:SHF:一种用于虚拟化云的新型地址转换架构
  • 批准号:
    2348066
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Standard Grant
BIORETS: Convergence Research Experiences for Teachers in Synthetic and Systems Biology to Address Challenges in Food, Health, Energy, and Environment
BIORETS:合成和系统生物学教师的融合研究经验,以应对食品、健康、能源和环境方面的挑战
  • 批准号:
    2341402
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Standard Grant
The Abundance Project: Enhancing Cultural & Green Inclusion in Social Prescribing in Southwest London to Address Ethnic Inequalities in Mental Health
丰富项目:增强文化
  • 批准号:
    AH/Z505481/1
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Research Grant
ERAMET - Ecosystem for rapid adoption of modelling and simulation METhods to address regulatory needs in the development of orphan and paediatric medicines
ERAMET - 快速采用建模和模拟方法的生态系统,以满足孤儿药和儿科药物开发中的监管需求
  • 批准号:
    10107647
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    EU-Funded
Ecosystem for rapid adoption of modelling and simulation METhods to address regulatory needs in the development of orphan and paediatric medicines
快速采用建模和模拟方法的生态系统,以满足孤儿药和儿科药物开发中的监管需求
  • 批准号:
    10106221
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    EU-Funded
Recite: Building Research by Communities to Address Inequities through Expression
背诵:社区开展研究,通过表达解决不平等问题
  • 批准号:
    AH/Z505341/1
  • 财政年份:
    2024
  • 资助金额:
    $ 26.03万
  • 项目类别:
    Research Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了