权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Automatic Analysis and Annotation of Document Keywords in Biomedical Literature

生物医学文献中文档关键词的自动分析与标注

基本信息

批准号：
8558117
负责人：
Willy Wilbur
金额：
$ 26.03万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

项目摘要

1) Electronic Textbook and PubMed Central Indexing Current processing of the electronic textbook material involves a number of steps designed to produce the most meaningful phrases in the text to be used as reference points. The first task is to identify grammatically reasonable phrases. We use a version of the Brill transformation based tagger, rewritten in C++, for part-of- speech tagging. This forms the basis for determining grammatically reasonable phrases. There is a significant post processing step that removes phrases that involve inappropriate references to context (e.g., different cells, final mutation). After finding grammatically reasonable phrases we attempt to eliminate those that are too common or generic to be useful (e.g., significant result, short time). The next step is to compare a phrase with previously rated phrases that have been collected over the life of the project. The final stage is to estimate the importance of a phrase in the passage where it is found in a textbook. Such an estimate is based on the frequency of the phrase and the size of the passage compared with the frequency of the phrase throughout the book and the overall size of the book. In order to improve such an estimate we attempt to take account of the phrase or any phrase that represents the same concept. For this purpose we use the UMLS Metathesaurus and also stemming and combine these two approaches into a consistent picture of the concept as it occurs in the text. The result of this processing is a scored list of phrase-book section pairs for each textbook. These are used to guide the response of general searching in the books. When a user types in a phrase that is on our curated list the first results given are the highly rated book sections for that phrase. We are now applying a similar indexing scheme to the text of articles in PMCentral. This allows us to give a list of highly rated phrases for each article as an enhanced reference point for searchers. 2) A significant fraction of queries in PubMed are multiterm queries and PubMed generally handles them as a Boolean conjunction of the terms. However, analysis of queries in PubMed indicates that many such queries are meaningful phrases, rather than simply collections of terms. We have examined whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that classes of records that contain all the search terms, but not the phrase, qualitatively differ from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching. The important insight here for indexing is that in some cases where the words of a phrase occur in text, but not as the phrase, the phrase may still be an appropriate concept to use in indexing the text. 3) Currently we are studying how good phrases can be recognized by their characteristics, such as frequency, tendency to be repeated in documents where they occur, and other numerical properties. These features allow one to predict which phrases are of high quality. We have found such predictions to be useful in studying different kinds of terms that may appear in text and how an ontoloogy might be extracted from text.

1)电子教科书与PubMed中心索引目前对电子教科书材料的处理涉及若干步骤，这些步骤旨在产生文本中最有意义的短语，以用作参考点。第一个任务是识别语法上合理的短语。我们使用一个版本的Brill转换为基础的标记，在C++重写，词性标记。这构成了确定语法上合理的短语的基础。存在一个重要的后处理步骤，其移除涉及对上下文的不适当引用的短语（例如，不同的细胞，最终突变）。在找到语法上合理的短语后，我们试图消除那些太常见或通用而无用的短语（例如，结果显著，时间短）。下一步是将一个短语与在项目生命周期中收集的先前评级的短语进行比较。最后一个阶段是评估一个短语在课文中的重要性。这样的估计是基于短语的频率和段落的大小与短语在整本书中的频率和整本书的大小进行比较。为了改进这种估计，我们试图考虑代表同一概念的短语或任何短语。为了这个目的，我们使用UMLS元词库和词干提取，并将这两种方法联合收割机结合成一个一致的概念，因为它出现在文本中。这个处理的结果是每本教科书的短语书部分对的评分列表。这些都是用来指导一般的图书检索的反应。当用户输入我们精选列表中的短语时，给出的第一个结果是该短语的高评分书籍部分。我们现在正在对PMCentral中的文章文本应用类似的索引方案。这使我们能够为每篇文章提供一个高度评价的短语列表，作为搜索者的增强参考点。 2)PubMed中的一大部分查询是多项查询，PubMed通常将其作为术语的布尔连接来处理。然而，PubMed中的查询分析表明，许多这样的查询是有意义的短语，而不仅仅是术语的集合。我们已经研究了它是否会产生差异，在检索质量方面，如果这样的查询被解释为一个短语或查询词的连接。如果是这样，用这种查询进行搜索的最佳方式是什么。为了解决这个问题，我们开发了一种基于机器学习技术的自动检索评估方法，使我们能够评估和比较各种检索结果。我们表明，类的记录，包含所有的搜索条件，但不是短语，定性不同于类的记录包含的短语。我们还表明，差异是系统性的，这取决于查询条件的接近记录内彼此。根据这些结果，可以建立记录的最佳检索顺序。我们的研究结果与邻近搜索的研究结果一致。这里对索引的重要见解是，在某些情况下，短语的单词出现在文本中，但不是作为短语，短语仍然可以是用于索引文本的适当概念。 3)目前，我们正在研究如何通过它们的特征来识别好的短语，例如频率，在它们出现的文档中重复的趋势，以及其他数字属性。这些特征允许人们预测哪些短语是高质量的。我们发现，这种预测在研究可能出现在文本中的不同类型的术语以及如何从文本中提取本体论方面非常有用。