权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Query Log Analysis for Improving User Access to NCBI Web Services

用于改善用户对 NCBI Web 服务的访问的查询日志分析

基本信息

批准号：
8558091
负责人：
Zhiyong Lu
金额：
$ 26.03万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

项目摘要

Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. One resource for understanding and characterizing patrons of search engines is the transaction logs. Our previous investigation of PubMed query logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries and Query Autocomplete. Inspired by its success, we have continued using log analysis to identify research problems which are closely related to NCBI operations. Among all Entrez databases, PubMed is the most used one and often serves as an entry point for people to access related data in other Entrez databases. In 2011-2012, we have studied the usage of PubMed articles with regard to their citations. The citations of an article have been an important measurement of the quality and impact of the article. Recently there is an increasing interest on the correlation between the citations and number of downloads, investigating whether the latter can act as a predicting indictor or an alternative solution for evaluation. Our experiments based on the citation and query logs of PubMed show that there is a strong correlation between the count of citation and the number of full-text access for PubMed articles. The highest correlation is 0.6 when 6-month total full-text access and 2-year total citation was counted, while articles with less than 2 citations were excluded. As there is generally a lag between when an article is published and when it is cited in another article, we found that the best correlation occurs when citations are computed 3-month after the publication. We also analyzed the public PLoS usage data, and found that the correlation between their citations (from CrossRef) and the total PDF downloads is 0.655, which is very similar to our PubMed dataset. Another research on query log analysis we conducted in 2011-2012 was the development of search filters using PubMed click-through data in order to enable topic-specific literature searches. Search filters have been developed and demonstrated for better information access to the immense and ever-growing body of publications in the biomedical domain. However, to date the number of filters remains quite limited because the current filter development methods require significant human involvement. In this regard, we developed an automated method to build topic-specific filters on the basis of users search logs from PubMed. Specifically, for a given topic, we first detect relevant user queries and use their corresponding clicks to construct a topic relevant article set. Next, we use statistics to identify terms that best represent the topic-relevant document set. Lastly, the selected representative terms are combined with Boolean operators and evaluated on benchmark datasets to derive the final filter with the best performance. We applied our method to develop filters for four different clinical topics: nephrology, diabetes, pregnancy and depression. For the nephrology filter, our method obtained comparable performance to the state of the art (sensitivity of 91.3%, specificity of 98.7%, precision of 94.6%, accuracy of 97.2%). Similarly, high-performing results (over 90% in all measures) were obtained for the other three search filters.

在过去的十年里，对生物信息的在线搜索发展迅速，已成为任何科学发现过程中不可或缺的一部分。今天，如果不依赖NCBI开发和维护的那种网络资源，几乎不可能进行生物医学的研发。事实上，每天都有数百万用户通过NCBIS在线Entrez系统搜索生物信息。然而，在Entrez中查找与用户信息需求相关的数据并不总是很容易。提高我们对Entrez用户日益增长的人口、他们的信息需求以及他们满足这些需求的方式的了解，为改进NCBI提供的信息服务和信息获取提供了机会。了解和描述搜索引擎用户特征的一个资源是交易日志。我们之前对PubMed查询日志的研究使我们开发和部署了几个有用的应用程序来帮助用户进行搜索和检索，例如PubMed中的查询公式，即相关查询和查询自动补全。受其成功的启发，我们继续使用日志分析来确定与NCBI操作密切相关的研究问题。在所有Entrez数据库中，PubMed是使用最多的数据库，经常作为人们访问其他Entrez数据库中相关数据的入口点。在2011-2012年，我们研究了PubMed文章的引文使用情况。一篇文章的引文情况一直是衡量文章质量和影响力的重要指标。最近，人们对引文和下载量之间的相关性越来越感兴趣，研究下载量是否可以作为评估的预测指标或替代解决方案。基于PubMed的引文和查询日志的实验表明，被引次数与PubMed文章的全文访问次数之间存在很强的相关性。当计入6个月的全文检索量和2年的总引文量时，相关系数最高，为0.6，排除引文量小于2的文章。由于一篇文章发表到另一篇文章被引用之间通常有一段时间，我们发现最好的相关性发生在发表后3个月计算引文时。我们还分析了公共科学图书馆的使用数据，发现它们的引文(来自CrosRef)与PDF总下载量之间的相关性为0.655，这与我们的PubMed数据集非常相似。我们在2011-2012年进行的另一项关于查询日志分析的研究是使用PubMed点击直达数据开发搜索过滤器，以便能够进行特定主题的文献搜索。已经开发和演示了搜索过滤器，以便更好地获取生物医学领域中数量庞大且不断增长的出版物的信息。然而，到目前为止，过滤器的数量仍然相当有限，因为目前的过滤器开发方法需要大量的人工参与。在这方面，我们开发了一种自动方法来构建基于PubMed的用户搜索日志的特定主题过滤器。具体地说，对于给定的主题，我们首先检测相关的用户查询，并使用他们相应的点击来构建主题相关文章集。接下来，我们使用统计数据来确定最能代表与主题相关的文档集的术语。最后，将选取的代表性项与布尔算子相结合，并在基准数据集上进行评估，以获得性能最好的最终过滤器。我们应用我们的方法为四个不同的临床主题开发了过滤器：肾脏病、糖尿病、怀孕和抑郁症。对于肾病滤器，我们的方法获得了与最先进水平相当的性能(灵敏度91.3%，特异度98.7%，精确度94.6%，准确度97.2%)。同样，其他三个搜索过滤器都获得了高性能的结果(在所有衡量标准中都超过了90%)。