权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Query Log Analysis for Improving User Access to NCBI Web Services

用于改善用户对 NCBI Web 服务的访问的查询日志分析

基本信息

批准号：
9564626
负责人：
Zhiyong Lu
金额：
$ 160.63万
依托单位：
NATIONAL LIBRARY OF MEDICINE
依托单位国家：
美国
项目类别：
财政年份：
资助国家：
美国
起止时间：
至
项目状态：
未结题

项目摘要

Over the last decade, the online search for biological information has progressed rapidly and has become an integral part of any scientific discovery process. Today, it is virtually impossible to conduct R&D in biomedicine without relying on the kind of Web resources developed and maintained by the NCBI. Indeed, each day millions of users search for biological information via NCBIs online Entrez system. However, finding data relevant to a users information need is not always easy in Entrez. Improving our understanding of the growing population of Entrez users, their information needs and the way in which they meet these needs opens opportunities to improve information services and information access provided by NCBI. Among all Entrez databases, PubMed is the most used and often serves as an entry point for people to access related data in other databases.One resource for understanding and characterizing patrons of PubMed search engines is its transaction logs. Our previous investigation of PubMed search logs has led us to develop and deploy several useful applications in assisting user searches and retrieval such as the query formulation in PubMed, namely Related Queries, Query Autocomplete and Author Name Disambiguation. Inspired by past success, we have continued using log analysis to improve access to NCBI resources. For example, we have used user clicks to identify articles that the user considered relevant to their own query. In 2016-2017, we have used deep learning models to understand the relationship between the query and the content of potentially relevant articles. This approach is robust and outperforms both traditional IR algorithms as well as related shallow and deep models based on continuous representations of text, with better results on the under-specified query and term mismatch problems. Of course, there are multiple factors that indicate whether an article is relevant to the searcher. These include the connection between the query and the content, how recent the article is, whether other people found the article relevant, etc. PubMeds new Best Match sort order (using a Learning to Rank algorithm) combines a number of different scores and sources of information to identify the most relevant queries. This has significantly improved the results of our relevance rankings since Spring 2017. We are continuing the effort begun by our work on TermVariants. When a term is used in a query, usually documents using equivalent terms are also desired. A seeming trivial example is singular and plural terms. But care must be taken to avoid irrelevant articles. For example, navely applying plural rules to abbreviations is often not helpful. Guidelines are being developed to show where these expansions will be helpful. To better understand queries, we developed a Field Sensor to completely identify the portions and aims of a query. In other words, we identify which part of the query is an author name, a journal title, a date, or key phrases describing a knowledge the searcher would like to uncover. One practical use for this tool is reminding those looking for information, not specific articles, about our improved relevance searching. We continue to improve our handling and understanding of author names in PubMed articles. Principle Investigators on NIH-funded grants make a particularly important subset of PubMed authors. Additional information about these authors is available from their grants. Information about published papers in grants allows us to do a better job connecting papers and authors. These authors can be more reliably identified between different institutional affiliations, across changes in research focus and even connect different names for the same author.

在过去的十年里，对生物信息的在线搜索发展迅速，已成为任何科学发现过程中不可或缺的一部分。今天，如果不依赖NCBI开发和维护的那种网络资源，几乎不可能进行生物医学的研发。事实上，每天都有数百万用户通过NCBIS在线Entrez系统搜索生物信息。然而，在Entrez中查找与用户信息需求相关的数据并不总是很容易。提高我们对Entrez用户日益增长的人口、他们的信息需求以及他们满足这些需求的方式的了解，为改进NCBI提供的信息服务和信息获取提供了机会。在所有Entrez数据库中，PubMed是使用最多的数据库，也是人们访问其他数据库中相关数据的入口点。要了解和描述PubMed搜索引擎的用户特征，一个资源就是其交易日志。我们之前对PubMed搜索日志的调查导致我们开发和部署了几个有用的应用程序来帮助用户进行搜索和检索，例如PubMed中的查询公式，即相关查询、查询自动补全和作者姓名消歧。受过去成功的启发，我们继续使用日志分析来改进对NCBI资源的访问。例如，我们使用用户点击来标识用户认为与他们自己的查询相关的文章。在2016-2017年间，我们使用深度学习模型来理解查询与潜在相关文章内容之间的关系。该方法具有较好的鲁棒性，优于传统的信息检索算法以及基于文本连续表示的浅层和深层模型，在欠指定查询和术语不匹配问题上取得了更好的效果。当然，有多个因素表明一篇文章是否与搜索者相关。这些包括查询和内容之间的连接，文章的最近时间，其他人是否认为文章相关，等等。PubMeds新的最佳匹配排序顺序(使用学习排名算法)结合了许多不同的分数和信息源，以识别最相关的查询。自2017年春季以来，这显著改善了我们的相关性排名结果。我们正在继续我们在TermVariants上开始的工作。当在查询中使用术语时，通常也需要使用等价术语的文档。一个看似微不足道的例子是单数和复数术语。但必须注意避免不相关的文章。例如，天真地将复数规则应用于缩略语通常是没有帮助的。目前正在制定指导方针，以显示这些扩展将在哪些方面有所帮助。为了更好地理解查询，我们开发了一种现场传感器来完全识别查询的部分和目标。换句话说，我们识别查询的哪个部分是作者姓名、期刊标题、日期或描述搜索者想要发现的知识的关键短语。这个工具的一个实际用途是提醒那些寻找信息的人，而不是特定的文章，关于我们改进的相关性搜索。我们继续改进对PubMed文章中作者姓名的处理和理解。美国国立卫生研究院资助的拨款的主要调查人员构成了PubMed作者的一个特别重要的子集。有关这些作者的更多信息可从他们的赠款中获得。有关在赠款中发表的论文的信息使我们能够更好地连接论文和作者。这些作者可以更可靠地在不同的机构附属机构之间识别，跨越研究重点的变化，甚至将同一作者的不同名字联系起来。