权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Small: Reliable and Generalizable Neural Search Engine Architectures

III：小：可靠且可推广的神经搜索引擎架构

基本信息

批准号：
1815528
负责人：
Jamie Callan
金额：
$ 49.97万
依托单位：
Carnegie-Mellon University
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2018
资助国家：
美国
起止时间：
2018-09-01 至 2023-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1815528&HistoricalAwards=false
关键词：
III Small Reliable Generalizable Neural

项目摘要

Scientists need to frequently search the scientific literature on the subject they are studying. Despite the availability of papers and citation databases on the Web, the enormous growth of scientific publications in all disciplines makes this a daunting task. Traditional commerical search engines, such as Google, often fail to include the most important documents in the first few pages of returned results - in other words, they do not do a good enough job of ranking scientific papers for a given query. Recently, new algorithms for search based on artificial neural network techniques have emerged as an alternative to traditional search architectures. These new neural search architectures are more accurate, but must be first trained with millions of example queries and answers from user interactions; this limits their usefulness for many tasks. This project will overcome this problem by developing new methods of training neural search engines that reduce the need for training examples by integrating explicit knowledge resources for a given discipline. The new techniques will be disseminated in freely available open-source search software for both university and industry researchers, thus broadly benefiting scientific advancement. In addition, the project will broaden participation by under-represented groups by creating research opportunities for female and undergraduate students and technology transfer opportunities for industry.This research develops new methods of training neural ranking architectures when a massive amount of training data is not available for the target application; integrates external knowledge resources to provide more information for making accurate ranking decisions; and applies the architecture to a domain-specific search task such as retrieving tabular data from scientific documents. This collection of problems is chosen to increase the practicality of neural ranking architectures outside of high-traffic commercial search environments, and to investigate and exploit the strengths of neural ranking architectures at using attention mechanisms to manage evidence, soft-matching across different types of evidence, and learning sophisticated nonlinear decision models. This research furthers the development of neural ranking architectures that are generally applicable and more reliable than current systems due to their ability to integrate a broader range of evidence in a predictable manner. Neural ranking architectures have generated much excitement and skepticism during the last several years. This research extends a recently-developed neural ranking system that is already able to beat strong learning-to-rank systems under specific conditions. It addresses one of the main obstacles to wider use of these models -- the availability of large amounts of training data. It integrates information from external semi-structured knowledge resources, because such information is effective in other ranking architectures and because it is likely to benefit from how neural ranking architectures manage and use diverse evidence of varying quality. Finally, it stress tests the architecture by applying it to a domain-specific task such as table retrieval from scientific documents, that requires the search engine to use several parts of the document selectively, rather than the entire document. These activities are designed to produce a neural ranking architecture capable of managing diverse evidence and document structure so as to provide greater knowledge about the particular strengths and weaknesses of neural ranking architectures. This research develops new methods of training neural ranking architectures when a massive amount of training data is not available for the target application; integrates external knowledge resources to provide more information for making accurate ranking decisions; and applies the architecture to a domain-specific search task such as retrieving tabular data from scientific documents. This collection of problems is chosen to increase the practicality of neural ranking architectures outside of high-traffic commercial search environments, and to investigate and exploit the strengths of neural ranking architectures at using attention mechanisms to manage evidence, soft-matching across different types of evidence, and learning sophisticated nonlinear decision models. This research furthers the development of neural ranking architectures that are generally applicable and more reliable than current systems due to their ability to integrate a broader range of evidence in a predictable manner. Neural ranking architectures have generated much excitement and skepticism during the last several years. This research extends a recently-developed neural ranking system that is already able to beat strong learning-to-rank systems under specific conditions. It addresses one of the main obstacles to wider use of these models -- the availability of large amounts of training data. It integrates information from external semi-structured knowledge resources, because such information is effective in other ranking architectures and because it is likely to benefit from how neural ranking architectures manage and use diverse evidence of varying quality. Finally, it stress tests the architecture by applying it to a domain-specific task such as table retrieval from scientific documents, that requires the search engine to use several parts of the document selectively, rather than the entire document. These activities are designed to produce a neural ranking architecture capable of managing diverse evidence and document structure so as to provide greater knowledge about the particular strengths and weaknesses of neural ranking architectures. The project website (http://www.cs.cmu.edu/~callan/Projects/IIS-1815528/) describes recent activities and provides access to research publications, experimental results, datasets, and open-sources software produced by the project.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

科学家需要经常查阅与他们正在研究的课题相关的科学文献。尽管网上有论文和引文数据库，但所有学科的科学出版物的巨大增长使这项任务变得艰巨。传统的商业搜索引擎，如谷歌，常常不能在返回结果的前几页中包含最重要的文档——换句话说，它们在为给定查询排序科学论文方面做得不够好。近年来，基于人工神经网络技术的新搜索算法已经成为传统搜索架构的替代方案。这些新的神经搜索架构更准确，但必须首先使用来自用户交互的数百万个示例查询和答案进行训练；这限制了它们在许多任务中的用处。该项目将通过开发训练神经搜索引擎的新方法来克服这个问题，这些方法通过集成给定学科的明确知识资源来减少对训练示例的需求。这些新技术将以免费的开源搜索软件形式传播给大学和工业界的研究人员，从而广泛地促进科学进步。此外，该项目将为女性和本科生创造研究机会，并为工业界创造技术转让机会，从而扩大代表性不足群体的参与。本研究开发了在大量训练数据不可用于目标应用时训练神经排序架构的新方法；整合外部知识资源，为做出准确的排名决策提供更多信息；并将该体系结构应用于特定于领域的搜索任务，例如从科学文档中检索表格数据。选择这些问题集是为了增加神经排序架构在高流量商业搜索环境之外的实用性，并研究和利用神经排序架构在使用注意力机制来管理证据、跨不同类型证据的软匹配以及学习复杂的非线性决策模型方面的优势。这项研究进一步发展了神经排序体系结构，由于它们能够以可预测的方式整合更广泛的证据，因此比当前系统普遍适用且更可靠。在过去的几年中，神经排序体系结构引起了许多兴奋和怀疑。这项研究扩展了最近开发的神经排序系统，该系统已经能够在特定条件下击败强大的学习排序系统。它解决了广泛使用这些模型的主要障碍之一——大量训练数据的可用性。它集成了来自外部半结构化知识资源的信息，因为这些信息在其他排序体系结构中是有效的，因为它可能受益于神经排序体系结构如何管理和使用不同质量的各种证据。最后，它通过将体系结构应用于特定于领域的任务（例如从科学文档中检索表）来对体系结构进行压力测试，这要求搜索引擎有选择地使用文档的几个部分，而不是整个文档。这些活动旨在产生一个能够管理各种证据和文档结构的神经排序体系结构，从而提供关于神经排序体系结构的特定优势和劣势的更多知识。本研究开发了在大量训练数据不可用于目标应用时训练神经排序架构的新方法；整合外部知识资源，为做出准确的排名决策提供更多信息；并将该体系结构应用于特定于领域的搜索任务，例如从科学文档中检索表格数据。选择这些问题集是为了增加神经排序架构在高流量商业搜索环境之外的实用性，并研究和利用神经排序架构在使用注意力机制来管理证据、跨不同类型证据的软匹配以及学习复杂的非线性决策模型方面的优势。这项研究进一步发展了神经排序体系结构，由于它们能够以可预测的方式整合更广泛的证据，因此比当前系统普遍适用且更可靠。在过去的几年中，神经排序体系结构引起了许多兴奋和怀疑。这项研究扩展了最近开发的神经排序系统，该系统已经能够在特定条件下击败强大的学习排序系统。它解决了广泛使用这些模型的主要障碍之一——大量训练数据的可用性。它集成了来自外部半结构化知识资源的信息，因为这些信息在其他排序体系结构中是有效的，因为它可能受益于神经排序体系结构如何管理和使用不同质量的各种证据。最后，它通过将体系结构应用于特定于领域的任务（例如从科学文档中检索表）来对体系结构进行压力测试，这要求搜索引擎有选择地使用文档的几个部分，而不是整个文档。这些活动旨在产生一个能够管理各种证据和文档结构的神经排序体系结构，从而提供关于神经排序体系结构的特定优势和劣势的更多知识。项目网站（http://www.cs.cmu.edu/~callan/Projects/IIS-1815528/）描述了最近的活动，并提供了对研究出版物、实验结果、数据集和项目生产的开源软件的访问。该奖项反映了美国国家科学基金会的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（25）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline

DOI：
10.1007/978-3-030-72240-1_26
发表时间：
2021-01
期刊：
影响因子：
0
作者：
Luyu Gao;Zhuyun Dai;Jamie Callan
通讯作者：
Luyu Gao;Zhuyun Dai;Jamie Callan

Summarizing and Exploring Tabular Data in Conversational Search

DOI：
10.1145/3397271.3401205
发表时间：
2020-05
期刊：
Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
影响因子：
0
作者：
Shuo Zhang;Zhuyun Dai;K. Balog;Jamie Callan
通讯作者：
Shuo Zhang;Zhuyun Dai;K. Balog;Jamie Callan

Precise Zero-Shot Dense Retrieval without Relevance Labels

无需相关标签的精确零样本密集检索

DOI：
10.18653/v1/2023.acl-long.99
发表时间：
2023
期刊：
Association for Computational Linguistics
影响因子：
0
作者：
Gao, Luyu;Ma, Xueguang;Lin, Jimmy;Callan, Jamie
通讯作者：
Callan, Jamie

Tevatron: An Efficient and Flexible Toolkit for Neural Retrieval

DOI：
10.1145/3539618.3591805
发表时间：
2023-07
期刊：
Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
影响因子：
0
作者：
Luyu Gao
通讯作者：
Luyu Gao

Efficiency Implications of Term Weighting for Passage Retrieval

DOI：
10.1145/3397271.3401263
发表时间：
2020-07
期刊：
Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval
影响因子：
0
作者：
J. Mackenzie;Zhuyun Dai;L. Gallagher;Jamie Callan
通讯作者：
J. Mackenzie;Zhuyun Dai;L. Gallagher;Jamie Callan