III: Small: Reliable and Generalizable Neural Search Engine Architectures
III:小:可靠且可推广的神经搜索引擎架构
基本信息
- 批准号:1815528
- 负责人:
- 金额:$ 49.97万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2018
- 资助国家:美国
- 起止时间:2018-09-01 至 2023-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Scientists need to frequently search the scientific literature on the subject they are studying. Despite the availability of papers and citation databases on the Web, the enormous growth of scientific publications in all disciplines makes this a daunting task. Traditional commerical search engines, such as Google, often fail to include the most important documents in the first few pages of returned results - in other words, they do not do a good enough job of ranking scientific papers for a given query. Recently, new algorithms for search based on artificial neural network techniques have emerged as an alternative to traditional search architectures. These new neural search architectures are more accurate, but must be first trained with millions of example queries and answers from user interactions; this limits their usefulness for many tasks. This project will overcome this problem by developing new methods of training neural search engines that reduce the need for training examples by integrating explicit knowledge resources for a given discipline. The new techniques will be disseminated in freely available open-source search software for both university and industry researchers, thus broadly benefiting scientific advancement. In addition, the project will broaden participation by under-represented groups by creating research opportunities for female and undergraduate students and technology transfer opportunities for industry.This research develops new methods of training neural ranking architectures when a massive amount of training data is not available for the target application; integrates external knowledge resources to provide more information for making accurate ranking decisions; and applies the architecture to a domain-specific search task such as retrieving tabular data from scientific documents. This collection of problems is chosen to increase the practicality of neural ranking architectures outside of high-traffic commercial search environments, and to investigate and exploit the strengths of neural ranking architectures at using attention mechanisms to manage evidence, soft-matching across different types of evidence, and learning sophisticated nonlinear decision models. This research furthers the development of neural ranking architectures that are generally applicable and more reliable than current systems due to their ability to integrate a broader range of evidence in a predictable manner. Neural ranking architectures have generated much excitement and skepticism during the last several years. This research extends a recently-developed neural ranking system that is already able to beat strong learning-to-rank systems under specific conditions. It addresses one of the main obstacles to wider use of these models -- the availability of large amounts of training data. It integrates information from external semi-structured knowledge resources, because such information is effective in other ranking architectures and because it is likely to benefit from how neural ranking architectures manage and use diverse evidence of varying quality. Finally, it stress tests the architecture by applying it to a domain-specific task such as table retrieval from scientific documents, that requires the search engine to use several parts of the document selectively, rather than the entire document. These activities are designed to produce a neural ranking architecture capable of managing diverse evidence and document structure so as to provide greater knowledge about the particular strengths and weaknesses of neural ranking architectures. This research develops new methods of training neural ranking architectures when a massive amount of training data is not available for the target application; integrates external knowledge resources to provide more information for making accurate ranking decisions; and applies the architecture to a domain-specific search task such as retrieving tabular data from scientific documents. This collection of problems is chosen to increase the practicality of neural ranking architectures outside of high-traffic commercial search environments, and to investigate and exploit the strengths of neural ranking architectures at using attention mechanisms to manage evidence, soft-matching across different types of evidence, and learning sophisticated nonlinear decision models. This research furthers the development of neural ranking architectures that are generally applicable and more reliable than current systems due to their ability to integrate a broader range of evidence in a predictable manner. Neural ranking architectures have generated much excitement and skepticism during the last several years. This research extends a recently-developed neural ranking system that is already able to beat strong learning-to-rank systems under specific conditions. It addresses one of the main obstacles to wider use of these models -- the availability of large amounts of training data. It integrates information from external semi-structured knowledge resources, because such information is effective in other ranking architectures and because it is likely to benefit from how neural ranking architectures manage and use diverse evidence of varying quality. Finally, it stress tests the architecture by applying it to a domain-specific task such as table retrieval from scientific documents, that requires the search engine to use several parts of the document selectively, rather than the entire document. These activities are designed to produce a neural ranking architecture capable of managing diverse evidence and document structure so as to provide greater knowledge about the particular strengths and weaknesses of neural ranking architectures. The project website (http://www.cs.cmu.edu/~callan/Projects/IIS-1815528/) describes recent activities and provides access to research publications, experimental results, datasets, and open-sources software produced by the project.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
科学家需要经常查阅与他们正在研究的课题相关的科学文献。尽管网上有论文和引文数据库,但所有学科的科学出版物的巨大增长使这项任务变得艰巨。传统的商业搜索引擎,如谷歌,常常不能在返回结果的前几页中包含最重要的文档——换句话说,它们在为给定查询排序科学论文方面做得不够好。近年来,基于人工神经网络技术的新搜索算法已经成为传统搜索架构的替代方案。这些新的神经搜索架构更准确,但必须首先使用来自用户交互的数百万个示例查询和答案进行训练;这限制了它们在许多任务中的用处。该项目将通过开发训练神经搜索引擎的新方法来克服这个问题,这些方法通过集成给定学科的明确知识资源来减少对训练示例的需求。这些新技术将以免费的开源搜索软件形式传播给大学和工业界的研究人员,从而广泛地促进科学进步。此外,该项目将为女性和本科生创造研究机会,并为工业界创造技术转让机会,从而扩大代表性不足群体的参与。本研究开发了在大量训练数据不可用于目标应用时训练神经排序架构的新方法;整合外部知识资源,为做出准确的排名决策提供更多信息;并将该体系结构应用于特定于领域的搜索任务,例如从科学文档中检索表格数据。选择这些问题集是为了增加神经排序架构在高流量商业搜索环境之外的实用性,并研究和利用神经排序架构在使用注意力机制来管理证据、跨不同类型证据的软匹配以及学习复杂的非线性决策模型方面的优势。这项研究进一步发展了神经排序体系结构,由于它们能够以可预测的方式整合更广泛的证据,因此比当前系统普遍适用且更可靠。在过去的几年中,神经排序体系结构引起了许多兴奋和怀疑。这项研究扩展了最近开发的神经排序系统,该系统已经能够在特定条件下击败强大的学习排序系统。它解决了广泛使用这些模型的主要障碍之一——大量训练数据的可用性。它集成了来自外部半结构化知识资源的信息,因为这些信息在其他排序体系结构中是有效的,因为它可能受益于神经排序体系结构如何管理和使用不同质量的各种证据。最后,它通过将体系结构应用于特定于领域的任务(例如从科学文档中检索表)来对体系结构进行压力测试,这要求搜索引擎有选择地使用文档的几个部分,而不是整个文档。这些活动旨在产生一个能够管理各种证据和文档结构的神经排序体系结构,从而提供关于神经排序体系结构的特定优势和劣势的更多知识。本研究开发了在大量训练数据不可用于目标应用时训练神经排序架构的新方法;整合外部知识资源,为做出准确的排名决策提供更多信息;并将该体系结构应用于特定于领域的搜索任务,例如从科学文档中检索表格数据。选择这些问题集是为了增加神经排序架构在高流量商业搜索环境之外的实用性,并研究和利用神经排序架构在使用注意力机制来管理证据、跨不同类型证据的软匹配以及学习复杂的非线性决策模型方面的优势。这项研究进一步发展了神经排序体系结构,由于它们能够以可预测的方式整合更广泛的证据,因此比当前系统普遍适用且更可靠。在过去的几年中,神经排序体系结构引起了许多兴奋和怀疑。这项研究扩展了最近开发的神经排序系统,该系统已经能够在特定条件下击败强大的学习排序系统。它解决了广泛使用这些模型的主要障碍之一——大量训练数据的可用性。它集成了来自外部半结构化知识资源的信息,因为这些信息在其他排序体系结构中是有效的,因为它可能受益于神经排序体系结构如何管理和使用不同质量的各种证据。最后,它通过将体系结构应用于特定于领域的任务(例如从科学文档中检索表)来对体系结构进行压力测试,这要求搜索引擎有选择地使用文档的几个部分,而不是整个文档。这些活动旨在产生一个能够管理各种证据和文档结构的神经排序体系结构,从而提供关于神经排序体系结构的特定优势和劣势的更多知识。项目网站(http://www.cs.cmu.edu/~callan/Projects/IIS-1815528/)描述了最近的活动,并提供了对研究出版物、实验结果、数据集和项目生产的开源软件的访问。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(25)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Rethink Training of BERT Rerankers in Multi-Stage Retrieval Pipeline
- DOI:10.1007/978-3-030-72240-1_26
- 发表时间:2021-01
- 期刊:
- 影响因子:0
- 作者:Luyu Gao;Zhuyun Dai;Jamie Callan
- 通讯作者:Luyu Gao;Zhuyun Dai;Jamie Callan
Summarizing and Exploring Tabular Data in Conversational Search
- DOI:10.1145/3397271.3401205
- 发表时间:2020-05
- 期刊:
- 影响因子:0
- 作者:Shuo Zhang;Zhuyun Dai;K. Balog;Jamie Callan
- 通讯作者:Shuo Zhang;Zhuyun Dai;K. Balog;Jamie Callan
Precise Zero-Shot Dense Retrieval without Relevance Labels
无需相关标签的精确零样本密集检索
- DOI:10.18653/v1/2023.acl-long.99
- 发表时间:2023
- 期刊:
- 影响因子:0
- 作者:Gao, Luyu;Ma, Xueguang;Lin, Jimmy;Callan, Jamie
- 通讯作者:Callan, Jamie
Tevatron: An Efficient and Flexible Toolkit for Neural Retrieval
- DOI:10.1145/3539618.3591805
- 发表时间:2023-07
- 期刊:
- 影响因子:0
- 作者:Luyu Gao
- 通讯作者:Luyu Gao
Efficiency Implications of Term Weighting for Passage Retrieval
- DOI:10.1145/3397271.3401263
- 发表时间:2020-07
- 期刊:
- 影响因子:0
- 作者:J. Mackenzie;Zhuyun Dai;L. Gallagher;Jamie Callan
- 通讯作者:J. Mackenzie;Zhuyun Dai;L. Gallagher;Jamie Callan
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Jamie Callan其他文献
Pruning long documents for distributed information retrieval
修剪长文档以进行分布式信息检索
- DOI:
10.1145/584792.584847 - 发表时间:
2002 - 期刊:
- 影响因子:0
- 作者:
Jie Lu;Jamie Callan - 通讯作者:
Jamie Callan
Language processing technologies for electronic rulemaking: a project highlight
用于电子规则制定的语言处理技术:项目亮点
- DOI:
- 发表时间:
2005 - 期刊:
- 影响因子:0
- 作者:
Stuart W. Shulman;E. Hovy;Jamie Callan;S. Zavestoski - 通讯作者:
S. Zavestoski
Passage-retrieval evidence in document retrieval
- DOI:
- 发表时间:
1994 - 期刊:
- 影响因子:0
- 作者:
Jamie Callan - 通讯作者:
Jamie Callan
Metric-based ontology learning
基于度量的本体学习
- DOI:
10.1145/1458484.1458486 - 发表时间:
2008 - 期刊:
- 影响因子:3
- 作者:
G. Yang;Jamie Callan - 通讯作者:
Jamie Callan
An effective and efficient results merging strategy for multilingual information retrieval in federated search environments
联合搜索环境中多语言信息检索的有效且高效的结果合并策略
- DOI:
10.1007/s10791-007-9036-6 - 发表时间:
2007-11 - 期刊:
- 影响因子:2.8
- 作者:
Jamie Callan;Luo Si - 通讯作者:
Luo Si
Jamie Callan的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Jamie Callan', 18)}}的其他基金
CRI: CI-SUSTAIN: Collaborative Research: Sustaining Lemur Project Resources for the Long-Term
CRI:CI-SUSTAIN:合作研究:长期维持狐猴项目资源
- 批准号:
1822975 - 财政年份:2018
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
III: Small: Using Knowledge Resources to Improve Information Retrieval
III:小:利用知识资源改进信息检索
- 批准号:
1422676 - 财政年份:2014
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
CI-EN-Collaborative Research: Supporting Research and Teaching for Next-Generation Search Engines in Lemur
CI-EN-协作研究:支持狐猴下一代搜索引擎的研究和教学
- 批准号:
1405045 - 财政年份:2014
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
III: Medium: Selective Search of Large-Scale Text Collections
III:媒介:大规模文本集合的选择性搜索
- 批准号:
1302206 - 财政年份:2013
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
III: Medium: Collaborative Research: Connecting the Ephemeral and Archival Information Networks
III:媒介:协作研究:连接临时和档案信息网络
- 批准号:
1160862 - 财政年份:2012
- 资助金额:
$ 49.97万 - 项目类别:
Continuing Grant
CI-ADDO-EN: Collaborative Proposal: Supporting Web-Scale Experimentation Using the Lemur Toolkit
CI-ADDO-EN:协作提案:使用 Lemur 工具包支持网络规模实验
- 批准号:
0934358 - 财政年份:2010
- 资助金额:
$ 49.97万 - 项目类别:
Continuing Grant
III: Small: Modeling and Predicting Term Mismatch for Full-Text Retrieval
III:小:全文检索的术语不匹配建模和预测
- 批准号:
1018317 - 财政年份:2010
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
DC: Small: An Integrated Architecture for Federated Search
DC:小型:联合搜索的集成架构
- 批准号:
0916553 - 财政年份:2009
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
Preservation and Access for ClueWeb09 Image Data
ClueWeb09 图像数据的保存和访问
- 批准号:
0948856 - 财政年份:2009
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
SGER: Multi-Tier Indexing for Web Search Engines
SGER:网络搜索引擎的多层索引
- 批准号:
0841275 - 财政年份:2008
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
相似国自然基金
昼夜节律性small RNA在血斑形成时间推断中的法医学应用研究
- 批准号:
- 批准年份:2024
- 资助金额:0.0 万元
- 项目类别:省市级项目
tRNA-derived small RNA上调YBX1/CCL5通路参与硼替佐米诱导慢性疼痛的机制研究
- 批准号:n/a
- 批准年份:2022
- 资助金额:10.0 万元
- 项目类别:省市级项目
Small RNA调控I-F型CRISPR-Cas适应性免疫性的应答及分子机制
- 批准号:32000033
- 批准年份:2020
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
Small RNAs调控解淀粉芽胞杆菌FZB42生防功能的机制研究
- 批准号:31972324
- 批准年份:2019
- 资助金额:58.0 万元
- 项目类别:面上项目
变异链球菌small RNAs连接LuxS密度感应与生物膜形成的机制研究
- 批准号:81900988
- 批准年份:2019
- 资助金额:21.0 万元
- 项目类别:青年科学基金项目
基于small RNA 测序技术解析鸽分泌鸽乳的分子机制
- 批准号:31802058
- 批准年份:2018
- 资助金额:26.0 万元
- 项目类别:青年科学基金项目
肠道细菌关键small RNAs在克罗恩病发生发展中的功能和作用机制
- 批准号:31870821
- 批准年份:2018
- 资助金额:56.0 万元
- 项目类别:面上项目
Small RNA介导的DNA甲基化调控的水稻草矮病毒致病机制
- 批准号:31772128
- 批准年份:2017
- 资助金额:60.0 万元
- 项目类别:面上项目
基于small RNA-seq的针灸治疗桥本甲状腺炎的免疫调控机制研究
- 批准号:81704176
- 批准年份:2017
- 资助金额:20.0 万元
- 项目类别:青年科学基金项目
水稻OsSGS3与OsHEN1调控small RNAs合成及其对抗病性的调节
- 批准号:91640114
- 批准年份:2016
- 资助金额:85.0 万元
- 项目类别:重大研究计划
相似海外基金
NSF-BSF: CNS Core: Small: Reliable and Zero-Power Timekeepers for Intermittently Powered Computing Devices via Stochastic Magnetic Tunnel Junctions
NSF-BSF:CNS 核心:小型:通过随机磁隧道结为间歇供电计算设备提供可靠且零功耗的计时器
- 批准号:
2400463 - 财政年份:2023
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
Peer-to-Peer Energy Trading over Reliable Small Cell Networks
通过可靠的小型蜂窝网络进行点对点能源交易
- 批准号:
RGPIN-2017-03995 - 财政年份:2022
- 资助金额:
$ 49.97万 - 项目类别:
Discovery Grants Program - Individual
Peer-to-Peer Energy Trading over Reliable Small Cell Networks
通过可靠的小型蜂窝网络进行点对点能源交易
- 批准号:
RGPIN-2017-03995 - 财政年份:2021
- 资助金额:
$ 49.97万 - 项目类别:
Discovery Grants Program - Individual
Collaborative Research: NeTS: Small: Reliable Task Offloading in Mobile Autonomous Systems Through Semantic MU-MIMO Control
合作研究:NeTS:小型:通过语义 MU-MIMO 控制实现移动自治系统中的可靠任务卸载
- 批准号:
2134973 - 财政年份:2021
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
Collaborative Research: NeTS: Small: Reliable Task Offloading in Mobile Autonomous Systems Through Semantic MU-MIMO Control
合作研究:NeTS:小型:通过语义 MU-MIMO 控制实现移动自治系统中的可靠任务卸载
- 批准号:
2134567 - 财政年份:2021
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
NSF-BSF: CNS Core: Small: Reliable and Zero-Power Timekeepers for Intermittently Powered Computing Devices via Stochastic Magnetic Tunnel Junctions
NSF-BSF:CNS 核心:小型:通过随机磁隧道结为间歇供电计算设备提供可靠且零功耗的计时器
- 批准号:
2106562 - 财政年份:2021
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
SHF: Small: Reliable Storage and Computation in Memory Technologies
SHF:小型:内存技术中的可靠存储和计算
- 批准号:
2113914 - 财政年份:2021
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
RI: Small: Reliable Machine Learning in Hyperbolic Spaces
RI:小型:双曲空间中的可靠机器学习
- 批准号:
2008102 - 财政年份:2020
- 资助金额:
$ 49.97万 - 项目类别:
Standard Grant
Peer-to-Peer Energy Trading over Reliable Small Cell Networks
通过可靠的小型蜂窝网络进行点对点能源交易
- 批准号:
RGPIN-2017-03995 - 财政年份:2020
- 资助金额:
$ 49.97万 - 项目类别:
Discovery Grants Program - Individual
Peer-to-Peer Energy Trading over Reliable Small Cell Networks
通过可靠的小型蜂窝网络进行点对点能源交易
- 批准号:
RGPIN-2017-03995 - 财政年份:2019
- 资助金额:
$ 49.97万 - 项目类别:
Discovery Grants Program - Individual