权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Medium: Generative Neural Information Retrieval Models

III：媒介：生成神经信息检索模型

基本信息

批准号：
1956221
负责人：
Ellie Pavlick
金额：
$ 99.98万
依托单位：
Brown University
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2020
资助国家：
美国
起止时间：
2020-09-01 至 2024-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1956221&HistoricalAwards=false
关键词：
III Medium Generative Neural Information

项目摘要

Search engines have become modern society's main sources of information. They put vast amounts of knowledge about virtually any topic at our fingertips wherever we go. To do so, a search engine studies our search query and tries to form an understanding of what it is that we were looking for in the first place, when querying for "boston tea party reason", "IRS form 1040" or "pizza near me". This internal representation is then compared with billions of webpages, books, or news articles in an attempt to find the best possible information for our searcher's query. This project will support and innovate this process in two ways. First, it will develop a more accurate understanding of search queries using insights into the way that humans use language, rather than just comparing queries and documents word-by-word. Secondly, using these improved representations of query meaning, the researchers will develop a fundamentally different way of searching for information. Instead of comparing our query with every possible match, they let the search engine come up with an idealized response to the query and then try to find those webpages that are most similar to this optimal answer. The expected consequences will be better search results and faster computation for the machines running the search engines (that, in turn, can lead to reduced electricity demand and CO2 emissions). To ensure a lasting impact even after this project has concluded, the research team will actively reach out to researchers and engineers at search engine companies to raise facilitate widespread adoption of the technology developed in this project. To broaden participation in computer science beyond traditionally well-represented demographics (e.g., in terms of genders, ethnicities, or socio-economic backgrounds), the study team will host a range of technology literacy outreach events among student populations at the college and middle school level. During these events, the researchers, supported by undergraduates from diverse backgrounds, will inform the student participants about effective information search tools and strategies, the research goals of this project, and college-level computer science education in general.Deep and representation learning have brought promising improvements to various Information Retrieval (IR) tasks. Existing neural IR models estimate a matching score between the information need - such as a query or question - and the documents, using semantic similarities between terms, learned from a large set of relevance information. In contrast to classical IR models where the estimation of matching scores is constrained to only those documents containing the query terms, neural IR models need to trace over all documents, or instead re-rank the top-retrieved documents, obtained from a classical IR model. In addition, since neural IR models are often based on purely distributional representations of term meaning, they lack a grounded understanding of language subtleties such as for example gradable terms. The objective of this project is to design generative information retrieval models enhanced by distributed representations of gradable terms. To accomplish this, the research team plans the following concrete objectives. (1) Generative IR models: Instead of computing matching scores for each query-document pair, a document generative model can effectively approximate a representation in the relevance sub-space for a given query, facilitating efficient fully-neural document retrieval. The investigators will explore generative models to approximate hierarchical representations of relevant documents, and use efficient nearest-neighbor algorithms to find and retrieve the most suitable organic documents in the collection. (2) Distributed representations of gradable terms: The often intangible meaning of gradable terms can be resolved by considering the global context of each term. The project will study a probabilistic formulation of gradable terms based on their hypothetical value ranges and frames of reference, estimated from the collection. (3) Incorporation of gradable term representations into generative retrieval systems: The integration of grounded representations of gradable terms in the generative retrieval model will provide better understanding and support of information needs. The project will study this effect on information needs with and without gradable terms. The expected artifacts produced by the project are peer-reviewed scientific publications, open-source implementations of the proposed models, pre-trained word and phrase embeddings, logged retrieval runs in trec_eval format, a manually annotated Subjective Entailment dataset, and a suite of middle school search literacy education materials. All of these will be shared on the project website under Creative Commons CC0 License.This project is jointly funded by the Information Integration & Informatics Program in the Division of Information & Intelligent Systems, and the Established Program to Stimulate Competitive Research (EPSCoR).This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

搜索引擎已经成为现代社会的主要信息来源。无论我们走到哪里，它们都能提供几乎任何主题的大量知识。为此，搜索引擎会研究我们的搜索查询，并试图理解我们在搜索“波士顿茶党原因”、“国税局1040表格”或“我附近的披萨”时首先要查找的是什么。然后将这个内部表示与数十亿的网页、书籍或新闻文章进行比较，试图为我们的搜索者的查询找到最好的信息。该项目将从两个方面支持和创新这一进程。首先，它将通过洞察人类使用语言的方式，对搜索查询产生更准确的理解，而不仅仅是逐字比较查询和文档。其次，利用这些改进的查询意义表示，研究人员将开发一种完全不同的信息搜索方式。他们不是将我们的查询与每个可能的匹配进行比较，而是让搜索引擎对查询提出一个理想的响应，然后尝试找到那些与这个最佳答案最相似的网页。预期的结果将是更好的搜索结果和运行搜索引擎的机器更快的计算速度（反过来，这可以减少电力需求和二氧化碳排放）。为了确保在项目结束后产生持久的影响，研究团队将积极联系搜索引擎公司的研究人员和工程师，以促进项目中开发的技术的广泛采用。为了扩大计算机科学的参与范围，超越传统上代表性较好的人口统计学（例如，在性别、种族或社会经济背景方面），研究小组将在大学和中学的学生群体中举办一系列技术素养推广活动。在这些活动中，研究人员将在来自不同背景的本科生的支持下，向学生参与者介绍有效的信息搜索工具和策略，本项目的研究目标，以及大学水平的计算机科学教育。深度学习和表示学习为各种信息检索（IR）任务带来了有希望的改进。现有的神经IR模型估计信息需求（如查询或问题）与文档之间的匹配分数，使用从大量相关信息中学习到的术语之间的语义相似性。经典IR模型对匹配分数的估计仅限于那些包含查询词的文档，与之相反，神经IR模型需要跟踪所有文档，或者重新排列从经典IR模型获得的检索最多的文档。此外，由于神经IR模型通常基于术语含义的纯粹分布表示，因此它们缺乏对语言微妙之处（例如可分级术语）的基本理解。该项目的目标是设计通过可分级术语的分布式表示增强的生成信息检索模型。为了实现这一目标，研究小组计划实现以下具体目标。(1)生成IR模型：与计算每个查询-文档对的匹配分数不同，文档生成模型可以有效地近似给定查询在相关子空间中的表示，从而实现高效的全神经网络文档检索。研究人员将探索生成模型来近似相关文档的层次表示，并使用有效的最近邻算法来查找和检索集合中最合适的有机文档。(2)可分级术语的分布式表示：可分级术语的无形含义通常可以通过考虑每个术语的全局上下文来解决。该项目将根据从收集中估计的可分级术语的假设值范围和参考框架，研究可分级术语的概率公式。(3)将可分级术语表示纳入生成检索系统：在生成检索模型中整合可分级术语的基础表示将更好地理解和支持信息需求。该项目将研究这种情况对有无可分级条件下的信息需求的影响。该项目产生的预期工件是同行评审的科学出版物、提议模型的开源实现、预训练的单词和短语嵌入、以tre_eval格式运行的日志检索、手动注释的主观蕴意数据集以及一套中学搜索素养教育材料。所有这些都将在知识共享CC0许可下在项目网站上共享。该项目由信息与智能系统部的信息集成与信息学项目和促进竞争研究的既定项目（EPSCoR）共同资助。该奖项反映了美国国家科学基金会的法定使命，并通过使用基金会的知识价值和更广泛的影响审查标准进行评估，被认为值得支持。

项目成果

期刊论文数量（11）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

CATS: Customizable Abstractive Topic-based Summarization

DOI：
10.1145/3464299
发表时间：
2021-10
期刊：
ACM Transactions on Information Systems (TOIS)
影响因子：
0
作者：
Seyed Ali Bahrainian;George Zerveas;F. Crestani;Carsten Eickhoff
通讯作者：
Seyed Ali Bahrainian;George Zerveas;F. Crestani;Carsten Eickhoff

IsoScore: Measuring the Uniformity of Embedding Space Utilization

DOI：
10.18653/v1/2022.findings-acl.262
发表时间：
2021-08
期刊：
影响因子：
0
作者：
W. Rudman;Nate Gillman;T. Rayne;Carsten Eickhoff
通讯作者：
W. Rudman;Nate Gillman;T. Rayne;Carsten Eickhoff

TripClick: The Log Files of a Large Health Web Search Engine

DOI：
10.1145/3404835.3463242
发表时间：
2021-03
期刊：
Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
影响因子：
0
作者：
Navid Rekabsaz;Oleg Lesota;M. Schedl;J. Brassey;Carsten Eickhoff
通讯作者：
Navid Rekabsaz;Oleg Lesota;M. Schedl;J. Brassey;Carsten Eickhoff

Self-Supervised Neural Topic Modeling