III: Medium: Generative Neural Information Retrieval Models

III:媒介:生成神经信息检索模型

基本信息

  • 批准号:
    1956221
  • 负责人:
  • 金额:
    $ 99.98万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Continuing Grant
  • 财政年份:
    2020
  • 资助国家:
    美国
  • 起止时间:
    2020-09-01 至 2024-08-31
  • 项目状态:
    已结题

项目摘要

Search engines have become modern society's main sources of information. They put vast amounts of knowledge about virtually any topic at our fingertips wherever we go. To do so, a search engine studies our search query and tries to form an understanding of what it is that we were looking for in the first place, when querying for "boston tea party reason", "IRS form 1040" or "pizza near me". This internal representation is then compared with billions of webpages, books, or news articles in an attempt to find the best possible information for our searcher's query. This project will support and innovate this process in two ways. First, it will develop a more accurate understanding of search queries using insights into the way that humans use language, rather than just comparing queries and documents word-by-word. Secondly, using these improved representations of query meaning, the researchers will develop a fundamentally different way of searching for information. Instead of comparing our query with every possible match, they let the search engine come up with an idealized response to the query and then try to find those webpages that are most similar to this optimal answer. The expected consequences will be better search results and faster computation for the machines running the search engines (that, in turn, can lead to reduced electricity demand and CO2 emissions). To ensure a lasting impact even after this project has concluded, the research team will actively reach out to researchers and engineers at search engine companies to raise facilitate widespread adoption of the technology developed in this project. To broaden participation in computer science beyond traditionally well-represented demographics (e.g., in terms of genders, ethnicities, or socio-economic backgrounds), the study team will host a range of technology literacy outreach events among student populations at the college and middle school level. During these events, the researchers, supported by undergraduates from diverse backgrounds, will inform the student participants about effective information search tools and strategies, the research goals of this project, and college-level computer science education in general.Deep and representation learning have brought promising improvements to various Information Retrieval (IR) tasks. Existing neural IR models estimate a matching score between the information need - such as a query or question - and the documents, using semantic similarities between terms, learned from a large set of relevance information. In contrast to classical IR models where the estimation of matching scores is constrained to only those documents containing the query terms, neural IR models need to trace over all documents, or instead re-rank the top-retrieved documents, obtained from a classical IR model. In addition, since neural IR models are often based on purely distributional representations of term meaning, they lack a grounded understanding of language subtleties such as for example gradable terms. The objective of this project is to design generative information retrieval models enhanced by distributed representations of gradable terms. To accomplish this, the research team plans the following concrete objectives. (1) Generative IR models: Instead of computing matching scores for each query-document pair, a document generative model can effectively approximate a representation in the relevance sub-space for a given query, facilitating efficient fully-neural document retrieval. The investigators will explore generative models to approximate hierarchical representations of relevant documents, and use efficient nearest-neighbor algorithms to find and retrieve the most suitable organic documents in the collection. (2) Distributed representations of gradable terms: The often intangible meaning of gradable terms can be resolved by considering the global context of each term. The project will study a probabilistic formulation of gradable terms based on their hypothetical value ranges and frames of reference, estimated from the collection. (3) Incorporation of gradable term representations into generative retrieval systems: The integration of grounded representations of gradable terms in the generative retrieval model will provide better understanding and support of information needs. The project will study this effect on information needs with and without gradable terms. The expected artifacts produced by the project are peer-reviewed scientific publications, open-source implementations of the proposed models, pre-trained word and phrase embeddings, logged retrieval runs in trec_eval format, a manually annotated Subjective Entailment dataset, and a suite of middle school search literacy education materials. All of these will be shared on the project website under Creative Commons CC0 License.This project is jointly funded by the Information Integration & Informatics Program in the Division of Information & Intelligent Systems, and the Established Program to Stimulate Competitive Research (EPSCoR).This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
搜索引擎已经成为现代社会的主要信息来源。无论我们走到哪里,它们都能提供几乎任何主题的大量知识。为此,搜索引擎会研究我们的搜索查询,并试图理解我们在搜索“波士顿茶党原因”、“国税局1040表格”或“我附近的披萨”时首先要查找的是什么。然后将这个内部表示与数十亿的网页、书籍或新闻文章进行比较,试图为我们的搜索者的查询找到最好的信息。该项目将从两个方面支持和创新这一进程。首先,它将通过洞察人类使用语言的方式,对搜索查询产生更准确的理解,而不仅仅是逐字比较查询和文档。其次,利用这些改进的查询意义表示,研究人员将开发一种完全不同的信息搜索方式。他们不是将我们的查询与每个可能的匹配进行比较,而是让搜索引擎对查询提出一个理想的响应,然后尝试找到那些与这个最佳答案最相似的网页。预期的结果将是更好的搜索结果和运行搜索引擎的机器更快的计算速度(反过来,这可以减少电力需求和二氧化碳排放)。为了确保在项目结束后产生持久的影响,研究团队将积极联系搜索引擎公司的研究人员和工程师,以促进项目中开发的技术的广泛采用。为了扩大计算机科学的参与范围,超越传统上代表性较好的人口统计学(例如,在性别、种族或社会经济背景方面),研究小组将在大学和中学的学生群体中举办一系列技术素养推广活动。在这些活动中,研究人员将在来自不同背景的本科生的支持下,向学生参与者介绍有效的信息搜索工具和策略,本项目的研究目标,以及大学水平的计算机科学教育。深度学习和表示学习为各种信息检索(IR)任务带来了有希望的改进。现有的神经IR模型估计信息需求(如查询或问题)与文档之间的匹配分数,使用从大量相关信息中学习到的术语之间的语义相似性。经典IR模型对匹配分数的估计仅限于那些包含查询词的文档,与之相反,神经IR模型需要跟踪所有文档,或者重新排列从经典IR模型获得的检索最多的文档。此外,由于神经IR模型通常基于术语含义的纯粹分布表示,因此它们缺乏对语言微妙之处(例如可分级术语)的基本理解。该项目的目标是设计通过可分级术语的分布式表示增强的生成信息检索模型。为了实现这一目标,研究小组计划实现以下具体目标。(1)生成IR模型:与计算每个查询-文档对的匹配分数不同,文档生成模型可以有效地近似给定查询在相关子空间中的表示,从而实现高效的全神经网络文档检索。研究人员将探索生成模型来近似相关文档的层次表示,并使用有效的最近邻算法来查找和检索集合中最合适的有机文档。(2)可分级术语的分布式表示:可分级术语的无形含义通常可以通过考虑每个术语的全局上下文来解决。该项目将根据从收集中估计的可分级术语的假设值范围和参考框架,研究可分级术语的概率公式。(3)将可分级术语表示纳入生成检索系统:在生成检索模型中整合可分级术语的基础表示将更好地理解和支持信息需求。该项目将研究这种情况对有无可分级条件下的信息需求的影响。该项目产生的预期工件是同行评审的科学出版物、提议模型的开源实现、预训练的单词和短语嵌入、以tre_eval格式运行的日志检索、手动注释的主观蕴意数据集以及一套中学搜索素养教育材料。所有这些都将在知识共享CC0许可下在项目网站上共享。该项目由信息与智能系统部的信息集成与信息学项目和促进竞争研究的既定项目(EPSCoR)共同资助。该奖项反映了美国国家科学基金会的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(11)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
CATS: Customizable Abstractive Topic-based Summarization
  • DOI:
    10.1145/3464299
  • 发表时间:
    2021-10
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Seyed Ali Bahrainian;George Zerveas;F. Crestani;Carsten Eickhoff
  • 通讯作者:
    Seyed Ali Bahrainian;George Zerveas;F. Crestani;Carsten Eickhoff
IsoScore: Measuring the Uniformity of Embedding Space Utilization
  • DOI:
    10.18653/v1/2022.findings-acl.262
  • 发表时间:
    2021-08
  • 期刊:
  • 影响因子:
    0
  • 作者:
    W. Rudman;Nate Gillman;T. Rayne;Carsten Eickhoff
  • 通讯作者:
    W. Rudman;Nate Gillman;T. Rayne;Carsten Eickhoff
TripClick: The Log Files of a Large Health Web Search Engine
Self-Supervised Neural Topic Modeling
  • DOI:
    10.18653/v1/2021.findings-emnlp.284
  • 发表时间:
    2021
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Seyed Ali Bahrainian;Martin Jaggi;Carsten Eickhoff
  • 通讯作者:
    Seyed Ali Bahrainian;Martin Jaggi;Carsten Eickhoff
NEWTS: A Corpus for News Topic-Focused Summarization
  • DOI:
    10.48550/arxiv.2205.15661
  • 发表时间:
    2022-05
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Seyed Ali Bahrainian;Sheridan Feucht;Carsten Eickhoff
  • 通讯作者:
    Seyed Ali Bahrainian;Sheridan Feucht;Carsten Eickhoff
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Ellie Pavlick其他文献

Inducing Lexical Style Properties for Paraphrase and Genre Differentiation
引入词汇风格属性以进行释义和体裁区分
  • DOI:
    10.3115/v1/n15-1023
  • 发表时间:
    2015
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Ellie Pavlick;A. Nenkova
  • 通讯作者:
    A. Nenkova
Self-play for Data Efficient Language Acquisition
数据高效语言习得的自我游戏
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Charles Lovering;Ellie Pavlick
  • 通讯作者:
    Ellie Pavlick
How well do NLI models capture verb veridicality?
NLI 模型捕捉动词真实性的效果如何?
Extracting Structured Information via Automatic + Human Computation
通过自动人工计算提取结构化信息
Compositionality as Directional Consistency in Sequential Neural Networks
组合性作为顺序神经网络中的方向一致性
  • DOI:
  • 发表时间:
    2019
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Christopher Potts;Christopher D. Manning;Ellie Pavlick;Ian Tenney
  • 通讯作者:
    Ian Tenney

Ellie Pavlick的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

相似海外基金

RII Track-4:@NASA: Bluer and Hotter: From Ultraviolet to X-ray Diagnostics of the Circumgalactic Medium
RII Track-4:@NASA:更蓝更热:从紫外到 X 射线对环绕银河系介质的诊断
  • 批准号:
    2327438
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Standard Grant
Collaborative Research: Topological Defects and Dynamic Motion of Symmetry-breaking Tadpole Particles in Liquid Crystal Medium
合作研究:液晶介质中对称破缺蝌蚪粒子的拓扑缺陷与动态运动
  • 批准号:
    2344489
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Standard Grant
Collaborative Research: AF: Medium: The Communication Cost of Distributed Computation
合作研究:AF:媒介:分布式计算的通信成本
  • 批准号:
    2402836
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Continuing Grant
Collaborative Research: AF: Medium: Foundations of Oblivious Reconfigurable Networks
合作研究:AF:媒介:遗忘可重构网络的基础
  • 批准号:
    2402851
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Continuing Grant
Collaborative Research: CIF: Medium: Snapshot Computational Imaging with Metaoptics
合作研究:CIF:Medium:Metaoptics 快照计算成像
  • 批准号:
    2403122
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403134
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Training Users, Developers, and Instructors at the Chemistry/Physics/Materials Science Interface
协作研究:网络培训:实施:媒介:在化学/物理/材料科学界面培训用户、开发人员和讲师
  • 批准号:
    2321102
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Transforming the Molecular Science Research Workforce through Integration of Programming in University Curricula
协作研究:网络培训:实施:中:通过将编程融入大学课程来改变分子科学研究人员队伍
  • 批准号:
    2321045
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Training Users, Developers, and Instructors at the Chemistry/Physics/Materials Science Interface
协作研究:网络培训:实施:媒介:在化学/物理/材料科学界面培训用户、开发人员和讲师
  • 批准号:
    2321103
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Standard Grant
Collaborative Research: CPS: Medium: Automating Complex Therapeutic Loops with Conflicts in Medical Cyber-Physical Systems
合作研究:CPS:中:自动化医疗网络物理系统中存在冲突的复杂治疗循环
  • 批准号:
    2322534
  • 财政年份:
    2024
  • 资助金额:
    $ 99.98万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了