III: Medium: Selective Search of Large-Scale Text Collections

III:媒介:大规模文本集合的选择性搜索

基本信息

  • 批准号:
    1302206
  • 负责人:
  • 金额:
    $ 108.34万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2013
  • 资助国家:
    美国
  • 起止时间:
    2013-09-15 至 2018-08-31
  • 项目状态:
    已结题

项目摘要

This project develops an alternative architecture for large-scale text search in which the document corpus is decomposed into index shards that are expected to have skewed utility distributions, thus enabling most index partitions to be ignored for most queries. This selective search architecture is as effective as conventional search engine architectures, but has far lower computational costs and reveals new challenges and opportunities in large-scale search. The decomposition process creates text collections, thus inviting research on what characteristics are desired or to be avoided in a text collection to enable accurate search. New resource selection algorithms are developed to address efficiency problems in existing algorithms and dynamically adjust search costs based on query difficulty. The project includes collaboration with three research groups at other universities, to help their research, leverage their expertise in designing new approaches to problems, and investigate the effectiveness of our research in more varied situations. The result is an "off-the-shelf" method that provides an order of magnitude reduction in search costs over the current state-of-the-art, especially on corpora of more than a billion documents, and that can be easily customized or extended to support varied needs.Selective search is significant in part because it provides a new perspective on how to organize a very large collection of documents so that it can be searched accurately and efficiently. This new understanding reveals new research problems and undiscovered weaknesses in existing algorithms that will have impact within the scientific community. Text search is one of the most widely used computer science technologies; hence selective search is of practical significance. The state-of-the-art in many areas of industry and science is increasingly associated with large-scale datasets, which makes it difficult for organizations with modest computational resources to compete. This project reduces the computational costs of searching large-scale text collections by an order of magnitude or more. It has the potential to reduce the energy and other costs associated with the data centers of large search providers, which has important economic and societal benefits. Research results from this project are disseminated via project web site (http://www.cs.cmu.edu/~callan/Projects/IIS-1302206/); in research publications; in the Lemur Project's open-source search engines, which are used by a broad international scientific community; and in the Lemur Project's ClueWeb public search services, which integrate research and education by enabling scientists and classroom students to do experiments on large, state-of-the-art text corpora.
该项目为大规模文本搜索开发了另一种架构,在这种架构中,文档语料库被分解为索引碎片,这些索引碎片预计会有倾斜的效用分布,从而使大多数索引分区在大多数查询中被忽略。这种选择性搜索体系结构与传统搜索引擎体系结构一样有效,但计算成本要低得多,并在大规模搜索中揭示了新的挑战和机遇。分解过程创建文本集合,因此需要研究文本集合中需要或避免哪些特征,以实现准确的搜索。为了解决现有算法的效率问题,开发了新的资源选择算法,并根据查询难度动态调整搜索成本。该项目包括与其他大学的三个研究小组合作,以帮助他们的研究,利用他们在设计解决问题的新方法方面的专业知识,并调查我们的研究在更多不同情况下的有效性。其结果是一种“现成的”方法,与当前最先进的方法相比,它的搜索成本降低了一个数量级,特别是在超过10亿个文档的语料库上,并且可以很容易地定制或扩展以支持各种需求。选择性搜索很重要,部分原因是它提供了一种新的视角,可以了解如何组织非常大的文档集合,以便准确有效地进行搜索。这种新的理解揭示了现有算法中新的研究问题和未发现的弱点,这些问题将在科学界产生影响。文本搜索是应用最广泛的计算机科学技术之一;因此,选择性搜索具有重要的现实意义。许多工业和科学领域的最新技术越来越多地与大规模数据集联系在一起,这使得拥有适度计算资源的组织很难与之竞争。这个项目将搜索大规模文本集合的计算成本降低了一个数量级或更多。它有可能减少与大型搜索提供商的数据中心相关的能源和其他成本,这具有重要的经济和社会效益。本项目研究成果通过项目网站(http://www.cs.cmu.edu/~callan/Projects/IIS-1302206/)发布;在研究出版物中;在狐猴项目的开源搜索引擎中,被广泛的国际科学界使用;以及狐猴计划的ClueWeb公共搜索服务,该服务将研究和教育结合起来,使科学家和课堂上的学生能够在大型、最先进的文本语料库上进行实验。

项目成果

期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Jamie Callan其他文献

Pruning long documents for distributed information retrieval
修剪长文档以进行分布式信息检索
  • DOI:
    10.1145/584792.584847
  • 发表时间:
    2002
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jie Lu;Jamie Callan
  • 通讯作者:
    Jamie Callan
Language processing technologies for electronic rulemaking: a project highlight
用于电子规则制定的语言处理技术:项目亮点
  • DOI:
  • 发表时间:
    2005
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Stuart W. Shulman;E. Hovy;Jamie Callan;S. Zavestoski
  • 通讯作者:
    S. Zavestoski
Passage-retrieval evidence in document retrieval
  • DOI:
  • 发表时间:
    1994
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Jamie Callan
  • 通讯作者:
    Jamie Callan
Metric-based ontology learning
基于度量的本体学习
  • DOI:
    10.1145/1458484.1458486
  • 发表时间:
    2008
  • 期刊:
  • 影响因子:
    3
  • 作者:
    G. Yang;Jamie Callan
  • 通讯作者:
    Jamie Callan
An effective and efficient results merging strategy for multilingual information retrieval in federated search environments
联合搜索环境中多语言信息检索的有效且高效的结果合并策略
  • DOI:
    10.1007/s10791-007-9036-6
  • 发表时间:
    2007-11
  • 期刊:
  • 影响因子:
    2.8
  • 作者:
    Jamie Callan;Luo Si
  • 通讯作者:
    Luo Si

Jamie Callan的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Jamie Callan', 18)}}的其他基金

III: Small: Reliable and Generalizable Neural Search Engine Architectures
III:小:可靠且可推广的神经搜索引擎架构
  • 批准号:
    1815528
  • 财政年份:
    2018
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
CRI: CI-SUSTAIN: Collaborative Research: Sustaining Lemur Project Resources for the Long-Term
CRI:CI-SUSTAIN:合作研究:长期维持狐猴项目资源
  • 批准号:
    1822975
  • 财政年份:
    2018
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
III: Small: Using Knowledge Resources to Improve Information Retrieval
III:小:利用知识资源改进信息检索
  • 批准号:
    1422676
  • 财政年份:
    2014
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
CI-EN-Collaborative Research: Supporting Research and Teaching for Next-Generation Search Engines in Lemur
CI-EN-协作研究:支持狐猴下一代搜索引擎的研究和教学
  • 批准号:
    1405045
  • 财政年份:
    2014
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
III: Medium: Collaborative Research: Connecting the Ephemeral and Archival Information Networks
III:媒介:协作研究:连接临时和档案信息网络
  • 批准号:
    1160862
  • 财政年份:
    2012
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Continuing Grant
CI-ADDO-EN: Collaborative Proposal: Supporting Web-Scale Experimentation Using the Lemur Toolkit
CI-ADDO-EN:协作提案:使用 Lemur 工具包支持网络规模实验
  • 批准号:
    0934358
  • 财政年份:
    2010
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Continuing Grant
III: Small: Modeling and Predicting Term Mismatch for Full-Text Retrieval
III:小:全文检索的术语不匹配建模和预测
  • 批准号:
    1018317
  • 财政年份:
    2010
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
DC: Small: An Integrated Architecture for Federated Search
DC:小型:联合搜索的集成架构
  • 批准号:
    0916553
  • 财政年份:
    2009
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
Preservation and Access for ClueWeb09 Image Data
ClueWeb09 图像数据的保存和访问
  • 批准号:
    0948856
  • 财政年份:
    2009
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
SGER: Multi-Tier Indexing for Web Search Engines
SGER:网络搜索引擎的多层索引
  • 批准号:
    0841275
  • 财政年份:
    2008
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant

相似海外基金

RII Track-4:@NASA: Bluer and Hotter: From Ultraviolet to X-ray Diagnostics of the Circumgalactic Medium
RII Track-4:@NASA:更蓝更热:从紫外到 X 射线对环绕银河系介质的诊断
  • 批准号:
    2327438
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
Collaborative Research: Topological Defects and Dynamic Motion of Symmetry-breaking Tadpole Particles in Liquid Crystal Medium
合作研究:液晶介质中对称破缺蝌蚪粒子的拓扑缺陷与动态运动
  • 批准号:
    2344489
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
Collaborative Research: AF: Medium: The Communication Cost of Distributed Computation
合作研究:AF:媒介:分布式计算的通信成本
  • 批准号:
    2402836
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Continuing Grant
Collaborative Research: AF: Medium: Foundations of Oblivious Reconfigurable Networks
合作研究:AF:媒介:遗忘可重构网络的基础
  • 批准号:
    2402851
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Continuing Grant
Collaborative Research: CIF: Medium: Snapshot Computational Imaging with Metaoptics
合作研究:CIF:Medium:Metaoptics 快照计算成像
  • 批准号:
    2403122
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403134
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Training Users, Developers, and Instructors at the Chemistry/Physics/Materials Science Interface
协作研究:网络培训:实施:媒介:在化学/物理/材料科学界面培训用户、开发人员和讲师
  • 批准号:
    2321102
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Transforming the Molecular Science Research Workforce through Integration of Programming in University Curricula
协作研究:网络培训:实施:中:通过将编程融入大学课程来改变分子科学研究人员队伍
  • 批准号:
    2321045
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
Collaborative Research: CyberTraining: Implementation: Medium: Training Users, Developers, and Instructors at the Chemistry/Physics/Materials Science Interface
协作研究:网络培训:实施:媒介:在化学/物理/材料科学界面培训用户、开发人员和讲师
  • 批准号:
    2321103
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
Collaborative Research: CPS: Medium: Automating Complex Therapeutic Loops with Conflicts in Medical Cyber-Physical Systems
合作研究:CPS:中:自动化医疗网络物理系统中存在冲突的复杂治疗循环
  • 批准号:
    2322534
  • 财政年份:
    2024
  • 资助金额:
    $ 108.34万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了