III: Medium: Dataset Search and Ranking for Data Augmentation and Explanation

III:中:用于数据增强和解释的数据集搜索和排序

基本信息

  • 批准号:
    2106888
  • 负责人:
  • 金额:
    $ 109.32万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2021
  • 资助国家:
    美国
  • 起止时间:
    2021-09-01 至 2025-08-31
  • 项目状态:
    未结题

项目摘要

There has been an explosion in the volume of data that is being collected and cataloged about the environment, society, and populace. Moreover, with the push towards transparency and open data, scientists, governments, and organizations are increasingly making these data available on the Web. Combined with advances in analytics and machine learning, such growing access to data should in theory allow for progress on many of the world’s most important scientific and societal questions. However, this opportunity is often missed due to a central technical barrier: it is currently nearly impossible for domain experts to weed through the vast amount of publicly-available information to discover datasets that are needed for their specific application. Data repository platforms, such as CKAN and Dataverse, and dataset search engines, such as Google Dataset Search, aim to make it easy to share and find datasets. But these systems only support simple, keyword-based queries and metadata search, which are insufficient for users to properly specify their information needs. The investigators envision a new kind of dataset search engine that unlocks the untapped value in open data by supporting a richer set of findability queries that cater to the needs of analytics tasks, and aid in the construction and refinement of machine learning models. By empowering scientists and practitioners with the ability to discover relevant data, the project has great potential to stimulate data reuse both within and across domains.The project will develop methods where the user’s existing data forms the basis of a query that retrieves additional, related data from a large collection of datasets and attributes. There are many technical hurdles to overcome to support such queries. One primary challenge is computational efficiency: this project will develop novel algorithms for rapidly computing and searching for dataset relationships. The investigators will build on a rich variety of tools, including randomized sketching and hashing algorithms, and contribute new theoretical analyses to understand these methods. The algorithms contributed will address both highly-structured data (e.g., spatio-temporal) as well as generic numerical or categorical data. A second challenge is usability: the project will develop novel methods for assessing the significance of discovered data relationships, for pruning out coincidental or spurious relationships, and for ranking and presenting datasets to the end-user. Finally, the project will contribute a formalism to the dataset search problem that supports a wide range of findability queries based on dataset relationships. Active plans for engagement in STEM related activities for high-school students are detailed.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
关于环境、社会和民众的数据被收集和编目的数量激增。此外,随着对透明度和开放数据的推动,科学家、政府和组织越来越多地将这些数据放在网上。结合分析和机器学习的进步,这种对数据的日益增长的访问在理论上应该允许在世界上许多最重要的科学和社会问题上取得进展。然而,由于一个核心技术障碍,这个机会经常被错过:目前领域专家几乎不可能通过大量的公开信息来发现其特定应用所需的数据集。数据存储库平台,如CKAN和Dataverse,以及数据集搜索引擎,如Google Dataset Search,旨在使共享和查找数据集变得容易。但这些系统只支持简单的、基于关键字的查询和元数据搜索,不足以让用户正确指定其信息需求。研究人员设想了一种新的数据集搜索引擎,通过支持更丰富的可查找性查询来释放开放数据中未开发的价值,以满足分析任务的需求,并帮助构建和完善机器学习模型。通过赋予科学家和从业人员发现相关数据的能力,该项目具有极大的潜力,可以促进数据在领域内和跨领域的重用。该项目将开发方法,使用户现有的数据形成查询的基础,从大量的数据集和属性中检索额外的相关数据。要支持这种查询,需要克服许多技术障碍。一个主要的挑战是计算效率:该项目将开发快速计算和搜索数据集关系的新算法。研究人员将建立在丰富的各种工具,包括随机草图和散列算法,并贡献新的理论分析,以了解这些方法。贡献的算法将解决高度结构化的数据(例如,空间-时间)以及通用数值或分类数据。第二个挑战是可用性:该项目将开发新的方法来评估发现的数据关系的重要性,修剪掉巧合或虚假的关系,并将数据集排序和呈现给最终用户。最后,该项目将有助于形式主义的数据集搜索问题,支持广泛的基于数据集关系的可查找性查询。该奖项反映了NSF的法定使命,并通过使用基金会的智力价值和更广泛的影响审查标准进行评估,被认为值得支持。

项目成果

期刊论文数量(3)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Simple Analysis of Priority Sampling
优先采样的简单分析
Weighted Minwise Hashing Beats Linear Sketching for Inner Product Estimation
A Sketch-based Index for Correlated Dataset Search
用于相关数据集搜索的基于草图的索引
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Juliana Freire其他文献

Retention in fissure resin-based sealants in schoolchildren: the etching step importance
学童中基于树脂的窝沟封闭剂的保留:蚀刻步骤的重要性
  • DOI:
  • 发表时间:
    2023
  • 期刊:
  • 影响因子:
    0
  • 作者:
    T. Pereira;L. Clementino;Juliana Freire;P. Martins‐Júnior
  • 通讯作者:
    P. Martins‐Júnior
Editorial for Special Issue: VLDB 2022
  • DOI:
    10.1007/s00778-025-00930-y
  • 发表时间:
    2025-06-23
  • 期刊:
  • 影响因子:
    3.800
  • 作者:
    Juliana Freire;Fatma Özcan;Xuemin Lin
  • 通讯作者:
    Xuemin Lin
SIGMOD Executive Committee :
SIGMOD 执行委员会:
  • DOI:
  • 发表时间:
    2020
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Juliana Freire;Ihab F. Ilyas;Fatma Ozcan;Rada Chirkova;C. Dyreson;Joe Hellerstein;Michael Franklin;Renée Miller;John Wilkes;Chris Olsten;A. Doan;Tamer M. Özsu;G. Weikum;Stefano Ceri;T. Sellis;Stratos Idreos
  • 通讯作者:
    Stratos Idreos
Managing XML data: an abridged overview
管理 XML 数据:简要概述
VisCareTrails: Visualizing Trails in the Electronic Health Record with Timed Word Trees, a Pancreas Cancer Use Case
VisCareTrails:使用定时词树可视化电子健康记录中的轨迹(胰腺癌用例)
  • DOI:
  • 发表时间:
    2011
  • 期刊:
  • 影响因子:
    0
  • 作者:
    L. Lins;Marta Heilbrun;Juliana Freire;Cláudio T. Silva
  • 通讯作者:
    Cláudio T. Silva

Juliana Freire的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Juliana Freire', 18)}}的其他基金

D-ISN/​Collaborative Research: An Interdisciplinary Approach to the Discovery, Analysis, and Disruption of Wildlife Trafficking Networks
D-ISN/ — 合作研究:发现、分析和破坏野生动物贩运网络的跨学科方法
  • 批准号:
    2146306
  • 财政年份:
    2022
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
CI-EN: Enhancing and Supporting a Community-Based Data Analysis, Visualization, and Provenance Platform
CI-EN:增强和支持基于社区的数据分析、可视化和来源平台
  • 批准号:
    1405927
  • 财政年份:
    2014
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
CAREER: Storing, Querying and Re-Using Provenance of Computational Tasks
职业:存储、查询和重用计算任务的来源
  • 批准号:
    1142013
  • 财政年份:
    2011
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Continuing Grant
III: EAGER: Collaborative Research: A Community Experiment Platform for Reproducibility and Generalizability
III:EAGER:协作研究:可重复性和普遍性的社区实验平台
  • 批准号:
    1139832
  • 财政年份:
    2011
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
III: EAGER: Collaborative Research: A Community Experiment Platform for Reproducibility and Generalizability
III:EAGER:协作研究:可重复性和普遍性的社区实验平台
  • 批准号:
    1050422
  • 财政年份:
    2010
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
III: Medium: Provenance Analytics: Exploring Computational Tasks and their History
III:媒介:起源分析:探索计算任务及其历史
  • 批准号:
    0905385
  • 财政年份:
    2009
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
CAREER: Storing, Querying and Re-Using Provenance of Computational Tasks
职业:存储、查询和重用计算任务的来源
  • 批准号:
    0746500
  • 财政年份:
    2008
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Continuing Grant
III-COR: Discovering and Organizing Hidden-Web Sources
III-COR:发现和组织隐藏网络资源
  • 批准号:
    0713637
  • 财政年份:
    2007
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Continuing Grant
XML Data Management: Taking Order and Updates into Account
XML 数据管理:考虑顺序和更新
  • 批准号:
    0534628
  • 财政年份:
    2006
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Continuing Grant
CT-T: A Laboratory Workbench for Security Research
CT-T:安全研究实验室工作台
  • 批准号:
    0524096
  • 财政年份:
    2005
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Continuing Grant

相似海外基金

Collaborative Research: CyberTraining: Implementation: Medium: Training Users, Developers, and Instructors at the Chemistry/Physics/Materials Science Interface
协作研究:网络培训:实施:媒介:在化学/物理/材料科学界面培训用户、开发人员和讲师
  • 批准号:
    2321102
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
RII Track-4:@NASA: Bluer and Hotter: From Ultraviolet to X-ray Diagnostics of the Circumgalactic Medium
RII Track-4:@NASA:更蓝更热:从紫外到 X 射线对环绕银河系介质的诊断
  • 批准号:
    2327438
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
Collaborative Research: Topological Defects and Dynamic Motion of Symmetry-breaking Tadpole Particles in Liquid Crystal Medium
合作研究:液晶介质中对称破缺蝌蚪粒子的拓扑缺陷与动态运动
  • 批准号:
    2344489
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
Collaborative Research: AF: Medium: The Communication Cost of Distributed Computation
合作研究:AF:媒介:分布式计算的通信成本
  • 批准号:
    2402836
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Continuing Grant
Collaborative Research: AF: Medium: Foundations of Oblivious Reconfigurable Networks
合作研究:AF:媒介:遗忘可重构网络的基础
  • 批准号:
    2402851
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Continuing Grant
Collaborative Research: CIF: Medium: Snapshot Computational Imaging with Metaoptics
合作研究:CIF:Medium:Metaoptics 快照计算成像
  • 批准号:
    2403122
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Differentiable Hardware Synthesis
合作研究:SHF:媒介:可微分硬件合成
  • 批准号:
    2403134
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Enabling Graphics Processing Unit Performance Simulation for Large-Scale Workloads with Lightweight Simulation Methods
合作研究:SHF:中:通过轻量级仿真方法实现大规模工作负载的图形处理单元性能仿真
  • 批准号:
    2402804
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
Collaborative Research: CIF-Medium: Privacy-preserving Machine Learning on Graphs
合作研究:CIF-Medium:图上的隐私保护机器学习
  • 批准号:
    2402815
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
Collaborative Research: SHF: Medium: Tiny Chiplets for Big AI: A Reconfigurable-On-Package System
合作研究:SHF:中:用于大人工智能的微型芯片:可重新配置的封装系统
  • 批准号:
    2403408
  • 财政年份:
    2024
  • 资助金额:
    $ 109.32万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了