EAGER: Algorithms for Data Set Versioning: Store or Re-create?
EAGER:数据集版本控制算法:存储还是重新创建?
基本信息
- 批准号:1655073
- 负责人:
- 金额:$ 7.5万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2016
- 资助国家:美国
- 起止时间:2016-09-01 至 2017-08-31
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Technologically facilitated access to large data sets is increasingly emerging as key to scientific research in areas ranging from medicine to climate change with teams of researchers simultaneously engaged in accessing, modifying and cleaning data sets. Not surprisingly, such collaborative data-use has engendered substantial challenges related to data management. Indeed, the continuous modification of large-scale data sets frequently results in the creation of thousands of versions of data sets over time, especially as multiple users? access and edit the data over time. Such proliferation raises some basic questions: Should all versions of a document be saved? While this is certainly convenient, the storage costs may be prohibitively high. Alternatively, should only a certain version be saved? In this case, while the storage costs are low, the cost of recreating a particular version can rise significantly due to the effort involved in making changes to an existing version. This project focuses on the fundamental challenges arising from balancing storage needs with efficient retrieval of information in the context of big data. Thus the primary research goal of this proposal is to design provably good algorithms that will not only result in a deeper understanding of the storage and re-creation tradeoff but will also contribute to the development of effective data storage systems that are based on a sound theoretical foundation.In previous NSF-funded projects, the PI has collaborated extensively and successfully with women and high school students and this project will also involve similar collaborations. Over the course of the past five years, the PI has graduated three women PhDs and is currently advising another three. He has also worked with several women undergraduates who are now pursuing doctoral degrees. Additionally, the PI has played a key role establishing connections with the national Braid project, supporting the departmental chapter of the Association of Women in Computing and organizing events and activities focused on bringing in established women computer scientists as role models for current students. This fundamental problem can be modeled within a graph theoretic framework, as a directed weighted graph. Each node denotes a version. In the general form each edge (a,b) has two associated parameters - a weight denoting the storage cost to generate version b, given a copy of a and a cost denoting the cost to actually perform the computation of converting a to b. While both these are closely related, they could be different. In addition, the edge weights and costs can be wildly asymmetric. The primary reason for this is that when a new version is created by deleting data, we can simply specify that a significant portion of the data is deleted, however the reverse operation of insertion needs to actually specify the data to be inserted. In this framework, the goal is to compute a rooted tree and the structure and depth of the tree controls the storage and re-creation trade-off. While there exists a deep understanding of this problem for undirected graphs, none of those methods work effectively for directed graphs. This project will develop a deeper understanding of this basic problem.
在从医学到气候变化等领域,利用技术手段获取大型数据集日益成为科学研究的关键,研究人员团队同时参与获取、修改和清理数据集。毫不奇怪,这种协作使用数据的做法在数据管理方面带来了巨大挑战。事实上,大规模数据集的不断修改经常导致随着时间的推移创建数千个版本的数据集,特别是作为多个用户?随着时间的推移访问和编辑数据。这样的扩散提出了一些基本问题:是否应该保存文档的所有版本?虽然这当然很方便,但存储成本可能高得令人望而却步。或者,是否应该只保存某个版本?在这种情况下,虽然存储成本很低,但由于对现有版本进行更改所涉及的工作量,重新创建特定版本的成本可能会显著增加。该项目侧重于在大数据环境中平衡存储需求与有效检索信息所带来的根本挑战。因此,本提案的主要研究目标是设计可证明良好的算法,不仅会导致对存储和重新创建权衡的更深入理解,而且还将有助于开发基于良好理论基础的有效数据存储系统。PI与妇女和高中生进行了广泛和成功的合作,该项目也将涉及类似的合作。在过去的五年中,PI有三名女博士毕业,目前正在为另外三名提供咨询。他还与几位正在攻读博士学位的女大学生一起工作。此外,PI还发挥了关键作用,与国家Braid项目建立联系,支持妇女计算机协会的部门分会,并组织各种活动,重点是将知名的女计算机科学家作为在校学生的榜样。这个基本问题可以在图论框架内建模,作为有向加权图。 每个节点表示一个版本。在一般形式中,每个边(a,B)具有两个相关联的参数--权重和成本,权重表示在给定a的副本的情况下生成版本B的存储成本,成本表示实际执行将a转换为B的计算的成本。虽然这两种情况密切相关,但它们可能不同。此外,边权重和成本可以是非常不对称的。这样做的主要原因是,当通过删除数据创建新版本时,我们可以简单地指定删除数据的重要部分,但是插入的反向操作需要实际指定要插入的数据。在这个框架中,目标是计算一棵有根树,树的结构和深度控制存储和重新创建的权衡。虽然对无向图的这个问题有深入的理解,但这些方法都不能有效地用于有向图。本项目将加深对这一基本问题的理解。
项目成果
期刊论文数量(0)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Samir Khuller其他文献
M ay 2 00 2 Balancing Minimum Spanning Trees and Shortest-Path Trees
May 2 00 2 平衡最小生成树和最短路径树
- DOI:
- 发表时间:
- 期刊:
- 影响因子:0
- 作者:
Samir Khuller - 通讯作者:
Samir Khuller
To Store or Not to Store: a graph theoretical approach for Dataset Versioning
存储还是不存储:数据集版本控制的图论方法
- DOI:
- 发表时间:
2024 - 期刊:
- 影响因子:0
- 作者:
Anxin Guo;Jingwei Li;Pattara Sukprasert;Samir Khuller;Amol Deshpande;Koyel Mukherjee - 通讯作者:
Koyel Mukherjee
Facility Location with Dynamic Distance Functions
- DOI:
10.1023/a:1009796525600 - 发表时间:
1998-09-01 - 期刊:
- 影响因子:1.100
- 作者:
Randeep Bhatia;Sudipto Guha;Samir Khuller;Yoram J. Sussmann - 通讯作者:
Yoram J. Sussmann
Approximation algorithms for data placement on parallel disks
并行磁盘上数据放置的近似算法
- DOI:
10.1145/1597036.1597037 - 发表时间:
2009 - 期刊:
- 影响因子:0
- 作者:
L. Golubchik;Sanjeev Khanna;Samir Khuller;R. Thurimella;An Zhu - 通讯作者:
An Zhu
Geometric knapsack problems
- DOI:
10.1007/bf01769706 - 发表时间:
1993-11-01 - 期刊:
- 影响因子:0.700
- 作者:
Esther M. Arkin;Samir Khuller;Joseph S. B. Mitchell - 通讯作者:
Joseph S. B. Mitchell
Samir Khuller的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Samir Khuller', 18)}}的其他基金
REU Site: CAAR: Combinatorial Algorithms Applied Research
REU 网站:CAAR:组合算法应用研究
- 批准号:
1262805 - 财政年份:2013
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
AF: Small:Efficient Data Management Algorithms
AF:小:高效的数据管理算法
- 批准号:
1217890 - 财政年份:2012
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
Collaborative Research: Broader Impacts for Research and Discovery Summit
协作研究:研究和发现峰会的更广泛影响
- 批准号:
1033192 - 财政年份:2010
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
Optimization Algorithms for Large-scale, Thermal-aware Storage Systems
大规模热感知存储系统的优化算法
- 批准号:
0937865 - 财政年份:2009
- 资助金额:
$ 7.5万 - 项目类别:
Continuing Grant
CCF: Fundamental Algorithms for Data Management
CCF:数据管理的基本算法
- 批准号:
0728839 - 财政年份:2007
- 资助金额:
$ 7.5万 - 项目类别:
Continuing Grant
ITR/SY: Algorithms for Data Storage and Movement
ITR/SY:数据存储和移动算法
- 批准号:
0113192 - 财政年份:2001
- 资助金额:
$ 7.5万 - 项目类别:
Continuing Grant
Designing Algorithms for NP-Hard Graph Problems
NP 难图问题的算法设计
- 批准号:
9820965 - 财政年份:1999
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
CAREER: Approximation Algorithms for Graph-Theoretic Problems
职业:图论问题的近似算法
- 批准号:
9501355 - 财政年份:1995
- 资助金额:
$ 7.5万 - 项目类别:
Continuing Grant
相似海外基金
EAGER: Algorithms for Analyzing Faulty Data Using Domain Information
EAGER:使用域信息分析错误数据的算法
- 批准号:
2414736 - 财政年份:2024
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
XTRIPODS: Algorithms and Machine Learning in Data Intensive Models
XTRIPODS:数据密集型模型中的算法和机器学习
- 批准号:
2342527 - 财政年份:2024
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
CAREER: Data Structures and Streaming Algorithms
职业:数据结构和流算法
- 批准号:
2339942 - 财政年份:2024
- 资助金额:
$ 7.5万 - 项目类别:
Continuing Grant
Data-Driven Algorithms for Data Acquisition
用于数据采集的数据驱动算法
- 批准号:
EP/Y037200/1 - 财政年份:2024
- 资助金额:
$ 7.5万 - 项目类别:
Research Grant
RII Track-4: NSF: Extracting Pan Genomic Information from Metagenomic Data: Distributed Algorithms and Scalable Software
RII Track-4:NSF:从宏基因组数据中提取泛基因组信息:分布式算法和可扩展软件
- 批准号:
2327456 - 财政年份:2024
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
Collaborative Research: CIF: Small: Versatile Data Synchronization: Novel Codes and Algorithms for Practical Applications
合作研究:CIF:小型:多功能数据同步:实际应用的新颖代码和算法
- 批准号:
2312872 - 财政年份:2023
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
ATD: Efficient and Effective Algorithms for Detection of Anomalies in High-dimensional Spatiotemporal Data with Large Amounts of Missing Data
ATD:高效且有效的高维时空数据异常检测算法
- 批准号:
2318925 - 财政年份:2023
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
RTML: Large: Collaborative: Harmonizing Predictive Algorithms and Mixed-Signal/Precision Circuits via Computation-Data Access Exchange and Adaptive Dataflows
RTML:大型:协作:通过计算数据访问交换和自适应数据流协调预测算法和混合信号/精密电路
- 批准号:
2400511 - 财政年份:2023
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
Collaborative Research: III: Medium: Algorithms for scalable inference and phylodynamic analysis of tumor haplotypes using low-coverage single cell sequencing data
合作研究:III:中:使用低覆盖率单细胞测序数据对肿瘤单倍型进行可扩展推理和系统动力学分析的算法
- 批准号:
2415562 - 财政年份:2023
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant
CAREER: Data Valuation in the Wild: Theories, Algorithms, and Applications
职业:野外数据评估:理论、算法和应用
- 批准号:
2239622 - 财政年份:2023
- 资助金额:
$ 7.5万 - 项目类别:
Standard Grant