权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

EAGER: Algorithms for Data Set Versioning: Store or Re-create?

EAGER：数据集版本控制算法：存储还是重新创建？

基本信息

批准号：
1655073
负责人：
Samir Khuller
金额：
$ 7.5万
依托单位：
University of Maryland, College Park
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2016
资助国家：
美国
起止时间：
2016-09-01 至 2017-08-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1655073&HistoricalAwards=false
关键词：
EAGER Algorithms Data Set Versioning

项目摘要

Technologically facilitated access to large data sets is increasingly emerging as key to scientific research in areas ranging from medicine to climate change with teams of researchers simultaneously engaged in accessing, modifying and cleaning data sets. Not surprisingly, such collaborative data-use has engendered substantial challenges related to data management. Indeed, the continuous modification of large-scale data sets frequently results in the creation of thousands of versions of data sets over time, especially as multiple users? access and edit the data over time. Such proliferation raises some basic questions: Should all versions of a document be saved? While this is certainly convenient, the storage costs may be prohibitively high. Alternatively, should only a certain version be saved? In this case, while the storage costs are low, the cost of recreating a particular version can rise significantly due to the effort involved in making changes to an existing version. This project focuses on the fundamental challenges arising from balancing storage needs with efficient retrieval of information in the context of big data. Thus the primary research goal of this proposal is to design provably good algorithms that will not only result in a deeper understanding of the storage and re-creation tradeoff but will also contribute to the development of effective data storage systems that are based on a sound theoretical foundation.In previous NSF-funded projects, the PI has collaborated extensively and successfully with women and high school students and this project will also involve similar collaborations. Over the course of the past five years, the PI has graduated three women PhDs and is currently advising another three. He has also worked with several women undergraduates who are now pursuing doctoral degrees. Additionally, the PI has played a key role establishing connections with the national Braid project, supporting the departmental chapter of the Association of Women in Computing and organizing events and activities focused on bringing in established women computer scientists as role models for current students. This fundamental problem can be modeled within a graph theoretic framework, as a directed weighted graph. Each node denotes a version. In the general form each edge (a,b) has two associated parameters - a weight denoting the storage cost to generate version b, given a copy of a and a cost denoting the cost to actually perform the computation of converting a to b. While both these are closely related, they could be different. In addition, the edge weights and costs can be wildly asymmetric. The primary reason for this is that when a new version is created by deleting data, we can simply specify that a significant portion of the data is deleted, however the reverse operation of insertion needs to actually specify the data to be inserted. In this framework, the goal is to compute a rooted tree and the structure and depth of the tree controls the storage and re-creation trade-off. While there exists a deep understanding of this problem for undirected graphs, none of those methods work effectively for directed graphs. This project will develop a deeper understanding of this basic problem.

在从医学到气候变化等领域，利用技术手段获取大型数据集日益成为科学研究的关键，研究人员团队同时参与获取、修改和清理数据集。毫不奇怪，这种协作使用数据的做法在数据管理方面带来了巨大挑战。事实上，大规模数据集的不断修改经常导致随着时间的推移创建数千个版本的数据集，特别是作为多个用户？随着时间的推移访问和编辑数据。这样的扩散提出了一些基本问题：是否应该保存文档的所有版本？虽然这当然很方便，但存储成本可能高得令人望而却步。或者，是否应该只保存某个版本？在这种情况下，虽然存储成本很低，但由于对现有版本进行更改所涉及的工作量，重新创建特定版本的成本可能会显著增加。该项目侧重于在大数据环境中平衡存储需求与有效检索信息所带来的根本挑战。因此，本提案的主要研究目标是设计可证明良好的算法，不仅会导致对存储和重新创建权衡的更深入理解，而且还将有助于开发基于良好理论基础的有效数据存储系统。PI与妇女和高中生进行了广泛和成功的合作，该项目也将涉及类似的合作。在过去的五年中，PI有三名女博士毕业，目前正在为另外三名提供咨询。他还与几位正在攻读博士学位的女大学生一起工作。此外，PI还发挥了关键作用，与国家Braid项目建立联系，支持妇女计算机协会的部门分会，并组织各种活动，重点是将知名的女计算机科学家作为在校学生的榜样。这个基本问题可以在图论框架内建模，作为有向加权图。每个节点表示一个版本。在一般形式中，每个边（a，B）具有两个相关联的参数--权重和成本，权重表示在给定a的副本的情况下生成版本B的存储成本，成本表示实际执行将a转换为B的计算的成本。虽然这两种情况密切相关，但它们可能不同。此外，边权重和成本可以是非常不对称的。这样做的主要原因是，当通过删除数据创建新版本时，我们可以简单地指定删除数据的重要部分，但是插入的反向操作需要实际指定要插入的数据。在这个框架中，目标是计算一棵有根树，树的结构和深度控制存储和重新创建的权衡。虽然对无向图的这个问题有深入的理解，但这些方法都不能有效地用于有向图。本项目将加深对这一基本问题的理解。