III: Medium: Collaborative Research: DataHub - A Collaborative Dataset Management Platform for Data Science

III:媒介:协作研究:DataHub - 数据科学协作数据集管理平台

基本信息

项目摘要

The rise of the Internet, smart phones, and wireless sensors has resulted in a vast trove of data about all aspects of our lives, from our social interactions to our personal preferences to our vital signs and medical records. Increasingly, "data science" teams want to collaboratively analyze these datasets, to understand trends and to extract actionable business, scientific, or social insights. Unfortunately, while there exist tools to support data analysis, much-needed underlying infrastructure and data management capabilities are missing. To this end, "DataHub", a collaborative platform for cleaning, storing, understanding, sharing, and publishing datasets, will be developed. DataHub will be a publicly accessible platform that will host private user datasets as well as public datasets retrieved from online sources. DataHub will serve as the common substrate for data science, freeing up end users from tedious dataset book-keeping tasks, and instead supporting them in their search for useful insights. DataHub will be deployed on a large scale at MIT; partnerships with organizations and groups from a variety of sectors will be leveraged upon to show benefits for real data scientists and to ensure that the proposed techniques meet real-world big data challenges. The curriculum development part of this project will lead to the training of new data scientists, and the project will also provide opportunities for graduate and undergraduate students to participate in research and learn how to do collaborative research.Unlike most systems that focus on improving performance or on supporting even more sophisticated analyses, DataHub will instead focus on simplifying and automating many fundamental book-keeping operations that are a pre-requisite to data science. Key features of DataHub will include: (1) a flexible, source code control-like versioning system for data, that efficiently branches, merges, and differences datasets; (2) new data ingest, cleaning, and wrangling tools designed to automate data cleaning process; (3) the ability to search for "related" tables and to integrate them into the analysis process; and (4) the ability to selectively share and collaborate on data sets across users and teams. Overall, DataHub will significantly reduce the amount of effort involved on the part of data scientists for preparing, analyzing, sharing, and managing data.For more information, see the project website at: http://data-hub.org
互联网、智能手机和无线传感器的兴起,带来了关于我们生活方方面面的海量数据,从我们的社交互动到个人偏好,再到我们的生命体征和医疗记录。越来越多的“数据科学”团队希望协作分析这些数据集,以了解趋势并提取可操作的业务、科学或社会见解。不幸的是,虽然存在支持数据分析的工具,但缺少急需的底层基础设施和数据管理功能。为此,将开发“DataHub”,这是一个用于清理、存储、理解、共享和发布数据集的协作平台。DataHub将是一个可公开访问的平台,将托管私人用户数据集以及从在线资源检索的公共数据集。DataHub将成为数据科学的共同基础,将最终用户从繁琐的数据集簿记任务中解放出来,并支持他们寻找有用的见解。DataHub将在麻省理工学院大规模部署;与来自不同部门的组织和团体的合作关系将被利用来为真正的数据科学家展示利益,并确保所提出的技术满足现实世界的大数据挑战。该项目的课程开发部分将培养新的数据科学家,该项目还将为研究生和本科生提供参与研究和学习如何进行合作研究的机会。与大多数专注于提高性能或支持更复杂分析的系统不同,DataHub将专注于简化和自动化许多基本的簿记操作,这些操作是数据科学的先决条件。DataHub的主要特性包括:(1)一个灵活的、类似于源代码控制的数据版本控制系统,可以有效地分支、合并和区分数据集;(2)设计新的数据摄取、清理和整理工具,使数据清理过程自动化;(3)搜索“相关”表并将其整合到分析过程中的能力;(4)有选择地在用户和团队之间共享和协作数据集的能力。总的来说,DataHub将显著减少数据科学家准备、分析、共享和管理数据的工作量。欲了解更多信息,请参阅项目网站:http://data-hub.org

项目成果

期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
Optimally Leveraging Density and Locality for Exploratory Browsing and Sampling
最佳地利用密度和位置进行探索性浏览和采样
  • DOI:
    10.1145/3209900.3209903
  • 发表时间:
    2018
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Kim, Albert;Xu, Liqi;Siddiqui, Tarique;Huang, Silu;Madden, Samuel;Parameswaran, Aditya
  • 通讯作者:
    Parameswaran, Aditya
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Aditya Parameswaran其他文献

$$\varvec{\textsc {Orpheus}}$$ DB: bolt-on versioning for relational databases (extended version)
  • DOI:
    10.1007/s00778-019-00594-5
  • 发表时间:
    2019-12-20
  • 期刊:
  • 影响因子:
    3.800
  • 作者:
    Silu Huang;Liqi Xu;Jialin Liu;Aaron J. Elmore;Aditya Parameswaran
  • 通讯作者:
    Aditya Parameswaran

Aditya Parameswaran的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Aditya Parameswaran', 18)}}的其他基金

FW-HTF-R: Human-Machine Teaming for Effective Data Work at Scale: Upskilling Defense Lawyers Working with Police and Court Process Data
FW-HTF-R:大规模有效数据工作的人机协作:提高辩护律师处理警察和法院流程数据的技能
  • 批准号:
    2129008
  • 财政年份:
    2021
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
AitF: Collaborative Research: Fast, Accurate, and Practical: Adaptive Sublinear Algorithms for Scalable Visualization
AitF:协作研究:快速、准确和实用:用于可扩展可视化的自适应次线性算法
  • 批准号:
    1940759
  • 财政年份:
    2019
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
CAREER: Advancing Open-Ended Crowdsourcing: The Next Frontier in Crowdsourced Data Management
职业:推进开放式众包:众包数据管理的下一个前沿
  • 批准号:
    1940757
  • 财政年份:
    2019
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Continuing Grant
AitF: Collaborative Research: Fast, Accurate, and Practical: Adaptive Sublinear Algorithms for Scalable Visualization
AitF:协作研究:快速、准确和实用:用于可扩展可视化的自适应次线性算法
  • 批准号:
    1733878
  • 财政年份:
    2017
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
CAREER: Advancing Open-Ended Crowdsourcing: The Next Frontier in Crowdsourced Data Management
职业:推进开放式众包:众包数据管理的下一个前沿
  • 批准号:
    1652750
  • 财政年份:
    2017
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Continuing Grant

相似海外基金

III : Medium: Collaborative Research: From Open Data to Open Data Curation
III:媒介:协作研究:从开放数据到开放数据管理
  • 批准号:
    2420691
  • 财政年份:
    2024
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Medium: Designing AI Systems with Steerable Long-Term Dynamics
合作研究:III:中:设计具有可操纵长期动态的人工智能系统
  • 批准号:
    2312865
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
Collaborative Research: III: MEDIUM: Responsible Design and Validation of Algorithmic Rankers
合作研究:III:媒介:算法排序器的负责任设计和验证
  • 批准号:
    2312932
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
III: Medium: Collaborative Research: Integrating Large-Scale Machine Learning and Edge Computing for Collaborative Autonomous Vehicles
III:媒介:协作研究:集成大规模机器学习和边缘计算以实现协作自动驾驶汽车
  • 批准号:
    2348169
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Continuing Grant
Collaborative Research: III: Medium: Algorithms for scalable inference and phylodynamic analysis of tumor haplotypes using low-coverage single cell sequencing data
合作研究:III:中:使用低覆盖率单细胞测序数据对肿瘤单倍型进行可扩展推理和系统动力学分析的算法
  • 批准号:
    2415562
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Medium: New Machine Learning Empowered Nanoinformatics System for Advancing Nanomaterial Design
合作研究:III:媒介:新的机器学习赋能纳米信息学系统,促进纳米材料设计
  • 批准号:
    2347592
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Medium: Knowledge discovery from highly heterogeneous, sparse and private data in biomedical informatics
合作研究:III:中:生物医学信息学中高度异构、稀疏和私有数据的知识发现
  • 批准号:
    2312862
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
Collaborative Research: III: MEDIUM: Responsible Design and Validation of Algorithmic Rankers
合作研究:III:媒介:算法排序器的负责任设计和验证
  • 批准号:
    2312930
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Medium: VirtualLab: Integrating Deep Graph Learning and Causal Inference for Multi-Agent Dynamical Systems
协作研究:III:媒介:VirtualLab:集成多智能体动态系统的深度图学习和因果推理
  • 批准号:
    2312501
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
Collaborative Research: III: Medium: Graph Neural Networks for Heterophilous Data: Advancing the Theory, Models, and Applications
合作研究:III:媒介:异质数据的图神经网络:推进理论、模型和应用
  • 批准号:
    2406648
  • 财政年份:
    2023
  • 资助金额:
    $ 33.3万
  • 项目类别:
    Standard Grant
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了