EAGER: Lifecycle Management of Collaborative Analysis Workflows through Provenance Capture and Analysis

EAGER:通过来源捕获和分析进行协作分析工作流程的生命周期管理

基本信息

  • 批准号:
    1650755
  • 负责人:
  • 金额:
    $ 25.69万
  • 依托单位:
  • 依托单位国家:
    美国
  • 项目类别:
    Standard Grant
  • 财政年份:
    2016
  • 资助国家:
    美国
  • 起止时间:
    2016-09-01 至 2018-08-31
  • 项目状态:
    已结题

项目摘要

Data-driven methods and products have shown tremendous promise and are becoming increasingly common in a variety of communities, including science, education, economics, government, and social and web analytics. This trend, popularly referred to as "big data" or "data science", has resulted in a pressing need for sustainable and scalable tools that facilitate the end-to-end collaborative data analysis process; this process is often ad hoc, typically featuring highly unstructured datasets, an amalgamation of different tools and techniques, significant back-and-forth among the members of a team, and trial-and-error to identify the right analysis tools, algorithms, models, and parameters. Although there is much prior and ongoing work on developing tools to perform specific data analysis tasks, there is no easy way to capture and reason about ad hoc data science pipelines, many of which are often spread across a collection of analysis scripts. Metadata or provenance information about how datasets were generated, including the programs or scripts used for generating them and/or values of any crucial parameters, is often lost. Similarly, it is hard to keep track of any dependencies between the datasets, or information about how they evolved over time. This project is building a unified provenance and metadata management system to support end-to-end lifecycle management of complex collaborative "data science workflows" that arise in big data applications. The system features a flexible and intuitive data model that can capture a variety of different types of data and metadata, including versioning and provenance information, derivation information, parameters used during experiments or modeling, statistics gathered to make decisions, analysis scripts, notes or tags, etc. It provides novel mechanisms for making it easy to capture such information with minimal burden on the users. The system also features a rich, high-level domain-specific query language that enables unified querying over such data, as well as a web browser-based visualization tool for formulating queries, and for exploring the search results. By continuously analyzing and exploiting such provenance information, the system also enables a host of new features including: searching for relevant data science workflows or analysis scripts for a given task, comparing end results of multiple pipelines to identify key similarities and differences, and quickly and automatically detecting problems or anomalies during model development and/or deployment. The system will transform the way in which data scientists manage provenance information and metadata while performing data analysis, and will allow them to more quickly derive actionable and useful insights or knowledge from the data. By lowering the barrier to sharing and reusing the work done by others, the system will lead to new insights that may not have been achievable beforehand. This project provides research opportunities for graduate and undergraduate students, and is aligned with several undergraduate and graduate courses offered by the PI.
数据驱动的方法和产品已经显示出巨大的前景,并在包括科学、教育、经济、政府以及社会和网络分析在内的各种社区中变得越来越普遍。这一趋势通常被称为“大数据”或“数据科学”,因此迫切需要可持续和可扩展的工具来促进端到端的协作数据分析过程;这一过程通常是特别的,其特点是高度非结构化的数据集、不同工具和技术的融合、团队成员之间的重大来回以及确定正确的分析工具、算法、模型和参数的反复试验。尽管有许多先前和正在进行的开发工具来执行特定数据分析任务的工作,但没有简单的方法来捕获和推理特殊的数据科学管道,其中许多管道通常分布在分析脚本的集合中。关于如何生成数据集的元数据或来源信息,包括用于生成它们的程序或脚本和/或任何关键参数的值,经常丢失。同样,很难跟踪数据集之间的任何依赖关系,或者关于它们如何随时间演变的信息。该项目正在建设一个统一的来源和元数据管理系统,以支持大数据应用程序中出现的复杂的协作式“数据科学工作流程”的端到端生命周期管理。该系统具有灵活和直观的数据模型,可以捕获各种不同类型的数据和元数据,包括版本和来源信息、派生信息、在实验或建模期间使用的参数、为做出决策而收集的统计数据、分析脚本、笔记或标记等。它提供了新的机制,使用户能够以最小的负担轻松捕获此类信息。该系统还具有丰富的高级特定于领域的查询语言,允许对此类数据进行统一查询,以及用于制定查询和探索搜索结果的基于Web浏览器的可视化工具。通过不断分析和利用这种来源信息,该系统还实现了一系列新功能,包括:搜索特定任务的相关数据科学工作流程或分析脚本,比较多个管道的最终结果以确定关键的相似和差异,以及在模型开发和/或部署期间快速和自动地检测问题或异常。该系统将改变数据科学家在进行数据分析时管理来源信息和元数据的方式,并使他们能够更快地从数据中得出可操作和有用的见解或知识。通过降低分享和重复使用他人所做工作的门槛,该系统将带来以前可能无法实现的新见解。该项目为研究生和本科生提供研究机会,并与PI提供的几个本科生和研究生课程保持一致。

项目成果

期刊论文数量(5)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
ProvDB: Lifecycle Management of Collaborative Analysis Workflows
ProvDB:协作分析工作流程的生命周期管理
Towards Unified Data and Lifecycle Management for Deep Learning
DEX: Query Execution in a Delta-based Storage System
RStore: A Distributed Multi-Version Document Store
{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

数据更新时间:{{ journalArticles.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ monograph.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ sciAawards.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ conferencePapers.updateTime }}

{{ item.title }}
  • 作者:
    {{ item.author }}

数据更新时间:{{ patent.updateTime }}

Amol Deshpande其他文献

MEDLINE/ PubMed
MEDLINE/PubMed
  • DOI:
    10.1007/978-0-387-39940-9_3039
  • 发表时间:
    2004
  • 期刊:
  • 影响因子:
    3.8
  • 作者:
    Cornelia Caragea;V. Honavar;P. Boncz;P. Larson;S. Dietrich;Gonzalo Navarro;Bhavani Thuraisingham;Yan Luo;Ouri E. Wolfson;S. Beitzel;Eric C. Jensen;Ophir Frieder;Christian S. Jensen;N. Tradisauskas;Ethan V. Munson;A. Wun;K. Goda;Stephen E. Fienberg;Jiashun Jin;Guimei Liu;Nick Craswell;T. Pedersen;Cesare Pautasso;M. Moro;S. Manegold;B. Carminati;Marina Blanton;Sara Bouchenak;Noël de Palma;Wei Tang;Christoph Quix;M. Jeusfeld;R. K. Pon;David J. Buttler;W. Meng;P. Zezula;Michal Batko;Vlastislav Dohnal;J. Domingo;Denilson Barbosa;Ioana Manolescu;Jeffrey Xu Yu;Emmanuel Cecchet;Vivien Quéma;Xifeng Yan;G. Santucci;D. Zeinalipour;Panos K. Chrysanthis;Amol Deshpande;Carlos Guestrin;Samuel Madden;Carson Kai;R. H. Güting;Amarnath Gupta;Heng Tao Shen;G. Weikum;Ramesh Jain;Jeffrey Xu Yu;Paolo Ciaccia;K. Candan;M. Sapino;C. Meghini;F. Sebastiani;U. Straccia;F. Nack;V. S. Subrahmanian;Maria Vanina Martinez;D. Reforgiato;T. Westerveld;M. Sebillo;G. Vitiello;Maria De Marsico;K. Voruganti;C. Parent;S. Spaccapietra;Christelle Vangenot;Esteban Zimányi;Prasan Roy;S. Sudarshan;E. Puppo;Peer Kröger;Matthias Renz;H. Schuldt;Solmaz Kolahi;A. Unwin;W. Cellary
  • 通讯作者:
    W. Cellary
To Store or Not to Store: a graph theoretical approach for Dataset Versioning
存储还是不存储:数据集版本控制的图论方法
  • DOI:
  • 发表时间:
    2024
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Anxin Guo;Jingwei Li;Pattara Sukprasert;Samir Khuller;Amol Deshpande;Koyel Mukherjee
  • 通讯作者:
    Koyel Mukherjee
Moment
片刻
  • DOI:
  • 发表时间:
    2009
  • 期刊:
  • 影响因子:
    0
  • 作者:
    Cornelia Caragea;V. Honavar;P. Boncz;Per;Suzanne W. Dietrich;Gonzalo Navarro;B. Thuraisingham;Yan Luo;Ouri Wolfson;S. Beitzel;Eric C. Jensen;O. Frieder;C. S. Jensen;N. Tradisauskas;E. Munson;A. Wun;K. Goda;Stephen E. Fienberg;Jiashun Jin;Guimei Liu;Nick Craswell;T. Pedersen;Cesare Pautasso;M. Moro;S. Manegold;B. Carminati;Marina Blanton;S. Bouchenak;Noël de Palma;Wei Tang;C. Quix;M. Jeusfeld;R. K. Pon;David J. Buttler;Weiyi Meng;P. Zezula;Michal Batko;Vlastislav Dohnal;J. Domingo;Denilson Barbosa;I. Manolescu;Jeffrey Xu Yu;E. Cecchet;Vivien Quéma;Xifeng Yan;G. Santucci;D. Zeinalipour;P. Chrysanthis;Amol Deshpande;Carlos Guestrin;S. Madden;C. Leung;R. H. Güting;Amarnath Gupta;Heng Tao Shen;G. Weikum;Ramesh Jain;Jeffrey Xu Yu;P. Ciaccia;K. Candan;M. Sapino;C. Meghini;Fabrizio Sebastiani;U. Straccia;F. Nack;V. S. Subrahmanian;Maria Vanina Martinez;D. Reforgiato;T. Westerveld;M. Sebillo;G. Vitiello;Maria De Marsico;K. Voruganti;Christine Parent;S. Spaccapietra;C. Vangenot;E. Zimányi;Prasan Roy;S. Sudarshan;Enrico Puppo;Peer Kröger;M. Renz;H. Schuldt;Solmaz Kolahi;A. Unwin;W. Cellary
  • 通讯作者:
    W. Cellary
Application of Packed Bed Chemical Looping (Unmixed) Combustion for water heating: Modelling and CFD simulation for Reduction cycle
  • DOI:
    10.1016/j.cep.2023.109569
  • 发表时间:
    2023-12-01
  • 期刊:
  • 影响因子:
  • 作者:
    Amina Faizal;Amol Deshpande
  • 通讯作者:
    Amol Deshpande
108 – The Prevalence and Use of Cannabis by Patients with Inflammatory Bowel Disease
  • DOI:
    10.1016/s0016-5085(19)36842-8
  • 发表时间:
    2019-05-01
  • 期刊:
  • 影响因子:
  • 作者:
    Lillian Du;Amol Deshpande;Laura Yang;Shlomit Boguslavsky;Kenneth Croitoru;Zane Gallinger;Vivian Huang;Mark S. Silverberg;Adam V. Weizman;Geoffrey C. Nguyen;A. Hillary Steinhart
  • 通讯作者:
    A. Hillary Steinhart

Amol Deshpande的其他文献

{{ item.title }}
{{ item.translation_title }}
  • DOI:
    {{ item.doi }}
  • 发表时间:
    {{ item.publish_year }}
  • 期刊:
  • 影响因子:
    {{ item.factor }}
  • 作者:
    {{ item.authors }}
  • 通讯作者:
    {{ item.author }}

{{ truncateString('Amol Deshpande', 18)}}的其他基金

III: Medium: Collaborative Research: DataHub - A Collaborative Dataset Management Platform for Data Science
III:媒介:协作研究:DataHub - 数据科学协作数据集管理平台
  • 批准号:
    1513972
  • 财政年份:
    2015
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Continuing Grant
III: Small: Enabling Declarative Querying and Analytics over Large Dynamic Information Networks
III:小型:在大型动态信息网络上实现声明式查询和分析
  • 批准号:
    1319432
  • 财政年份:
    2013
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Continuing Grant
III: Small: Collaborative Proposal: Towards Robust Uncertain Data Management
III:小:协作提案:迈向稳健的不确定数据管理
  • 批准号:
    1218367
  • 财政年份:
    2012
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Continuing Grant
III: Small: Managing Large-scale Uncertain Data Repositories
III:小型:管理大规模不确定数据存储库
  • 批准号:
    0916736
  • 财政年份:
    2009
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Continuing Grant
CAREER: MauveDB: Model-Based User Views over Sensor Data
职业:MauveDB:基于模型的用户对传感器数据的视图
  • 批准号:
    0546136
  • 财政年份:
    2006
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Continuing Grant
CSR-EHS: Collaborative Research: A General, Efficient and Robust Platform for Enabling Control Applications in Sensor Networks
CSR-EHS:协作研究:用于在传感器网络中实现控制应用的通用、高效且稳健的平台
  • 批准号:
    0509220
  • 财政年份:
    2005
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Standard Grant

相似海外基金

Privacy-preserving machine learning through secure management of data's lifecycle in distributed systems: REMINDER
通过安全管理分布式系统中的数据生命周期来保护隐私的机器学习:提醒
  • 批准号:
    EP/Y036301/1
  • 财政年份:
    2024
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Research Grant
Study on data lifecycle management for open access to social survey data
社会调查数据开放获取的数据生命周期管理研究
  • 批准号:
    23K17577
  • 财政年份:
    2023
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Grant-in-Aid for Challenging Research (Exploratory)
A Software-as-a-Service for loan origination and full lifecycle loan management with full online and self-service onboarding, configuration and support
用于贷款发放和全生命周期贷款管理的软件即服务,具有完整的在线和自助服务入门、配置和支持
  • 批准号:
    10060461
  • 财政年份:
    2023
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Collaborative R&D
Excellence in Research/Collaborative Research: Smart Technology-enabled Nutrient Lifecycle and Supply Chain Management for Microgreens
卓越的研究/合作研究:智能技术支持的微型蔬菜营养生命周期和供应链管理
  • 批准号:
    2000244
  • 财政年份:
    2020
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Standard Grant
Excellence in Research/Collaborative Research: Smart Technology-enabled Nutrient Lifecycle and Supply Chain Management for Microgreens
卓越的研究/合作研究:智能技术支持的微型蔬菜营养生命周期和供应链管理
  • 批准号:
    2000266
  • 财政年份:
    2020
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Standard Grant
Developing an AI platform to automate and track the lifecycle of solid waste management for Property Managers and waste stream stakeholders in Kenya
开发人工智能平台,为肯尼亚的物业经理和废物流利益相关者实现固体废物管理的自动化和跟踪生命周期
  • 批准号:
    82349
  • 财政年份:
    2020
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Collaborative R&D
Excellence in Research/Collaborative Research: Smart Technology-enabled Nutrient Lifecycle and Supply Chain Management for Microgreens
卓越的研究/合作研究:智能技术支持的微型蔬菜营养生命周期和供应链管理
  • 批准号:
    2000229
  • 财政年份:
    2020
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Standard Grant
Applying a sex and gender-based lens to prescription drug lifecycle management
将性别和性别视角应用于处方药生命周期管理
  • 批准号:
    416989
  • 财政年份:
    2019
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Operating Grants
Applying an SGBA+ lens to medical device lifecycle management
将 SGBA 镜头应用于医疗设备生命周期管理
  • 批准号:
    416987
  • 财政年份:
    2019
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Operating Grants
Sustainable Lifecycle Management for Scientific Software (SuLMaSS) - Software Dissemination and Infrastructure Development Driven by a Cardiac Electrophysiology Simulator
科学软件的可持续生命周期管理 (SuLMaSS) - 由心脏电生理模拟器驱动的软件传播和基础设施开发
  • 批准号:
    391128822
  • 财政年份:
    2018
  • 资助金额:
    $ 25.69万
  • 项目类别:
    Research data and software (Scientific Library Services and Information Systems)
{{ showInfoDetail.title }}

作者:{{ showInfoDetail.author }}

知道了