权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Large: Collaborative Research: Analysis Engineering for Robust End-to-End Data Science

III：大型：协作研究：稳健的端到端数据科学的分析工程

基本信息

批准号：
1900991
负责人：
Arvind Satyanarayan
金额：
$ 71.25万
依托单位：
Massachusetts Institute of Technology
依托单位国家：
美国
项目类别：
Continuing Grant
财政年份：
2019
资助国家：
美国
起止时间：
2019-10-01 至 2024-09-30
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1900991&HistoricalAwards=false
关键词：
III Large Collaborative Research Analysis

项目摘要

From poor statistical practices leading to retractions of scientific "discoveries" to low-level spreadsheet errors subverting high-stakes analyses, failures of data analysis can have catastrophic consequences. The rapid growth of data science practice in the last decade has led to large collaborative efforts to develop new data processing, machine learning, and analytics tools that put more advanced data analysis into the hands of a wider audience of practitioners, from students to scientists to designers. The most dominant tool for data science is code, where cutting-edge algorithms can be applied from an existing libraries. However, as this democratization of data science has lowered the barrier to using advanced methods, safely using these tools under sound statistical practice remains as difficult as ever. To facilitate more robust data science, this project investigates models and tools for analysis engineering by data scientists who write programs. The focus is on the complete end-to-end process of data analysis performed with code: the iterative, and often exploratory, steps that analysts go through to turn data into This project will contribute insights and characterizations of analytic work, novel methods for capturing and analyzing data science activities, and develop new programming tools and visualization methods for authoring and validating analyses. If successful, this project will augment people's ability to conduct and assess data analyses, promoting more robust results and reducing the gap between novice and expert analysts. The findings and tools from the project will be incorporated into educational efforts, including classroom teaching and tutorials and available as open source software integrated into popular analytical environments (e.g., Jupyter).Data analysis is a central activity to scientific research, yet is too often conducted in an undisciplined fashion. This project treats the entire analytic process as our central phenomenon of study. The project will employ mixed methods to study and characterize common analysis practices and pitfalls, including direct observations of data analysts, large-scale analysis of computational notebooks, and instrumentation of analytic programming environments like JupyterLab. The project will contribute new methods for specifying and safeguarding analyses, including domain-specific languages and program synthesis methods to guide users to preferred next steps. It will also explore "multiverse" workflows to manage and assess a diversity of analysis decisions. Analogues of debugging and testing tools will be developed to flag problems and perform error analysis, while the capture and visualization of analytic provenance to aid reproducibility, verification, and collaborative review. The work will be evaluated through controlled studies, classroom use, and open-source deployment for wide-scale field use.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

从导致科学“发现”撤回的不良统计实践到破坏高风险分析的低级电子表格错误，数据分析的失败可能会产生灾难性的后果。在过去十年中，数据科学实践的快速增长导致了大规模的合作努力，以开发新的数据处理，机器学习和分析工具，将更先进的数据分析交给更广泛的从业者，从学生到科学家再到设计师。数据科学最主要的工具是代码，可以从现有的库中应用尖端的算法。然而，随着数据科学的民主化降低了使用先进方法的障碍，在合理的统计实践下安全地使用这些工具仍然像以往一样困难。为了促进更强大的数据科学，该项目研究了编写程序的数据科学家进行分析工程的模型和工具。重点是使用代码执行的完整的端到端数据分析过程：分析师将数据转化为数据的迭代且通常是探索性的步骤。该项目将有助于分析工作的见解和特征，捕获和分析数据科学活动的新方法，并开发新的编程工具和可视化方法来创作和验证分析。如果成功，该项目将提高人们进行和评估数据分析的能力，促进更有力的结果，并缩小新手和专家分析师之间的差距。该项目的研究结果和工具将被纳入教育工作，包括课堂教学和教程，并作为开源软件集成到流行的分析环境中（例如，数据分析是科学研究的核心活动，但往往是以一种无纪律的方式进行的。这个项目把整个分析过程作为我们研究的中心现象。该项目将采用混合方法来研究和描述常见的分析实践和陷阱，包括数据分析师的直接观察，计算笔记本的大规模分析以及分析编程环境（如XuanyterLab）的仪器化。该项目将为指定和保护分析提供新的方法，包括特定领域的语言和程序合成方法，以指导用户选择下一步。它还将探索“多元宇宙”工作流程，以管理和评估各种分析决策。将开发类似的调试和测试工具，以标记问题并执行错误分析，同时捕获和可视化分析出处，以帮助再现性，验证和协作审查。这项工作将通过受控研究、课堂使用和大规模现场使用的开源部署进行评估。该奖项反映了NSF的法定使命，并被认为值得通过使用基金会的知识价值和更广泛的影响审查标准进行评估来支持。

项目成果

期刊论文数量（11）

专著数量（0）

科研奖励数量（0）

会议论文数量（0）

专利数量（0）

Intuitively Assessing ML Model Reliability through Example-Based Explanations and Editing Model Inputs

DOI：
10.1145/3490099.3511160
发表时间：
2021-02
期刊：
Proceedings of the 27th International Conference on Intelligent User Interfaces
影响因子：
0
作者：
Harini Suresh;Kathleen M. Lewis;J. Guttag;Arvind Satyanarayan
通讯作者：
Harini Suresh;Kathleen M. Lewis;J. Guttag;Arvind Satyanarayan

B2: Bridging Code and Interactive Visualization in Computational Notebooks

DOI：
10.1145/3379337.3415851
发表时间：
2020-10
期刊：
Proceedings of the 33rd Annual ACM Symposium on User Interface Software and Technology
影响因子：
0
作者：
Yifan Wu;J. Hellerstein;Arvind Satyanarayan
通讯作者：
Yifan Wu;J. Hellerstein;Arvind Satyanarayan

Accessible Visualization via Natural Language Descriptions: A Four-Level Model of Semantic Content

DOI：
10.1109/tvcg.2021.3114770
发表时间：
2021-09
期刊：
IEEE Transactions on Visualization and Computer Graphics
影响因子：
5.2
作者：
Alan Lundgard;Arvind Satyanarayan
通讯作者：
Alan Lundgard;Arvind Satyanarayan

Striking a Balance: Reader Takeaways and Preferences when Integrating Text and Charts

取得平衡：整合文本和图表时读者的要点和偏好

DOI：
10.1109/tvcg.2022.3209383
发表时间：
2023
期刊：
IEEE Transactions on Visualization and Computer Graphics
影响因子：
5.2
作者：
Stokes, Chase;Setlur, Vidya;Cogley, Bridget;Satyanarayan, Arvind;Hearst, Marti A.
通讯作者：
Hearst, Marti A.

Beyond Expertise and Roles: A Framework to Characterize the Stakeholders of Interpretable Machine Learning and their Needs

DOI：
10.1145/3411764.3445088
发表时间：
2021-01
期刊：
Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems
影响因子：
0
作者：
Harini Suresh;Steven R. Gomez;K. Nam;Arvind Satyanarayan
通讯作者：
Harini Suresh;Steven R. Gomez;K. Nam;Arvind Satyanarayan

DOI：
{{ item.doi }}
发表时间：
{{ item.publish_year }}
期刊：
{{ item.journal_name }}
影响因子：
{{ item.factor }}
作者：
{{ item.authors }}
通讯作者：
{{ item.author }}

数据更新时间：{{ journalArticles.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ monograph.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ sciAawards.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ conferencePapers.updateTime }}

作者：
{{ item.author }}

数据更新时间：{{ patent.updateTime }}

Arvind Satyanarayan其他文献

Visual Debugging Techniques for Reactive Data Visualization

反应式数据可视化的可视化调试技术

DOI：
10.1111/cgf.12903
发表时间：
2016
期刊：
Computer Graphics Forum
影响因子：
2.5
作者：
J. Hoffswell;Arvind Satyanarayan;Jeffrey Heer
通讯作者：
Jeffrey Heer

Varv: Reprogrammable Interactive Software as a Declarative Data Structure

Varv：作为声明性数据结构的可重新编程交互式软件

DOI：
10.1145/3491102.3502064
发表时间：
2022
期刊：
Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems
影响因子：
0
作者：
Marcel Borowski;Luke Murray;Rolf Bagge;Bager Kristensen;Arvind Satyanarayan;C. Klokmose
通讯作者：
C. Klokmose

Umwelt: Accessible Structured Editing of Multimodal Data Representations

Umwelt：多模式数据表示的可访问结构化编辑

DOI：
10.1145/3613904.3641996
发表时间：
2024
期刊：
Proceedings of the CHI Conference on Human Factors in Computing Systems
影响因子：
0
作者：
Jonathan Zong;Isabella Pedraza Pineros;Mengzhu Katie Chen;Daniel Hajas;Arvind Satyanarayan
通讯作者：
Arvind Satyanarayan

“Customization is Key”: Reconfigurable Textual Tokens for Accessible Data Visualizations

“定制是关键”：可重新配置文本标记以实现可访问的数据可视化

DOI：
发表时间：
2023
期刊：
International Conference on Human Factors in Computing Systems
影响因子：
0
作者：
Shuli Jones;Isabella Pedraza Pineros;Daniel Hajas;Jonathan Zong;Arvind Satyanarayan
通讯作者：
Arvind Satyanarayan