Interdisciplinary Scientific Data Management
跨学科科学数据管理
基本信息
- 批准号:1244820
- 负责人:
- 金额:$ 100万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2012
- 资助国家:美国
- 起止时间:2012-10-01 至 2016-09-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
Scientists are increasingly limited by their ability to analyze the large amounts of complex data available. These data sets are generated not only by instruments but also computational experiments; the sizes of the largest numerical simulations are on par with data collected by instruments, crossing the petabyte threshold this year. The importance of large synthetic data sets is increasingly important, as scientists compare their experiments to reference simulations. All disciplines need a new ?instrument for data? that can deal not only with large data sets but the cross product of large and diverse data sets. While the largest data sets have captured most of the public attention, they only represent the tip of the iceberg. What is often missed is that scientific data sets have a power law distribution. At one end are the very large data collections compiled by hundreds of scientists collaborating over multiple years. These projects typically have coherent data management plans and organization to ensure that the data products are accessible to a wide community. Nevertheless, the long-term curation of the data is still an unsolved problem. At the other end of the distribution, in the "long tail", are the very large numbers of small data sets, such as the images, spreadsheets and tables collected in laboratories and field studies. While the individual files are small, their numbers add up; in fact, there is as much data aggregated in these small items as in the biggest collections. On the other hand, these data sets are often not as well documented as their bigger counterparts. For most scientists there is little reward in becoming a data management expert and devoting the time required to documenting the data for later reuse. In fact, the process of manually cleaning data sets has been called the strip mining of big data: an ugly and resource intensive effort that leaves big scars.Scientists at the Johns Hopkins University have built innovative frameworks to publish scientific data across a wide range of disciplines, from astronomy to turbulence, and environmental science. These projects already share some common components for data management. This project will connect more of the existing independent components into a coherent one, explore how to scale the data services to deal with the "long tail" of the data distribution, and demonstrate the overlap in the basic data management tasks across disciplines. The project has four parts: (i) continue and enhance the efforts on the Sloan Digital Sky Survey, (ii) turn large numerical simulations into easy-to-use numerical laboratories, (iii) enhance an existing end-to-end system for environmental sensors and integrate it with other field data, (iv) enhance and generalize a set of core collaborative tools, and apply these to help with the challenge of the "long tail" of scientific data. The projects involve the Sloan Digital Sky Survey (SDSS) -- the world's most used astronomy facility -- and its CASJOBs/MyDB collaborative environment The framework will be extended to other areas of science, like in-situ environmental monitoring and field biology. This will be demonstrated by integrating data in soil ecology from the Baltimore Ecosystem Study project with data collected automatically, via a wireless sensor network. The project will also test, how a simple, "DropBox"-like interface (i.e., online storage and sharing) can be used to overcome some of the barriers that prevent scientists from publishing much of their value-added data. Finally, the project will explore how smaller and larger numerical simulations can be placed into interactive, publicly accessible numerical laboratories, using data sets currently from turbulence and astronomy. The funds will support people: a combination of data scientists, database administrators, postdoctoral fellows, students and programmers working together to "connect the dots" and bring additional data sets on line. The project will enhance the public interfaces of several publicly available data sets, prototype an easy-to-use environment to upload small user data into a collaborative environment, and create a framework for a new citizen-science project in environmental science.
科学家分析大量复杂数据的能力越来越受到限制。这些数据集不仅是由仪器产生的,也是由计算实验产生的;最大的数值模拟的大小与仪器收集的数据相当,今年超过了PB级的阈值。随着科学家将他们的实验与参考模拟进行比较,大型合成数据集的重要性越来越重要。所有的学科都需要一个新的?数据工具?它不仅可以处理大型数据集,而且可以处理大型和不同数据集的交叉产品。虽然最大的数据集吸引了大多数公众的注意力,但它们只代表了冰山一角。人们经常忽略的是,科学数据集具有幂律分布。一端是由数百名科学家多年合作汇编的非常庞大的数据集。 这些项目通常具有一致的数据管理计划和组织,以确保广泛的社区可以访问数据产品。然而,数据的长期管理仍然是一个未解决的问题。在分布的另一端,即“长尾”,是大量的小数据集,如在实验室和实地研究中收集的图像、电子表格和表格。虽然各个文件很小,但它们的数量会相加;事实上,这些小项目中聚集的数据与最大的集合中聚集的数据一样多。另一方面,这些数据集往往没有像它们的更大的对应物那样得到很好的记录。对于大多数科学家来说,成为一名数据管理专家并投入所需的时间来记录数据以供日后重用几乎没有什么回报。事实上,手动清理数据集的过程被称为大数据的剥离挖掘:这是一种丑陋的、资源密集型的工作,会留下很大的伤疤。约翰霍普金斯大学的科学家们已经建立了创新的框架,可以发布从天文学到湍流和环境科学等广泛学科的科学数据。这些项目已经共享了一些用于数据管理的通用组件。该项目将把更多现有的独立组件连接成一个连贯的组件,探索如何扩展数据服务以处理数据分布的“长尾”,并展示跨学科基本数据管理任务的重叠。该项目有四个部分:(i)继续并加强斯隆数字巡天的工作,(ii)将大型数值模拟转化为易于使用的数值实验室,(iii)加强现有的环境传感器端对端系统,并将其与其他实地数据相结合,(iv)加强和推广一套核心协作工具,并应用这些工具帮助应对科学数据“长尾”的挑战。这些项目涉及斯隆数字巡天(SDSS)--世界上使用最多的天文学设施--及其CASJOBs/MyDB协作环境。该框架将扩展到其他科学领域,如原位环境监测和野外生物学。这将通过整合来自巴尔的摩生态系统研究项目的土壤生态数据与通过无线传感器网络自动收集的数据来证明。该项目还将测试,如何一个简单的,“Dropbox”样的界面(即,在线存储和共享)可用于克服阻碍科学家公布其大部分增值数据的一些障碍。最后,该项目将探索如何将更小和更大的数值模拟放入交互式的、可公开访问的数值实验室,使用目前来自湍流和天文学的数据集。 这些资金将支持人们:数据科学家,数据库管理员,博士后研究员,学生和程序员的组合,共同努力“连接点”,并将更多的数据集上线。该项目将加强几个公开数据集的公共接口,建立一个易于使用的环境原型,以便将小型用户数据上传到一个协作环境中,并为环境科学方面的一个新的公民科学项目创建一个框架。
项目成果
期刊论文数量(1)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
数据更新时间:{{ journalArticles.updateTime }}
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Alexander Szalay其他文献
Über die in Mg durch Po-α-Teilchen erregten γ-Strahlung
- DOI:
10.1007/bf01475288 - 发表时间:
1940-10-01 - 期刊:
- 影响因子:2.100
- 作者:
Alexander Szalay - 通讯作者:
Alexander Szalay
Alexander Szalay的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Alexander Szalay', 18)}}的其他基金
Collaborative Research: Building the Community for the Open Storage Network
协作研究:构建开放存储网络社区
- 批准号:
1747493 - 财政年份:2018
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Cyberinfrastructure (CI) for NSF Large Facilities Workshop
NSF 大型设施研讨会的网络基础设施 (CI)
- 批准号:
1608742 - 财政年份:2015
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
CIF21 DIBBs: Long Term Access to Large Scientific Data Sets: The SkyServer and Beyond
CIF21 DIBB:长期访问大型科学数据集:SkyServer 及其他
- 批准号:
1261715 - 财政年份:2013
- 资助金额:
$ 100万 - 项目类别:
Cooperative Agreement
Collaborative Research: 100G Connectivity for Data-Intensive Computing at JHU
合作研究:JHU 数据密集型计算的 100G 连接
- 批准号:
1137045 - 财政年份:2012
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: CDI- Type II: Towards Analyzing Complex Petascale Datasets: The Milky Way Laboratory
合作研究:CDI-II 型:分析复杂千万亿次数据集:银河系实验室
- 批准号:
1124403 - 财政年份:2011
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
RI: Large: Collaborative Research: A Robotic Network for Locating and Removing Invasive Carp from Inland Lakes
RI:大型:合作研究:用于定位和清除内陆湖泊入侵鲤鱼的机器人网络
- 批准号:
1111507 - 财政年份:2011
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
MRI: Development of Data-Scope - A Multi-Petabyte Generic Data Analysis Environment for Science
MRI:Data-Scope 的开发 - 多 PB 通用科学数据分析环境
- 批准号:
1040114 - 财政年份:2010
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative:Balanced Scalable Architectures for Data-Intensive Supercomputing
协作:数据密集型超级计算的平衡可扩展架构
- 批准号:
0937947 - 财政年份:2009
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
The Structure of the Universe Beyond 100 Mpc
100 Mpc 以上的宇宙结构
- 批准号:
0407308 - 财政年份:2004
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
ITR - ASE - (int+sim): Exploring the Lagrangian Structure of Complex Flows with 100 Terabyte Datasets
ITR - ASE - (int sim):利用 100 TB 数据集探索复杂流的拉格朗日结构
- 批准号:
0428325 - 财政年份:2004
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
相似海外基金
Data-enabled Pathways to Equity in Cyberinfrastructure Utilization for Scientific Discovery
利用数据实现科学发现的网络基础设施公平之路
- 批准号:
2346631 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
CRII: CSR: Enhancing Eventual Data Consistency in Multidimensional Scientific Computing through Lightweight In-Memory Distributed Ledger System.
CRII:CSR:通过轻量级内存分布式账本系统增强多维科学计算中的最终数据一致性。
- 批准号:
2348330 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
PFI-TT: A Hybrid Scalable Data Management System Providing Deep Access to the Scientific Knowledge in Data Science
PFI-TT:混合可扩展数据管理系统,提供对数据科学中科学知识的深入访问
- 批准号:
2345794 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Continuing Grant
Research Infrastructure: CC* Data Storage: Broadening UMBCs Data Storage footprint to Advance Scientific Research and Discovery
研究基础设施:CC* 数据存储:扩大 UMBC 数据存储足迹以推进科学研究和发现
- 批准号:
2346667 - 财政年份:2024
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Topology-Aware Data Compression for Scientific Analysis and Visualization
合作研究:OAC 核心:用于科学分析和可视化的拓扑感知数据压缩
- 批准号:
2313124 - 财政年份:2023
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Travel: Workshop on Clusters, Clouds, and Data Analytics for Scientific Computing 2024
旅行:2024 年科学计算集群、云和数据分析研讨会
- 批准号:
2336813 - 财政年份:2023
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: CDS&E: An experimentally validated, interactive, data-enabled scientific computing platform for cardiac tissue ablation characterization and monitoring
合作研究:CDS
- 批准号:
2245152 - 财政年份:2023
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Topology-Aware Data Compression for Scientific Analysis and Visualization
合作研究:OAC 核心:用于科学分析和可视化的拓扑感知数据压缩
- 批准号:
2313122 - 财政年份:2023
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
CSR: Small: Accelerating Data Intensive Scientific Workflows with Consistency Contracts
CSR:小:通过一致性合同加速数据密集型科学工作流程
- 批准号:
2317556 - 财政年份:2023
- 资助金额:
$ 100万 - 项目类别:
Standard Grant
Collaborative Research: OAC Core: Topology-Aware Data Compression for Scientific Analysis and Visualization
合作研究:OAC 核心:用于科学分析和可视化的拓扑感知数据压缩
- 批准号:
2313123 - 财政年份:2023
- 资助金额:
$ 100万 - 项目类别:
Standard Grant