CCRI: ENS: Collaborative Research: Open Computer System Usage Repository and Analytics Engine
CCRI:ENS:协作研究:开放计算机系统使用存储库和分析引擎
基本信息
- 批准号:2016704
- 负责人:
- 金额:$ 118.39万
- 依托单位:
- 依托单位国家:美国
- 项目类别:Standard Grant
- 财政年份:2020
- 资助国家:美国
- 起止时间:2020-10-01 至 2023-09-30
- 项目状态:已结题
- 来源:
- 关键词:
项目摘要
In science and engineering research, large-scale, centrally managed computing clusters or “supercomputers” have been instrumental in enabling the kinds of resource-intensive simulations, analyses, and visualizations that have been used in computer-aided drug discovery, high strength materials design for cars and jet engines, and disease vector analysis to name a few. Such clusters are complex systems comprised of several hundred to thousand computer servers with fast network connections between them, various data storage resources, and highly optimized scientific software being shared with several hundred other researchers from diverse domains. Consequently, the overall dependability of such systems relies on the dependability of these individual highly interconnected elements as well as the characteristics of cascading failures. While computer systems researchers and practitioners have been at the forefront of designing and deploying dependable computing cluster systems, this task has been hampered by the lack of publicly available, real-world failure data from supercomputers currently in operation. Prior practice has largely involved tedious, manual collection and curation of small sets of data for use in specific analyses. This project will establish seamless, automated pipelines for acquiring, processing, and curating continuous, detailed system usage, monitoring, and failure data from large computing clusters at two organizations, Purdue University and the University of Texas at Austin. This data will be disseminated through a publicly accessible portal and complemented by a suite of in-situ analytics capabilities that will support and spur research in dependable computing systems. The data acquisition pipeline and analytics software will be made open-source and designed for ease of federation, extension, and adoption to cluster systems operated by other organizations.Cluster computing systems are a key resource in time-sensitive, computationally intensive research such as virus structure modeling and drug discovery and have been at the forefront of efforts to tackle global pandemics. Both unanticipated system down-times and lack of actionable feedback to researchers on computational failures can have adverse effects on research timeliness and efficiency. This project will allow the practitioners and administrators of these systems to develop data-backed best practices for ensuring high availability and utilization for their clusters. The resulting large, public data repository consisting of data from clusters with diverse workloads spanning traditional high-performance computing, modern accelerator-based computing (for example on graphics processing units (GPUs)), and cloud-style applications will allow the systems research community to consider forward-looking research questions based on real system data. The project will train a cadre of students in data analysis on live production systems and this will provide them with a unique learning experience, interfacing with a variety of stakeholders.This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.
在科学和工程研究中,大规模、集中管理的计算集群或“超级计算机”已经有助于实现各种资源密集型模拟、分析和可视化,所述资源密集型模拟、分析和可视化已经用于计算机辅助药物发现、用于汽车和喷气发动机的高强度材料设计以及疾病媒介分析,仅举几例。这些集群是由数百到数千台计算机服务器组成的复杂系统,它们之间具有快速的网络连接,各种数据存储资源,以及与来自不同领域的数百名其他研究人员共享的高度优化的科学软件。因此,这种系统的整体可靠性依赖于这些独立的高度互连的元件的可靠性以及级联故障的特性。虽然计算机系统研究人员和从业人员一直处于设计和部署可靠的计算集群系统的最前沿,但由于缺乏来自当前运行的超级计算机的公开可用的真实世界故障数据,这项任务一直受到阻碍。先前的实践主要涉及繁琐的手动收集和管理用于特定分析的小数据集。该项目将建立无缝的自动化管道,用于获取,处理和管理来自普渡大学和德克萨斯大学奥斯汀分校两个组织的大型计算集群的连续,详细的系统使用,监控和故障数据。这些数据将通过一个可公开访问的门户网站传播,并辅之以一套现场分析能力,以支持和促进可靠计算系统的研究。数据采集管道和分析软件将开放源代码,旨在方便联合、扩展和采用其他组织运营的集群系统。集群计算系统是时间敏感、计算密集型研究(如病毒结构建模和药物发现)的关键资源,并一直处于应对全球流行病的最前沿。意外的系统停机时间和缺乏可操作的反馈给研究人员的计算故障可能会对研究的及时性和效率产生不利影响。该项目将使这些系统的从业人员和管理人员能够制定以数据为后盾的最佳做法,以确保其集群的高可用性和利用率。由此产生的大型公共数据存储库由来自集群的数据组成,这些集群具有跨越传统高性能计算、现代基于加速器的计算(例如图形处理单元(GPU))和云式应用程序的各种工作负载,这将使系统研究社区能够考虑基于真实的系统数据的前瞻性研究问题。该项目将培训一批学生进行现场生产系统的数据分析,这将为他们提供独特的学习体验,与各种利益相关者进行交流。该奖项反映了NSF的法定使命,并通过使用基金会的知识价值和更广泛的影响审查标准进行评估,被认为值得支持。
项目成果
期刊论文数量(9)
专著数量(0)
科研奖励数量(0)
会议论文数量(0)
专利数量(0)
ORION and the Three Rights: Sizing, Bundling, and Prewarming for Serverless DAGs
ORION 和三项权利:无服务器 DAG 的规模调整、捆绑和预热
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Mahgoub, Ashraf;Yi, Edgardo Barsallo;Shankar, Karthick;Elnikety, Sameh;Chaterji, Somali;Bagchi, Saurabh
- 通讯作者:Bagchi, Saurabh
Root Cause Analysis of Failures in Microservices through Causal Discovery
- DOI:
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Azam Ikram;Sarthak Chakraborty;Subrata Mitra;S. Saini;S. Bagchi;Murat Kocaoglu
- 通讯作者:Azam Ikram;Sarthak Chakraborty;Subrata Mitra;S. Saini;S. Bagchi;Murat Kocaoglu
SONIC: Application-aware Data Passing for Chained Serverless Applications
- DOI:
- 发表时间:2021
- 期刊:
- 影响因子:0
- 作者:Ashraf Y. Mahgoub;K. Shankar;S. Mitra;Ana Klimovic;S. Chaterji;S. Bagchi
- 通讯作者:Ashraf Y. Mahgoub;K. Shankar;S. Mitra;Ana Klimovic;S. Chaterji;S. Bagchi
Closing-the-Loop: A Data-Driven Framework for Effective Video Summarization
- DOI:10.1109/ism.2020.00042
- 发表时间:2020-12
- 期刊:
- 影响因子:0
- 作者:Ran Xu;Haoliang Wang;Stefano Petrangeli;Viswanathan Swaminathan;S. Bagchi
- 通讯作者:Ran Xu;Haoliang Wang;Stefano Petrangeli;Viswanathan Swaminathan;S. Bagchi
An Automated Approach to Re-Hosting Embedded Firmware Through Removing Hardware Dependencies.
通过删除硬件依赖性来重新托管嵌入式固件的自动化方法。
- DOI:10.2172/2006057
- 发表时间:2022
- 期刊:
- 影响因子:0
- 作者:Ketterer, Austin;Shekar, Asha;Yi, Edgardo;Bagchi, Saurabh;Clements, Abraham
- 通讯作者:Clements, Abraham
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
数据更新时间:{{ journalArticles.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ monograph.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ sciAawards.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ conferencePapers.updateTime }}
{{ item.title }}
- 作者:
{{ item.author }}
数据更新时间:{{ patent.updateTime }}
Saurabh Bagchi其他文献
Intrusion detection in voice over IP environments
- DOI:
10.1007/s10207-008-0071-0 - 发表时间:
2008-12-16 - 期刊:
- 影响因子:3.200
- 作者:
Yu-Sung Wu;Vinita Apte;Saurabh Bagchi;Sachin Garg;Navjot Singh - 通讯作者:
Navjot Singh
Erratum to: ‘MicroRNA target prediction using thermodynamic and sequence curves’
- DOI:
10.1186/s12864-016-2367-1 - 发表时间:
2016-03-09 - 期刊:
- 影响因子:3.700
- 作者:
Asish Ghoshal;Raghavendran Shankar;Saurabh Bagchi;Ananth Grama;Somali Chaterji - 通讯作者:
Somali Chaterji
A Survey Article on Wormhole Attack Detection and Security in Wireless Sensor Networks
关于无线传感器网络中虫洞攻击检测和安全的调查文章
- DOI:
10.5120/ijca2017915666 - 发表时间:
2017 - 期刊:
- 影响因子:0
- 作者:
Gaurav Tejpal;Sonal Sharma;Khalil;Issa;Saurabh Bagchi;N. Shroff;S. Krishnamurthy - 通讯作者:
S. Krishnamurthy
Reliable and Efficient Distributed Checkpointing System for Grid Environments
- DOI:
10.1007/s10723-014-9297-4 - 发表时间:
2014-05-20 - 期刊:
- 影响因子:2.900
- 作者:
Tanzima Zerin Islam;Saurabh Bagchi;Rudolf Eigenmann - 通讯作者:
Rudolf Eigenmann
Saurabh Bagchi的其他文献
{{
item.title }}
{{ item.translation_title }}
- DOI:
{{ item.doi }} - 发表时间:
{{ item.publish_year }} - 期刊:
- 影响因子:{{ item.factor }}
- 作者:
{{ item.authors }} - 通讯作者:
{{ item.author }}
{{ truncateString('Saurabh Bagchi', 18)}}的其他基金
NSF Workshop on State-of-the-Art and Challenges in Resilience
美国国家科学基金会关于复原力的最新技术和挑战研讨会
- 批准号:
2140139 - 财政年份:2021
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
NSF Workshop on State-of-the-Art and Challenges in Resilience
美国国家科学基金会关于复原力的最新技术和挑战研讨会
- 批准号:
1845192 - 财政年份:2018
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
CI-NEW: Collaborative Research: Computer System Failure Data Repository to Enable Data-Driven Dependability
CI-NEW:协作研究:计算机系统故障数据存储库以实现数据驱动的可靠性
- 批准号:
1513197 - 财政年份:2015
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
CSR: Small: Diagnosing Performance and Correctness Errors in Parallel Applications at Large Scales
CSR:小:诊断大规模并行应用程序中的性能和正确性错误
- 批准号:
1527262 - 财政年份:2015
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
CI-P: Computer System Failure Data Repository to Enable Data-Driven Dependability Research
CI-P:计算机系统故障数据存储库,支持数据驱动的可靠性研究
- 批准号:
1405906 - 财政年份:2014
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
NeTS: Medium: Collaborative Research: Tango: Performance and Fault Management in Cellular Networks through Device-Network Cooperation
NeTS:媒介:协作研究:Tango:通过设备网络协作进行蜂窝网络的性能和故障管理
- 批准号:
1409506 - 财政年份:2014
- 资助金额:
$ 118.39万 - 项目类别:
Continuing Grant
Travel Grants for Attending the 29th IEEE Symposium on Reliable Distributed Systems (SRDS)
参加第 29 届 IEEE 可靠分布式系统 (SRDS) 研讨会的旅费补助
- 批准号:
1047647 - 财政年份:2010
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
CSR: Small: Monitoring for Error Detection in Today's High Throughput Applications
CSR:小:监控当今高吞吐量应用程序中的错误检测
- 批准号:
0916337 - 财政年份:2009
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
NeTS-NOSS: Robust Sensor Network Architecture through Neighborhood Monitoring and Isolation
NeTS-NOSS:通过邻域监控和隔离实现稳健的传感器网络架构
- 批准号:
0626830 - 财政年份:2006
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
Sensors: Smart RF Antennas for Reliable and Real-Time Sensor Networks
传感器:用于可靠、实时传感器网络的智能射频天线
- 批准号:
0330016 - 财政年份:2003
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
相似国自然基金
基于色氨酸代谢调控ENS途径探讨电针治疗功能性消化不良的作用机制
- 批准号:JCZRLH202500075
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
基于GDNF/PI3K/AKT信号通路探讨白术七物颗粒调控ENS-ICC-SMC网络治 疗气阴两虚型STC的机制研究
- 批准号:2025JJ90111
- 批准年份:2025
- 资助金额:0.0 万元
- 项目类别:省市级项目
水稻EnS150基因调控种子休眠和萌发的分子机制研究
- 批准号:32301853
- 批准年份:2023
- 资助金额:30.00 万元
- 项目类别:青年科学基金项目
岩藻糖基化修饰的MSCs介导GDNF正反馈调控肠神经元焦亡及ENPC自噬促进ENS重建
- 批准号:n/a
- 批准年份:2023
- 资助金额:0.0 万元
- 项目类别:省市级项目
生孢梭菌通过“IPA-AHR-mTOR”轴调控ENPC自噬参与糖尿病ENS重建的机制研究
- 批准号:82300616
- 批准年份:2023
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于肠道菌群/5-HT/ENS调控的番茄红素改善肠动力作用机制研究
- 批准号:
- 批准年份:2021
- 资助金额:30 万元
- 项目类别:青年科学基金项目
MSCs胞外囊泡调控ENPC的SETD2/H3K36轴在糖尿病ENS重建中的作用及机制研究
- 批准号:
- 批准年份:2021
- 资助金额:30 万元
- 项目类别:青年科学基金项目
基于lncRNA Ens6探讨天南星活性成分抑制线粒体分裂促进M2小胶质细胞极化改善缺血性脑卒中的作用机制研究
- 批准号:82003976
- 批准年份:2020
- 资助金额:24.0 万元
- 项目类别:青年科学基金项目
岩藻糖基化在MSCs介导的ENS重建中的作用及机制研究
- 批准号:81974068
- 批准年份:2019
- 资助金额:55.0 万元
- 项目类别:面上项目
从肌层巨噬细胞MM和ENS的Cross-talk 探讨广藿香活性成分对IBS-D肠神经稳态的调节机制
- 批准号:81973586
- 批准年份:2019
- 资助金额:55.0 万元
- 项目类别:面上项目
相似海外基金
Collaborative Research: Research Infrastructure: CCRI: ENS: Enhanced Open Networked Airborne Computing Platform
合作研究:研究基础设施:CCRI:ENS:增强型开放网络机载计算平台
- 批准号:
2235160 - 财政年份:2023
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
Collaborative Research: Research Infrastructure: CCRI: ENS: Enhanced Open Networked Airborne Computing Platform
合作研究:研究基础设施:CCRI:ENS:增强型开放网络机载计算平台
- 批准号:
2235157 - 财政年份:2023
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
Collaborative Research: Research Infrastructure: CCRI: ENS: Enhanced Open Networked Airborne Computing Platform
合作研究:研究基础设施:CCRI:ENS:增强型开放网络机载计算平台
- 批准号:
2235158 - 财政年份:2023
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
Collaborative Research: Research Infrastructure: CCRI: ENS: Enhanced Open Networked Airborne Computing Platform
合作研究:研究基础设施:CCRI:ENS:增强型开放网络机载计算平台
- 批准号:
2235159 - 财政年份:2023
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
Collaborative Research: CCRI: ENS: Boa 2.0: Enhancing Infrastructure for Studying Software and its Evolution at a Large Scale
合作研究:CCRI:ENS:Boa 2.0:增强大规模研究软件及其演化的基础设施
- 批准号:
2120448 - 财政年份:2021
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
Collaborative Research: CCRI: ENS: Boa 2.0: Enhancing Infrastructure for Studying Software and its Evolution at a Large Scale
合作研究:CCRI:ENS:Boa 2.0:增强大规模研究软件及其演化的基础设施
- 批准号:
2120386 - 财政年份:2021
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
Collaborative Research: CCRI: ENS: Boa 2.0: Enhancing Infrastructure for Studying Software and its Evolution at a Large Scale
合作研究:CCRI:ENS:Boa 2.0:增强大规模研究软件及其演化的基础设施
- 批准号:
2120345 - 财政年份:2021
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
CCRI: ENS: Collaborative Research: ns-3 Network Simulation for Next-Generation Wireless
CCRI:ENS:协作研究:下一代无线的 ns-3 网络仿真
- 批准号:
2016379 - 财政年份:2020
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
CCRI: ENS: Collaborative Research: ns-3 Network Simulation for Next-Generation Wireless
CCRI:ENS:协作研究:下一代无线的 ns-3 网络仿真
- 批准号:
2016381 - 财政年份:2020
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant
CCRI: ENS: Collaborative Research: Enabling Automated Language Support for the srcML Infrastructure
CCRI:ENS:协作研究:为 srcML 基础设施提供自动化语言支持
- 批准号:
2016452 - 财政年份:2020
- 资助金额:
$ 118.39万 - 项目类别:
Standard Grant