权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

III: Small: Automated Event Classification and Decision Making in Massive Data Streams

III：小：海量数据流中的自动事件分类和决策

基本信息

批准号：
1118041
负责人：
Stanislav Djorgovski
金额：
$ 50万
依托单位：
California Institute of Technology
依托单位国家：
美国
项目类别：
Standard Grant
财政年份：
2011
资助国家：
美国
起止时间：
2011-08-01 至 2014-07-31
项目状态：
已结题

来源：
https://www.nsf.gov/awardsearch/showAward?AWD_ID=1118041&HistoricalAwards=false
关键词：
III Small Automated Event Classification

项目摘要

As the exponential growth of data volumes and complexity continues in all sciences (and indeed all other fields of the modern society, economy, commerce, security, etc.), there is a growing need for powerful new tools and methodologies which can help us extract knowledge and understanding from these massive data sets and data streams. The newly gained knowledge is often used to guide our actions, and in science that typically means follow-up studies and measurements, as the research cycle continues. As the data rates and volume increase, it becomes necessary to take humans out of the loop, and develop automated methods for time-critical knowledge extraction and optimized response to anomalous or interesting events found by the data processing pipelines. This proposal is to develop a system that will be an example of a new generation of scientific experiments and methods that involve real-time mining of massive data streams, and dynamical follow-up strategies. The system would be developed and validated in the context of real scientific situations from the emerging field of time-domain astronomy. A new generation of synoptic sky surveys covers the sky repeatedly, detecting variable or transient phenomena, over a broad range of astrophysics, from the Solar system and stellar evolution, to cosmology and extreme relativistic objects; from extrasolar planets to gamma-ray bursts and supernovae as probes of the dark energy. As we explore the observable parameter space, there is a real possibility of discovery of new types of objects and phenomena. The system will enable exciting new astrophysics, and facilitate discovery. The key to this is a fully automated classification and prioritization of the transient events, and their follow-up observations. This poses some interesting challenges for applied computer science, especially in the area of Machine Learning, including an automated classification where only a sparse, incomplete, and heterogeneous data are available, and contextual information and domain expertise must be folded in the process. The process must be dynamic, incorporating new data as they become available, and revising the classifications accordingly. The system would then generate automatically decisions for an optimal follow-up of the most interesting events, given the available limited assets and resources. This project will aid the entire astronomical community in developing new scientific strategies and procedures in the era of large synoptic sky surveys, facilitate data sharing and re-use, and stimulate further development of Virtual Observatory capabilities. The methods and experiences gained here will be described in the open literature, so that they may find a broader use outside astronomy, wherever similar time-critical situations occur, thus fostering constructive new synergies between applied computer science and other domains. The proposers will train undergraduate and graduate students and postdocs, in the methods of scientific computing and computational thinking, and develop effective EPO materials, touching on both the new science and computation.The challenges posed by the knowledge extraction in the era of data abundance become even sharper in the time-critical situations where we mine the information from massive data streams, especially when the phenomena under study are short-lived, and/or a rapid follow-up reaction is needed. Potentially interesting phenomena and events must be identified, classified, and prioritized in real time, typically using some combination of the new measurements, and existing archival data and models. Then an optimal decision has to be made as to what is the best follow-up that will provide the essential new information in any given individual case; this can be critical if the follow-up assets are scarce or costly. If the time scales are short, and data rates large, the implication is that humans should be taken out of the loop, and that the classification, prioritization, and follow-up decision process must be fully automated. Machine learning (ML) and machine intelligence tools become a necessity. This proposal is to develop a novel, ML-based system for a real-time classification and prioritization of transient events, using the newly emerging field of time-domain astronomy and synoptic sky surveys as a scientific testbed. The classification problem here is different from the usual situations: the data are sparse and/or incomplete, heterogeneous, and evolving as the new measurements come in; the decision process has to take into account the uncertainties of the classification process, and the available assets; and so on. While the sky surveys detect transient cosmic events, the scientific returns come from their directed follow-up. It is essential to be able to classify and prioritize interesting events, especially as we move from the present Terascale data streams and tens of candidate events per night, to the future Petascale data regime, with literally millions of candidates, only a handful of which can be followed. Given the problem of data incompleteness and sparsity, the proposers will explore the use of Bayesian techniques that can operate on a set of expert-developed and ML-based priors, using the currently best available data. Some of the methodological challenges include incorporation of the contextual information and human expertise and optimal combination of separate classifier outputs, as well as new methods developed in this project. All of the algorithmic developments will be done keeping the robustness and scalability in mind, and tested on real scientific use cases.

随着所有科学(甚至现代社会、经济、商业、安全等所有其他领域)数据量和复杂性的指数级增长，对能够帮助我们从这些海量数据集和数据流中提取知识和理解的强大新工具和方法的需求日益增长。新获得的知识经常被用来指导我们的行动，在科学上，这通常意味着随着研究周期的继续，进行后续研究和测量。随着数据率和数据量的增加，有必要让人类走出循环，开发自动化方法来提取时间关键型知识，并优化对数据处理管道发现的异常或感兴趣的事件的响应。这项提议是开发一个系统，它将成为新一代科学实验和方法的例子，这些实验和方法涉及对海量数据流的实时挖掘，以及动态后续战略。该系统将在新兴的时间域天文学领域的真实科学情况下开发和验证。从太阳系和恒星演化到宇宙学和极端相对论天体；从太阳系外的行星到伽马射线爆发和超新星作为暗能量的探测器，新一代天气巡天重复地覆盖天空，探测各种可变或瞬变的现象。当我们探索可观测参数空间时，有可能发现新类型的物体和现象。该系统将使令人兴奋的新天体物理学成为可能，并促进发现。实现这一点的关键是对瞬时事件及其后续观察进行全自动分类和优先排序。这给应用计算机科学带来了一些有趣的挑战，特别是在机器学习领域，包括自动分类，在这种分类中，只有稀疏、不完整和不同种类的数据可用，并且上下文信息和领域专业知识必须在这个过程中折叠。这一过程必须是动态的，在新数据可用时纳入其中，并相应地修订分类。在现有的有限资产和资源的情况下，系统将自动为最感兴趣的事件的最佳后续行动作出决定。该项目将帮助整个天文学界在大型天文观测时代制定新的科学战略和程序，促进数据共享和重复使用，并促进虚拟天文台能力的进一步发展。在这里获得的方法和经验将在开放文献中加以描述，以便它们可以在天文学之外得到更广泛的使用，无论在哪里发生类似的时间紧迫的情况，从而在应用计算机科学和其他领域之间促进建设性的新的协同作用。提出者将对本科生、研究生和博士后进行科学计算和计算思维方法的培训，并开发有效的EPO材料，涉及新的科学和计算。在数据丰富的时代，知识提取带来的挑战在从海量数据流中挖掘信息的时间关键情况下变得更加尖锐，特别是当所研究的现象是短暂的，和/或需要快速的后续反应。必须实时识别、分类潜在的有趣现象和事件，并确定优先顺序，通常使用新测量数据和现有存档数据和模型的某种组合。然后，必须作出最佳决定，确定在任何特定个案中提供基本新信息的最佳后续行动是什么；如果后续行动资产稀缺或成本高昂，这可能是至关重要的。如果时间范围很短，而数据率很大，这意味着应该将人类排除在循环之外，并且分类、优先排序和后续决策过程必须完全自动化。机器学习(ML)和机器智能工具成为必需品。这项提议是开发一个新的、基于ML的系统，用于对瞬变事件进行实时分类和优先排序，使用新出现的时间域天文学和天气天文测量领域作为科学试验台。这里的分类问题不同于通常的情况：数据稀疏和/或不完整、异质，并且随着新的衡量标准的到来而不断变化；决策过程必须考虑分类过程的不确定性和可用的资产；等等。虽然天空勘测探测到了短暂的宇宙事件，但科学回报来自于他们定向的后续行动。重要的是能够对有趣的事件进行分类和优先排序，特别是当我们从目前的Terascale数据流和每晚数十个候选事件转移到未来的Petascale数据制度时，实际上有数百万个候选事件，其中只有一小部分可以跟踪。考虑到数据的不完备性和稀疏性的问题，提出者将探索使用贝叶斯技术，这种技术可以使用目前最好的可用数据，对一组专家开发的、基于ML的先验数据进行操作。一些方法学上的挑战包括纳入背景信息和人类专门知识、单独的分类器输出的最佳组合，以及在该项目中开发的新方法。所有的算法开发都将考虑到健壮性和可伸缩性，并在真实的科学用例上进行测试。