权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Methods for Hypothesis-driven Analysis of Sequential Data (HydrAS)

假设驱动的序列数据分析方法 (HydrAS)

基本信息

批准号：
438232455
负责人：
Professor Dr. Andreas Hotho
金额：
--
依托单位：
Lehrstuhl für Informatik X: Data Science
依托单位国家：
德国
项目类别：
Research Grants
财政年份：
资助国家：
德国
起止时间：
项目状态：
未结题

来源：
https://gepris.dfg.de/gepris/projekt/438232455?language=en
关键词：
Methods Hypothesis driven Analysis Sequential

项目摘要

Increased availability of large-scale digital trace data on human behavior requires the development of suitable algorithmic approaches in the fields of computer and data science. Such data often comes in the form of sequences, e.g. as sequences of visited websites or locations in cities. To analyze this kind of data and extract knowledge in large scale, the applicants and others presented a novel computational approach that enables the comparison of hypotheses (derived from intuition, previous studies, or social theories) with respect to their plausibility regarding observed sequences in a Bayesian approach. In this project, we will develop fundamentally new data analysis methods in that direction that overcome current shortcomings. In that regard, we will (1) systemize and simplify the process of hypothesis elicitation by integrating (semi-)automatic procedures for deriving interpretable base hypotheses from background knowledge and combining base hypotheses with each other. Additionally, we aim to (2) develop methods that partition data sequences in such a way that each part of the data can be succinctly described in terms of background information on the features, and the transition behavior in each partition can be explained by given hypotheses in order to account for heterogeneity in the data. Finally, we (3) extend the general framework of hypothesis-based analysis of sequential data, which currently focuses on simple first-order Markov Chain models to more complex models such as Hidden Markov chain models, continuous time Markov chain models or neural networks for sequential data. This would allow to formalize more complex and more fine-grained hypotheses, to pick models that are most suitable for a specific scenario, and integrate additional information (e.g., time information) in an easily understandable way.In contrast to many recently proposed methods in the field of data science and machine learning, our research will not focus on methods that yield the maximum predictive power. Instead, we concentrate on finding potential explanations of the data generation process that can be understood by human domain experts through incorporating their hypotheses directly into the analysis process. In that regard, it will provide unique opportunities to integrate hypothesis-driven data analysis on one hand with advanced machine learning techniques on the other hand to support the understanding of the underlying processes generating the observed sequences. While this project focuses on developing new data science methods for analyzing human behavior, we expect the results to be easily transferable to other application areas featuring sequential data.

人类行为的大规模数字跟踪数据的增加需要在计算机和数据科学领域开发合适的算法方法。这些数据通常以序列的形式出现，例如访问过的网站或城市位置的序列。为了分析这种数据并大规模地提取知识，申请人和其他人提出了一种新的计算方法，该方法能够比较假设（源自直觉、先前的研究或社会理论）关于它们在贝叶斯方法中关于观察到的序列的可验证性。在这个项目中，我们将在这个方向上开发全新的数据分析方法，以克服当前的缺点。在这方面，我们将（1）通过整合（半）自动化程序，从背景知识中导出可解释的基础假设，并将基础假设相互结合，从而系统化和简化假设推导过程。此外，我们的目标是（2）开发划分数据序列的方法，使数据的每个部分都可以根据特征的背景信息来简洁地描述，并且每个分区中的过渡行为可以通过给定的假设来解释，以说明数据中的异质性。最后，我们（3）扩展了基于假设的序列数据分析的一般框架，目前主要集中在简单的一阶马尔可夫链模型，更复杂的模型，如隐马尔可夫链模型，连续时间马尔可夫链模型或神经网络的序列数据。这将允许形式化更复杂和更细粒度的假设，以挑选最适合特定场景的模型，并整合额外的信息（例如，与数据科学和机器学习领域最近提出的许多方法相比，我们的研究将不会集中在产生最大预测能力的方法上。相反，我们专注于寻找潜在的解释数据生成过程中，可以理解的人类领域的专家，通过将他们的假设直接到分析过程中。在这方面，它将提供独特的机会，一方面将假设驱动的数据分析与先进的机器学习技术相结合，另一方面支持对产生观察到的序列的基本过程的理解。虽然该项目的重点是开发用于分析人类行为的新数据科学方法，但我们预计结果可以轻松转移到其他以序列数据为特征的应用领域。