权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Approximate Inference for Latent Position Models

潜在位置模型的近似推理

基本信息

批准号：
RGPIN-2022-03012
负责人：
Smith, Aaron
金额：
$ 2.26万
依托单位：
University of Ottawa
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2022
资助国家：
加拿大
起止时间：
2022-01-01 至 2023-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=759217
关键词：
Approximate Inference Latent Position Models

项目摘要

Toy Example and Introduction: Fans of competitive games, from hockey to videogames such as DOTA, enjoy ranking teams based on their performance. A very simple statistical model for ranking might assume that every team has a single unobserved latent characteristic (their "true skill"), then model the chance of winning a match as a simple function of the "true skills" of the teams playing. A statistician could use this model to infer the ranking and "true skills" of various teams based on the observed games. Furthermore, the statistician could use tools such as Markov chain Monte Carlo (MCMC) to calculate the uncertainty of each part of the estimated ranking - they might be 98% sure that Tampa Bay is better than Columbus, but only 54% sure that it is better than Vegas. In practice, statisticians develop more complicated models that incorporate many types of team skill, but the basic principles and goals are the same. The NHL has only 32 teams, and so it is easy to fit very complicated models. On the other hand, over 400,000 people play DOTA every day. An algorithm that runs in minutes on your phone for NHL data could take months on your desktop for DOTA data. This discrepancy becomes even worse for calculating certainty estimates. The fundamental "big data" problem illustrated by this example is: the computational cost of fitting latent-position models grows very quickly in the size of the dataset, making many natural statistical analyses computationally intractable. The computational costs grow much more quickly than linearly in the size of the dataset, which means that the problem can't easily be solved by simply buying a slightly better computer. The primary goal of this proposal is to develop algorithms that ameliorate this problem, allowing researchers to fit sophisticated models to substantially larger datasets. The secondary goal is to develop a deeper understanding of the limits of this approach - when one must "give up" and try a different approach. Impact: Latent-position models (LPMs) that are almost identical to the "ranking" model described above are not primarily used for sports analysis. They are ubiquitous in cybersecurity (for detecting and prioritizing anamolies), neuroscience (for visualizing functional relationships), and many other areas. Achieving the central goal would allow more sophisticated versions of these models to be applied to larger datasets, improving inference. Methodology and Relation to Existing Literature: Achieving the primary goal requires new (i) point estimators related to LPMs and (ii) methods for incorporating such point estimators into MCMC algorithms. The secondary goal is based on new probabilistic "anti-concentration" bounds. Both rely on expertise in MCMC theory. In the long term, solutions to the "big data" for MCMC and LPMs will lead to solutions for other MCMC "big-data" problems such as tensor models.

玩具示例和介绍：从曲棍球到DOTA等视频游戏，竞技游戏的粉丝们喜欢根据他们的表现对球队进行排名。一个非常简单的排名统计模型可以假设每个球队都有一个未被观察到的潜在特征（他们的“真正技能”），然后将赢得比赛的机会建模为球队“真正技能”的简单函数。统计学家可以利用这个模型，根据观察到的比赛来推断各个球队的排名和“真正的技能”。此外，统计学家可以使用马尔可夫链蒙特卡罗（MCMC）等工具来计算估计排名的每个部分的不确定性-他们可能有98%的把握认为坦帕湾优于哥伦布，但只有54%的把握认为它优于拉斯维加斯。在实践中，统计学家开发了更复杂的模型，其中包含许多类型的团队技能，但基本原则和目标是相同的。NHL只有32支球队，所以很容易适应非常复杂的模型。另一方面，每天有超过40万人玩DOTA。一个在手机上运行几分钟的NHL数据算法可能需要几个月的桌面上的DOTA数据。这种差异在计算确定性估计时变得更糟。这个例子所说明的基本“大数据”问题是：拟合潜在位置模型的计算成本随着数据集的大小而快速增长，使得许多自然的统计分析在计算上难以处理。计算成本的增长速度比数据集大小的线性增长速度快得多，这意味着这个问题不能简单地通过购买一台稍微好一点的计算机来解决。该提案的主要目标是开发改善这一问题的算法，使研究人员能够将复杂的模型拟合到更大的数据集。第二个目标是更深入地理解这种方法的局限性-当一个人必须“放弃”并尝试不同的方法时。影响：与上述“排名”模型几乎相同的潜在位置模型（LPM）主要不用于体育分析。它们在网络安全（用于检测和优先考虑Anamolies），神经科学（用于可视化功能关系）和许多其他领域中无处不在。实现中心目标将允许这些模型的更复杂版本应用于更大的数据集，从而改善推理。方法和与现有文献的关系：实现主要目标需要新的（i）与LPM相关的点估计量和（ii）将此类点估计量纳入MCMC算法的方法。第二个目标是基于新的概率“反浓度”的界限。两者都依赖于MCMC理论的专业知识。从长远来看，MCMC和LPM的“大数据”解决方案将导致其他MCMC“大数据”问题的解决方案，如张量模型。