权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Data Exploration and Predictive Analytics for Music Publishing

音乐出版的数据探索和预测分析

基本信息

批准号：
EP/M507076/1
负责人：
Christoforos Anagnostopoulos
金额：
$ 14.83万
依托单位：
Imperial College London
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2014
资助国家：
英国
起止时间：
2014 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=EP%2FM507076%2F1
关键词：
Data Exploration Predictive Analytics Music

项目摘要

The PDRA will liaise with the developers at Sentric Music to ensure a broad array of diverse data sources is linked andpreprocessed in a statistically sound manner, and ensuring the final version of the data are in a format conducive tomachine learning and statistical inference (e.g., unstructured data will need to be pre-parsed into structured data). ThePDRA will need to use a broad suite of "data science" skills to achieve this - including computing skills, as well as statisticalexpertise.The second objective will involve representing the problem from a statistical viewpoint, as a problem of predicting the futurevalue of a quantity of interest (in this case earnings), on the basis of attributes about the artist and/or their songs, such aspast earnings, genre, fan-base, etc. To choose an appropriate model, two types of considerations come into play: theformat of the data, as well as our expectations about the types of relationships we are trying to capture. We discuss both inturn.With regards to data format, this particular application is likely to give rise to a large number of attributes, of various types(e.g., each song, or artist, will be represented in numeric ways, placed into categories, or rated according to possiblydifferent scales, etc.). Automatic feature selection techniques will be required to ensure that information-poor attributes areexcluded from consideration to avoid contaminating the results. Moreover, there is a natural hierarchical structure to thisproblem, introduced by the relationship between an artist and their songs. Both these aspects challenge off-the-shelfstatistical models, and require a bespoke model.With regards to the choice of model, it is known that typically in Big Data, as the data set size increases, so does theheterogeneity in the data, and failing to account for this can lead to over-confident and inaccurate predictions. One solutionis to employ a "divide and conquer" approach by using decision trees, which segment the initial dataset and fit a separatestatistical model in each segment. This approach achieves flexibility without compromising on computational efficiency.Notably, the output of such models remains interpretable by the end user because it closely resembles the manualsegmentation already used extensively in marketing and, currently, by Sentric. The difference is that the segmentationrules are extracted from the data in a principled, automatic fashion. Another consideration in choosing the model is theability for it to output the confidence of its own predictions. Failure to do so can introduce risks since only confidentpredictions should be used for decision-making. Adopting a Bayesian framework is a natural way to achieve this objective.Our favored approach overall is the framework of Bayesian Dynamic Trees, which combines flexibility, statisticalsoundness, scalability using cutting-edge methods, as well as a built-in ability to adapt to data evolution at no extracomputational cost [Anagnostopoulos, 2013]. This framework will have to be extended to handle this problem, to handle thehierarchical relationship between artists and their songs; the diversity of available attributes; and the need to produceforecasts over possibly longer-time horizons.Finally, the PRDA will supervise and contribute to the deployment of the model within Sentric, as well as the design of theUser Interface that will be made available to the artists. The former will involve scalability considerations, and the latter willinvolve innovation in visualisation, and communication of uncertainty.

PDRA 将与 Sentric Music 的开发人员联络，确保以统计上合理的方式链接和预处理各种不同的数据源，并确保数据的最终版本采用有利于机器学习和统计推理的格式（例如，非结构化数据需要预先解析为结构化数据）。 PDRA 将需要使用广泛的“数据科学”技能来实现这一目标 - 包括计算技能以及统计专业知识。第二个目标将涉及从统计角度表示问题，即根据艺术家和/或其歌曲的属性（例如过去的收入、流派、粉丝基础等）预测一定数量的兴趣（在本例中为收入）的未来价值。要选择合适的模型，需要两种类型考虑因素开始发挥作用：数据的格式，以及我们对试图捕获的关系类型的期望。我们依次讨论两者。关于数据格式，这个特定的应用程序可能会产生大量各种类型的属性（例如，每首歌曲或艺术家将以数字方式表示，放入类别中，或根据可能不同的尺度进行评级等）。需要自动特征选择技术来确保将信息匮乏的属性排除在考虑范围之外，以避免污染结果。此外，这个问题存在一个自然的层次结构，这是由艺术家和他们的歌曲之间的关系引入的。这两个方面都对现成的统计模型提出了挑战，需要定制模型。关于模型的选择，众所周知，通常在大数据中，随着数据集大小的增加，数据的异质性也会增加，如果不考虑这一点，可能会导致过度自信和不准确的预测。一种解决方案是通过使用决策树来采用“分而治之”的方法，该方法对初始数据集进行分段并在每个分段中拟合一个单独的统计模型。这种方法在不影响计算效率的情况下实现了灵活性。值得注意的是，此类模型的输出仍然可以由最终用户解释，因为它非常类似于市场营销中广泛使用的手动细分，目前由 Sentric 使用。不同之处在于，分段规则是以有原则的、自动的方式从数据中提取的。选择模型的另一个考虑因素是它输出其自身预测的置信度的能力。如果不这样做可能会带来风险，因为只有自信的预测才能用于决策。采用贝叶斯框架是实现这一目标的自然方式。总体而言，我们最喜欢的方法是贝叶斯动态树框架，它结合了灵活性、统计可靠性、使用尖端方法的可扩展性，以及无需额外计算成本即可适应数据演化的内置能力[Anagnostopoulos，2013]。这个框架必须扩展来处理这个问题，处理艺术家和他们的歌曲之间的层次关系；可用属性的多样性；最后，PRDA 将监督并促进 Sentric 内模型的部署，以及将提供给艺术家的用户界面的设计。前者将涉及可扩展性考虑，后者将涉及可视化创新和不确定性沟通。