权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Utilizing unlabeled data for machine learning tasks - theoretical analysis

利用未标记数据进行机器学习任务 - 理论分析

基本信息

批准号：
RGPIN-2015-04654
负责人：
BenDavid, Shai
金额：
$ 2.62万
依托单位：
University of Waterloo
依托单位国家：
加拿大
项目类别：
Discovery Grants Program - Individual
财政年份：
2016
资助国家：
加拿大
起止时间：
2016-01-01 至 2017-12-31
项目状态：
已结题

来源：
https://www.nserc-crsng.gc.ca/ase-oro/Details-Detailles_eng.asp?id=613090
关键词：
Utilizing unlabeled data machine learning

项目摘要

Mainstream machine learning tools depend on the availability of human annotated training data. Nowaday, in what is called the ``big data" era, applications of machine learning have access to very large amounts of un-annotated (a.k.a. unlabeled) data. Consequently, there is vast interest in designing machine learning tools that can utilize such big pools of raw, unannotated data to reduce the need for human intervention in the learning process. Various recently arising machine learning paradigms address this issue. These include Clustering, Active Learning, Semi-Supervised Learning, Domain Adaptation and Transfer Learning, as well as Learning from ``Weak Teachers" (like supervision obtained via crowdsourcing). To cope with such scenarios, learning practitioners have developed heuristics that, while apparently working reasonably well in practice, are not supported by existing mathematical analyses. In the past couple of decades, machine learning provided a resounding demonstration of the impact of theoretical analysis on the development of practical applications. Algorithmic paradigms like Support-Vector-Machines, Decision-Trees and Boosting grew from theoretical models into popular and vastly applicable software packages. Can the success of theoretical analysis of machine learning be extended to the modern big-data-minimal-human-intervention scenarios? The proposed research aims to provide basis for such developments by building mathematical support for those emerging machine learning and data mining paradigms. Some examples of recent initiatives taken by my team in that direction include a program that aims to provide tools for guiding users that wish to cluster big data sets on how to choose appropriate clustering algorithms and parameter settings. Such choices are critical to the success of clustering applications, and yet have so far been done in an ad hoc fashion. The PhD thesis of my student Margareta Ackerman took first steps in this direction but there are still big challenges to overcome, both in terms of developing such tools, and in terms of raising the awareness of the data mining community to the significance of the matching between clustering tasks and the algorithms employed to address them. Another project addresses the task of utilizing ``weak supervision". The use of annotation by novice supervisors to help collect training data for classification prediction tasks has been drawing research attention recently due to the growing popularity of using crowdsourcing. In contrast with much of the current research in this direction, we consider the scenario in which the weak supervision is used in cooperation with human supervision, aiming to reduce (rather than eliminate) calls to the expert. We are developing novel mathematical models to zoom in on the instances for which novice-generated labels cannot be trusted and need to be scrutinized by a human expert.

主流的机器学习工具依赖于人类注释的训练数据的可用性。现在，在所谓的“大数据”时代，机器学习的应用程序可以访问大量的未注释（也就是说，未标记的）数据。因此，人们对设计机器学习工具非常感兴趣，这些工具可以利用如此大的原始、未注释的数据池来减少学习过程中对人工干预的需求。最近出现的各种机器学习范式解决了这个问题。这些包括集群，主动学习，半监督学习，领域适应和迁移学习，以及从“弱教师”学习（如通过众包获得的监督）。为了科普这些情况，学习实践者开发了学习方法，虽然在实践中显然工作得很好，但没有得到现有数学分析的支持。在过去的几十年里，机器学习为理论分析对实际应用开发的影响提供了一个响亮的证明。支持向量机，决策树和Boosting等数学范式从理论模型发展成为流行的和广泛适用的软件包。机器学习理论分析的成功是否可以扩展到现代大数据最少人为干预的场景？该研究旨在通过为新兴的机器学习和数据挖掘范式建立数学支持，为这些发展提供基础。我的团队最近在这个方向上采取的一些举措包括一个计划，该计划旨在提供工具，指导希望对大数据集进行聚类的用户如何选择合适的聚类算法和参数设置。这些选择对于聚类应用程序的成功至关重要，但迄今为止还没有以特定的方式完成。我的学生Alfreta阿克曼的博士论文在这个方向上迈出了第一步，但仍然有很大的挑战需要克服，无论是在开发这样的工具方面，还是在提高数据挖掘社区对聚类任务和用于解决它们的算法之间匹配的重要性的认识方面。另一个项目解决了利用"弱监督”的任务。由于使用众包的日益普及，新手监督员使用注释来帮助收集分类预测任务的训练数据最近引起了研究的关注。与目前在这个方向上的研究相比，我们考虑的情况下，弱监督与人类监督合作使用，旨在减少（而不是消除）呼吁专家。我们正在开发新的数学模型，以放大新手生成的标签不可信，需要由人类专家仔细检查的情况。