权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

Performance, Portability, and Productivity for Deep Learning Applications on Multi- and Many-Core Architectures (PPP-DL)

多核和众核架构上深度学习应用的性能、可移植性和生产力 (PPP-DL)

基本信息

批准号：
470527619
负责人：
Professor Dr. Sergei Gorlatch
金额：
--
依托单位：
School of Informatics
依托单位国家：
德国
项目类别：
Research Grants
财政年份：
资助国家：
德国
起止时间：
项目状态：
未结题

来源：
https://gepris.dfg.de/gepris/projekt/470527619?language=en
关键词：
Performance Portability Productivity Deep Learning

项目摘要

Deep Learning (DL) is currently the most popular machine-learning method that solves a great variety of real-world problems in academia and industry. The success of DL applications critically depends on the quality of software that implements DL algorithms for modern parallel architectures like multi-core CPU, Graphics Processing Unit (GPU), Field-Programmable Gate Array (FPGA), etc. The state-of-the-art DL frameworks like TensorFlow and PyTorch rely for high performance upon general-purpose libraries provided by vendors, such as Intel or NVIDIA, causing major weaknesses regarding three fundamental aspects: i) suboptimal performance – many DL-specific optimizations are not applicable because of libraries’ focus toward general-purpose usage; ii) lacking both functional and performance portability, because the libraries are specifically designed and optimized toward architectures of particular vendors only; iii) restricted user productivity, because the libraries are limited to a fixed set of pre- implemented algorithms (e.g., matrix multiplication and convolutions), and it is cumbersome to integrate high-performance libraries into DL frameworks. This project will develop a novel, holistic approach toward automatic code generation and optimization for DL applications targeting modern parallel architectures; its overall goal is to address in one combined approach three major research challenges in the area of high-performance computing for DL: Performance, Portability, and Productivity (PPP). We plan to achieve the goal of the project based on the following new contributions: 1) a new algebraic formalism and a formalism-based Domain-Specific Language (DSL) for conveniently expressing/implementing established and emerging DL applications at a high-level of abstraction, thereby contributing to programmer’s productivity; 2) a uniform low-level programming model for DL applications, which enables functional portability of code by being straightforwardly lowerable to executable code in the state-of-practice parallel programming approaches: OpenMP, CUDA, OpenCL, etc.; 3) a code generation mechanism for our DSL that enables high, portable performance over various architectures and input/output characteristics by automatically generating auto-tunable code in our low-level programming model; 4) a systematic process that integrates our code generation mechanism into modern DL frameworks, based on the emerging MLIR framework; 5) a new auto-tuning system that fully automatically optimizes our generated code via combined numerical search techniques; 6) a new analytical cost model to predict for different architectures the run time of DL applications expressed in our DSL, in order to accelerate the auto-tuning process.We will experimentally compare our approach in terms of all – performance, portability, and productivity – to state-of-the-art approaches for a broad range of DL applications, parallel architectures, and real-world DL data sets.

深度学习（DL）是目前最流行的机器学习方法，可以解决学术界和工业界的各种现实问题。DL应用的成功关键取决于为现代并行架构（如多核CPU、图形处理单元（GPU）、现场可编程门阵列（FPGA）等）实现DL算法的软件质量。最先进的DL框架（如TensorFlow和PyTorch）依赖于英特尔或NVIDIA等供应商提供的通用库来实现高性能，导致关于三个基本方面的主要弱点：i）次优性能-许多DL特定的优化不适用，因为库关注于通用用途; ii）缺乏功能和性能可移植性，因为库仅针对特定供应商的体系结构专门设计和优化; iii）受限的用户生产力，因为库限于预先实现的算法的固定集合（例如，矩阵乘法和卷积），并且将高性能库集成到DL框架中很麻烦。该项目将开发一种新颖的，全面的方法，以自动代码生成和优化DL应用程序，目标是现代并行架构;其总体目标是在一个综合的方法来解决DL高性能计算领域的三个主要研究挑战：性能，可移植性和生产力（PPP）。我们计划基于以下新的贡献来实现该项目的目标：1）一个新的代数形式主义和基于形式主义的领域特定语言（DSL），用于在高抽象级别上方便地表达/实现已建立和新兴的DL应用，从而有助于程序员的生产力; 2）用于DL应用程序的统一低级编程模型，其通过在实践状态并行编程方法中直接降低到可执行代码来实现代码的功能可移植性：OpenMP、CUDA、OpenCL等; 3)一种DSL的代码生成机制，通过在我们的低级编程模型中自动生成可自动调整的代码，在各种体系结构和输入/输出特性上实现高的可移植性能; 4）一种基于新兴的MLIR框架，将我们的代码生成机制集成到现代DL框架中的系统过程; 5）一个新的自动调优系统，通过组合数值搜索技术完全自动优化我们生成的代码; 6）一个新的分析成本模型，用于预测不同体系结构的DL应用程序在我们的DSL中的运行时间，为了加速自动调整过程。我们将实验比较我们的方法在所有方面-性能，可移植性和生产力-国家的最先进的方法，广泛的DL应用程序，并行架构，和现实世界的DL数据集。