权益分类	功能权益	普通用户	{{item.name}}会员
{{category.name}}	{{benefitItem.name}}

GPU-based Machine Learning System for fundamental biological research

用于基础生物学研究的基于 GPU 的机器学习系统

基本信息

批准号：
BB/V019805/1
负责人：
Rastko Sknepnek
金额：
$ 51.78万
依托单位：
University of Dundee
依托单位国家：
英国
项目类别：
Research Grant
财政年份：
2021
资助国家：
英国
起止时间：
2021 至无数据
项目状态：
已结题

来源：
https://gtr.ukri.org/projects?ref=BB%2FV019805%2F1
关键词：
GPU based Machine Learning System

项目摘要

In the past decade, biological sciences have witnessed a major shift towards data-driven research. Consequently, high-performance computing has become a standard research tool in life sciences. Biological data, however, is not only large but it is highly complex. This complexity requires alternative approaches to conventional data analysis. Machine learning (ML) has emerged as a powerful methodology that can successfully tackle the analysis of complex biological data. It is a hopeless task to attempt to develop a mathematical model of an elephant. Yet, a three-year-old child can with ease point at an elephant in a photo. The child was shown a picture of an elephant and told that the object in the picture was an elephant. In other words, she learnt to recognise an elephant by seeing photos of it and now she can identify it on her own. ML emulates the learning process on a computer. Instead of building a precise description of patterns, the computer is "taught" to recognise them. This is a paradigm shift from conventional computing and, thus, has its own challenges. Notably, the brain is well suited for learning by example, yet it performs poorly when it comes to long divisions. Computers, on the other hand, have been designed to perform numerical operations with great speed and precision. It is, therefore, not surprising that emulating an inherently heuristic process such as learning on a computer would require a substantial computational effort. With the recent advent in Graphics Processing Unit (GPU) and Solid State Drive (SSD) technologies, the necessary computer power has become broadly available. It is, however, not surprising that traditional High-Performance Computing facilities are not well-suited for ML applications. ML has been successfully used in biology for more than two decades. An excellent example is the prediction of the viability of cancer cells when exposed to a drug. The idea is to associate a response (e.g., whether a cancer cell survives or not) to a set of characteristics or features (e.g., which genes were mutated and what chemical properties of the drug are). In the so-called supervised learning, the machine is presented with a large set of training data that contains correct responses for given input parameters. Based on that data, the machine learns to predict the response for new, previously unseen parameters. A major challenge is that it is often not easy to identify what the appropriate features are. Cells are very complex and it is often unclear which are the most relevant features that determine a specific response, e.g., mutations of which genes one should consider, etc. An expert is, therefore, required to prepare the appropriate training set. In recent years, so-called deep learning techniques have revolutionised the learning process by allowing the machine to automatically extract the key features from raw data. This is achieved by a set of model neurons, inspired by biological neural cells, organised in a layered network (i.e., a neural network). The information propagates through layers of the network, which enables each layer to capture more and more abstract features in the data. This drastically reduces the need for carefully tailored training sets and makes the ML applicable to a wider range of problems, especially those where expert-made training sets are not available or too costly to make. Deep learning ML approaches, however, require substantial computational resources. Typical deep learning neural networks contain tens to hundreds of layers, thousands of neurons, and hundreds of thousands of links between them. Training them, therefore, requires hardware that operates at TFLOPS speeds (trillions of operations per second) and can access the data at several GB/s.The aim of this proposal is to build a designated GPU-based system for applying deep-learning ML methods in fundamental biological research at the University of Dundee.

在过去的十年中，生物科学见证了向数据驱动研究的重大转变。因此，高性能计算已成为生命科学中的标准研究工具。然而，生物数据不仅庞大，而且非常复杂。这种复杂性需要传统数据分析的替代方法。机器学习（ML）已经成为一种强大的方法，可以成功地处理复杂生物数据的分析。试图建立一个大象的数学模型是一项无望的任务。然而，一个三岁的孩子可以轻松地在照片中指向大象。给孩子看一张大象的照片，并告诉他照片中的物体是大象。换句话说，她学会了通过看大象的照片来识别大象，现在她可以自己识别大象。ML在计算机上模拟学习过程。计算机不是建立一个精确的模式描述，而是被“教会”去识别它们。这是从传统计算的范式转变，因此有其自身的挑战。值得注意的是，大脑非常适合通过例子学习，但在长除法方面表现不佳。另一方面，计算机被设计成能以极高的速度和精度进行数值运算。因此，在计算机上模拟一个固有的启发式过程（如学习）需要大量的计算工作也就不足为奇了。随着最近图形处理单元（GPU）和固态驱动器（SSD）技术的出现，必要的计算机能力已经变得广泛可用。然而，传统的高性能计算设施并不适合ML应用程序，这并不奇怪。ML在生物学中已经成功应用了二十多年。一个很好的例子是预测癌细胞暴露于药物时的生存能力。其想法是将响应（例如，癌细胞是否存活）到一组特性或特征（例如，哪些基因发生了突变以及药物的化学性质是什么）。在所谓的监督学习中，机器被提供了大量的训练数据，其中包含对给定输入参数的正确响应。基于这些数据，机器学习预测对新的、以前看不见的参数的响应。一个主要的挑战是，往往不容易确定哪些是适当的特征。细胞非常复杂，通常不清楚哪些是决定特定反应的最相关特征，例如，应该考虑哪些基因的突变等。因此，需要专家准备适当的训练集。近年来，所谓的深度学习技术通过允许机器自动从原始数据中提取关键特征，彻底改变了学习过程。这是通过一组模型神经元来实现的，这些模型神经元受到生物神经细胞的启发，组织在分层网络中（即，神经网络）。信息通过网络的各个层传播，这使得每一层都能够捕获数据中越来越多的抽象特征。这大大减少了对精心定制的训练集的需求，并使ML适用于更广泛的问题，特别是那些专家制作的训练集不可用或制作成本太高的问题。然而，深度学习ML方法需要大量的计算资源。典型的深度学习神经网络包含数十到数百层，数千个神经元以及它们之间的数十万个链接。因此，训练它们需要以TFLOPS速度（每秒数万亿次操作）运行的硬件，并且可以以几GB/s的速度访问数据。该提案的目的是建立一个指定的基于GPU的系统，用于在邓迪大学的基础生物学研究中应用深度学习ML方法。