elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Imprint | Privacy Policy | Contact | Deutsch
Fontsize: [-] Text [+]

The Helmholtz Analytics Toolkit (HEAT): A scientific Big Data Library for HPC

Krajsek, Kai and Comito, Claudia and Götz, Markus and Hagemeier, Björn and Knechtges, Philipp and Siggel, Martin (2018) The Helmholtz Analytics Toolkit (HEAT): A scientific Big Data Library for HPC. Extreme Data Workshop 2018, 18.-19. Sep. 2018, Forschungszentrum Jülich, Deutschland.

[img] PDF
3MB

Abstract

We present the Helmholtz Analytics Toolkit (HeAT), a scientific big data analytics library for HPC systems. The large progress in big data analytics in general and machine learning/deep learning (ML/DL) in particular, has been considerably enforced by well-designed open source libraries like Hadoop, Spark, Storm, Disco, scikit-learn, H2O.ai, Mahout, TensorFlow, PaddlePaddle, PyTorch, Caffe, Keras, MXNet, CNTK, BigDL, Theano, Neon, Chainer, DyNet, Dask and Intel DAAL to mention only a few of them. Despite the large number of existing data analytics frameworks, a library taking the specific needs in scientific big data analytics under consideration is still missing. For instance, no pre-existing library operates on heterogeneous hardware like GPU/CPU systems while allowing transparent computation on distributed systems. Typical big data analytics frameworks like Spark are designed for distributed memory systems and consequently do not fully explore the shared memory architecture as well as the network technology of HPC systems. ML/DL frameworks like Theano or Chainer focus on single node computations or when providing mechanisms for distributed computation, as done by TensorFlow or PytTorch, they impose the details of the distributed computation to the programmer. Libraries designed for HPC like Dask and Intel DAAL do not provide any GPU support. The presented library - HeAT - is designed for the specific needs of big data analytics in the scientific context. It is based on a tensor data object on which operations can be performed like basic scalar functions, linear algebra algorithms, slicing or broadcasting operations necessary for most data analytics algorithms. The tensor data objects reside either on the CPU or on the GPU and, if needed, is distributed over different nodes. Operations on tensor objects are transparent to the user, i.e. they keep the same irrespective whether the tensor object resides on a single node or if it is distributed over several nodes allowing conveniently port algorithms from single nodes to multiple nodes or from CPUs to GPUs. HeAT builds in top of PyTorch which already provides many required features like automatic differentiation, CPU and GPU support, linear algebra operations and basic MPI functionality as well as an imperative programming paradigm allowing for fast prototyping essentially in scientific research. In addition to basic tensor operations, HeAT implements several common data analytics algorithms, e.g. kmeans, logistic regression and neural networks, optimized for large scale distributed systems. After motivating the framework and specifying its scope, we will describe its concept and its realization in detail. We demonstrate its usage by means of several typical examples from data analytics including benchmarks comparing HeAT to pre-existing frameworks. The presentation closes with a discussion on the downsides, further developments and future challenges of HeAT.

Item URL in elib:https://elib.dlr.de/124422/
Document Type:Conference or Workshop Item (Speech)
Title:The Helmholtz Analytics Toolkit (HEAT): A scientific Big Data Library for HPC
Authors:
AuthorsInstitution or Email of AuthorsAuthors ORCID iD
Krajsek, KaiForschungszentrum Jülichhttps://orcid.org/0000-0003-3417-161X
Comito, ClaudiaForschungszentrum JülichUNSPECIFIED
Götz, MarkusKarlsruher Institut für Technologie (KIT)https://orcid.org/0000-0002-2233-1041
Hagemeier, BjörnForschungszentrum Jülichhttps://orcid.org/0000-0003-1528-0933
Knechtges, PhilippPhilipp.Knechtges (at) dlr.dehttps://orcid.org/0000-0002-4849-0593
Siggel, Martinmartin.siggel (at) dlr.dehttps://orcid.org/0000-0002-3952-4659
Date:19 September 2018
Refereed publication:Yes
Open Access:Yes
Gold Open Access:No
In SCOPUS:No
In ISI Web of Science:No
Status:Published
Keywords:Distributed, Machine Learning, Big Data, Data Analytics, GPU, HPC, HeAT
Event Title:Extreme Data Workshop 2018
Event Location:Forschungszentrum Jülich, Deutschland
Event Type:Workshop
Event Dates:18.-19. Sep. 2018
HGF - Research field:Aeronautics, Space and Transport
HGF - Program:Space
HGF - Program Themes:Space Technology
DLR - Research area:Raumfahrt
DLR - Program:R SY - Technik für Raumfahrtsysteme
DLR - Research theme (Project):R - Vorhaben SISTEC
Location: Köln-Porz
Institutes and Institutions:Institut of Simulation and Software Technology > High Performance Computing
Deposited By: Siggel, Dr. Martin
Deposited On:06 Dec 2018 14:19
Last Modified:31 Jul 2019 20:22

Repository Staff Only: item control page

Browse
Search
Help & Contact
Information
electronic library is running on EPrints 3.3.12
Copyright © 2008-2017 German Aerospace Center (DLR). All rights reserved.