High-Performance Data Analysis with the Helmholtz Analytics Toolkit

Siggel, Martin und Debus, Charlotte und Rüttgers, Alexander und Krajsek, Kai und Knechtges, Philipp und Götz, Markus und Comito, Claudia und Hagemeier, Björn (2019) High-Performance Data Analysis with the Helmholtz Analytics Toolkit. Computational Sciences and AI in Industry (CSAI), 2019-06-12 - 2019-06-14, Jyväskylä, Finnland.

PDF
4MB

Kurzfassung

This work introduces the Helmholtz Analytics Toolkit (HeAT), a scientific big data analytics library for High-Performance Computing (HPC) and High-Performance Data Analytics (HPDA) systems. The large progress in big data analytics in general and machine learning/deep learning (ML/DL) in particular, has been considerably enforced by well-designed open source libraries like Hadoop, Spark, Storm, Disco, scikit-learn, H2O.ai, Mahout, TensorFlow, PaddlePaddle, PyTorch, Caffe, Keras, MXNet, CNTK, BigDL, Theano, Neon, Chainer, DyNet, Dask and Intel DAAL to mention only a few of them. Despite the large number of existing data analytics frameworks, a library taking the specific needs in scientific big data analytics under consideration is still missing. For instance, no pre-existing library operates on heterogeneous hardware like GPU/CPU systems while allowing transparent computation on distributed systems. Typical big data analytics frameworks like Spark are designed for distributed memory systems and consequently do not fully explore the shared memory architecture as well as the network technology of HPC systems. ML/DL frameworks like Theano or Chainer focus on single node computations or when providing mechanisms for distributed computation, as done by TensorFlow or PyTorch, they impose the details of the distributed computation to the programmer. Libraries designed for HPC like Dask and Intel DAAL do not provide any GPU support. The presented library - HeAT - is designed for the specific needs of big data analytics in the scientific context. It is based on a distributed tensor data object on which operations can be performed like basic scalar functions, linear algebra algorithms, slicing or broadcasting operations necessary for most data analytics algorithms. The tensor data objects reside either on the CPU or on the GPU and, if needed, are distributed over various nodes. Operations on tensor objects are transparent to the user, i.e. they remain the same irrespective of whether the tensor object resides on a single node or if it is distributed over several nodes allowing to conveniently port algorithms from single nodes to multiple nodes or from CPUs to GPUs. HeAT's tensor module offers a Python-based API almost identical to NumPy, which allow a fast transition from vectorized NumPy code to a parallel and distributed HeAT code. HeAT builds on top of PyTorch, which already provides many required features like automatic differentiation, CPU and GPU support, linear algebra operations, and basic MPI functionality as well as an imperative programming paradigm allowing for fast prototyping essentially in scientific research. In addition to basic tensor operations, HeAT implements several common data analytics algorithms, e.g. k-means, logistic regression and neural networks, optimized for large scale distributed systems. We demonstrate the runtime performance of HeAT using a clustering of 8 GB image data of a rocket engine combustion taken from high-speed cameras. Compared to the performance of the k-means clustering algorithm on MATLAB or the serial HeAT implementation, the distributed computation with 16 MPI ranks leads to a runtime acceleration of about a factor of 20. Since the serial k-means clustering also has a very high memory requirement for these large data sets, the distributed computation even enables the computation of even larger data sets, which would not be possible with single node shared-memory only computation.

elib-URL des Eintrags:

https://elib.dlr.de/132418/

Dokumentart:

Konferenzbeitrag (Vortrag)

Titel:

High-Performance Data Analysis with the Helmholtz Analytics Toolkit

Autoren:

Autoren	Institution oder E-Mail-Adresse	Autoren-ORCID-iD	ORCID Put Code
Siggel, Martin	martin.siggel (at) dlr.de	https://orcid.org/0000-0002-3952-4659	NICHT SPEZIFIZIERT
Debus, Charlotte	Charlotte.Debus (at) dlr.de	https://orcid.org/0000-0002-7156-2022	NICHT SPEZIFIZIERT
Rüttgers, Alexander	Alexander.Ruettgers (at) dlr.de	https://orcid.org/0000-0001-6347-9272	NICHT SPEZIFIZIERT
Krajsek, Kai	Forschungszentrum Jülich	https://orcid.org/0000-0003-3417-161X	NICHT SPEZIFIZIERT
Knechtges, Philipp	Philipp.Knechtges (at) dlr.de	https://orcid.org/0000-0002-4849-0593	NICHT SPEZIFIZIERT
Götz, Markus	Karlsruher Institut für Technologie (KIT)	https://orcid.org/0000-0002-2233-1041	NICHT SPEZIFIZIERT
Comito, Claudia	Forschungszentrum Jülich	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Hagemeier, Björn	Forschungszentrum Jülich	https://orcid.org/0000-0003-1528-0933	NICHT SPEZIFIZIERT

Datum:

14 Juni 2019

Referierte Publikation:

Open Access:

Gold Open Access:

Nein

In SCOPUS:

Nein

In ISI Web of Science:

Nein

Status:

veröffentlicht

Stichwörter:

HeAT, Data Analytics, Distributed, HPDA, MPI, Combustion

Veranstaltungstitel:

Computational Sciences and AI in Industry (CSAI)

Veranstaltungsort:

Jyväskylä, Finnland

Veranstaltungsart:

internationale Konferenz

Veranstaltungsbeginn:

12 Juni 2019

Veranstaltungsende:

14 Juni 2019

Veranstalter :

University of Jyväskylä

HGF - Forschungsbereich:

Luftfahrt, Raumfahrt und Verkehr

HGF - Programm:

Raumfahrt

HGF - Programmthema:

Technik für Raumfahrtsysteme

DLR - Schwerpunkt:

Raumfahrt

DLR - Forschungsgebiet:

R SY - Technik für Raumfahrtsysteme

DLR - Teilgebiet (Projekt, Vorhaben):

R - Vorhaben SISTEC (alt)

Standort:

Köln-Porz

Institute & Einrichtungen:

Institut für Simulations- und Softwaretechnik
Institut für Simulations- und Softwaretechnik > High Performance Computing

Hinterlegt von:

Siggel, Dr. Martin

Hinterlegt am:

10 Dez 2019 11:45

Letzte Änderung:

24 Apr 2024 20:35

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags