elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Imprint | Privacy Policy | Contact | Deutsch
Fontsize: [-] Text [+]

Accelerating massive data processing in Python with Heat (Tutorial)

Comito, Claudia and Hoppe, Fabian (2024) Accelerating massive data processing in Python with Heat (Tutorial). 4th conference for Research Software Engineering in Germany deRSE24, 2024-03-05 - 2024-03-07, Würzburg, Deutschland.

Full text not available from this repository.

Abstract

Manipulating and processing massive data sets is challenging. For the vast majority of research communities, the standard approach involves setting up Python pipelines to break up and analyze data in smaller chunks, an inefficient and prone-to-errors process. The problem is exacerbated on GPUs, because of the smaller available memory. Popular solutions to distribute NumPy/SciPy computations are based on task parallelism, introducing significant runtime overhead, complicating implementation, and often limiting GPU support to specific vendors. In this tutorial, we will show you an alternative based on data parallelism. The open-source library Heat [1] builds on PyTorch and mpi4py to simplify porting of NumPy/SciPy-based code to GPU (CUDA, ROCm, including multi-GPU, multi-node clusters). Under the hood, Heat distributes massive memory-intensive operations and algorithms via MPI communication, achieving significant speed-ups compared to task-distributed frameworks. On the surface however, Heat implements a NumPy-like API, is largely interoperable with the Python array ecosystem, and can be employed seamlessly as a backend to accelerate existing single-CPU pipelines, as well as develop new HPC applications from scratch. You will get an overview of: - Heat's basics: getting started with distributed I/O, data decomposition scheme, array operations - Existing functionalities: multi-node linear algebra, statistics, signal processing, machine learning... - DIY how-to: using existing Heat infrastructure to build your own multi-node, multi-GPU research software. We'll also touch upon Heat's implementation roadmap, and possible paths to collaboration.

Item URL in elib:https://elib.dlr.de/203210/
Document Type:Conference or Workshop Item (Other)
Title:Accelerating massive data processing in Python with Heat (Tutorial)
Authors:
AuthorsInstitution or Email of AuthorsAuthor's ORCID iDORCID Put Code
Comito, ClaudiaForschungszentrum JülichUNSPECIFIEDUNSPECIFIED
Hoppe, FabianUNSPECIFIEDhttps://orcid.org/0000-0002-4501-6829UNSPECIFIED
Date:6 March 2024
Refereed publication:Yes
Open Access:No
Gold Open Access:No
In SCOPUS:No
In ISI Web of Science:No
Status:Published
Keywords:HPC, Data Analytics, Machine Learning, Heat
Event Title:4th conference for Research Software Engineering in Germany deRSE24
Event Location:Würzburg, Deutschland
Event Type:national Conference
Event Start Date:5 March 2024
Event End Date:7 March 2024
Organizer:Universität Würzburg / de-RSE e.V. - Gesellschaft für Forschungssoftware
HGF - Research field:Aeronautics, Space and Transport
HGF - Program:Space
HGF - Program Themes:Space System Technology
DLR - Research area:Raumfahrt
DLR - Program:R SY - Space System Technology
DLR - Research theme (Project):R - HPDA basic software
Location: Köln-Porz
Institutes and Institutions:Institute of Software Technology > High-Performance Computing
Institute of Software Technology
Deposited By: Hoppe, Fabian
Deposited On:18 Mar 2024 10:12
Last Modified:09 Apr 2025 09:11

Repository Staff Only: item control page

Browse
Search
Help & Contact
Information
OpenAIRE Validator logo electronic library is running on EPrints 3.3.12
Website and database design: Copyright © German Aerospace Center (DLR). All rights reserved.