Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs

Ernst, Dominik und Hager, Georg und Thies, Jonas und Wellein, Gerhard (2020) Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs. In: 13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019, 120 (43). Springer. PPAM 2019, 2019-09-08 - 2019-09-11, Bialystok, Polen. doi: 10.1007/978-3-030-43229-4_43. ISBN 978-303043221-8. ISSN 0302-9743.

Dies ist die aktuellste Version dieses Eintrags.

PDF
380kB

Kurzfassung

General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia's current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.

elib-URL des Eintrags:

https://elib.dlr.de/130199/

Dokumentart:

Konferenzbeitrag (Vortrag)

Titel:

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs

Autoren:

Autoren	Institution oder E-Mail-Adresse	Autoren-ORCID-iD	ORCID Put Code
Ernst, Dominik	Dominik.Ernst (at) fau.de	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Hager, Georg	Georg.Hager (at) fau.de	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Thies, Jonas	Jonas.Thies (at) dlr.de	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Wellein, Gerhard	Erlangen Regional Computing Center	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT

Datum:

2020

Erschienen in:

13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019

Referierte Publikation:

Open Access:

Gold Open Access:

Nein

In SCOPUS:

In ISI Web of Science:

Nein

Band:

120

DOI:

10.1007/978-3-030-43229-4_43

Verlag:

Springer

Name der Reihe:

Lecture Notes in Computer Science

ISSN:

0302-9743

ISBN:

978-303043221-8

Status:

veröffentlicht

Stichwörter:

linear algebra, high performance computing, Graphics Processing Units, memory-bounded operations

Veranstaltungstitel:

PPAM 2019

Veranstaltungsort:

Bialystok, Polen

Veranstaltungsart:

internationale Konferenz

Veranstaltungsbeginn:

8 September 2019

Veranstaltungsende:

11 September 2019

HGF - Forschungsbereich:

Luftfahrt, Raumfahrt und Verkehr

HGF - Programm:

Raumfahrt

HGF - Programmthema:

Technik für Raumfahrtsysteme

DLR - Schwerpunkt:

Raumfahrt

DLR - Forschungsgebiet:

R SY - Technik für Raumfahrtsysteme

DLR - Teilgebiet (Projekt, Vorhaben):

R - Vorhaben SISTEC (alt)

Standort:

Köln-Porz

Institute & Einrichtungen:

Institut für Simulations- und Softwaretechnik

Hinterlegt von:

Thies, Jonas

Hinterlegt am:

21 Nov 2019 09:24

Letzte Änderung:

24 Apr 2024 20:33

Verfügbare Versionen dieses Eintrags

Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs. (deposited 21 Nov 2019 09:24) [Gegenwärtig angezeigt]

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags