Ernst, Dominik und Hager, Georg und Thies, Jonas und Wellein, Gerhard (2020) Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs. In: 13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019, 120 (43). Springer. PPAM 2019, 2019-09-08 - 2019-09-11, Bialystok, Polen. doi: 10.1007/978-3-030-43229-4_43. ISBN 978-303043221-8. ISSN 0302-9743.
Dies ist die aktuellste Version dieses Eintrags.
PDF
380kB |
Kurzfassung
General matrix-matrix multiplications (GEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. Nvidia's current CUBLAS implementation delivers only a fraction of the potential performance (as given by the roofline model) in this case. We describe the challenges and key properties of an implementation that can achieve perfect performance. We further evaluate different approaches of parallelization and thread distribution, and devise a flexible, configurable mapping scheme. A code generation approach enables a simultaneously flexible and specialized implementation with autotuning. This results in perfect performance for a large range of matrix sizes in the domain of interest, and at least 2/3 of maximum performance for the rest on an Nvidia Volta GPGPU.
elib-URL des Eintrags: | https://elib.dlr.de/130199/ | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Dokumentart: | Konferenzbeitrag (Vortrag) | ||||||||||||||||||||
Titel: | Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs | ||||||||||||||||||||
Autoren: |
| ||||||||||||||||||||
Datum: | 2020 | ||||||||||||||||||||
Erschienen in: | 13th International Conference on Parallel Processing and Applied Mathematics, PPAM 2019 | ||||||||||||||||||||
Referierte Publikation: | Ja | ||||||||||||||||||||
Open Access: | Ja | ||||||||||||||||||||
Gold Open Access: | Nein | ||||||||||||||||||||
In SCOPUS: | Ja | ||||||||||||||||||||
In ISI Web of Science: | Nein | ||||||||||||||||||||
Band: | 120 | ||||||||||||||||||||
DOI: | 10.1007/978-3-030-43229-4_43 | ||||||||||||||||||||
Verlag: | Springer | ||||||||||||||||||||
Name der Reihe: | Lecture Notes in Computer Science | ||||||||||||||||||||
ISSN: | 0302-9743 | ||||||||||||||||||||
ISBN: | 978-303043221-8 | ||||||||||||||||||||
Status: | veröffentlicht | ||||||||||||||||||||
Stichwörter: | linear algebra, high performance computing, Graphics Processing Units, memory-bounded operations | ||||||||||||||||||||
Veranstaltungstitel: | PPAM 2019 | ||||||||||||||||||||
Veranstaltungsort: | Bialystok, Polen | ||||||||||||||||||||
Veranstaltungsart: | internationale Konferenz | ||||||||||||||||||||
Veranstaltungsbeginn: | 8 September 2019 | ||||||||||||||||||||
Veranstaltungsende: | 11 September 2019 | ||||||||||||||||||||
HGF - Forschungsbereich: | Luftfahrt, Raumfahrt und Verkehr | ||||||||||||||||||||
HGF - Programm: | Raumfahrt | ||||||||||||||||||||
HGF - Programmthema: | Technik für Raumfahrtsysteme | ||||||||||||||||||||
DLR - Schwerpunkt: | Raumfahrt | ||||||||||||||||||||
DLR - Forschungsgebiet: | R SY - Technik für Raumfahrtsysteme | ||||||||||||||||||||
DLR - Teilgebiet (Projekt, Vorhaben): | R - Vorhaben SISTEC (alt) | ||||||||||||||||||||
Standort: | Köln-Porz | ||||||||||||||||||||
Institute & Einrichtungen: | Institut für Simulations- und Softwaretechnik | ||||||||||||||||||||
Hinterlegt von: | Thies, Jonas | ||||||||||||||||||||
Hinterlegt am: | 21 Nov 2019 09:24 | ||||||||||||||||||||
Letzte Änderung: | 24 Apr 2024 20:33 |
Verfügbare Versionen dieses Eintrags
- Performance Engineering for a Tall & Skinny Matrix Multiplication Kernel on GPUs. (deposited 21 Nov 2019 09:24) [Gegenwärtig angezeigt]
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags