Multimodal Transformer Models for Human Action Classification

Varga, Zoltán und Mascaro, Esteve Valls und Sliwowski, Daniel Jan und Lee, Dongheui (2025) Multimodal Transformer Models for Human Action Classification. In: 12th International Conference on Robot Intelligence Technology and Applications, RiTA 2024, 1419, Seiten 52-63. The 12th International Conference on Robot Intelligence Technology and Applications (RiTA 2024), 2024-12-04 - 2024-12-07, Ulsan, Korea. doi: 10.1007/978-3-031-92011-0_5. ISBN 978-303192010-3. ISSN 2367-3370.

Dieses Archiv kann nicht den Volltext zur Verfügung stellen.

Offizielle URL: https://link.springer.com/chapter/10.1007/978-3-031-92011-0_5

Kurzfassung

Most research in deep learning focuses on a single modality, such as image, text, or proprioception data. However, humans benefit from leveraging information from diverse senses on a daily basis for richer information acquisition. Inspired by this, we design a transformer-based multimodal model for human action recognition and thoroughly evaluate its performance and robustness. Furthermore, we explore fusion methods to assess how modalities are best combined. Lastly, a model is trained to infer (generate) a missing modality. Our study shows that multimodal transformers perform better than their modality-specific equivalents. We achieve an improvement of 10.1% when using multiple data modalities over our vision-only baseline and outperform current state-of-the-art approaches by 32.8%. Furthermore, we evaluate a mean square error of 9.6% in the tactile force reconstruction task. The implemented model can be applied in scenarios where robotic assistance depends on recognising human actions for decision-making, tackling situations where vision is limited or audio and other modalities are required for deeper understanding.

elib-URL des Eintrags:

https://elib.dlr.de/223611/

Dokumentart:

Konferenzbeitrag (Vortrag)

Titel:

Multimodal Transformer Models for Human Action Classification

Autoren:

Autoren	Institution oder E-Mail-Adresse	Autoren-ORCID-iD	ORCID Put Code
Varga, Zoltán	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Mascaro, Esteve Valls	TU Wien	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Sliwowski, Daniel Jan	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT
Lee, Dongheui	Dongheui.Lee (at) dlr.de	https://orcid.org/0000-0003-1897-7664	NICHT SPEZIFIZIERT

Datum:

16 Juli 2025

Erschienen in:

12th International Conference on Robot Intelligence Technology and Applications, RiTA 2024

Referierte Publikation:

Open Access:

Nein

Gold Open Access:

Nein

In SCOPUS:

In ISI Web of Science:

Band:

1419

DOI:

10.1007/978-3-031-92011-0_5

Seitenbereich:

Seiten 52-63

Name der Reihe:

Lecture Notes in Networks and Systems

ISSN:

2367-3370

ISBN:

978-303192010-3

Status:

veröffentlicht

Stichwörter:

human action classification

Veranstaltungstitel:

The 12th International Conference on Robot Intelligence Technology and Applications (RiTA 2024)

Veranstaltungsort:

Ulsan, Korea

Veranstaltungsart:

internationale Konferenz

Veranstaltungsbeginn:

4 Dezember 2024

Veranstaltungsende:

7 Dezember 2024

HGF - Forschungsbereich:

Luftfahrt, Raumfahrt und Verkehr

HGF - Programm:

Raumfahrt

HGF - Programmthema:

Robotik

DLR - Schwerpunkt:

Raumfahrt

DLR - Forschungsgebiet:

R RO - Robotik

DLR - Teilgebiet (Projekt, Vorhaben):

R - Basistechnologien [RO]

Standort:

Oberpfaffenhofen

Institute & Einrichtungen:

Institut für Robotik und Mechatronik (ab 2013)

Hinterlegt von:

Strobl, Dr.-Ing. Klaus H.

Hinterlegt am:

24 Mär 2026 14:27

Letzte Änderung:

25 Mär 2026 09:59

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags