Golendukhina, Valentina und Felderer, Michael (2024) Unveiling Data Preprocessing Patterns in Computational Notebooks. In: 50th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2024, Seiten 114-121. IEEE. SEAA 2024, 2024-08-28 - 2024-08-30, Paris, France. doi: 10.1109/SEAA64295.2024.00025. ISBN 979-835038026-2. ISSN 2640-592X.
|
PDF
263kB |
Kurzfassung
Data preprocessing, which includes data integration, cleaning, and transformation, is often a time and effort-intensive step due to its fundamental importance. This crucial phase is integral for ensuring the quality and suitability of data for sub-sequent stages, such as feature engineering and model training in Machine Learning-enabled and data-driven systems. This paper provides an extensive overview of data preprocessing functions in Python and examines their application and prevalence in computational notebooks by analyzing 149,048 computational notebooks collected from Kaggle. Despite the crucial role played by data preprocessing in model performance, our results expose a significant lack of emphasis on data preprocessing activities in the examined notebooks. Notably, users holding the highest rankings tend to skip data preprocessing steps and focus on model-related activities. Although other users exhibit more frequent incorporation of data preprocessing methods, the overall prevalence remains relatively limited. We discovered that data preparation practices such as missing values are present in 20 % to 60 % of the notebooks depending on the competition, whereas outliers handling is only present in less than 20% of the analyzed scripts. The most frequently and consistently applied practices are the data transformation methods.
| elib-URL des Eintrags: | https://elib.dlr.de/211398/ | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Dokumentart: | Konferenzbeitrag (Vortrag) | ||||||||||||
| Titel: | Unveiling Data Preprocessing Patterns in Computational Notebooks | ||||||||||||
| Autoren: |
| ||||||||||||
| Datum: | 2024 | ||||||||||||
| Erschienen in: | 50th Euromicro Conference on Software Engineering and Advanced Applications, SEAA 2024 | ||||||||||||
| Referierte Publikation: | Ja | ||||||||||||
| Open Access: | Ja | ||||||||||||
| Gold Open Access: | Nein | ||||||||||||
| In SCOPUS: | Ja | ||||||||||||
| In ISI Web of Science: | Ja | ||||||||||||
| DOI: | 10.1109/SEAA64295.2024.00025 | ||||||||||||
| Seitenbereich: | Seiten 114-121 | ||||||||||||
| Verlag: | IEEE | ||||||||||||
| ISSN: | 2640-592X | ||||||||||||
| ISBN: | 979-835038026-2 | ||||||||||||
| Status: | veröffentlicht | ||||||||||||
| Stichwörter: | Data Engineering Computational Notebooks Data Analysis | ||||||||||||
| Veranstaltungstitel: | SEAA 2024 | ||||||||||||
| Veranstaltungsort: | Paris, France | ||||||||||||
| Veranstaltungsart: | internationale Konferenz | ||||||||||||
| Veranstaltungsbeginn: | 28 August 2024 | ||||||||||||
| Veranstaltungsende: | 30 August 2024 | ||||||||||||
| HGF - Forschungsbereich: | Luftfahrt, Raumfahrt und Verkehr | ||||||||||||
| HGF - Programm: | Raumfahrt | ||||||||||||
| HGF - Programmthema: | Technik für Raumfahrtsysteme | ||||||||||||
| DLR - Schwerpunkt: | Raumfahrt | ||||||||||||
| DLR - Forschungsgebiet: | R SY - Technik für Raumfahrtsysteme | ||||||||||||
| DLR - Teilgebiet (Projekt, Vorhaben): | R - Digitale Transformation in der Raumfahrt [SY] | ||||||||||||
| Standort: | Köln-Porz | ||||||||||||
| Institute & Einrichtungen: | Institut für Softwaretechnologie | ||||||||||||
| Hinterlegt von: | Felderer, Michael | ||||||||||||
| Hinterlegt am: | 12 Jan 2026 08:11 | ||||||||||||
| Letzte Änderung: | 12 Jan 2026 08:11 |
Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags