elib
DLR-Header
DLR-Logo -> http://www.dlr.de
DLR Portal Home | Imprint | Privacy Policy | Accessibility | Contact | Deutsch
Fontsize: [-] Text [+]

Fusing Convolution and Vision Transformer Encoders for Object Height Estimation from Monocular Satellite and Aerial Images

Gültekin, Furkan and Koz, Alper and Bahmanyar, Reza and Azimi, Seyedmajid and Lütfi Süzen, Mehmet (2026) Fusing Convolution and Vision Transformer Encoders for Object Height Estimation from Monocular Satellite and Aerial Images. In: 2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025, pp. 3768-3777. ICCV - 3D VAST, 2025-10-19 - 2025-10-20, Honolulu, Hawaii. doi: 10.1109/ICCVW69036.2025.00393. ISBN 979-833158988-2. ISSN 2473-9944.

[img] PDF
1MB

Official URL: https://ieeexplore.ieee.org/document/11374341

Abstract

Accurate height estimation from aerial and satellite imagery is crucial for large-scale 3D scene modeling, which has applications in urban planning, environmental monitoring, and disaster management. In this work, we propose integrating convolutional neural networks (CNNs) and vision transformers (ViTs) to leverage both local and global feature extraction. Our experiments show that using a combination of CNN and ViT encoders significantly improves accuracy compared to relying on either one alone, as CNNs capture fine details while ViTs enhance contextual understanding. Additionally, we incorporate a segmentation head to enhance pixel-level precision, particularly at object boundaries. Evaluated on the DFC2019 and DFC2023 datasets, our proposed fusion approach outperforms baseline methods across multiple metrics. For instance, root-mean-squared error is reduced by 5%–13%, and accuracy is improved by 4%–9% in the delta threshold metric. The results also demonstrate strong generalizability across diverse sensors, acquisition altitudes, viewing angles, and real-world scenarios. Our models are released at https://github.com/Furkangultekin/FusedHE

Item URL in elib:https://elib.dlr.de/218107/
Document Type:Conference or Workshop Item (Poster)
Title:Fusing Convolution and Vision Transformer Encoders for Object Height Estimation from Monocular Satellite and Aerial Images
Authors:
AuthorsInstitution or Email of AuthorsAuthor's ORCID iDORCID Put Code
Gültekin, FurkanLocdus AIUNSPECIFIEDUNSPECIFIED
Koz, AlperCenter for Image Analysis, METUUNSPECIFIEDUNSPECIFIED
Bahmanyar, Rezareza.bahmanyar (at) dlr.deUNSPECIFIEDUNSPECIFIED
Azimi, SeyedmajidSeyedmajid.Azimi (at) dlr.deUNSPECIFIEDUNSPECIFIED
Lütfi Süzen, MehmetGeological Engineering Department, METUUNSPECIFIEDUNSPECIFIED
Date:23 February 2026
Journal or Publication Title:2025 IEEE/CVF International Conference on Computer Vision Workshops, ICCV-W 2025
Refereed publication:Yes
Open Access:Yes
Gold Open Access:No
In SCOPUS:Yes
In ISI Web of Science:No
DOI:10.1109/ICCVW69036.2025.00393
Page Range:pp. 3768-3777
ISSN:2473-9944
ISBN:979-833158988-2
Status:Published
Keywords:Fernerkundung
Event Title:ICCV - 3D VAST
Event Location:Honolulu, Hawaii
Event Type:international Conference
Event Start Date:19 October 2025
Event End Date:20 October 2025
Organizer:ICCV International Conference on Computer Vision
HGF - Research field:Aeronautics, Space and Transport
HGF - Program:Transport
HGF - Program Themes:Road Transport
DLR - Research area:Transport
DLR - Program:V ST Straßenverkehr
DLR - Research theme (Project):V - ACT4Transformation - Automated and Connected Technologies for Mobility Transformation, V - SaiNSOR
Location: Oberpfaffenhofen
Institutes and Institutions:Remote Sensing Technology Institute > Photogrammetry and Image Analysis
Deposited By: Azimi, Seyedmajid
Deposited On:31 Oct 2025 12:06
Last Modified:20 Apr 2026 08:18

Repository Staff Only: item control page

Browse
Search
Help & Contact
Information
OpenAIRE Validator logo electronic library is running on EPrints 3.3.12
Website and database design: Copyright © German Aerospace Center (DLR). All rights reserved.