Towards the Automatic Diagnosis of State Divergence using Visual Language Models for Planning-based Robots

Thomas, Riya (2024) Towards the Automatic Diagnosis of State Divergence using Visual Language Models for Planning-based Robots. Masterarbeit, Technische Universität München.

PDF - Nur DLR-intern zugänglich
17MB

Kurzfassung

Service robots are designed and built to operate as versatile tools capable of performing a wide range of tasks in various environments, from household assistance to space applications. In many cases, especially in dynamic or complex environments, a high level of autonomy is essential for robots to function effectively without constant human supervision. To accomplish this, robots must be able to link together pre-programmed actions to achieve specified goals, necessitating the use of task and motion planners. Planners typically act on a given initial state and output a sequence of actions that transition the world state from its initial state to the desired goal state. Executing these actions requires the robot to maintain a precise understanding of the current world state and to continuously track this state throughout task execution. While the assumption of certainty in the robot’s perceived current state may hold true in static environments, it can prove unreliable in dynamic environments, especially when perception is limited by the robot’s sensing capabilities. In the event that the robot’s perception of its environment is outdated, the robot would be unable to execute the preplanned actions. That is, divergence between the belief state and the real state of the world would result in a failure in task execution. This thesis aims to mitigate such problems by outlining a framework that enables a robot to investigate and address the underlying reason for these failures. The module that equips the robot with reasoning capabilities is a multi-modal model (such as a large-language or vision-language model). In particular, the introduction of a large-language model (LLM) into the framework allows the human end-user to be included in the investigation, with the AI-based framework carrying the brunt of the investigative effort, and the human user querying the framework for explanations. The framework serves in scenarios where the robot has detected the failure of an action and is tasked with identifying the cause and explaining it to a human operator. To do this, the robot compares its expected state of the world with the data received from its sensors and interprets the discrepancies between them. In particular, the framework considers visual images from the robot’s perspective and performs a spot-the-difference-like comparison with rendered versions of the expected world state. We test our approach with data collected in a space assistance scenario and a household scenario. In the course of the thesis, we implemented, developed, and otherwise incorporated from external sources "specialized assistants" for certain subtasks within the natural language pipeline. In particular, we discuss an anomaly detection module that is trained to determine if two images, differing in visual characteristics such as lighting, quality, and texture, fundamentally align in representing a scene. We introduce and evaluate different hyperparameters that influence this model’s performance. Other assistants in the pipeline include a dedicated object-detection module, and an object spatial-relation predictor. One of the newest entrants into the vision-language-model space, Pixtral, shows promising preliminary results when introduced into our framework, which we primarily designed around the vision language model, LLaVA.

elib-URL des Eintrags:

https://elib.dlr.de/210191/

Dokumentart:

Hochschulschrift (Masterarbeit)

Titel:

Towards the Automatic Diagnosis of State Divergence using Visual Language Models for Planning-based Robots

Autoren:

Autoren	Institution oder E-Mail-Adresse	Autoren-ORCID-iD	ORCID Put Code
Thomas, Riya	riya.thomas (at) dlr.de	NICHT SPEZIFIZIERT	NICHT SPEZIFIZIERT

Datum:

November 2024

Open Access:

Nein

Seitenanzahl:

139

Status:

veröffentlicht

Stichwörter:

Robotics, Task and Motion Planning, Large Language Model, Visual Language Model

Institution:

Technische Universität München

Abteilung:

School of Computation, Information and Technology — Informatics

HGF - Forschungsbereich:

Luftfahrt, Raumfahrt und Verkehr

HGF - Programm:

Raumfahrt

HGF - Programmthema:

Robotik

DLR - Schwerpunkt:

Raumfahrt

DLR - Forschungsgebiet:

R RO - Robotik

DLR - Teilgebiet (Projekt, Vorhaben):

R - On-Orbit Servicing [RO]

Standort:

Oberpfaffenhofen

Institute & Einrichtungen:

Institut für Robotik und Mechatronik (ab 2013) > Autonomie und Fernprogrammierung

Hinterlegt von:

Bauer, Adrian Simon

Hinterlegt am:

10 Dez 2024 07:50

Letzte Änderung:

10 Dez 2024 07:50

Nur für Mitarbeiter des Archivs: Kontrollseite des Eintrags