Image Feedback-Guided Autonomous Driving

Jimuyang ZhangZanming HuangArijit RayEshed Ohn-Bar
Boston University CVPR 2024 (Highlight)


Abstract

While behavior cloning has recently emerged as a highly successful paradigm for autonomous driving, humans rarely learn to perform complex tasks, such as driving, via imitation or behavior cloning alone. In contrast, learning in humans often involves additional detailed guidance throughout the interactive learning process, i.e., where feedback, often via language, provides detailed information as to which part of their trial was performed incorrectly or suboptimally and why. Motivated by this observation, we introduce an efficient feedback-based framework for improving behavior-cloning-based training of sensorimotor driving agents. Our key insight is to leverage recent advances in Large Language Models (LLMs) to provide corrective fine-grained feedback regarding the underlying reason behind driving prediction failures. Moreover, our introduced network architecture is efficient, enabling the first sensorimotor end-to-end training and evaluation of LLM-based driving models. The resulting agent achieves state-of-the-art performance in open-loop evaluation on nuScenes, outperforming prior state-of-the-art by over 8.1% and 57.1% in accuracy and collision rate, respectively. In CARLA, our camera-based agent improves by 16.6% in driving score over prior LIDAR-based approaches.

Method

We propose FeD, a feedback-guided end-to-end sensorimotor driving agent that employs a multimodal large language model (MLLM) to leverage its rich language interface for user-control and refinement. Our three key improvements are:
  1. Language-based feedback refining trained using autogenerated feedback data. Hence, our approach requires no additional data collection.
  2. Training the model via distillation from a privileged agent with Bird’s Eye View (BEV) of the scene, allowing our model to robustly use just RGB data at test time.
  3. Predicting driving waypoints in a masked-token fashion from the waypoint tokens’ internal representations, i.e., not relying on the slow sequentially generative process.

Model Architecture of FeD

Our proposed FeD is the first sensorimotor end-to-end LLM-based autonomous driving model. FeD enables efficient closed-loop evaluation compared with the existing LLM-based methods, which often leverage slow and costly inference. Our goal is to train a sensorimotor agent to map front camera images (orange) and ego vehicle state information (blue) encoded as language tokens, and predict a set of future waypoints. This is accomplished by introducing new waypoint tokens (green) as part of the input prompt. Our introduced tokens also enable us to leverage the rich output embeddings from the LLM for the prompt to perform direction waypoint prediction, i.e., as opposed to slow and inefficient sequential generation. Our training is done in two stages. First, to ease the challenging sensorimotor learning task, we introduce a privileged agent that additionally takes ground truth environmental information (purple) and provides rich supervision for training the sensorimotor agent through feature distillation. Subsequently, the sensorimotor agent is fine-tuned with prompt-based feedback to enable efficient failure reasoning, i.e., effective reflection on its own mistakes.

An End-to-End LLM-based Driver

We closely follow an MLLM architecture LLaVA, with an additional waypoints prediction head consisting of multi-layer perceptron, which enables FeD to be initialized from pre-trained LLaVA-7B weights, benefiting from the large diverse image-text corpus used to train MLLMs. However, we find off-the-shelf LLaVA to perform poorly in intricate spatial reasoning tasks, addressed below.
  1. Token Prediction Mechanism: We note that our proposed architecture does not leverage generative sequence prediction as in most related approaches, but instead draws inspiration from more efficient methodologies based on masked token prediction.
  2. Vision Encoder: The front camera image is processed by a CLIP ViT vision encoder, whose output image features are converted into visual embeddings by a trainable projection matrix.
  3. Language Encoder: Given the language prompt, we first compute language embeddings which is concatenated with visual embeddings. We then encode the concatenated embeddings with an LLM.
  4. Waypoint Prediction Head: We propose to directly compute the waypoints from the output embeddings of those tokens. This bypasses the need for recursive token-by-token generation and expensive sampling strategies like beam search, leading to a more efficient inference process.
  5. Prompt Design for Sensorimotor Agent: We wrap ego-vehicle speed v and short-term goal g with flag tokens indicating the beginning and the end of the text span. We further provide the categorical command as natural language.
  6. Prompt Design for Privileged Agent: For the privileged agent, we additionally provide parameterized environmental information.

Feedback-Guided Fine-tuning

We propose to incorporate feedback fine-tuning by leveraging fine-grained textual feedback regarding waypoint prediction errors. This enables the sensorimotor agent to effectively learn from experience, including failure which can provide a highly informative supervision signal. In FeD, we guide the waypoint predictions with structured critique and reasoning as language prompts. Given the ground-truth surrounding object states and the original waypoint predictions, we define a rich taxonomy over five failure cases and generate a corresponding feedback prompt for each failure case. To ensure our agent’s ability to both generate informative failure feedback and rectify mispredicted waypoints, we supervise the model learning by applying a Cross-Entropy (CE) loss of the LLaMA outputs over the generated language feedback. We additionally compute an L1 loss over the corrected waypoints. The optimization objective for our proposed feedback finetuning procedure is hence a weighted sum over waypoint predictions and language CE loss.

Results

In our experiments, we demonstrate state-of-the-art performance in open-loop and closed-loop evaluation settings as showed in the tables below, improving over prior methods by over 16% in performance, particularly benefiting from the additional autogenerated language-based feedback. Notably, FeD achieves a significant drop in infractions by over 33% with almost zero collisions with objects in CARLA.

Quantitative Evaluation in CARLA

Open-Loop Evaluation on nuScenes

Qualitative Examples

BibTeX

@inproceedings{zhang2023coaching,
        title={Feedback-Guided Autonomous Driving},
        author={Zhang, Jimuyang and Huang, Zanming and Ray, Arijit and Ohn-Bar, Eshed},
        booktitle={CVPR},
        year={2024}
}

Acknowledgments

We thank the Red Hat Collaboratory for supporting this research.