LACING

Abstract

Large vision-language models (LVLMs) have achieved impressive results in various vision-language tasks. However, despite showing promising performance, LVLMs suffer from hallucinations caused by language bias, leading to diminished focus on images and ineffective visual comprehension. We identify two primary reasons for this bias: 1. Different scales of training data between the pretraining stage of LLM and multimodal alignment stage. 2. The learned inference bias due to short-term dependency of text data. Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with muLtimodal duAl-attention meChanIsm (MDA) aNd soft-image Guidance (SIG). Specifically, MDA introduces a parallel dual-attention mechanism that enhances the integration of visual inputs across the model. SIG introduces a learnable soft visual prompt during training and inference to replace visual inputs, designed to compel LVLMs to prioritize text inputs. Then, SIG further proposes a novel decoding strategy using the soft visual prompt to mitigate the model's over-reliance on adjacent text inputs. Comprehensive experiments demonstrate that our method effectively debiases LVLMs from their language bias, enhancing visual comprehension and reducing hallucinations without requiring additional training resources or data.

LACING

LVLMs often produce erroneous or hallucinatory responses irrelevant to the input images. The main reason behind the hallucinations in LVLMs is referred to as language bias, i.e., these models sometimes “ignore” the visual input and generate the text response solely on the text inputs. We suggest that this bias potentially emerges for the following two reasons:
(1) The learned inference bias due to the short-term dependency of text data.
(2) For free-form editing, we use the collected images as anchors, and invoke regular diffusion followed by prompt-to-prompt (P2P) control to produce source and target images.

Therefore, we propose LACING, a systemic framework designed to address the language bias of LVLMs with MuLtimodal DuAl attention MeChanIsm aNd Soft-Image Guidance.

Comparison of attention allocation between a standard LVLM (LLaVA-1.5) and our model trained with the Multimodal Dual-Attention (MDA) mechanism.

Experiments

We evaluate the effectiveness of our proposed method across various benchmarks

The baseline models are categorized into three groups:

LVLMs obtained after multimodal alignment training of foundational LLMs and the visual encoders
Training-free methods designed to mitigate hallucinations in LVLMs
Reinforcement learning-based methods aimed at aligning LVLM outputs with human intentions

Comparison of baselines across multiple benchmarks

Analysis Results

Effect of Soft-Image Guidance in Decoding Perspective.

Comparison of SIG with training-free methods designed to mitigate hallucinations across different decoding strategies.

Our analysis reveals that compared to previous training-free approaches which exhibit performance gains exclusively under Nucleus Sampling, SIG shows consistently improvements in Greedy Sampling and Nucleus Sampling in all benchmark

How does LVLMs treat visual inputs with Multimodal Dual-Attention Mechanism?

Performance comparison between LLaVA-1.5 and those with MDA, with and without FastV

It reveals that employing FastV in the model with MDA leads to a substantial performance drop. This finding indicates that the model is indeed leveraging visual inputs across all layers, not just in the shallow ones

Ablation Study