Enhanced OoD Detection through Cross-Modal Alignment of Multi-modal Representations

Abstract

Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance.

Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of naive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings.

To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

The Key Idea: Cross-Model Alignment

Figure 1. ZS

Figure 2. FLYP

Figure 3. CMA

The figures above illustrate the visualization of DOSNES on the ImageNet-1k validation dataset and the MOS benchmark dataset. Blue and orange represent ID image and ID text embeddings, respectively, while green and red represent OoD image and OoD text embeddings.

As shown in the figure, existing pre-trained image-text representations exhibit a modality gap, and simply applying the same CLIP-style fine-tuning does not resolve this issue.

The key idea behind CMA involves two main components: 1) alignment between the image and text modalities of the ID data to effectively separate negative text embeddings, and 2) correspondence between matching ID image-text pairs to maintain ID accuracy. To implement these ideas, we employ a contrastive loss for each modality with an additional CMA regularization loss.

\[ \mathcal{L}^k_{\text{imageCMA}} = -\log \sum_{j=1}^{B} \exp\Bigl(i_k \cdot t_j / {\tau}\Bigr), \quad \mathcal{L}^k_{\text{textCMA}} = -\log \sum_{j=1}^{B} \exp\Bigl(i_j \cdot t_k /{ \tau}\Bigr), \] \[ \mathcal{L}_{CMA} = \mathcal{L}_{CLIP} + \frac{\lambda}{2B} \sum_{k=1}^{B} \Bigl(\mathcal{L}^k_{\text{imageCMA}} + \mathcal{L}^k_{\text{textCMA}}\Bigr). \]

The proposed CMA losses work by globally increasing the similarity between ID image and text pairs, causing the ID modalities to be well-aligned on the hypersphere. As a result, the separability from the pretrained negative concepts can be enhanced, allowing for more effective OoDD.

Relation to EBMs

The proposed CMA objective is closely related to energy-based models (EBMs). Minimizing the CMA loss is equivalent to maximizing the log-likelihood of a joint distribution \( q_\theta(i, t) \) in a hyperspherical embedding space:

\[ \max_\theta \mathbb{E}_p \left[ \log q_\theta(i,t) \right] = \max_\theta \frac{1}{2} \mathbb{E}_p \left[ \log q_\theta(t|i) + \log q_\theta(i) \right] + \frac{1}{2} \mathbb{E}_p \left[ \log q_\theta(i|t) + \log q_\theta(t) \right] \]

In this formulation, the discriminative term \( \mathbb{E}_p[\log q_\theta(t|i)] \) corresponds to a contrastive loss, while the generative term \( \mathbb{E}_p[\log q_\theta(i)] \) involves marginal density estimation over image embeddings. The generative term can be written as:

\[ \mathbb{E}_p[\log q_\theta(i)] = -\mathbb{E}_p[E_\theta(i)] - \log Z(\theta) \]

The von Mises-Fisher (vMF) distribution is a probability distribution on the hypersphere in \( \mathbb{R}^p \) for the \( p \)-dimensional unit vectors. Given \( t \) be the mean direction of the vMF, the probability density function for the image embeddings is defined as \( q_\theta(i|t) = C_p(1/\tau) \exp(-E_\theta(i,t)/\tau) \), where \( C_p(1/\tau) \) denotes the normalization factor. Thus, the log marginal density of \( i \) and its empirical estimate can be written as:

\[ \log q_\theta(i) = \log \int_t q_\theta(i|t) q_\theta(t) \, dt \approx \log \sum_{t \in B} q_\theta(i|t) \approx \log \sum_{t \in B} \exp \left( -E_\theta(i,t)/\tau \right) + C \]

where \( C = \log C_p(1/\tau) \) is a constant. Therefore, maximizing \( \log q_\theta(i) \) is equivalent to minimizing the generation-aware loss \( \mathcal{L}^{k}_{\text{image}_\text{CMA}} \). While traditional contrastive losses only consider discriminative terms (i.e., \( \log q_\theta(t|i) \)), CMA integrates both discriminative and generative modeling for enhanced OoDD performance.

Conclusion

In this paper, we introduce cross-modal alignment (CMA), a novel multi-modal fine-tuning (MMFT) method that achieves state-of-the-art performance in both OoDD and ID accuracy. We establish a theoretical connection between CMA and EBMs by incorporating the generative term into contrastive learning. Our experimental results show how the CMA regularizer enhances the hyperspherical structure of the embedding space, reduces the modality gap, and strengthens alignment, leading to better OoD detection and ID classification. We plan to further explore the effectiveness of using auxiliary negative labels in MMFT training.

Key Contributions

We introduce CMA, a novel MMFT method that aligns image–text embeddings on a hypersphere to improve ID accuracy and OoDD performance.
We demonstrate that minimizing our objective is equivalent to maximizing the log-likelihood of a joint EBMs in hyperspherical space.
We achieve state-of-the-art results on MOS and OpenOOD v1.5 and provide an in-depth analysis of how hyperspherical alignment enhances OoDD.

@article{kim2025enhanced, title={Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations}, author={Kim, Jeonghyeon and Hwang, Sangheum}, journal={arXiv preprint arXiv:2503.18817}, year={2025} }

Enhanced OoD Detection through Cross-Modal Alignment of Multi-modal Representations

Bridging the modality gap between image and text embeddings via cross-modal alignment

Abstract

The Key Idea: Cross-Model Alignment

Relation to EBMs

Main Results

Conclusion

Key Contributions

BibTeX