CDAM: Class Distribution-induced Attention Map for Open-vocabulary Semantic Segmentations

1Dept. of Electrical and Computer Engineering, 2INMC & IPAI
Seoul National University, Korea

Motivation

Similarity of class distributions between patches. We argue that CLIP-based prior works yield patch-wise noisy class predictions while having highly correlated class distributions for each object. The similarity of the class distribution patches between P1 and P2 is more similar than between P1 and P3.

Abstract

Open-vocabulary semantic segmentation is a challenging task that assigns seen or unseen class labels to individual pixels. While recent works with vision-language models (VLMs) have shown promising results in zero-shot semantic segmentation, they still struggle to accurately localize class-related objects. In this work, we argue that CLIP-based prior works yield patch-wise noisy class predictions while having highly correlated class distributions for each object. Then, we propose Class Distribution-induced Attention Map, dubbed CDAM, that is generated by the Jensen-Shannon divergence between class distributions of two patches that belong to the same (class) object. This CDAM can be used for open-vocabulary semantic segmentation by integrating it into the final layer of CLIP to enhance the capability to accurately localize desired classes. Our class distribution-induced attention scheme can easily work with multi-scale image patches as well as augmented text prompts for further enhancing attention maps. By exploiting class distribution, we also propose robust entropy-based background thresholding for the inference of semantic segmentation. Interestingly, the core idea of our proposed method does not conflict with other prior arts in zero-shot semantic segmentation, thus can be synergetically used together, yielding substantial improvements in performance across popular semantic segmentation benchmarks.

Method

The overall pipeline of our proposed CDAM. During inference, the class distribution-induced attention map (CDAM) is constructed by measuring the distance between the class distributions of each patch in the initial similarity map \(S\). The CDAM is then integrated with the last attention layer of CLIP, highlighting the class-specific regions in the input image. CDAM with multi-scale image patches and augmented text prompts can further enhance the quality of attention map. Next, we dynamically adjust the threshold value for foreground-background regions based on the entropy.

Quantitative Result

Comparison with state-of-the-art methods on benchmark datasets with background class. We evaluate the open-vocabulary semantic segmentation methods on VOC21, Context60 and COCO-Obj. SD stands for Stable Diffusion and we marked \(^\dagger\) for the reproduced methods by following the unified evaluation protocol and removing renaming tricks. Performance improvements by CDAM are indicated in parentheses. The evaluation is based on mIoU (\(\%\)).

Qualitative Result

Qualitative segmentation results of CDAM from inaccurate initial predictions. Our proposed CDAM demonstrated its ability to generate high-quality attention maps (\(\text{Attn}_\text{MS}\)) even when starting from inaccurate predictions provided by prior methods. This capability led to significantly reduced noise in the final predictions of CDAM. Notably, our CDAM captures fine-grained details present within images, such as doors in a train and fence in front of sheep.

BibTeX

@inproceedings{kangclass,
      title={Class Distribution-induced Attention Map for Open-vocabulary Semantic Segmentations},
      author={Kang, Dong Un and Kim, Hayeon and Chun, Se Young},
      booktitle={The Thirteenth International Conference on Learning Representations},
      year={2025}
    }