HyFL-CLIP: Hyperbolic Fine-Tuning of CLIP for Robust Long-Context Understanding

1Dept. of Electrical and Computer Engineering, 2IPAI 3INMC & AIIS
Seoul National University, Korea
ECCV 2026
Long-context CLIP Hyperbolic fine-tuning Robust retrieval Caption perturbation

Motivation

Robust long-context retrieval under caption perturbation

Long-context captions often preserve the same overall semantics even when sentences are reordered, summarized, or partially omitted. Existing long-context CLIP variants can still fail under these semantic-preserving perturbations because Euclidean contrastive learning mainly enforces strict one-to-one matching. HyFL-CLIP addresses this by modeling part–whole relationships between long descriptions, their constituent short textual components, and images in hyperbolic space.

Abstract

Recent success of CLIP has established it as a de facto paradigm for aligning images and text. However, handling long-context descriptions beyond 77 tokens remains challenging because CLIP relies on absolute positional encodings and is trained mostly on short captions. In long contexts, sentences may be reordered, summarized, or partially omitted, and existing methods that extend positional encodings can still suffer from degraded image-text alignment under these perturbations.

We propose HyFL-CLIP, a hyperbolic fine-tuning framework that distills the well-established image-text alignment of Euclidean CLIP into hyperbolic space via cross-manifold similarity distillation. HyFL-CLIP introduces hierarchical semantic modeling that connects summarized token-wise features, long-context descriptions, and their constituent short textual components with the visual modality. By capturing part–whole relationships through hyperbolic entailment with Einstein midpoint aggregation, HyFL-CLIP achieves robust long-context understanding across cross-modal retrieval, caption perturbation robustness, text-to-text retrieval, and text-to-image generation with SDXL.

Method

Overview of HyFL-CLIP framework

Overview of HyFL-CLIP. The framework transfers Euclidean image-text alignment into hyperbolic space through short-text guided cross-manifold similarity distillation. Then, it optimizes long text-image pairs with a hyperbolic geodesic contrastive loss and models semantic abstraction through hierarchical entailment. Einstein midpoint aggregation summarizes token-wise information within each modality, while an entropy regularizer stabilizes the embedding distribution.

The final objective combines distillation, geodesic contrastive learning, hierarchical entailment, and entropy regularization: \[ \mathcal{L} = \lambda_1\mathcal{L}_{\mathrm{distill}} + \lambda_2\mathcal{L}_{\mathrm{itc}} + \lambda_3\mathcal{L}_{\mathrm{ent}} + \lambda_4\mathcal{L}_{\mathrm{reg}}. \]

Quantitative Results

Table 1: Comparison of zero-shot long-caption cross-modal retrieval. HyFL-CLIP consistently outperforms existing long-context CLIP baselines across datasets and model architectures. The best and second-best results are highlighted in bold and underline, respectively.

Backbone Model DOCCI DCI Long-DCI Urban-1k
I2TT2I I2TT2I I2TT2I I2TT2I
ViT-B/16 Long-CLIP 63.1071.4959.8861.2842.2148.3879.4079.60
TULIP 50.2050.6088.1086.60
HiMo-CLIP* 77.3779.3571.0969.9358.5957.0089.2089.20
FineLIP* 77.1679.1469.3868.0357.1855.2289.3086.90
LongD-CLIP 87.2087.30
SmartCLIP 77.4078.0064.9064.0053.4052.8090.0087.40
Fix-CLIP 59.7063.0080.9081.10
HyFL-CLIP (Ours) 78.4181.1271.5471.7959.0058.7591.8091.10
ViT-L/14 Long-CLIP 66.7878.6164.1367.8346.5554.2582.4086.20
TULIP 77.9079.1055.7056.4090.1091.10
HiMo-CLIP 82.3584.5974.5974.5462.0661.9493.0093.20
FineLIP 82.2083.1060.8060.7093.2093.00
LongD-CLIP 91.9090.80
SmartCLIP 81.6082.5068.2069.8057.6058.5093.3090.10
Fix-CLIP 65.1066.7086.8087.70
HyFL-CLIP (Ours) 82.1285.3974.7476.1961.9263.9394.6094.30

* indicates results from our implementation; indicates checkpoints provided by the original authors.

Embeddings and Token Weight Visualization

embeddings and token weight visualization

Embeddings and token weight visualization. We visualize the embedding distributions of the image, long-text representation, and text summary token using HoroPCA, and compare token-level contribution weights computed from their similarity to the image. HyFL-CLIP keeps semantically related image and text representations well aligned in hyperbolic space, while assigning larger weights to visually grounded tokens such as street, bicycles, and city.

BibTeX

@inproceedings{Jang@2026hyflclip,
  title     = {HyFL-CLIP: Hyperbolic Fine-Tuning of CLIP for Robust Long-Context Understanding},
  author    = {Ji Ha Jang, Hayeon Kim, Chulwon Lee, Junghun James Kim, Se Young Chun},
  booktitle = {European Conference on Computer Vision},
  year      = {2026}
}