HyFL-CLIP: Hyperbolic Fine-Tuning of CLIP for Robust Long-Context Understanding

Motivation

Robust long-context retrieval under caption perturbation

Long-context captions often preserve the same overall semantics even when sentences are reordered, summarized, or partially omitted. Existing long-context CLIP variants can still fail under these semantic-preserving perturbations because Euclidean contrastive learning mainly enforces strict one-to-one matching. HyFL-CLIP addresses this by modeling part–whole relationships between long descriptions, their constituent short textual components, and images in hyperbolic space.

Abstract

Recent success of CLIP has established it as a de facto paradigm for aligning images and text. However, handling long-context descriptions beyond 77 tokens remains challenging because CLIP relies on absolute positional encodings and is trained mostly on short captions. In long contexts, sentences may be reordered, summarized, or partially omitted, and existing methods that extend positional encodings can still suffer from degraded image-text alignment under these perturbations.

We propose HyFL-CLIP, a hyperbolic fine-tuning framework that distills the well-established image-text alignment of Euclidean CLIP into hyperbolic space via cross-manifold similarity distillation. HyFL-CLIP introduces hierarchical semantic modeling that connects summarized token-wise features, long-context descriptions, and their constituent short textual components with the visual modality. By capturing part–whole relationships through hyperbolic entailment with Einstein midpoint aggregation, HyFL-CLIP achieves robust long-context understanding across cross-modal retrieval, caption perturbation robustness, text-to-text retrieval, and text-to-image generation with SDXL.

Method

Overview of HyFL-CLIP. The framework transfers Euclidean image-text alignment into hyperbolic space through short-text guided cross-manifold similarity distillation. Then, it optimizes long text-image pairs with a hyperbolic geodesic contrastive loss and models semantic abstraction through hierarchical entailment. Einstein midpoint aggregation summarizes token-wise information within each modality, while an entropy regularizer stabilizes the embedding distribution.

The final objective combines distillation, geodesic contrastive learning, hierarchical entailment, and entropy regularization: \[ \mathcal{L} = \lambda_1\mathcal{L}_{\mathrm{distill}} + \lambda_2\mathcal{L}_{\mathrm{itc}} + \lambda_3\mathcal{L}_{\mathrm{ent}} + \lambda_4\mathcal{L}_{\mathrm{reg}}. \]

Quantitative Results

Table 1: Comparison of zero-shot long-caption cross-modal retrieval. HyFL-CLIP consistently outperforms existing long-context CLIP baselines across datasets and model architectures. The best and second-best results are highlighted in bold and underline, respectively.

Backbone	Model	DOCCI		DCI		Long-DCI		Urban-1k
Backbone	Model	I2T	T2I	I2T	T2I	I2T	T2I	I2T	T2I
ViT-B/16	Long-CLIP^†	63.10	71.49	59.88	61.28	42.21	48.38	79.40	79.60
	TULIP	–	–	–	–	50.20	50.60	88.10	86.60
	HiMo-CLIP*	77.37	79.35	71.09	69.93	58.59	57.00	89.20	89.20
	FineLIP*	77.16	79.14	69.38	68.03	57.18	55.22	89.30	86.90
	LongD-CLIP	–	–	–	–	–	–	87.20	87.30
	SmartCLIP	77.40	78.00	64.90	64.00	53.40	52.80	90.00	87.40
	Fix-CLIP	–	–	59.70	63.00	–	–	80.90	81.10
	HyFL-CLIP (Ours)	78.41	81.12	71.54	71.79	59.00	58.75	91.80	91.10
ViT-L/14	Long-CLIP^†	66.78	78.61	64.13	67.83	46.55	54.25	82.40	86.20
	TULIP	77.90	79.10	–	–	55.70	56.40	90.10	91.10
	HiMo-CLIP^†	82.35	84.59	74.59	74.54	62.06	61.94	93.00	93.20
	FineLIP	82.20	83.10	–	–	60.80	60.70	93.20	93.00
	LongD-CLIP	–	–	–	–	–	–	91.90	90.80
	SmartCLIP	81.60	82.50	68.20	69.80	57.60	58.50	93.30	90.10
	Fix-CLIP	–	–	65.10	66.70	–	–	86.80	87.70
	HyFL-CLIP (Ours)	82.12	85.39	74.74	76.19	61.92	63.93	94.60	94.30

* indicates results from our implementation; ^† indicates checkpoints provided by the original authors.

Embeddings and Token Weight Visualization

Embeddings and token weight visualization. We visualize the embedding distributions of the image, long-text representation, and text summary token using HoroPCA, and compare token-level contribution weights computed from their similarity to the image. HyFL-CLIP keeps semantically related image and text representations well aligned in hyperbolic space, while assigning larger weights to visually grounded tokens such as street, bicycles, and city.