Abstract
In autonomous driving, pedestrian appearance cues from cameras can be unreliable under glare, fog, or low illumination, while LiDAR offers complementary geometry signals. Building on uncertainty-aware CLIP-based modal modeling, this work introduces a fusion framework that integrates vision–language embeddings with LiDAR-derived shape descriptors using uncertainty gated feature mixing. The gating module increases reliance on LiDAR when visual uncertainty rises and maintains vision–language dominance in clear conditions. Experiments are conducted on multi sensor datasets covering 210,000 paired camera–LiDAR observations and 26,000 identities. Baselines include camera-only ReID (OSNet, TransReID), CLIP-based ReID, and naive concatenation fusion. The proposed method improves overall mAP by 3.3%–4.8% and yields larger gains of 6.0%–7.5% in low-light subsets, while adding less than 8% inference latency compared with camera-only models.

This work is licensed under a Creative Commons Attribution 4.0 International License.
Copyright (c) 2026 Carlos Martínez, Lucía Fernández, Javier Ruiz (Author)