Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding

Sombit Dey^1,2, Ozan Unal^1,3, Christos Sakaridis¹, Luc Van Gool^1,2,4,

¹ETH-Zurich ² INSAIT, Sofia University “St. Kliment Ohridski” ³ Huawei Technologies ⁴ KU Leuven

We introduce two novel losses for 3D visual grounding: a visual-level offset loss on regressed vector offsets from each instance to the ground-truth referred instance and a language-related span loss on predictions for the word-level span of the referred instance in the description. AsphaltNet proposes novel auxiliary losses to aid 3D visual grounding with competitive results compared to the state-of-the-art on the ReferIt3D benchmark.

The overview of the method

BibTeX

@misc{dey2024finegrainedspatialverballosses,
      title={Fine-Grained Spatial and Verbal Losses for 3D Visual Grounding}, 
      author={Sombit Dey and Ozan Unal and Christos Sakaridis and Luc Van Gool},
      year={2024},
      eprint={2411.03405},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2411.03405}, 
}