https://arxiv.org/abs/2408.16314 ResVG: Enhancing Relation and Semantic Understanding in Multiple Instances for Visual GroundingVisual grounding aims to localize the object referred to in an image based on a natural language query. Although progress has been made recently, accurately localizing target objects within multiple-instance distractions (multiple objects of the same categarxiv.org (방법론..