Blog

March 5, 2025

Publication

PSAIR: A Neuro-Symbolic Approach to Zero-Shot Visual Grounding

September 9, 2024

Pan, Yi; Zhang, Yujia; Kampffmeyer, Michael Christian; Zhao, Xiaoguang.

Paper abstract

Supervised methods for Visual Grounding often require costly annotations of paired sentences and images with ground truth boxes. Recent zero-shot approaches to visual grounding such as ReCLIP and ChatRef aim to avoid the need for costly annotation of paired sentences and images with ground truth boxes. However, these approaches leverage an inflexible detect-then-reasoning paradigm, which leads to a notable semantic information loss. Additionally, these approaches are highly dependent on the definition of predefined keywords or the potentially inconsistent reasoning capabilities of Large Language Models (LLMs). To address these limitations, we propose a neuro-symbolic visual grounding method, PSAIR that incorporates two novel mechanisms, namely Parallel Scoring and Active Information Retrieval. PSAIR equips an LLM with external encapsulated reasoning functions, which the LLM can invoke, ensuring a flexible and stable reasoning process. The proposed Parallel Scoring mechanism is then used to reframe the sequential reasoning process found in prior approaches to facilitate robustness to noise in the detection process. Subsequently, the Active Information Retrieval mechanism is designed to address the loss of semantic information by having the ability to retrieve essential visual information, resulting in a new detect-reason-retrieve paradigm. These innovations result in superior performance and robustness across three public datasets compared to recent state-of-the-art zero-shot visual grounding methods.