The conventional data diagnosis and augmentation pipeline begins with an original (biased) dataset. Existing methods address these biases via object frequency calibration [52], metadata analysis [8], or traditional augmentation techniques [59, 5]. In contrast, our framework models visual data as a knowledge graph of concepts, with orange nodes representing classes and blue nodes representing concepts, facilitating a systematic diagnosis of class-concept imbalances for debiasing object co-occurrences in vision datasets.

Blog

Visual Data Diagnosis and Debiasing with Concept Graphs

March 6, 2025

By Rwiddhi Chakraborty, PhD Candidate at SFI Visual Intelligence/UiT The Arctic University of Norway.

Modern deep learning classifiers frequently pick up spurious correlations in the training data. This reliance leads to poor performance on downstream generalization tasks.

The dominant mode in research today is to debias the model rather than the data it trains on. The dataset is considered to be a static, inflexible object that exists only as an input for a task.

In this article, we adopt a different view in this work: we recognise that the dataset is indeed the most important component for any work on generalization, and that both the model and the dataset exist in a co-dependent setup, where debiasing one should automatically debias the other.

In this paper, we address the issue of datadiagnosis, i.e. directly analyzing the dataset for inherent biases, rather than using the model that trains on it as a proxy. We propose a novel end-to-end framework called ConBias that simultaneously diagnoses the data for spurious correlations, discovers imbalanced (biased) concept correlations, and generates a synthetic dataset using a uniform distribution of the learned concepts,effectively debiasing the data.

Our results show that training a base classifier on the augmented, debiased dataset results in state-of-the-art performance on a variety of benchmark datasets, in a variety of tasks. These tasks include single and multi-shortcut mitigation, out-of-distribution robustness, and downstream generalization.

We are optimistic that more approaches to represent data as graphs of biases will become popular in the future.

Illustration provided by Rwiddhi Chakraborty (UiT).

Publication

Visual Data Diagnosis and Debiasing with Concept Graphs

September 26, 2024

Chakraborty, Rwiddhi; Wang, Yinong; Gao, Jialu; Zheng, Runkai; Zhang, Cheng; De la Torre, Fernando

Paper abstract

The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present CONBIAS, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. CONBIAS represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by CONBIAS improves generalization performance across multiple datasets compared to state-of-the-art methods.

Full-text publication

View publication

View All publications