AnomalyVFM - Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors
Abstract
Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision–language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points.
Contributions
- Synthetic Dataset Generation A new modular synthetic dataset generation scheme exploiting pretrained Image Generation Models, such as FLUX. Upon acceptance we will also release a large dataset of synthethic images to enable faster development of future methods.
- VFM Adaptation Framework A new framework (AnomalyVFM) that adjusts pretrained transformer VFMs leveraging the synthethic dataset and parameter-efficient finetuning for 0-shot anomaly detection.
Synthetic Dataset Generation
Anomaly-free images are created using an image generation model and then modified via inpainting to produce anomalous versions within a targeted region. Corresponding masks are generated by comparing feature-level differences between the normal and anomalous images, which also serves to filter out samples where the defect failed to generate.
Examples of Generated Images
AnomalyVFM
AnomalyVFM adapts a pretrained backbone by injecting LoRA-based feature adaptation modules into the transformer attention layers to refine internal representations. It utilizes a convolutional decoder and a confidence-weighted loss to generate segmentation masks, a combination specifically designed to remain robust against noise in synthetic training labels.
Generalisation across various backbones
Experiments across various backbones demonstrate generalisation of AnomalyVFM. On the image SD stands for Synthetic Dataset and FA for Feature Adaptors.
Results
BibTeX
@article{fucka2026anomalyvfm,
title={AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors},
author={Fučka, Matic and Zavrtanik, Vitjan and Skočaj, Danijel},
journal={arXiv preprint arXiv:2601.20524},
year={2026}
}