AnomalyVFM - Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

University of Ljubljana, Faculty of Computer and Information Science
arXiv 2026

*Indicates Equal Contribution

Transforming VFMs into Zero-Shot Anomaly Detectors

TL;DR: AnomalyVFM proposes a framework on how to transform any (transformer) VFM into a strong zero-shot anomaly detector. It does so by leveraging synthetic images and parameter-efficient finetuning. It achieves SOTA results across 9 industrial inspection datasets.

Abstract

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision–language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points.

Contributions

  • Synthetic Dataset Generation A new modular synthetic dataset generation scheme exploiting pretrained Image Generation Models, such as FLUX. Upon acceptance we will also release a large dataset of synthethic images to enable faster development of future methods.
  • VFM Adaptation Framework A new framework (AnomalyVFM) that adjusts pretrained transformer VFMs leveraging the synthethic dataset and parameter-efficient finetuning for 0-shot anomaly detection.

Synthetic Dataset Generation

Transforming VFMs into Zero-Shot Anomaly Detectors

Anomaly-free images are created using an image generation model and then modified via inpainting to produce anomalous versions within a targeted region. Corresponding masks are generated by comparing feature-level differences between the normal and anomalous images, which also serves to filter out samples where the defect failed to generate.

Examples of Generated Images

AnomalyVFM

Transforming VFMs into Zero-Shot Anomaly Detectors

AnomalyVFM adapts a pretrained backbone by injecting LoRA-based feature adaptation modules into the transformer attention layers to refine internal representations. It utilizes a convolutional decoder and a confidence-weighted loss to generate segmentation masks, a combination specifically designed to remain robust against noise in synthetic training labels.

Generalisation across various backbones

Transforming VFMs into Zero-Shot Anomaly Detectors

Experiments across various backbones demonstrate generalisation of AnomalyVFM. On the image SD stands for Synthetic Dataset and FA for Feature Adaptors.

Results

Transforming VFMs into Zero-Shot Anomaly Detectors

Transforming VFMs into Zero-Shot Anomaly Detectors

BibTeX

@article{fucka2026anomalyvfm,
        title={AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors},
        author={Fučka, Matic and Zavrtanik, Vitjan and Skočaj, Danijel},
        journal={arXiv preprint arXiv:2601.20524},
        year={2026}
}