AnomalyVFM

Fučka, Matic; Zavrtanik, Vitjan; Skočaj, Danijel

AnomalyVFM - Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Matic Fučka^*, Vitjan Zavrtanik^*, Danijel Skočaj

University of Ljubljana, Faculty of Computer and Information Science
arXiv 2026
^*Indicates Equal Contribution

Transforming VFMs into Zero-Shot Anomaly Detectors

TL;DR: AnomalyVFM proposes a framework on how to transform any (transformer) VFM into a strong zero-shot anomaly detector. It does so by leveraging synthetic images and parameter-efficient finetuning. It achieves SOTA results across 9 industrial inspection datasets.

Abstract

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision–language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points.

Contributions

Synthetic Dataset Generation A new modular synthetic dataset generation scheme exploiting pretrained Image Generation Models, such as FLUX. Upon acceptance we will also release a large dataset of synthethic images to enable faster development of future methods.
VFM Adaptation Framework A new framework (AnomalyVFM) that adjusts pretrained transformer VFMs leveraging the synthethic dataset and parameter-efficient finetuning for 0-shot anomaly detection.

Synthetic Dataset Generation

Transforming VFMs into Zero-Shot Anomaly Detectors

Anomaly-free images are created using an image generation model and then modified via inpainting to produce anomalous versions within a targeted region. Corresponding masks are generated by comparing feature-level differences between the normal and anomalous images, which also serves to filter out samples where the defect failed to generate.

Examples of Generated Images

First example of an image without an anomaly.

First example of an image with an anomaly.

First example of a generate anomaly mask.

Second example of an image without an anomaly.

Second example of an image with an anomaly.

Second example of a generate anomaly mask.

Third example of an image without an anomaly.

Third example of an image with an anomaly.

Third example of a generate anomaly mask.

Fourth example of an image without an anomaly.

Fourth example of an image with an anomaly.

Fourth example of a generate anomaly mask.

Fifth example of an image without an anomaly.

Fifth example of an image with an anomaly.

Fifth example of a generate anomaly mask.

Sixth example of an image without an anomaly.

Sixth example of an image with an anomaly.

Sixth example of a generate anomaly mask.

Seventh example of an image without an anomaly.

Seventh example of an image with an anomaly.

Seventh example of a generate anomaly mask.

Transforming VFMs into Zero-Shot Anomaly Detectors

AnomalyVFM adapts a pretrained backbone by injecting LoRA-based feature adaptation modules into the transformer attention layers to refine internal representations. It utilizes a convolutional decoder and a confidence-weighted loss to generate segmentation masks, a combination specifically designed to remain robust against noise in synthetic training labels.

Generalisation across various backbones

Transforming VFMs into Zero-Shot Anomaly Detectors

Experiments across various backbones demonstrate generalisation of AnomalyVFM. On the image SD stands for Synthetic Dataset and FA for Feature Adaptors.

Results

Transforming VFMs into Zero-Shot Anomaly Detectors

BibTeX

@article{fucka2026anomalyvfm,
        title={AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors},
        author={Fučka, Matic and Zavrtanik, Vitjan and Skočaj, Danijel},
        journal={arXiv preprint arXiv:2601.20524},
        year={2026}
}