A Hybrid Vision Transformer Model for Efficient Waste Classification
Dublin Core
Title
A Hybrid Vision Transformer Model for Efficient Waste Classification
Subject
Deep Learning, Fine-Tuning, Hybrid Approach, ResNet50, Vision Transformers, Waste Classification
Description
The rapid and accurate sorting of municipal waste is essential for efficient recycling and sustainable
resource recovery. Most existing AI solutions focus only on four common materials (plastic, paper,
metal, and glass), overlooking many other routinely encountered waste types and losing accuracy when
applied to the mixed waste compositions seen in operational environments. We introduce HR-ViT, a
hybrid network that combines ResNet50 residual blocks, which capture fine-grained local cues, with
Vision Transformer global self-attention. Trained on a balanced six-class benchmark of about 775
images per class (plastic, paper, organic, metal, glass, batteries), HR-ViT attains 98.27 % accuracy and a macro-averaged F1-score of 0.98, outperforming a pure ViT, VT-MLH-CNN, and Garbage FusionNet by up to five percentage points in both metrics. Gains arise from selective fine-tuning of the last ten ResNet layers, lightweight ViT hyper-parameter optimisation, and targeted data augmentation that mitigates cluttered backgrounds, uneven lighting, and object deformation. These results show that hybrid attention-residual architectures provide reliable predictions under complex imaging conditions.
Future work will extend the method to multi-object scenes and domain-adaptive deployment in smartcity recycling systems.
resource recovery. Most existing AI solutions focus only on four common materials (plastic, paper,
metal, and glass), overlooking many other routinely encountered waste types and losing accuracy when
applied to the mixed waste compositions seen in operational environments. We introduce HR-ViT, a
hybrid network that combines ResNet50 residual blocks, which capture fine-grained local cues, with
Vision Transformer global self-attention. Trained on a balanced six-class benchmark of about 775
images per class (plastic, paper, organic, metal, glass, batteries), HR-ViT attains 98.27 % accuracy and a macro-averaged F1-score of 0.98, outperforming a pure ViT, VT-MLH-CNN, and Garbage FusionNet by up to five percentage points in both metrics. Gains arise from selective fine-tuning of the last ten ResNet layers, lightweight ViT hyper-parameter optimisation, and targeted data augmentation that mitigates cluttered backgrounds, uneven lighting, and object deformation. These results show that hybrid attention-residual architectures provide reliable predictions under complex imaging conditions.
Future work will extend the method to multi-object scenes and domain-adaptive deployment in smartcity recycling systems.
Creator
Amir Mahmud Husein, Baren Baruna Harahap, Tio Fulalo Simatupang, Karunia Syukur Baeha,Bintang Keitaro Sinambela
Source
DOI: http://dx.doi.org/10.21609/jiki.v18i2.1489
Publisher
Faculty of Computer Science UI
Date
2025-02-26
Contributor
Sri Wahyuni
Rights
ISSN : 2502-9274
Format
PDF
Language
English
Type
Text
Files
Collection
Citation
Amir Mahmud Husein, Baren Baruna Harahap, Tio Fulalo Simatupang, Karunia Syukur Baeha,Bintang Keitaro Sinambela, “A Hybrid Vision Transformer Model for Efficient Waste Classification,” Repository Horizon University Indonesia, accessed January 11, 2026, https://repository.horizon.ac.id/items/show/9887.