Skip to main content

Research Repository

Advanced Search

A Hybrid Neuro-Fuzzy Approach for Heterogeneous Patch Encoding in ViTs Using Contrastive Embeddings and Deep Knowledge Dispersion

Shah, S Muhammad Ahmed Hassan; Khan, Muhammad Qasim; Ghadi, Yazeed Yasin; Jan, Sana Ullah; Mzoughi, Olfa; Hamdi, Monia

Authors

S Muhammad Ahmed Hassan Shah

Muhammad Qasim Khan

Yazeed Yasin Ghadi

Olfa Mzoughi

Monia Hamdi



Abstract

Vision Transformers (ViT) are commonly utilized in image recognition and related applications. It delivers impressive results when it is pre-trained using massive volumes of data and then employed in mid-sized or small-scale image recognition evaluations such as ImageNet and CIFAR-100. Basically, it converts images into patches, and then the patch encoding is used to produce latent embeddings (linear projection and positional embedding). In this work, the patch encoding module is modified to produce heterogeneous embedding by using new types of weighted encoding. A traditional transformer uses two embeddings including linear projection and positional embedding. The proposed model replaces this with weighted combination of linear projection embedding, positional embedding and three additional embeddings called Spatial Gated, Fourier Token Mixing and Multi-layer perceptron Mixture embedding. Secondly, a Divergent Knowledge Dispersion (DKD) mechanism is proposed to propagate the previous latent information far in the transformer network. It ensures the latent knowledge to be used in multi headed attention for efficient patch encoding. Four benchmark datasets (MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100) are used for comparative performance evaluation. The proposed model is named as SWEKP-based ViT, where the term SWEKP stands for Stochastic Weighted Composition of Contrastive Embeddings & Divergent Knowledge Dispersion (DKD) for Heterogeneous Patch Encoding. The experimental results show that adding extra embeddings in transformer and integrating DKD mechanism increases performance for benchmark datasets. The ViT has been trained separately with combination of these embeddings for encoding. Conclusively, the spatial gated embedding with default embeddings outperforms Fourier Token Mixing and MLP-Mixture embeddings.

Journal Article Type Article
Acceptance Date Jul 31, 2023
Online Publication Date Aug 4, 2023
Publication Date 2023
Deposit Date Aug 8, 2023
Publicly Available Date Aug 8, 2023
Journal IEEE Access
Electronic ISSN 2169-3536
Publisher Institute of Electrical and Electronics Engineers
Peer Reviewed Peer Reviewed
Volume 11
Pages 83171-83186
DOI https://doi.org/10.1109/access.2023.3302253
Keywords vision transformer, patch encoding, spatial gated unit, Fourier token mixing, MLP-mixture embedding, computer vision

Files




You might also like



Downloadable Citations