A Hybrid Neuro-Fuzzy Approach for Heterogeneous Patch Encoding in ViTs Using Contrastive Embeddings and Deep Knowledge Dispersion

Shah, S Muhammad Ahmed Hassan; Khan, Muhammad Qasim; Ghadi, Yazeed Yasin; Jan, Sana Ullah; Mzoughi, Olfa; Hamdi, Monia

doi:10.1109/access.2023.3302253

A Hybrid Neuro-Fuzzy Approach for Heterogeneous Patch Encoding in ViTs Using Contrastive Embeddings and Deep Knowledge Dispersion

Shah, S Muhammad Ahmed Hassan; Khan, Muhammad Qasim; Ghadi, Yazeed Yasin; Jan, Sana Ullah; Mzoughi, Olfa; Hamdi, Monia

Authors

S Muhammad Ahmed Hassan Shah

Muhammad Qasim Khan

Yazeed Yasin Ghadi

Dr Sanaullah Jan S.Jan@napier.ac.uk
Lecturer

Olfa Mzoughi

Monia Hamdi

Abstract

Vision Transformers (ViT) are commonly utilized in image recognition and related applications. It delivers impressive results when it is pre-trained using massive volumes of data and then employed in mid-sized or small-scale image recognition evaluations such as ImageNet and CIFAR-100. Basically, it converts images into patches, and then the patch encoding is used to produce latent embeddings (linear projection and positional embedding). In this work, the patch encoding module is modified to produce heterogeneous embedding by using new types of weighted encoding. A traditional transformer uses two embeddings including linear projection and positional embedding. The proposed model replaces this with weighted combination of linear projection embedding, positional embedding and three additional embeddings called Spatial Gated, Fourier Token Mixing and Multi-layer perceptron Mixture embedding. Secondly, a Divergent Knowledge Dispersion (DKD) mechanism is proposed to propagate the previous latent information far in the transformer network. It ensures the latent knowledge to be used in multi headed attention for efficient patch encoding. Four benchmark datasets (MNIST, Fashion-MNIST, CIFAR-10 and CIFAR-100) are used for comparative performance evaluation. The proposed model is named as SWEKP-based ViT, where the term SWEKP stands for Stochastic Weighted Composition of Contrastive Embeddings & Divergent Knowledge Dispersion (DKD) for Heterogeneous Patch Encoding. The experimental results show that adding extra embeddings in transformer and integrating DKD mechanism increases performance for benchmark datasets. The ViT has been trained separately with combination of these embeddings for encoding. Conclusively, the spatial gated embedding with default embeddings outperforms Fourier Token Mixing and MLP-Mixture embeddings.

Citation

Shah, S. M. A. H., Khan, M. Q., Ghadi, Y. Y., Jan, S. U., Mzoughi, O., & Hamdi, M. (2023). A Hybrid Neuro-Fuzzy Approach for Heterogeneous Patch Encoding in ViTs Using Contrastive Embeddings and Deep Knowledge Dispersion. IEEE Access, 11, 83171-83186. https://doi.org/10.1109/access.2023.3302253

Journal Article Type	Article
Acceptance Date	Jul 31, 2023
Online Publication Date	Aug 4, 2023
Publication Date	2023
Deposit Date	Aug 8, 2023
Publicly Available Date	Aug 8, 2023
Journal	IEEE Access
Electronic ISSN	2169-3536
Publisher	Institute of Electrical and Electronics Engineers
Peer Reviewed	Peer Reviewed
Volume	11
Pages	83171-83186
DOI	https://doi.org/10.1109/access.2023.3302253
Keywords	vision transformer, patch encoding, spatial gated unit, Fourier token mixing, MLP-mixture embedding, computer vision