Theses and Dissertations

Date of Award


Document Type


Degree Name

Master of Science in Engineering (MSE)


Electrical Engineering

First Advisor

Dimah Dera

Second Advisor

Rogelio Soto

Third Advisor

Yong Zhou


Transformer Neural Networks have emerged as the predominant architecture for addressing a wide range of Natural Language Processing (NLP) applications such as machine translation, speech recognition, sentiment analysis, text anomaly detection, etc. This noteworthy achievement of Transformer Neural Networks in the NLP field has sparked a growing interest in integrating and utilizing Transformer models in computer vision tasks. The Vision Transformer (ViT) model efficiently captures long-range dependencies by employing a self-attention mechanism to transform different image data into meaningful, significant representations. Recently, the Vision Transformer (ViT) has exhibited incredible performance in solving image classification problems by utilizing ViT models, thereby surpassing the capabilities of Convolutional Neural Networks (CNN). Deterministic Vision Transformer (ViT) models are prone to noise and adversarial attacks, hence lacking the ability to provide a reliable measure of confidence or uncertainty in their output predictions. However, developing a robust Vision Transformer (ViT) model, which can quantify the confidence (or uncertainty) level in the output predictions for vision applications with high-risk implications, such as autonomous vehicles, medical imaging, etc., has significant importance. To ensure the dependability of Vision Transformer (ViT) in crucial applications, using Bayesian Inference aids in generating probabilistic predictions. In this work, we develop a robust image classification framework using the Bayesian Vision Transformer (Bayes-ViT) and Bayesian Compact Convolutional Transformer (Bayes-CCT) model, which provides output predictions and quantifies uncertainty associated with output predictions. The proposed Bayesian Vision Transformer model incorporates a variational inference framework and optimizes the variational posterior distribution over the model parameters using the evidence lower bound (ELBO) loss function. The propagation of variational moments in Bayesian Vision Transformer's sequential, non-linear layers is achieved using the first-order Taylor series approximation. The output of the proposed architecture consists of a predictive distribution, where the mean represents the output prediction, and the covariance matrix provides information about the uncertainty associated with the prediction. Extensive experiments on benchmark datasets demonstrate (1) the superior robustness of proposed models under noise and adversarial attacks in comparison to the deterministic ViT models and (2) the ability for self-evaluation by utilizing the discernible increase in prediction uncertainty when the developed model encountered high levels of random noise and adversarial attacks, which acts as a warning sign for crucial image classification tasks.


Copyright 2023 Fazlur Rahman Bin Karim. All Rights Reserved.