Authors - Parekh Rikita Dhaval, Hiteishi M. Diwanji Abstract - The Visual Question answering is an emerging multidisciplinary research field that intersect computer vision and natural language processing. Medical Visual Question Answering is one of the prominent area of VQA. Medical images and Clinical Questions related to the medical image is given as input to the VQA model and VQA model respond with corresponding answer in natural language. The aim of Medical VQA is to enhance interpretability of medical image data for enhancing diagnostic accuracy, clinical decision making and patient care. This paper presents a novel framework that integrates Vision Transformer (ViT), Language transformer (BERT), and a Convolutional Autoencoder (CAE) to improve the performance of Medical VQA task. The Vision Transformer is used to capture complex visual features from medical images, while BERT processes the corresponding clinical question to understand its context and generate meaningful language embedding. To further enhance visual feature extraction, a Convolutional Autoencoder [1], [2] is incorporated to preprocess and denoise the medical images, capturing essential patterns, compressing medical images without losing key features, thereby providing cleaner input to the ViT. The combined use of these three components enables the model to effectively align visual features with textual information, leading to more precise and context-aware answers. We evaluate the proposed ViT+BERT+CAE model on benchmark medical VQA dataset MEDVQA-2019, showing significant improvements over traditional methods based solely on convolutional or recurrent networks. The results demonstrate significant increase in accuracy, precision, recall, F1-Score and WuPS score after applying Covolutional AutoEncoder in Preprocessing stage.