Authors - Vedant Patil, Bhargavi Bhende, Omkar Jadhav, Gitanjali Shinde, Kavita Moholkar Abstract - The increasing realism of AI-generated faces, driven by advancements in Generative Adversarial Networks (GANs) like StyleGAN and ProGAN, poses significant challenges in security, identity verification, and digital forensics. Current detection methods, primarily relying on Convolutional Neural Networks (CNNs), struggle to identify subtle artifacts in high-quality synthetic imagery. This paper proposes a hybrid model combining Vision Transformers (ViT) and XceptionNet in a soft-voting ensemble framework. ViT captures global spatial patterns, while XceptionNet excels in detecting localized texture inconsistencies. The ensemble achieves 92.3% accuracy, 92.5% precision, and an F1-score of 0.922 on a dataset of 188,800 real and AI-generated faces. Extensive experiments demonstrate the model’s robustness against diverse deepfake architectures, including those with minimal artifacts. This approach offers a state-of-the-art solution for differentiating real and AI-generated faces, with significant implications for fraud prevention, content moderation, and digital forensics.