Loading…
Friday January 31, 2025 3:00pm - 5:00pm IST

Authors - Dinesh Rajput, Prajwal Nimbone, Siddhesh Kasat, Mousami Munot, Rupesh Jaiswal
Abstract - We introduce a system based on neural networks that combines real-time avatar functionality with TTS synthesis. The which system can produce speech in the voices of various talkers, including ones that were not seen during training. To generate a speaker embedding from a brief reference voice sample, the system makes use of a unique encoder that was trained using a large volume of voice data. Using this speaker voice, the algorithm converts text into a mel-spectrogram graph, and a vocoder turns it into an audio waveform. Concurrently, the produced speech is synced with a three-dimensional avatar that produces equivalent lip motions in real time. By using this method, the encoder's learned speaker variability is transferred to the TTS job, enabling it to mimic genuine conversation in the voices of unseen speakers. On a web interface, precise lip syncing of speech with facial movements is ensured via the integration of the avatar system. We also demonstrate that The system's ability to adapt to novel voices is markedly improved by training the encoder on a diverse speaker dataset. In addition, The capacity of the model to generate unique voices that are distinct from those heard during training and retain smooth synchronization with the avatar's visual output is demonstrated by the use of random speaker embeddings, which further showcases the model's capacity to produce high-caliber, interactive voice cloning experiences.
Paper Presenter
Friday January 31, 2025 3:00pm - 5:00pm IST
Virtual Room A Pune, India

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

Share Modal

Share this link via

Or copy link