Blog

How to convert text to speech using new interface of FPT.AI Voice Maker

January 6, 2025

Share with:

FPT.AI Voice Maker is the platform that allows users to automatically convert text to natural speech. This soulution is equipped with state-of-the-art Text to Speech technology, new-generation AceSound voices, and many advanced editing features. Users can easily customize and save their audio as mp3 files. Recently, to create outstanding experiences for users, FPT.AI introduces a new and friendly interface, providing professional editing tools for easy free text to speech conversion. Below is all you should know about Text to Speech Technology as well as how to use this new version of FPT.AI Voicemaker.

What is Text to Speech?

Text to Speech (TTS), also known as text-to-voice conversion, is a technology that transforms written text into audio output. The primary goal of Text to-Speech AI is to simulate natural human speech, enabling users to consume information by listening instead of reading aloud.

TTS integrates artificial intelligence, deep learning, and natural language processing (NLP) to produce high-quality, natural-sounding AI voices that replicate human speech’s tone, emotion, intonation, and speed.

The process of generating AI voices has been simplified thanks to modern tools. Users can access these applications via web browsers or iOS and Android devices, select a language, input a script, customize elements like voice style and tone, and create AI-generated speech within seconds.

This technology unlocks new creative possibilities and offers practical applications in everyday life, including chatbots, callbots, audiobooks, navigation systems, and virtual assistants like Siri, Alexa, Cortana, and Google Assistant.

How Does Text to Speech Technology Work?

The operation of Text to Speech (TTS) involves three main steps: Natural Language Processing (NLP), Acoustic Modeling, and Speech Synthesis using a Vocoder. Below is an overview of the process:

Natural Language Processing (NLP)

This is the first step, responsible for analyzing and preparing input text for subsequent stages. It involves:

Expanding abbreviations: Transforming abbreviations (e.g., “NYC” to “New York City”) for better recognition.
Removing special characters: Cleaning unnecessary symbols (&, %, @) from the text.
Normalization: Converting numbers (“123” to “one hundred twenty-three”) and standardizing language formats.
Linguistic analysis: Identifying word types (noun, verb, adjective), phonemes (smallest sound units), and assigning stress and intonation to ensure natural pronunciation.

The result is a detailed transcription of the input text, including phonemes, stress patterns, intonation, and rhythm.

voice generator — NLP responsible for analyzing and preparing input text for subsequent stages

>>> READ MORE ABOUT: Super fast and free voice creator for movie review

Acoustic Modeling

In this step, processed text data is converted into acoustic parameters that simulate the characteristics of human speech.

The model utilizes Mel-Spectrograms, which visually represent sound frequencies, to encode features like pitch, duration, and energy.
Acoustic models, trained on real-world data, predict how a Mel-Spectrogram should be structured based on the input text, ensuring accurate and context-appropriate speech synthesis.

ai voices — Acoustic model turns linguistic information into parameters that simulate the characteristics of human speech

Speech Synthesis using a Vocoder

Finally, the Mel-Spectrogram is passed into a Vocoder (e.g., HiFi-GAN, WaveNet) to generate actual audio signals.

Vocoders convert Mel-Spectrograms into waveforms that humans can hear.
Modern vocoders fine-tune elements such as intonation, emphasis, and speed, delivering natural and expressive voices.

tts — The Mel-Spectrogram is fed into the Speech Synthesis Model (Vocoder), where it is converted into audio signals

Powered by artificial intelligence and deep learning, this end-to-end process ensures faster and more realistic voice synthesis. TTS systems today are integral in diverse applications, from assisting visually impaired individuals to enhancing user experiences in smart devices and improving the efficiency of automated customer service systems. The continuous evolution of Text to Speech technology promises not only improved voice quality but also groundbreaking potential for future applications.

Guideline on how to use the new version of FPT.AI Voicemaker to create synthetic speech

To use this platform, you will need an FPT ID. Register at https://id.fpt.ai/accounts/signin/?next=/accounts/profile/
Then visit https://console.fpt.ai to create a Project, turn API Text to Speech on and Enable the project.

tts

Next, visit https://voicemaker.fpt.ai/ or in Applications, choose Voicemaker to start using the text to speech app.

text to speech

Set up features on Voicemaker.fpt.ai:

Select a language to convert Text to Speech

Voice Maker platform supports 2 languages, English and Vietnamese.

Click on the globe icon on the right corner to choose a language.

voice selection

Select a project

You need to choose a project to start.

? If you already have a project, click on (1) to select a project. Choose one project you created at (2)

? If you have no project yet, you will need to create one. Click on Create new project (3). You will be forwarded to console.fpt.ai. Each FPT ID can create up to 3 free Text to Speech projects. .

FPT.AI Console is the platform to manage and synthesize statistics about all FPT.AI’s services. You can create a new project here. Then go back to Voicemaker.fpt.ai to continue your project. To create smooth experiences for users, FPT.AI Voicemaker introduces a new and friendly interface, providing professional editing tools for easy text-to-speech conversion.

text to speech

>>> EXPLORE: Voice-based Transactions – The Inevitable Trend of Digital Banking

Add your text link

Paste a link of any website that needs to convert text to speech into URL box

Click Process, the system will analyze text on that website.

As a result, text on the website will appear on the editing interface.

text to speech

Preview and choose a voice

Listen and choose a suitable voice on the top bar.

FPT.AI Text to Speech now has 13 voices with high quality, diversity in regions (North – Center – South), gender (Male/Female), meeting different needs and purposes of customers

text to speech

Customize text

After setting a voice with appropriate speed, you can edit text by using more specialized features to create a high-quality audio file as you want.

Dictionary

With this dictionary, you can teach the machine to pronounce difficult words, foreign languages by transliterating them into Vietnamese.

For example, there is the proper noun HoSE in an article. It is a difficult word so the machine may pronounce it wrongly or poorly. You can transliterate it into Vietnamese, then click Add to teach the machine.

Insert break time

All voices of FPT.AI Text to Speech have natural breaks like human voices. However, if you want the machine to pause somewhere for a longer time, you can insert break time with this feature.

Leave the cursor after the word you want to add a break time, click on Insert break time at cursor place, then adjust the break time to suit your needs by entering the time in the Break time box.

Choose another voice

For the same text, you may need different voices for dialogues, or to highlight quotes. You can set up another voice by:

Highlight the passage that needs another voice, click on Choose another voice button, then select a voice and adjust speed.

text to speech

Find and replace

To find a word/a phrase, type it into the Search box and click Find. All places where that word/phrase appears will be highlighted.

Speech Synthesis

You can replace it by entering another word into the Replace box.

?Click Replace to replace one by one

? Click Replace all to replace it in the whole text

text to speech

In the above example, I replaced “FLC” by “DIG” and clicked Replace to change one by one.

To deselect the word/phrase you search (highlighted words), click Clear.

Choose a voice and preview

To preview how the machine reads a word, a phrase or a passage, highlight that word/phrase/passage, choose a Voice, Speech and click play button to Preview.

text to speech

Undo and Redo

To undo an action, you can click on the Undo (1) button on the toolbar.

To redo an action, click on the Redo (2) button.

Besides, you can use shortcut keys Ctrl + Z to Undo and Ctrl + Y to Redo.

text to speech

Download audio files

To download your audio, click Convert To MP3.

text to speech

See history

To see your history, click on the History button.

You will see information about time, requests, status and audio links. You can download your previous files here, no need to convert the same text again

Buy more characters

FPT.AI Text to Speech gives you 100.000 characters/month for free. If you want to expand your limit and have a higher converting speed, you can buy paid packages by clicking on Buy more.

text to speech

Click on the package you want, the system will redirect you to a payment portal to process your payment.

text to speech

>>> EXPLORE: What is Generative AI? Trends in Applying GenAI from 2024 to 2027

Applications of Text to Speech Technology

Creating Engaging and Automated Advertising and Video Content

Automatically generating advertising content without the need for manual voice recording is how businesses apply text-to-speech technology in the media industry. Advertisements, blog posts, product tutorials, and social media videos can be converted into audio formats that are clear and easy to understand. This enables businesses to reach new audiences, including those who have limited time to read.

Automated Dubbing and Narration in Multiple Languages and Intonations

Text to speech tools can produce narrations and dubbing for videos, films, and TV shows without requiring live voice actors. Users can adjust speed, volume, and pauses between sentences, teach the system to pronounce difficult words, and transcribe specialized terms or unique phonetics to create tailored voiceovers for film reviews or other content.

This technology helps YouTube channels, educational video producers, and broadcast platforms save costs, speed up content production, and make updates or edits easily without re-recording entire segments. It also supports AI voices in multiple languages, enhancing global reach and audience engagement.

Frequently Asked Questions About Using Text-to-Speech Tools

What is “Ban Mai Voice,” and why is it popular?

“Ban Mai Voice” (also known as “Google Voice”) is a standout AI voice from FPT.AI Voicemaker. It features a gentle, natural, and expressive Northern Vietnamese female tone that is easy to listen to. Ban Mai Voice is widely used in movie reviews, audiobooks, and short narrated content on social media platforms like TikTok, Facebook, and YouTube. This AI voice allows content creators to convey messages clearly and engagingly, attracting listeners without the need for complex post-production editing.

>>>> READ MORE ABOUT: What Are AI Agents? The Difference Between AI Agents and AI Chatbots

Is FPT.AI VoiceMaker a Free text to speech tool?

FPT.AI offers a free trial feature, allowing you to input text and preview the AI-generated voice. For additional advanced features or to apply the tool in your creative projects, you can consider upgrading to paid service packages.

What is the maximum text length that can be converted into speech?

FPT.AI VoiceMaker supports converting up to 1,000 characters of text per voice generation session. Additionally, the synthesis duration is capped at 10 minutes per session to ensure optimal performance for your applications.

In conclusion, Text to Speech (TTS) technology has transformed the way users interact with digital content, offering seamless text-to-audio conversion that mimics natural sounding human speech. By leveraging advanced AI, deep learning, and NLP, TTS provides high-quality and versatile voice outputs that serve various purposes across education, healthcare, entertainment, and more.

Platforms like FPT.AI Voice Maker elevate this experience further by integrating cutting-edge TTS technology, customizable AceSound voices, and professional editing tools within an intuitive interface. With features such as speed adjustment, voice selection, and specialized editing capabilities, users can easily create and save natural sounding audio files for diverse applications. As TTS technology continues to evolve, solutions like FPT.AI Voice Maker ensure that users can harness its full potential to enhance accessibility, productivity, and user experience.

_____________________

? TRY VOICE MAKER RIGHT NOW AT voicemaker.fpt.ai

? Experience other products of #FPT_AI at https://fpt.ai/vi

? Address: 7th floor, FPT Tower, 10 Pham Van Bach Street, Cau Giay District, Hanoi///3rd floor Pijico Tower, 186 Dien Bien Phu Street, Ward 6 District 3, Ho Chi Minh City

☎ Hotline: 1900 638399

? Email: support@fpt.ai

>>> MAYBE YOU WANT TO KNOW: 2025 technology trends: The explosive development of AI Agents