Skip to main content

Spring AI's Text-to-Speech API

Text to Speech
Photo by Jeffrey Hamilton on Unsplash

The Spring AI's Text-to-Speech (TTS) API helps convert text documents into audio files. Currently, Spring AI supports integration only with the OpenAI TTS API.

Further, TTS has applications in multiple fields, including e-learning, multilingual support, etc. Numerous industries are adopting this technology to improve communication, accessibility, and operational efficiency.

Spring AI provides a good abstraction on top of the underlying OpenAI TTS services. Let's learn more.

TTS API Key Classes

Before we can build a program using Spring AI's TTS API, let's get familiar with a few important components of the library:

Spring AI TTS API Classes

The Spring AI framework supports auto-configurations for various model client classes, allowing integration with the underlying LLM Service APIs. This functionality is facilitated through special configuration properties, which we will discuss in the upcoming sections. Additionally, the library provides classes such as OpenAIAudioSpeechOptions and OpenAIAudioSpeechOptions.Builder that helps build model clients with fine-grained control.

Further, OpenAIAudioSpeechModel is the client model class implementing the SpeechModel interface for invoking OpenAI's TTS APIs. In the future, we may see additional implementations of the SpeechModel interface. The overloaded versions of the OpenAiAudioSpeechModel#call() method facilitate sending prompts to the OpenAI TTS service. The OpenAI TTS service supports response formats including MP3, OPUS, AAC, FLAC, WAV, and PCM. We receive it in byte-array form, which can be serialized and saved to the file system.

Prerequisites

In this section, we'll look at a few important prerequisites before we can use the APIs.

Key OpenAI Configurations

First, we must have an OpenAI account and subscribe to generate the API keys. Also, it is desirable to know about OpenAI's TTS API configurations. In the spring application, we can declare them in the application properties file:

spring.ai.openai.api-key=sk-proj-XXX

spring.ai.openai.audio.speech.api-key=sk-proj-XXX
spring.ai.openai.audio.speech.options.model=gpt-4o-mini-tts
spring.ai.openai.audio.speech.options.voice=fable
spring.ai.openai.audio.speech.options.response-format=mp3
spring.ai.openai.audio.speech.options.speed=1.0

In our sample program, we've defined all the properties in the application-tts.properties file. This setup primarily covers the key configurations such as the API key, TTS model, voice, response format, and speed.

Additionally, we must avoid storing the API keys in the properties files. Hence, we encourage you to store it in the environment property OPENAI_API_KEY and read from it.

Currently, OpenAI offers three TTS models: gpt-4o-mini-tts, tts-1, and tts-1-hd. Moreover, we've utilized the latest gpt-4o-mini-tts model in our sample code.

Maven Dependencies

We must import the Spring AI's Open AI starter library in the Spring Boot application's pom.xml:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
    <version>1.0.0-M8</version>
</dependency>

Ideally, we recommend using the Spring initializr online web tool to import the necessary libraries.

Auto-Configured OpenAI Client

When the Spring AI framework comes across the configuration properties with the namespace spring.ai.openai.audio.speech, it instantiates the OpenAiAudioSpeechModel class. Further, we can autowire it and use it to invoke the OpenAI speech model:

public class SpringAiTtsLiveTest {
    @Autowired
    private SpeechModel openAiAudioSpeechModel;

    @Test
    void givenAutoConfiguredOpenAiAudioSpeechModel_whenCalled_thenCreateAudioFile() {
        assertInstanceOf(OpenAiAudioSpeechModel.class, openAiAudioSpeechModel);
        byte[] audioBytes = null;
        try {
            audioBytes = openAiAudioSpeechModel
              .call(FileWriterUtil.readFile(Paths.get("poem.txt")));
            FileWriterUtil.writeFile(audioBytes, 
              Paths.get("tts-output/twinkle-auto.mp3"));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
     }
   }

In our sample program, which was developed using Spring Boot Test, we use a custom FleWriterUtil class to read the string from the file poem.txt, which contains the poem, Twinkle, Twinkle, Little Star. Then, we invoke the OpenAiAudioSpeechModel#call() method with the string to call the underlying speech model service. Finally, we invoke FileWriter#writeFile() to persist the resulting byte array in the target/test-classes/tts-output/twinkle-auto.mp3 audio file:

Listen to Twinkle Twinkle Little Star

Programmatically Configured OpenAI Client

Typically, applications require greater flexibility and control over configuring and creating the model client class. Therefore, we'll often find ourselves programmatically building the OpenAiSpeechModel class:

@Service
public class TtsService {
    @Value("${spring.ai.openai.audio.speech.api-key}")
    private String API_KEY;

    public byte[] textToSpeech(String text, String instruction) throws IOException {
        OpenAiAudioApi openAiAudioApi = OpenAiAudioApi.builder()
          .apiKey(API_KEY)
          .build();
        OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
          .model(OpenAiAudioApi.TtsModel.TTS_1_HD.value)
          .voice(OpenAiAudioApi.SpeechRequest.Voice.CORAL)
          .speed(0.75f)
          .responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
          .input(text)
          .build();
        OpenAiAudioSpeechModel openAiAudioSpeechModel 
          = new OpenAiAudioSpeechModel(openAiAudioApi, speechOptions);
        SpeechPrompt speechPrompt = new SpeechPrompt(instruction, speechOptions);
        return openAiAudioSpeechModel.call(speechPrompt).getResult().getOutput();
    }
}

First, in the TtsService#textToSpeech() method, we create the OpenAIAudioApi object. Next, we set the configuration in the OpenAiAudioSpeechOptions object. Then, we use both objects to instantiate the OpenAiAudioSpeechModel object. Finally, we invoke its call() method with the prompt to fetch the audio bytes.

In real-world applications, API keys are normally fetched from a more secure source, such as a vault or a secret manager. Moreover, more logic is applied to fetch the configurations from a storage layer such as a database or a cache.

Moving on, we can autowire the TtsService service class in our program and use it to convert the contents of a text file into an audio file:

public class SpringAiTtsLiveTest {
    @Autowired
    private TtsService ttsService;

    @Test
    void givenManuallyConfiguredOpenAiAudioSpeechModel_whenCalled_thenCreateAudioFile() {
        byte[] audioBytes;
        try {
            final String instruction = "Read the poem with a calm and soothing voice.";
            audioBytes = ttsService
              .textToSpeech(FileWriterUtil.readFile(Paths.get("poem.txt")), instruction);

            FileWriterUtil.writeFile(audioBytes, 
              Paths.get("tts-output/twinkle-manual.mp3"));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

We've autowired the TtsService class in the program and later invoked its textToSpeech() method with the contents of poem.txt. Finally, we store the received byte array from the OpenAI model service into the file system at target/test-classes/tts-output/manual-twinkle.mp3:

Listen to Twinkle Twinkle Little Star

Conclusion

In this article, we've learned about important components of the Spring AI TTS API. It's still quite early for the library to support more LLM service providers besides OpenAI. Nevertheless, the current OpenAI support works well and serves its purpose. Furthermore, we must watch out for minor modifications in the API.

Visit our GitHub repository to access the article's source code.

Comments

Popular posts from Kode Sastra

Qdrant DB - Spring AI Integration

Designed by Freepik This tutorial covers Spring AI's integration with Qdrant DB . It's an open-source, efficient, and scalable vector database. We'll insert some unstructured data into the vector DB. Then, we'll perform query and delete operations on the DB using the Spring AI framework. Brief Introduction to Qdrant DB It's a highly scalable multi-dimensional vector database with multiple flexible deployment options: Qdrant Cloud offers 100% managed SaaS on AWS, Azure, and GCP and a hybrid cloud variant on the Kubernetes cluster. It provides a unified console, to help create, manage, and monitor multi-node Qdrant DB clusters. It also supports on-premise private cloud deployments. This is for customers who want more control over management and data. Moreover, IAC tools like Terraform and Pulumi enable automated deployment and managemen...

Implement Rag with Spring AI and Qdrant DB

Designed by Freepik Earlier, we discussed Spring AI's integration with Qdrant DB . Continuing on the same lines, we'll explore and try implementing the Retrieval Augmented Generation (RAG) technique using Spring AI and Qdrant DB. We'll develop a chatbot that helps users query PDF documents, in natural language . RAG Technique Several LLMs exist, including OpenAI's GPT and Meta's Llama series, all pre-trained on publicly available internet data. However, they can't be used directly in a private enterprise's context because of the access restrictions to its knowledge base. Moreover, fine-tuning the LLMs is a time-consuming and resource-intensive process. Hence, augmenting the query or prompts with the information from the private knowledge base is the quickest and easiest way out . The application converts the user query into vectors. Then, it fires the q...

Building AI Assistance Using Spring AI's Function Calling API

Photo by Alex Knight on Unsplash Building AI assistance in existing legacy applications is gaining a lot of momentum. An AI assistant like a chatbot can provide users with a unified experience and enable them to perform functionalities across multiple modules through a single interface. In our article, we'll see how to leverage Spring AI to build an AI assistant. We'll demonstrate how to seamlessly reuse existing application services and functions alongside LLM capabilities. Function Calling Concept An LLM can respond to an application request in multiple ways: LLM responds from its training data LLM looks for the information provided in the prompt to respond to the query LLM has a callback function information in the prompt, that can help get the response Let's try to understand the third option, Spring AI's Function calling ...