Spring AI's Text-to-Speech API

The Spring AI's Text-to-Speech (TTS) API helps convert text documents into audio files. Currently, Spring AI supports integration only with the OpenAI TTS API.

Further, TTS has applications in multiple fields, including e-learning, multilingual support, etc. Numerous industries are adopting this technology to improve communication, accessibility, and operational efficiency.

Spring AI provides a good abstraction on top of the underlying OpenAI TTS services. Let's learn more.

TTS API Key Classes

Before we can build a program using Spring AI's TTS API, let's get familiar with a few important components of the library:

The Spring AI framework supports auto-configurations for various model client classes, allowing integration with the underlying LLM Service APIs. This functionality is facilitated through special configuration properties, which we will discuss in the upcoming sections. Additionally, the library provides classes such as OpenAIAudioSpeechOptions and OpenAIAudioSpeechOptions.Builder that helps build model clients with fine-grained control.

Further, OpenAIAudioSpeechModel is the client model class implementing the SpeechModel interface for invoking OpenAI's TTS APIs. In the future, we may see additional implementations of the SpeechModel interface. The overloaded versions of the OpenAiAudioSpeechModel#call() method facilitate sending prompts to the OpenAI TTS service. The OpenAI TTS service supports response formats including MP3, OPUS, AAC, FLAC, WAV, and PCM. We receive it in byte-array form, which can be serialized and saved to the file system.

Prerequisites

In this section, we'll look at a few important prerequisites before we can use the APIs.

Key OpenAI Configurations

First, we must have an OpenAI account and subscribe to generate the API keys. Also, it is desirable to know about OpenAI's TTS API configurations. In the spring application, we can declare them in the application properties file:

spring.ai.openai.api-key=sk-proj-XXX

spring.ai.openai.audio.speech.api-key=sk-proj-XXX
spring.ai.openai.audio.speech.options.model=gpt-4o-mini-tts
spring.ai.openai.audio.speech.options.voice=fable
spring.ai.openai.audio.speech.options.response-format=mp3
spring.ai.openai.audio.speech.options.speed=1.0

In our sample program, we've defined all the properties in the application-tts.properties file. This setup primarily covers the key configurations such as the API key, TTS model, voice, response format, and speed.

Additionally, we must avoid storing the API keys in the properties files. Hence, we encourage you to store it in the environment property OPENAI_API_KEY and read from it.

Currently, OpenAI offers three TTS models: gpt-4o-mini-tts, tts-1, and tts-1-hd. Moreover, we've utilized the latest gpt-4o-mini-tts model in our sample code.

Maven Dependencies

We must import the Spring AI's Open AI starter library in the Spring Boot application's pom.xml:

<dependency>
    <groupId>org.springframework.ai</groupId>
    <artifactId>spring-ai-starter-model-openai</artifactId>
    <version>1.0.0-M8</version>
</dependency>

Ideally, we recommend using the Spring initializr online web tool to import the necessary libraries.

Auto-Configured OpenAI Client

When the Spring AI framework comes across the configuration properties with the namespace spring.ai.openai.audio.speech, it instantiates the OpenAiAudioSpeechModel class. Further, we can autowire it and use it to invoke the OpenAI speech model:

public class SpringAiTtsLiveTest {
    @Autowired
    private SpeechModel openAiAudioSpeechModel;

    @Test
    void givenAutoConfiguredOpenAiAudioSpeechModel_whenCalled_thenCreateAudioFile() {
        assertInstanceOf(OpenAiAudioSpeechModel.class, openAiAudioSpeechModel);
        byte[] audioBytes = null;
        try {
            audioBytes = openAiAudioSpeechModel
              .call(FileWriterUtil.readFile(Paths.get("poem.txt")));
            FileWriterUtil.writeFile(audioBytes, 
              Paths.get("tts-output/twinkle-auto.mp3"));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
     }
   }

In our sample program, which was developed using Spring Boot Test, we use a custom FleWriterUtil class to read the string from the file poem.txt, which contains the poem, Twinkle, Twinkle, Little Star. Then, we invoke the OpenAiAudioSpeechModel#call() method with the string to call the underlying speech model service. Finally, we invoke FileWriter#writeFile() to persist the resulting byte array in the target/test-classes/tts-output/twinkle-auto.mp3 audio file:

Listen to Twinkle Twinkle Little Star

Programmatically Configured OpenAI Client

Typically, applications require greater flexibility and control over configuring and creating the model client class. Therefore, we'll often find ourselves programmatically building the OpenAiSpeechModel class:

@Service
public class TtsService {
    @Value("${spring.ai.openai.audio.speech.api-key}")
    private String API_KEY;

    public byte[] textToSpeech(String text, String instruction) throws IOException {
        OpenAiAudioApi openAiAudioApi = OpenAiAudioApi.builder()
          .apiKey(API_KEY)
          .build();
        OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
          .model(OpenAiAudioApi.TtsModel.TTS_1_HD.value)
          .voice(OpenAiAudioApi.SpeechRequest.Voice.CORAL)
          .speed(0.75f)
          .responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
          .input(text)
          .build();
        OpenAiAudioSpeechModel openAiAudioSpeechModel 
          = new OpenAiAudioSpeechModel(openAiAudioApi, speechOptions);
        SpeechPrompt speechPrompt = new SpeechPrompt(instruction, speechOptions);
        return openAiAudioSpeechModel.call(speechPrompt).getResult().getOutput();
    }
}

First, in the TtsService#textToSpeech() method, we create the OpenAIAudioApi object. Next, we set the configuration in the OpenAiAudioSpeechOptions object. Then, we use both objects to instantiate the OpenAiAudioSpeechModel object. Finally, we invoke its call() method with the prompt to fetch the audio bytes.

In real-world applications, API keys are normally fetched from a more secure source, such as a vault or a secret manager. Moreover, more logic is applied to fetch the configurations from a storage layer such as a database or a cache.

Moving on, we can autowire the TtsService service class in our program and use it to convert the contents of a text file into an audio file:

public class SpringAiTtsLiveTest {
    @Autowired
    private TtsService ttsService;

    @Test
    void givenManuallyConfiguredOpenAiAudioSpeechModel_whenCalled_thenCreateAudioFile() {
        byte[] audioBytes;
        try {
            final String instruction = "Read the poem with a calm and soothing voice.";
            audioBytes = ttsService
              .textToSpeech(FileWriterUtil.readFile(Paths.get("poem.txt")), instruction);

            FileWriterUtil.writeFile(audioBytes, 
              Paths.get("tts-output/twinkle-manual.mp3"));
        } catch (IOException e) {
            throw new RuntimeException(e);
        }
    }
}

We've autowired the TtsService class in the program and later invoked its textToSpeech() method with the contents of poem.txt. Finally, we store the received byte array from the OpenAI model service into the file system at target/test-classes/tts-output/manual-twinkle.mp3:

Listen to Twinkle Twinkle Little Star

Conclusion

In this article, we've learned about important components of the Spring AI TTS API. It's still quite early for the library to support more LLM service providers besides OpenAI. Nevertheless, the current OpenAI support works well and serves its purpose. Furthermore, we must watch out for minor modifications in the API.

Visit our GitHub repository to access the article's source code.

Implement Rag with Spring AI and Qdrant DB

Designed by Freepik Earlier, we discussed Spring AI's integration with Qdrant DB . Continuing on the same lines, we'll explore and try implementing the Retrieval Augmented Generation (RAG) technique using Spring AI and Qdrant DB. We'll develop a chatbot that helps users query PDF documents, in natural language . RAG Technique Several LLMs exist, including OpenAI's GPT and Meta's Llama series, all pre-trained on publicly available internet data. However, they can't be used directly in a private enterprise's context because of the access restrictions to its knowledge base. Moreover, fine-tuning the LLMs is a time-consuming and resource-intensive process. Hence, augmenting the query or prompts with the information from the private knowledge base is the quickest and easiest way out . The application converts the user query into vectors. Then, it fires the q...

Kode Sastra

Search this blog