![]() |
Photo by Jeffrey Hamilton on Unsplash |
The Spring AI's Text-to-Speech (TTS) API helps convert text documents into audio files. Currently, Spring AI supports integration only with the OpenAI TTS API.
Further, TTS has applications in multiple fields, including e-learning, multilingual support, etc. Numerous industries are adopting this technology to improve communication, accessibility, and operational efficiency.
Spring AI provides a good abstraction on top of the underlying OpenAI TTS services. Let's learn more.
TTS API Key Classes
Before we can build a program using Spring AI's TTS API, let's get familiar with a few important components of the library:
The Spring AI framework supports auto-configurations for various model client classes, allowing integration with the underlying LLM Service APIs. This functionality is facilitated through special configuration properties, which we will discuss in the upcoming sections. Additionally, the library provides classes such as OpenAIAudioSpeechOptions and OpenAIAudioSpeechOptions.Builder that helps build model clients with fine-grained control.
Further, OpenAIAudioSpeechModel is the client model class implementing the SpeechModel interface for invoking OpenAI's TTS APIs. In the future, we may see additional implementations of the SpeechModel interface. The overloaded versions of the OpenAiAudioSpeechModel#call() method facilitate sending prompts to the OpenAI TTS service. The OpenAI TTS service supports response formats including MP3, OPUS, AAC, FLAC, WAV, and PCM. We receive it in byte-array form, which can be serialized and saved to the file system.
Prerequisites
Key OpenAI Configurations
First, we must have an OpenAI account and subscribe to generate the API keys. Also, it is desirable to know about OpenAI's TTS API configurations. In the spring application, we can declare them in the application properties file:
spring.ai.openai.api-key=sk-proj-XXX
spring.ai.openai.audio.speech.api-key=sk-proj-XXX
spring.ai.openai.audio.speech.options.model=gpt-4o-mini-tts
spring.ai.openai.audio.speech.options.voice=fable
spring.ai.openai.audio.speech.options.response-format=mp3
spring.ai.openai.audio.speech.options.speed=1.0
In our sample program, we've defined all the properties in the application-tts.properties file. This setup primarily covers the key configurations such as the API key, TTS model, voice, response format, and speed.
Additionally, we must avoid storing the API keys in the properties files. Hence, we encourage you to store it in the environment property OPENAI_API_KEY and read from it.
Currently, OpenAI offers three TTS models: gpt-4o-mini-tts, tts-1, and tts-1-hd. Moreover, we've utilized the latest gpt-4o-mini-tts model in our sample code.
Maven Dependencies
We must import the Spring AI's Open AI starter library in the Spring Boot application's pom.xml:
<dependency>
<groupId>org.springframework.ai</groupId>
<artifactId>spring-ai-starter-model-openai</artifactId>
<version>1.0.0-M8</version>
</dependency>
Ideally, we recommend using the Spring initializr online web tool to import the necessary libraries.
Auto-Configured OpenAI Client
When the Spring AI framework comes across the configuration properties with the namespace spring.ai.openai.audio.speech, it instantiates the OpenAiAudioSpeechModel class. Further, we can autowire it and use it to invoke the OpenAI speech model:
public class SpringAiTtsLiveTest {
@Autowired
private SpeechModel openAiAudioSpeechModel;
@Test
void givenAutoConfiguredOpenAiAudioSpeechModel_whenCalled_thenCreateAudioFile() {
assertInstanceOf(OpenAiAudioSpeechModel.class, openAiAudioSpeechModel);
byte[] audioBytes = null;
try {
audioBytes = openAiAudioSpeechModel
.call(FileWriterUtil.readFile(Paths.get("poem.txt")));
FileWriterUtil.writeFile(audioBytes,
Paths.get("tts-output/twinkle-auto.mp3"));
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
In our sample program, which was developed using Spring Boot Test, we use a custom FleWriterUtil class to read the string from the file poem.txt, which contains the poem, Twinkle, Twinkle, Little Star. Then, we invoke the OpenAiAudioSpeechModel#call() method with the string to call the underlying speech model service. Finally, we invoke FileWriter#writeFile() to persist the resulting byte array in the target/test-classes/tts-output/twinkle-auto.mp3 audio file:
Programmatically Configured OpenAI Client
Typically, applications require greater flexibility and control over configuring and creating the model client class. Therefore, we'll often find ourselves programmatically building the OpenAiSpeechModel class:
@Service
public class TtsService {
@Value("${spring.ai.openai.audio.speech.api-key}")
private String API_KEY;
public byte[] textToSpeech(String text, String instruction) throws IOException {
OpenAiAudioApi openAiAudioApi = OpenAiAudioApi.builder()
.apiKey(API_KEY)
.build();
OpenAiAudioSpeechOptions speechOptions = OpenAiAudioSpeechOptions.builder()
.model(OpenAiAudioApi.TtsModel.TTS_1_HD.value)
.voice(OpenAiAudioApi.SpeechRequest.Voice.CORAL)
.speed(0.75f)
.responseFormat(OpenAiAudioApi.SpeechRequest.AudioResponseFormat.MP3)
.input(text)
.build();
OpenAiAudioSpeechModel openAiAudioSpeechModel
= new OpenAiAudioSpeechModel(openAiAudioApi, speechOptions);
SpeechPrompt speechPrompt = new SpeechPrompt(instruction, speechOptions);
return openAiAudioSpeechModel.call(speechPrompt).getResult().getOutput();
}
}
First, in the TtsService#textToSpeech() method, we create the OpenAIAudioApi object. Next, we set the configuration in the OpenAiAudioSpeechOptions object. Then, we use both objects to instantiate the OpenAiAudioSpeechModel object. Finally, we invoke its call() method with the prompt to fetch the audio bytes.
In real-world applications, API keys are normally fetched from a more secure source, such as a vault or a secret manager. Moreover, more logic is applied to fetch the configurations from a storage layer such as a database or a cache.
Moving on, we can autowire the TtsService service class in our program and use it to convert the contents of a text file into an audio file:
public class SpringAiTtsLiveTest {
@Autowired
private TtsService ttsService;
@Test
void givenManuallyConfiguredOpenAiAudioSpeechModel_whenCalled_thenCreateAudioFile() {
byte[] audioBytes;
try {
final String instruction = "Read the poem with a calm and soothing voice.";
audioBytes = ttsService
.textToSpeech(FileWriterUtil.readFile(Paths.get("poem.txt")), instruction);
FileWriterUtil.writeFile(audioBytes,
Paths.get("tts-output/twinkle-manual.mp3"));
} catch (IOException e) {
throw new RuntimeException(e);
}
}
}
We've autowired the TtsService class in the program and later invoked its textToSpeech() method with the contents of poem.txt. Finally, we store the received byte array from the OpenAI model service into the file system at target/test-classes/tts-output/manual-twinkle.mp3:
Conclusion
In this article, we've learned about important components of the Spring AI TTS API. It's still quite early for the library to support more LLM service providers besides OpenAI. Nevertheless, the current OpenAI support works well and serves its purpose. Furthermore, we must watch out for minor modifications in the API.
Visit our GitHub repository to access the article's source code.
Comments
Post a Comment