How to use AWS Polly (text-to-speech) with NodeJS

Introduction

Amazon AWS offers a ton of cognitive services on their web services offerings.– aws.amazon.com. One of the many services offered is Polly, it’s a text-to-speech model served via the API.

In this article, we’ll use it in NodeJS. There are a lot of ways to get the raw output and then parse it further. But here we’ll use another signer instance method which returns us a temporary output URL in whatever file type we request it. (JSON, mp3, Ogg, PCM)

SDK setup

The very first step is to install the AWS SDK and set it up before using Polly. Here’s a short article if your environment is not set up yet.

Make sure you choose the service region which suits you, and not all service regions support AWS Polly and their prices may differ. So make sure to check it via the AWS dashboard.

Initialize and use Polly

We need to initialize the Polly instance, here we can set the AWS region we want to use for that instance.

const Polly = new AWS.Polly({region: 'ap-south-1'});
// using ap-south-1 region

Here’s a list of regions supported by Polly, you can use any of them at your convenience and use case. If you’re looking to use neural TTS, which is far better than standard TTS.

Here’s a list of regions where Neural voices are supported:

  • US East (N. Virginia): us-east-1
  • US West (Oregon): us-west-2
  • Africa (Cape Town): af-south-1
  • Asia Pacific (Tokyo): ap-northeast-1
  • Asia Pacific (Seoul): ap-northeast-2
  • Asia Pacific (Mumbai): ap-south-1
  • Asia Pacific (Singapore): ap-southeast-1
  • Asia Pacific (Sydney): ap-southeast-2
  • Canada (Central): ca-central-1
  • Europe (Frankfurt): eu-central-1
  • Europe (Ireland): eu-west-1
  • Europe (London): eu-west-2
  • Europe (Paris): eu-west-3
  • AWS GovCloud (US-West): us-gov-west-1

Request body parameters:

// List of parameters in input.

{
   "Engine": "string",  // standard or neural
   "LanguageCode": "string",  // en-US, en-GB, hi-IN
   "LexiconNames": [ "string" ],
   "OutputFormat": "string",  // json | mp3 | ogg_vorbis | pcm
   "OutputS3BucketName": "string",
   "OutputS3KeyPrefix": "string",
   "SampleRate": "string",
   "SnsTopicArn": "string",
   "SpeechMarkTypes": [ "string" ],
   "Text": "string",
   "TextType": "string",  // ssml | text
   "VoiceId": "string"  // Aditi | Amy | Astrid | Bianca | Brian, etc.
}

If you want to see all the parameters here is a reference link. Only necessary ones are being used here. Here’s the list of voices.

const input = {
      Engine: "neural" ,
      Text: "Hi how you're doing?",
      OutputFormat: "ogg_vorbis",
      VoiceId: "Amy",
      LanguageCode: "en-IN"
   };

Create a signer instance.

const signer = new AWS.Polly.Presigner(input, Polly); // input and polly as parameters

Using the getSynthisizeSpeechUrl method to get the output as a URL.

signer.getSynthesizeSpeechUrl(input, function (err, data) {
         if (err) {
            console.log(err, err.stack);
           // Handle error here
        } else {
           const outputUrl = data;
           // data contains output URL
      });

The output is in form of a URL which is temporary and can be used further anywhere in your application. Just make sure you request the suitable output format in input parameters.

There are other ways to get the output, then parsing it further as audio form, however, as per my use case this is what I was looking for to be used in chatbots.

Thanks for reading, hope this article helped you, check more articles on development on my blog. Follow me on Twitter 🙂