WebSocket Streaming Speech Synthesis

WebSocket Streaming Speech Synthesis

Product Overview

  • This API performs streaming speech synthesis over a WebSocket connection. The client sends JSON text frames, and the server returns task initialization, audio chunk, task completion, or error events.
  • It is suitable for real-time scenarios where multiple synthesis requests are sent over the same connection and response events are grouped by task ID.

Apply for services

Synthesis API uses a complete flow and self-application model. You may sign up on the LiveData official website (https://www.ilivedata.com/), and then create an application on the console. An appId and service key will be assigned to you.

You can also activate other services on the Management Console - Overview Page.

Integration Flow

  1. Call the token issuing API and complete auth verification with appId and secretKey to obtain a WebSocket Token.
  2. Append the token query parameter to the returned wsUrl and establish the WebSocket connection.
  3. After the connection is established, send synthesis request JSON through WebSocket.
  4. The server returns init, audio, done, and error events through WebSocket.

Get WebSocket Token

Service Endpoint

https://tts.ilivedata.com/api/v1/speech/synthesis/ws-token

HTTP Request Headers

Header Value Description
X-AppId Example: 81900001 Unique identifier of the project or application
X-TimeStamp Example: 2024-07-01T07:59:59Z UTC timestamp of the request. The timestamp must follow the W3C format
Authorization Example: Njl86M/jY6zZaZoGhZdGO+GI/8+yGFECusGH1yQHUFE= Signature token

Request Method: GET

Request Signature

When requesting the token issuing API, use appId and secretKey to sign the request. The API verifies the signature with the same algorithm. If the signature is invalid, authentication fails.

Signature Calculation

  1. Construct StringToSign ("\n" stands for ASCII newline character):
StringToSign = HTTPMethod + "\n" +
               HostHeaderInLowercase + "\n" +
               HTTPRequestURI + "\n" +
               "X-AppId:" + SAME_APPID_IN_HEADER + "\n" +
               "X-TimeStamp:" + SAME_TIMESTAMP_IN_HEADER

The token issuing API uses the GET method and usually has no request body. HTTPRequestURI is the absolute path of the request URI without the query string.

  1. Use StringToSign as the signed string, secretKey as the secret key, and HMAC-SHA256 as the hash algorithm.

  2. Convert the result to a Base64 string.

  3. Put the Base64 string into the Authorization HTTP request header.

Signature Example

GET
tts.ilivedata.com
/api/v1/speech/synthesis/ws-token
X-AppId:81900001
X-TimeStamp:2024-11-01T07:59:59Z

Request Sample

curl -X GET 'https://tts.ilivedata.com/api/v1/speech/synthesis/ws-token' \
  -H 'X-AppId: 81900001' \
  -H 'X-TimeStamp: 2024-11-01T07:59:59Z' \
  -H 'Authorization: {signature}'

Response Sample

{
  "token": "eyJhbGciOiJSUzI1NiIsInR5cCI6IkpXVCJ9...",
  "expiresIn": 60,
  "expiresAt": 1782359144,
  "wsUrl": "wss://tts.ilivedata.com/api/v1/speech/synthesis/ws"
}

Response Fields

Field Name Type Description
token String WebSocket JWT signed with RS256
expiresIn Number Token lifetime in seconds. The token must be used to establish the WebSocket connection before it expires
expiresAt Number Token expiration time in Unix seconds
wsUrl String TTS WebSocket connection URL without the token parameter

Establish WebSocket Connection

WebSocket URL

wss://tts.ilivedata.com/api/v1/speech/synthesis/ws?token={token}

The token is used only for WebSocket handshake authentication. After the connection is established, the current connection will not be closed automatically when the token expires. If the connection is closed and the client needs to reconnect, obtain a new token.

TTS verifies the JWT signature, expiration time, aud, iss, scope, path, and appId. The appId in the request message must match the appId in the token.

WebSocket Request Parameters

After the connection is established, the client sends request JSON through WebSocket text messages.

Top-level Parameters

Field Name Optional Type Description
appId Conditionally required Number Required if request.appId is omitted. The top-level appId has priority
sessionId Optional String Business session ID. If omitted, the same WebSocket connection reuses a connection-level default sessionId
request Required Object Synthesis request body, same structure as SynthesisRequest

request

Field Name Optional Type Description
appId Conditionally required Number Required if the top-level appId is omitted
text Required String Text to synthesize. It must not be empty after trimming leading and trailing spaces
language Optional String Text content language. It is recommended to pass this parameter. If omitted, the language will be automatically detected. For supported languages, see Language List
voice Optional VoiceSetting Synthetic voice related configuration
output Optional OutputSetting Output audio related configuration

VoiceSetting

Field Name Optional Type Description
name Optional String Voice name from Prebuilt Voices or Voice Registration
audio Optional String Audio file used for voice cloning when the voice name is not specified
emotion Optional String Emotional expression

OutputSetting

Field Name Optional Type Description
format Optional String Output audio format. Candidates are pcm, wav, mp3, and opus. The default value is wav

Request Sample

{
  "appId": 81900001,
  "request": {
    "appId": 81900001,
    "text": "Hello, this is a WebSocket streaming speech synthesis example.",
    "language": "en",
    "voice": {
      "name": "juvenile"
    },
    "output": {
      "format": "mp3"
    }
  }
}

If the client needs to specify a business session, pass sessionId explicitly:

{
  "appId": 81900001,
  "sessionId": "biz-session-001",
  "request": {
    "appId": 81900001,
    "text": "The first message in the same business session.",
    "voice": {
      "name": "juvenile"
    },
    "output": {
      "format": "mp3"
    }
  }
}

WebSocket Response Events

init Event

Indicates that the server has accepted the task and returned task identifiers.

Field Name Type Description
event String Fixed value: init
taskId String Task ID generated by the server
sessionId String Session ID. If omitted by the client, the server generates a connection-level session ID
status String Fixed value: init
taskStatus Number Task status
{
  "event": "init",
  "taskId": "bj_ws_1b3d21549d3841d3b4400829403a4fff",
  "sessionId": "bj_ws_5f45c8fa85814b159741c80620f705bf",
  "status": "init",
  "taskStatus": 1
}

audio Event

Indicates an audio chunk.

Field Name Type Description
event String Fixed value: audio
taskId String Task ID
sessionId String Session ID
seq Number Audio chunk sequence number
itemIndex Number Text chunk index
itemDone Boolean Whether the current itemIndex is completed
sampleRate Number Sample rate of the current audio chunk
durationMs Number Duration of the current audio chunk in milliseconds
audioBase64 String Base64 string of the audio binary chunk
status String Fixed value: streaming
{
  "event": "audio",
  "taskId": "bj_ws_1b3d21549d3841d3b4400829403a4fff",
  "sessionId": "bj_ws_5f45c8fa85814b159741c80620f705bf",
  "seq": 12,
  "itemIndex": 0,
  "itemDone": false,
  "sampleRate": 22050,
  "durationMs": 120,
  "audioBase64": "...",
  "status": "streaming"
}

done Event

Indicates that the task is completed and the full audio file has been uploaded.

Field Name Type Description
event String Fixed value: done
taskId String Task ID
sessionId String Session ID
status String Fixed value: done
url String URL of the uploaded audio file
{
  "event": "done",
  "taskId": "bj_ws_1b3d21549d3841d3b4400829403a4fff",
  "sessionId": "bj_ws_5f45c8fa85814b159741c80620f705bf",
  "status": "done",
  "url": "https://xxx.cos.accelerate.myqcloud.com/tts/.../bj_ws_1b3d21549d3841d3b4400829403a4fff.mp3"
}

error Event

Indicates that the task failed.

Field Name Type Description
event String Fixed value: error
taskId String Task ID. It may be empty if the request has not entered the task initialization phase
sessionId String Session ID. It may be empty if the request has not entered the task initialization phase
status String Fixed value: error
errorCode Number Error code
errorMessage String Error message
{
  "event": "error",
  "taskId": "bj_ws_1b3d21549d3841d3b4400829403a4fff",
  "sessionId": "bj_ws_5f45c8fa85814b159741c80620f705bf",
  "status": "error",
  "errorCode": 3003,
  "errorMessage": "Invalid voice name."
}

taskId and sessionId Rules

  • The client does not need to pass taskId. The server generates a unique taskId for each message and returns it in init, audio, done, and error events.
  • If the client passes taskId, the server ignores it and still uses the server-generated taskId.
  • If the client does not pass sessionId, the server generates a connection-level default sessionId when the WebSocket connection is established. Multiple requests on the same connection return the same sessionId.
  • If the client explicitly passes sessionId, the server uses the client-provided sessionId.

Disconnection Handling

  • After the WebSocket connection is closed, the client needs to establish a new connection and resend the synthesis request.
  • The server does not use the client-provided taskId for task-level recovery.
  • If the business needs to associate requests before and after reconnection, the client may pass the same sessionId as the business session identifier.

Client Handling Recommendations

  • The token has an expiration time. It is recommended to establish the WebSocket connection immediately after obtaining the token.
  • After receiving init, record the returned taskId and sessionId for log troubleshooting and event grouping.
  • Decode audioBase64 before playing or caching audio chunks.
  • Use the done event as the final source of the uploaded audio file URL.
  • If the connection is closed, establish a new WebSocket connection and resend the request. The new request will generate a new taskId.