I am using the latest Microsoft Neural HD voices, such as “en-US:Steffan:DragonHDLatestNeural,” for text-to-speech conversion. These enhanced HD voices are highly expressive, dynamically interpret the spoken text, and produce slightly different results each time they are used. As a result, the length of the generated audio tracks can vary every time the “convert captions to audio” feature is applied.
To create a more professional impression and avoid overly fast speech, it is often useful to insert small pauses between spoken texts. I achieve this by leaving small gaps between captions in the timeline.
However, this presents a challenge with these HD voices. Since the tracks differ each time I perform the conversion and AP9 automatically adjusts the closed captions according to the generated audio, my previously fixed gaps either disappear, become longer, or shorter than I originally intended.
How can this behaviour improved in future? Maybe introducing a “pause time” in the CC dialog where you can define a constant pause time following the spoken text?
Thank you
Armin