Google’s New Text-to-Speech AI Is so Good We Bet You Can’t Tell It From a Real Human

Can you tell the difference between AI-generated computer speech and a real, live human being? Maybe you’ve always thought you could. Maybe you’re fond of Alexa and Siri but believe you would never confuse either of them with an actual woman.

Things are about to get a lot more interesting. Google engineers have been hard at work creating a text-to-speech system called Tacotron 2
. According to a paper
they published this month, the system first creates a spectrogram of the text, a visual representation of how the speech should sound. That image is put through Google’s existing WaveNet algorithm, which uses the image to produce extremely natural sounding human speech.

Using this method, the researchers report, “Our model achieves a mean opinion score (MOS) of 4.53 comparable to a MOS of 4.58 for professionally recorded speech.” (A mean opinion score is a telecommunications term that measures how true-to-life something sounds.)

As Google’s audio samples demonstrate, Tacotron 2 can detect from context the difference between the noun “desert” and the verb “desert” the verb, as well as the noun “present” and the verb “present,” and alter its pronunciation accordingly. It can place emphasis on capitalized words and apply the proper inflection when asking a question rather than making a statement.

And it can generate text that sounds so similar to human speech that it’s difficult or impossible to know the difference. If you want to see just how hard it is, go to Google’s audio samples page
, and scroll down to the last set of samples, titled “Tacotron 2 or Human?” There you’ll find Tacotron 2 and a real person each saying sentences such as, “That girl did a video about Star Wars lipstick.”

SPOILER ALERT: To test yourself, listen to the samples and guess which is which before reading the rest of this column.

So which samples are text-to-speech and which are a real human voice? Google’s engineers aren’t saying but they’ve left a very big clue. Each of the .wav file samples has a file name containing either the term “gen” or “gt.” Based on the paper, it’s highly probable that “gen” indicates speech generated by Tacotron 2, and “gt” is real human speech. (“GT” likely stands for “ground truth,” a machine learning term that basically means “the real deal.”)

Assuming this is correct, here are the answers to the test:

“That girl did a video about Star Wars lipstick.”

Sample 1: Real human

Sample 2: Tacotron 2

“She earned a doctorate in sociology from Columbia University.”

Sample 1: Tacotron 2

Sample 2: Real human

“George Washington was the first President of the United States.”

Sample 1: Tacotron 2

Sample 2: Real human

“I’m too busy for romance.”

Sample 1: Real human

Sample 2: Tacotron 2

How many did you get right? And could you really tell the difference, or did you just have to guess?

Get Real稿源:Get Real (源链) | 关于 | 阅读提示

本站遵循[CC BY-NC-SA 4.0]。如您有版权、意见投诉等问题,请通过eMail联系我们处理。
酷辣虫 » 科技动态 » Google’s New Text-to-Speech AI Is so Good We Bet You Can’t Tell It From a Real Human

喜欢 (0)or分享给?

专业 x 专注 x 聚合 x 分享 CC BY-NC-SA 4.0

使用声明 | 英豪名录