Proving human authenticity of recorded voices

ASU researchers Visar Berisha, Daniel Bliss and Julie Liss develop a special microphone to verify human speech

By TJ Triolo
May 1, 2024

Fulton Schools professor, Visar Berisha, sitting in front of a microphone with a small circuit board attached to it.

Visar Berisha, a professor of electrical engineering in the Ira A. Fulton Schools of Engineering at Arizona State University with a joint appointment in ASU’s College of Health Solutions, records speech with OriginStory technology. OriginStory, which won the U.S. Federal Trade Commission AI Voice Cloning Challenge, uses a special microphone with sensors that detect qualities of speech only produced by humans, ensuring voice recordings are not generated by artificial intelligence. Image courtesy of Visar Berisha

“Deepfakes” have become a large societal concern with the advent of video and audio content generated by artificial intelligence, or AI. A deepfake is a convincing imitation that blurs the lines between fantasy and reality. They can cause trouble in determining, for example, whether a politician actually made a troubling statement or if they were sabotaged by those seeking to interfere in an election.

“Until recently, the sound of a recorded voice was universally accepted as genuinely human,” says Visar Berisha, a professor of electrical engineering in the Ira A. Fulton Schools of Engineering at Arizona State University with a joint appointment in the university’s College of Health Solutions. “There was no reason to doubt its authenticity. With the advent of voice cloning technology, this trust is eroding and skepticism, rather than trust, will become the new norm.”

With the potential to ruin reputations and erode faith in institutions, the U.S. Federal Trade Commission, or FTC, held the FTC Voice Cloning Challenge to develop creative multidisciplinary methods to combat AI-generated deepfake audio for a share of $35,000 in prize money.

One of the contest’s winners is OriginStory, a project that uses a new kind of microphone, one that first verifies that a human speaker is producing recorded speech, then watermarks the speech as authentically human. The watermark can be shown to listeners, establishing a chain of trust from recording to retrieval.

OriginStory’s development is heavy on ASU involvement; the project was developed with ASU resources and patented through SkySong Innovations.

Berisha leads the development team, which includes fellow ASU faculty members Daniel Bliss, a Fulton Schools professor of electrical engineering in the School of Electrical, Computer and Energy Engineering, part of the Fulton Schools, and Julie Liss, ASU College of Health Solutions associate dean and professor of speech and hearing science.

Human biology to the rescue

Although human and AI-generated speech can sound similar to the untrained ear, the way these signals are generated are markedly different. Deepfakes are algorithmically generated using neural networks, a type of machine learning technology.

On the other hand, the biological human speech production mechanism includes intermediate biosignals such as vocal cord vibrations and movements of articulators, which are the body parts used to form speech such as the lips, tongue and nasal cavity.

OriginStory uses sensor technology already present in a variety of electronics to detect these biosignals while the microphone performs its normal function of recording speech. Because the biosignals and speech are recorded at the same time, OriginStory can confirm the authenticity of a recorded human voice.

The presence of the biosignals indicates that a distinctly human speech production mechanism generated the speech. OriginStory also ensures the privacy of those recorded, as the biosignals it verifies are distinguishable between humanity and AI, but not between different individuals.

The resulting audio gets a watermark embedded in the file verifying its legitimacy. Any future retrieval of the media can then be guaranteed as authentically human to ensure public trust.

Addressing threats in a new AI-powered era

Inspiration for the idea came from a news story Berisha saw in 2023 about a mother living in the Phoenix area who received a call from a scammer claiming to have kidnapped her daughter. However, the teenage girl was safe and sound; what was supposedly her voice on the phone was an AI clone.

“It was really scary to read, and it hit home in a personal way because I have kids about the same age,” Berisha says.

Liss, an expert in speech physiology and speech acoustics, joined the project because of her alignment with Berisha on the dangers of AI voice cloning technology. She says developing protection against AI-generated speech is crucial to ensure world security.

The project is the latest in more than 10 years of collaboration between the pair on projects transcending boundaries between engineering and health applications.

“To translate innovative ideas into practical solutions, interdisciplinary collaborations are crucial,” Liss says. “ASU expects its faculty to imagine and try bold and innovative approaches to solving the world’s challenges. It’s baked into the culture here.”

With the Voice Cloning Challenge award under its belt, the OriginStory team aims to continue refining the technology for eventual commercialization. The team members will work with Drena Kusari, vice president of product at Microsoft, leveraging her expertise in developing tech products and bringing them to market.

For Berisha, the FTC naming OriginStory as one of its winners emphasizes the importance of the technology’s potential widespread use in society.

He says, “Our selection serves as further validation for our central thesis: We need new technology to establish a chain of trust that a voice is authentically human from the moment it is recorded to when it is listened to.”