Research: A Glimpse of Conversation

  • Noteworthy News

Research: A Glimpse of Conversation

by 

Sara Cody

Study reveals the impact of harmonic sound on the cocktail party problem  

One of the biggest mysteries of the brain is the cocktail party problem, or how we are able to carry on a one-on-one conversation in a crowded room, tuning out the noisy background to focus on a single voice. MIT researchers have devised a method to manipulate the frequencies of voice recordings to understand how the harmonic sound frequencies contained in speech impact our ability to understand natural sound in the real world.  

When vocal cords produce speech, they open and close at a regular rate, producing pulses of air that occur at regularly spaced time intervals. The ear decomposes sounds according to frequency, and the regular pulses in time translate into frequencies that are harmonic – integer multiples of the pulse rate, which is also known as the fundamental frequency. Speech remains harmonic even when the fundamental frequency is increased or decreased, because the harmonics increase or decrease along with the fundamental frequency. It had been hypothesized that these harmonic relations provide a clue that the brain uses to extract speech signals from mixtures with other sounds. A key prediction is that this ability should be adversely affected if the frequencies are randomly perturbed to be inharmonic. However, until recently it was not possible to manipulate natural speech in this way.

“In this study, we were interested in embracing the richness of natural audio. Historically, these types of experiments have been done using simple artificial sounds because there was no way of taking speech apart to mess with its individual components,” says Prof. Joshua McDermott, Associate Professor in MIT’s Department of Brain and Cognitive Sciences and senior author on the paper. “We came up with a method to do that and we think the results of this study provide some explanation of why speech is voiced and why we don’t walk around whispering.”

Other authors on the paper include Sara Popham, a former research assistant in the McDermott lab who is currently a graduate student at UC Berkley; Dana Boebinger, BCS/Harvard University; Hideki Kawahara, Wakayama University; and Dan Ellis of Google Research.

McDermott and his research team devised a series of experiments where they used a modified version of a signal processing method developed by collaborator Hideki Kawahara that facilitates the analysis and manipulation of natural speech. McDermott and his team presented participants with speech samples rendered harmonic or inharmonic by the system, as well as whispered words or phrases. In some cases participants heard one speech sample at a time, and in others a speech sample was superimposed on noise, or on another speech sample, to simulate the sound of multiple people talking (Listen to sound samples.) Wearing headphones in a soundproof booth, participants reported what they heard.

Kawahara’s methodology enabled McDermott’s team to manipulate the degree of inharmonicity of their speech samples, allowing them to measure the effect of the mistuned frequencies on the listener. “When you listen to inharmonic speech, it sounds unnatural and you can tell it was made by a computer, but on its own it is fully intelligible. However, if you have to listen to an inharmonic speech sample that is superimposed on other speech sample, it is more difficult to understand.”

However, the biggest effect occurred when the researchers instead simulated whispered speech, by swapping in noise for the normally discrete frequencies of speech. Unlike either harmonic or inharmonic speech, concurrent samples of whispered speech were nearly impossible to understand. “We found that while you do see negative effects of making speech inharmonic, it’s not nearly as bad as changing it to noise,” says McDermott. “This suggests that when you have discrete frequency components, regardless of their relationship, your brain is able to catch a glimpse of the conversation and put the pieces together to understand what is being said. When you whisper, that is not the case and everything pretty much falls apart.”

With this insight, McDermott sees potential application to develop better hearing aid algorithms that will improve the device’s ability to tune out background noise.

“You might notice when you take your parent or grandparent with impaired hearing to a crowded restaurant, they may have such a hard time hearing you that they take their hearing aids out altogether,” says McDermott. “This is a problem that the field has been grappling with for a long time, and we are interested in understanding how a person with normal hearing solves this problem so we can build systems that replicate our abilities and then figure out what could you actually do to a sound signal to make it easier to hear in this ‘cocktail party’ situation.”