My experiences creating an in-kitchen voice application

Daniel Bigham

2009-02-26 17:43:02 UTC

The context here is that last year I wrote an application named "Grace" that
runs on a computer in the kitchen and can be interacted with via voice.

It has been a couple of months now using "Grace", so it's time to do some
evaluation of the technologies I've used to make this application work.

Here are the biggest challenges, the things that don't work well:

##Did you say something?##

An aspect of SAPI (5.1) that I have found very frustrating of late is how
increasingly often it interprets/recognizes non-vocal noise as if it were
speech. Back in January when I first starting using Grace, this was a
significant but manageable issue, but in the last couple of weeks it has made
Grace almost unusable in the noisy kitchen environment. Simply walking across
the room or opening a drawer causes SAPI to recognize the command "Grace".
Put a glass on the kitchen counter and it will recognize the sentence "Open
my inbox". This is where I draw the line: Behavior like that is ridiculous,
especially the later example. A couple of days ago I was making some bread
and chatting with Meredith, and apparently it heard the sentence "Open my
inbox" about 5 times.

I expect that one of the culprits here is that SAPI tries to "learn" over
time, adjusting its internal probabilities so that words or phrases that it
has heard more often are more likely to be recognized. The obvious problem
with this approach is that once you have used a command or phrase a few dozen
times, it becomes weighted so heavily that more and more often background
noise will match the word or phrase, to the point that you start seeing
behavior like I have described above. I believe there is a way to disable
this adaptation, which I will likely have to do, but there is a downside to
doing this, because I expect that for the most part, this adaptation has a
positive effect on recognition rates.

Overall, this is a commentary on where voice recognition technology is at
for use in environments that aren't perfectly quiet. If I were to assign a
grade on how well SAPI protects itself from recognizing noise as speech, it
would have to be an "F". More research needs to be done in this area.

##Keeping the monitor off##

Grace is primarily a voice interface: You speak a command or query, and it
speaks back the answer. To make this work, the computer needs to be running,
but there is no need for the monitor to be on until if and when information
needs to be displayed to the user. Indeed, in today's world where the
environment and energy conservation are important issues, it would be very
wasteful to have a computer monitor on all day when it's not needed.

There are Windows APIs that a program can use to put an LCD monitor into and
then later out of sleep mode, and at first glance, this seems to solve the
problem: The software can keep the monitor off until information needs to be
displayed, at which point, it can turn the monitor on. LCDs can come out of
standby mode within a second or two -- perfect, right?

Unfortunately, SAPI contains a "feature" whereby audio input automatically
takes the monitor out of standby mode. The reasoning is that if a computer is
employing a voice interface, audio input is the equivalent of a mouse
movement or keyboard key press. Thus, if you're in the kitchen and open a
cupboard or even shift in your chair, the monitor turns back on.

The only workaround that I've come up with is to run a loop that tells the
monitor to go to standby mode 20 times a second, so that when SAPI goes to
bring the monitor out of standby mode, the software immediately overrides it.
I worry though that this may be causing additional stress on the hardware.
And even with this workaround in place, the software needs to make sure that
a black window is completely obscuring the screen, otherwise when you move
around in the kitchen the monitor is constantly flickering as it comes out of
and then back into standby mode, displaying the Windows desktop for a
fraction of a second each time. Gross.

Microsoft: The conclusion here is that for SAPI to be used in an always-on
environment where electricity needs to be conserved by keeping a monitor in
standby mode, this setting needs to be configurable. Until that time, ugly
ugly hacks are required.

##When to listen##

Another challenge is for the software to know when to listen and when not to
listen. For example, if you are playing some music in the kitchen, you
obviously don't want SAPI listening. Fortunately, iTunes offers a COM
interface that allows the software to know when music starts and stops, so
recognition can be enabled or disable.

Unfortunately, I'm not currently aware of any integrations for Windows Media
Player, so there doesn't seem to be any way of being smart about
stopping/starting recognition while listening to a radio station through
Media Player. Perhaps there is a more direct way to accomplish this via
DirectShow, etc.

...

Ok, so those are the challenges, the things that don't work very well. Here
are the things that work pretty well, but have room for improvement:

...

##Recognition accuracy##

While far from perfect, I'm relatively happy with recognition accuracy, that
is, when you are actually speaking to the software. Grace uses a fairly
complex command and control grammar that allows for natural language commands
and queries, and accuracy isn't bad. I'm sure this is an area of research
that will improve over time, but I can live with where things are at.

One area that hasn't worked that well is numbers. For example, the
recognizer seems to have a lot of difficulty distinguishing between words
like "seventy" and "seventeen".

Occasionally it will recognize completely bizarre statements that are
nothing even close to what I said, but this doesn't happen too often.
Interestingly, accuracy seems to be improved when commands and queries are
longer VS shorter. For instance, playing a song by saying "play the song
Chariots of Fire" will result in fewer mis-recognitions than if the grammar
allowed for "play Chariots of Fire". This is a nice attribute to have for a
system that prefers commands and queries be spoken in natural language, but
sometimes it does make more sense for a command to be short and concise, and
it's frustrating when that translates to more mis-recognitions.

##iTunes##

It has turned out that iTunes has been an important component of a kitchen
computer: Music playback, yes, but more importantly video podcasts. I can
watch the nightly news by saying "Play the ABC news podcast", likewise the
NASA podcast, and TED podcast.

Having a COM interface has made interfacing with iTunes possible. Without a
COM interface, there would have been some serious problems such as knowing
when to listen and when not to listen. And as it turns out, many podcasts
seem to have a relatively low volume, so the software can also adjust the
system volume to an appropriate level when a podcast is being viewed, and
then restore it to the default level when it stops being played.

While iTunes has been a very important piece, there are frustrations: For
instance, if the Windows tray opens an information balloon, video playback
drops to about 0.2 frames per second, and you have to get up and fight with
the computer trying to close the darn thing before you can continue watching
your program. It also seems impossible to make the video full screen via the
COM API, which is unfortunate.

...

And finally, things that have worked very well:

...

##iMac##

The iMac hardware is really ideal for a kitchen installation. It's very
quiet, pretty, and compact, all of which are very important. And of course,
it now runs Windows.

What I'm most impressed by is how quiet it is: Probably an order of
magnitude quieter than many desktop computers I've owned, and ends up being
virtual silent in the kitchen environment. This can easily be a show stopper
for a kitchen installation since a noisy fan is extremely tiring to listen
to, and many people, myself included, wouldn't have patience for it.

I also love how the iMac looks: The screen is a beautiful glossy black when
it's off, which looks great in the corner of the kitchen, and the anodized
aluminum looks similarly nice. I wouldn't want a ugly computer in the corner
of my kitchen, so this is an important attribute for it to have.

The compactness: I couldn't be more pleased with how compact it is: It saved
me drilling a hole in my kitchen counter which would have been required if I
had used a desktop + LCD monitor. Even the keyboard is understated. Perfect.

And finally, the Apple remote! What a wonderful gadget, and this turns out
to be quite important because there's no way to pause audio or video, skip
tracks, or adjust the volume using a voice interface because SAPI isn't going
to be able to hear you over the audio that the computer is plying.

My one gripe has been that the wireless adapter appears to have gone flaky
and then died on me -- and what's with Apple mice? I replaced the standard
mouse with a wireless Microsoft mouse.

Anyway, the iMac has been a very important component of this project and has
worked remarkably well. It was Meredith's idea too, so good thinking Meredith!

##VoiceTracker Array Microphone##

I'm very happy with this purchase: It's an array microphone that even works
from 12 feet away, albeit with moderate performance at times from that
distance.

A project like this is only really possible with a high quality array
microphone. I experimented with Blue Tooth headsets, but:

1. Who wants to wear one around the house? Not me.

2. Recognition accuracy sucked.

Another alternative would have been to use a high quality wireless
microphone, but the whole idea here is for the system to be hands free,
because when you're in the kitchen, you're often busy doing things, or have
wet or grimy hands and don't want to have stop what you're doing to handle a
device.

So bravo to the VoiceTracker team!

My one beef here is that the USB adapter they send you has been gimped so
that it only produces 1/10th the volume that it would by default. This makes
recognition from 12 feet lousy. I would normally just bypass this and plug
the microphone directly into the iMac, but as I discovered, the iMac doesn't
have a microphone input. How's that for frustrating! I ended up purchasing
something called an 'iBooster' to get around this, but I'm unclear as to how
well this is working. I wonder whether it is causing clipping when I'm
actually close to the computer, and I'm a bit confused because the line input
volume seems to jump around: Does Windows automatically adjust line input
volume when it's used for SR? I'll need to do some more playing around with
this.

...

I'm still in the process of creating a website to share my experiences:

www.platoai.com

Daniel Bigham

2009-05-02 12:31:01 UTC

Permalink

I thought I'd post some additional information on the primary trouble I had
with this project, which was the computer thinking that I had said something
when it was just background noise.

I found that re-training my SAPI profile and turning off background
adaptation made a huge difference. I wrote up my findings here:

http://www.platoai.com/background_adaptation.htm

Post by Daniel Bigham
The context here is that last year I wrote an application named "Grace" that
runs on a computer in the kitchen and can be interacted with via voice.
It has been a couple of months now using "Grace", so it's time to do some
evaluation of the technologies I've used to make this application work.
##Did you say something?##
An aspect of SAPI (5.1) that I have found very frustrating of late is how
increasingly often it interprets/recognizes non-vocal noise as if it were
speech. Back in January when I first starting using Grace, this was a
significant but manageable issue, but in the last couple of weeks it has made
Grace almost unusable in the noisy kitchen environment. Simply walking across
the room or opening a drawer causes SAPI to recognize the command "Grace".
Put a glass on the kitchen counter and it will recognize the sentence "Open
my inbox". This is where I draw the line: Behavior like that is ridiculous,
especially the later example. A couple of days ago I was making some bread
and chatting with Meredith, and apparently it heard the sentence "Open my
inbox" about 5 times.
I expect that one of the culprits here is that SAPI tries to "learn" over
time, adjusting its internal probabilities so that words or phrases that it
has heard more often are more likely to be recognized. The obvious problem
with this approach is that once you have used a command or phrase a few dozen
times, it becomes weighted so heavily that more and more often background
noise will match the word or phrase, to the point that you start seeing
behavior like I have described above. I believe there is a way to disable
this adaptation, which I will likely have to do, but there is a downside to
doing this, because I expect that for the most part, this adaptation has a
positive effect on recognition rates.
Overall, this is a commentary on where voice recognition technology is at
for use in environments that aren't perfectly quiet. If I were to assign a
grade on how well SAPI protects itself from recognizing noise as speech, it
would have to be an "F". More research needs to be done in this area.
##Keeping the monitor off##
Grace is primarily a voice interface: You speak a command or query, and it
speaks back the answer. To make this work, the computer needs to be running,
but there is no need for the monitor to be on until if and when information
needs to be displayed to the user. Indeed, in today's world where the
environment and energy conservation are important issues, it would be very
wasteful to have a computer monitor on all day when it's not needed.
There are Windows APIs that a program can use to put an LCD monitor into and
then later out of sleep mode, and at first glance, this seems to solve the
problem: The software can keep the monitor off until information needs to be
displayed, at which point, it can turn the monitor on. LCDs can come out of
standby mode within a second or two -- perfect, right?
Unfortunately, SAPI contains a "feature" whereby audio input automatically
takes the monitor out of standby mode. The reasoning is that if a computer is
employing a voice interface, audio input is the equivalent of a mouse
movement or keyboard key press. Thus, if you're in the kitchen and open a
cupboard or even shift in your chair, the monitor turns back on.
The only workaround that I've come up with is to run a loop that tells the
monitor to go to standby mode 20 times a second, so that when SAPI goes to
bring the monitor out of standby mode, the software immediately overrides it.
I worry though that this may be causing additional stress on the hardware.
And even with this workaround in place, the software needs to make sure that
a black window is completely obscuring the screen, otherwise when you move
around in the kitchen the monitor is constantly flickering as it comes out of
and then back into standby mode, displaying the Windows desktop for a
fraction of a second each time. Gross.
Microsoft: The conclusion here is that for SAPI to be used in an always-on
environment where electricity needs to be conserved by keeping a monitor in
standby mode, this setting needs to be configurable. Until that time, ugly
ugly hacks are required.
##When to listen##
Another challenge is for the software to know when to listen and when not to
listen. For example, if you are playing some music in the kitchen, you
obviously don't want SAPI listening. Fortunately, iTunes offers a COM
interface that allows the software to know when music starts and stops, so
recognition can be enabled or disable.
Unfortunately, I'm not currently aware of any integrations for Windows Media
Player, so there doesn't seem to be any way of being smart about
stopping/starting recognition while listening to a radio station through
Media Player. Perhaps there is a more direct way to accomplish this via
DirectShow, etc.
...
Ok, so those are the challenges, the things that don't work very well. Here
...
##Recognition accuracy##
While far from perfect, I'm relatively happy with recognition accuracy, that
is, when you are actually speaking to the software. Grace uses a fairly
complex command and control grammar that allows for natural language commands
and queries, and accuracy isn't bad. I'm sure this is an area of research
that will improve over time, but I can live with where things are at.
One area that hasn't worked that well is numbers. For example, the
recognizer seems to have a lot of difficulty distinguishing between words
like "seventy" and "seventeen".
Occasionally it will recognize completely bizarre statements that are
nothing even close to what I said, but this doesn't happen too often.
Interestingly, accuracy seems to be improved when commands and queries are
longer VS shorter. For instance, playing a song by saying "play the song
Chariots of Fire" will result in fewer mis-recognitions than if the grammar
allowed for "play Chariots of Fire". This is a nice attribute to have for a
system that prefers commands and queries be spoken in natural language, but
sometimes it does make more sense for a command to be short and concise, and
it's frustrating when that translates to more mis-recognitions.
##iTunes##
It has turned out that iTunes has been an important component of a kitchen
computer: Music playback, yes, but more importantly video podcasts. I can
watch the nightly news by saying "Play the ABC news podcast", likewise the
NASA podcast, and TED podcast.
Having a COM interface has made interfacing with iTunes possible. Without a
COM interface, there would have been some serious problems such as knowing
when to listen and when not to listen. And as it turns out, many podcasts
seem to have a relatively low volume, so the software can also adjust the
system volume to an appropriate level when a podcast is being viewed, and
then restore it to the default level when it stops being played.
While iTunes has been a very important piece, there are frustrations: For
instance, if the Windows tray opens an information balloon, video playback
drops to about 0.2 frames per second, and you have to get up and fight with
the computer trying to close the darn thing before you can continue watching
your program. It also seems impossible to make the video full screen via the
COM API, which is unfortunate.
...
...
##iMac##
The iMac hardware is really ideal for a kitchen installation. It's very
quiet, pretty, and compact, all of which are very important. And of course,
it now runs Windows.
What I'm most impressed by is how quiet it is: Probably an order of
magnitude quieter than many desktop computers I've owned, and ends up being
virtual silent in the kitchen environment. This can easily be a show stopper
for a kitchen installation since a noisy fan is extremely tiring to listen
to, and many people, myself included, wouldn't have patience for it.
I also love how the iMac looks: The screen is a beautiful glossy black when
it's off, which looks great in the corner of the kitchen, and the anodized
aluminum looks similarly nice. I wouldn't want a ugly computer in the corner
of my kitchen, so this is an important attribute for it to have.
The compactness: I couldn't be more pleased with how compact it is: It saved
me drilling a hole in my kitchen counter which would have been required if I
had used a desktop + LCD monitor. Even the keyboard is understated. Perfect.
And finally, the Apple remote! What a wonderful gadget, and this turns out
to be quite important because there's no way to pause audio or video, skip
tracks, or adjust the volume using a voice interface because SAPI isn't going
to be able to hear you over the audio that the computer is plying.
My one gripe has been that the wireless adapter appears to have gone flaky
and then died on me -- and what's with Apple mice? I replaced the standard
mouse with a wireless Microsoft mouse.
Anyway, the iMac has been a very important component of this project and has
worked remarkably well. It was Meredith's idea too, so good thinking Meredith!
##VoiceTracker Array Microphone##
I'm very happy with this purchase: It's an array microphone that even works
from 12 feet away, albeit with moderate performance at times from that
distance.
A project like this is only really possible with a high quality array
1. Who wants to wear one around the house? Not me.
2. Recognition accuracy sucked.
Another alternative would have been to use a high quality wireless
microphone, but the whole idea here is for the system to be hands free,
because when you're in the kitchen, you're often busy doing things, or have
wet or grimy hands and don't want to have stop what you're doing to handle a
device.
So bravo to the VoiceTracker team!
My one beef here is that the USB adapter they send you has been gimped so
that it only produces 1/10th the volume that it would by default. This makes
recognition from 12 feet lousy. I would normally just bypass this and plug
the microphone directly into the iMac, but as I discovered, the iMac doesn't
have a microphone input. How's that for frustrating! I ended up purchasing
something called an 'iBooster' to get around this, but I'm unclear as to how
well this is working. I wonder whether it is causing clipping when I'm
actually close to the computer, and I'm a bit confused because the line input
volume seems to jump around: Does Windows automatically adjust line input
volume when it's used for SR? I'll need to do some more playing around with
this.
...
www.platoai.com