Thread Rating:
  • 1 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Announcing CQCVoice - CQC's All Local Voice Control System
#1
Once again I'm starting a new thread on this, partly because yet again a lot of water has passed under the bridge and I don't want newly interested folks to have to wade through it. And, partly due to a name change before it goes public. Turns out that there is a trademark for Jarvis in the context of digital assistants, by of all people Marvel. It is apparently for mobile devices, which doesn't apply to our stuff. But, just to avoid any potential issues, the name will become CQCVoice as of 5.0.914 and forward. That's generic enough that it shouldn't cause any issues.

---------

The upcoming 5.1 release will contain a new feature, called CQCVoice, which is an all-local voice control mechanism. All-local meaning it is not cloud based like the Echo, it's working totally locally within your system. Both the Echo and Voice have their own pros and cons, and I will try to lay them out logically here so that folks can make a reasonable comparison. But first a little background...

We've supported the Echo for some time, and it works well with CQC now that it's been sorted out over time. But, it is cloud based, which means that is subject to disruption, and it also has its security implications because it requires a connection back into your home network from the Amazon servers. Some folks are more concerned with either of those than others, and some very concerned about both. It also requires a considerable effort to set up.

So we wanted to have a local, easier option. This involved quite a dive into a new world of speech recognition, audio data processing, command and control grammars (more about that below), and microphone hardware. The first few rounds were a bit disappointing, but good old desperation and inability to spell defeet has finally allowed us to make a lot of progress and it's getting pretty sorted out now as well. It's one of those systems that could be indefinitely expanded, but it's getting pretty close to something that is ready for an inaugural release.

I don't want do duplicate the help information, but for a little semi-technical background here... 

Types of Voice Recognition

There are roughly two types of voice recognition. There is dictation and there is command and control. The former almost always requires training and only works for people who have trained for it, and the program has to know who is talking. And you typically have to speak sort of slowly with a solid division between words, because the program has to recognize any possible word you might speak, a freakishly difficult task. Command and control style recognition is based on a limited 'grammar', which defines what phrases can be recognized.

Both the Echo and CQCVoice are of the latter type, though the Echo combines a little of both. With the Echo we can define a phrase like:

Code:
Set the {whatever} to bongo mode

The Echo can be told that {whatever} can be from 1 to x number of words, and they can be any words. But the rest of the phrase has to be fixed, though variations can be defined. So the Echo is recognizing small numbers of arbitrary words, which is sort of dictation-like. It is able to do this without training by basically having enormous computing resources available to it on its servers, and reams of memory, and probably quite advanced recognition algorithms. Though, even so, it still gets things wrong sometimes.

CQCVoice is a pure command and control grammar, in that all of the phrases have to be predefined. However, that doesn't mean that the grammar cannot be created on the fly and then compiled, and that is what we do. We have the basic grammar, which we load, and then we have rules in that grammar that are placeholders into which we can plug the names of your lights, your playlists, your security zones, and so forth. So it's fixed in one sense, but still somewhat dynamic.

Still, once we've done that and rebuilt the grammar, those are the only phrases that are recognized. So, unlike above, in the Echo case, where Amazon will hand us back whatever text it thought was spoken for {whatever}, all CQCVoice can do is tell us the fixed phrase in the grammar that it thought was matched, along with confidence level information to help us make decisions about how trustworthy that match is. 

This is also a security 'feature' in that we cannot hear what you are saying. We can only be told what phrase matched closest to what you are saying. So if you are just talking in the room, we'll just get a string of very low confidence matches that we'll just ignore. 

Ease of Setup

One huge advantage of CQCVoice is the ease of setup. Since it's all local, there's no setup of any servers or port forwarding or security certificate stuff involved as there is with the Echo. And, since it is based on the Room/System configuration data that you may already be using to auto-generate templates, basically you sort of get it for free. You just have to make some simple configuration changes to the room configuration to associate a CQCVoice host with a room. The CQCVoice program on that machine looks through the rooms for one that has his host name configured and so he knows that is his room.
  • There is also new 'spoken name' configuration for things like lights and security zones. You may have given a light the name MBR NiteLt, but you'll never match that when speaking. So, if the actual title is of that non-spoken form, you can now also provide an 'as spoken' name that will be used by CQCVoice.
This also means that many commands don't require you to identify which thing you are talking about, it's implied by the fact that those thermos, renderers, repositories, weather sources, and such are associated with that CQCVoice' room. This can greatly simplify many operations, something that is very tricky to do with the Echo at best.

Of course the down side is that you mostly only have control over those things that are defined by the system/room configuration. We will now obviously want to expand this over time, but still it's not as open ended as the Echo is, where you have more control over the operations involved and what is controllable. There is a little open ended'ness currently, in that you can define a set of 'room modes', which are just global actions that you can invoke via voice control.

It also means we cannot have a command like "Zira, remind me in 10 minutes to do X", because we would have to define every possible variation of X. We could, and probably will, allow you to define such a list for yourself, so there are always ways to get around it. But we cannot get the arbitrary words you spoke for X and then later say, "You ask me to remind you to do X." We have to have a defined set of things for X that can be matched. That obviously make it less spontaneously useful.
  • I'll use Zira as the keyword in any examples here, but this is configurable so you can call it what you want.
Conversations vs. Commands

Another advantage to CQCVoice is that it is more conversational. The Echo is really just a voice driven remote control, which has no idea what you are talking about, it's just matching phrases. CQCVoice does know what you are talking about, so it can ask you for clarifications, allow you to implicitly refer to things you were just talking about, and be more interactive in general. It knows that some pieces of information you say are more important than others, and can be safer by requiring higher confidence matches for those things, and lower for less critical things.

I will admit that what is actually possible is not as nice as what I'd originally hoped for. My original vision had CQC/CQCVoice winning the Presidential Courage award for calling emergency services on behalf of house members and talking the kids through medical procedures via conversation with emergency personal while simultaneous guiding them to the home and relaying vital statistics. 

Sadly this is beyond the means of any reasonable voice recognition system we could afford to use. The problem becomes ambiguity in the grammar. You can only support so many phrases and variations of phrases before you drown in ambiguity and the accuracy of recognition drops substantially. 

So we have cut the grammar down from the original form. But still, it is quite conversational and can guide you to getting the operation done correctly, unlike the Echo which either gets it right or not. We didn't reduce the number of commands, just reduced the 'floweriness' of the available phrases you can say.

It also means that you can invoke multiple commands in a row without having to constantly say the key word. Once you start a conversation, you can continue to just invoke commands in a more natural manner. When you are done, you can say things like "That will be all" or "That's all for now" or "You are dismissed", as you would say to dismiss a real butler. If you don't say anything for 10'ish seconds, it will ask if you want anything else, to give you a chance to continue the conversation. If you say nothing still it will dismiss itself.


Hardware

The Echo obviously is its own hardware. CQCVoice can use any microphone that can be configured to show up as the recording input in Windows. There are limitations of course. For distant recognition, you really need a microphone array type product. These, using fancy signal processing algorithms, allow the device to create a 'virtual directional' microphone. I.e. it can hear where the loudest sound is coming from, and selectively emphasize sound from that area, rejecting 'off axis' noise. This greatly improves recognition accuracy from a distance.

The Microsoft Kinect 2 is such a product, and one of the less expensive ones, though it also includes various other features that we aren't making use of. It's too bad they don't split out just the microphone array part. There are also various, more expensive to varying degrees, array mics designed mostly for the conference room market. These tend to be larger and have more mics than the Kinect, which means better reception and more off axis rejection. Some of them also have nice signal processing software on board that can do reverb cancellation if you have a more echo'y room.

One such company is Acoustic Magic, which have the Tracker 1 and Tracker 2 products, for non-crazy prices. The 1 version is the one you'd want for this application. The 2 has other features that would not be useful for CQCVoice. Of course we cannot guarantee any particular performance, not having been in a position to try one ourselves, so make sure you can return any such products if they don't perform as expected. This is just one company that we've found that is selling a reasonably priced mic array product the specs of which would seem to indicate an real advantage over the Kinect.

Others can get quite pricey, though they probably also have substantially increased performance as well, and might have some placement advantages. I.e. some are roundish and designed to sit on a table top, which might be optimal for a living room based setup (assuming it won't be subject to serious abuse in such a setting.) Others are 'linear' arrays that are long and skinny and designed to mount on the wall, which may also keep them out of harm's way better.

For close up work, if you wanted to do voice control while sitting at your desk working on our computer, a small desktop microphone or a headset would likely work just fine. The front of the computer mic might work just fine as well when you are sitting right there in front of it.

Anyway, the moral of the story is, we've now done away with the Kinect requirement, so it's up to you what you want to use. You can try anything you want and see if it works. 

Summary

So, that is it in a nutshell. It's coming along nicely, and folks are now reporting good accuracy, as long as you use the appropriate hardware, and don't drink too much before issuing commands (IVB this means you.) There are recognition files for various English accents as well, which can help.
Dean Roddey
Explorans limites defectum
Reply
#2
Do you want us to post here on our CQCVoice experiences, or in the 5.1 Beta thread?
Reply
#3
Here would be fine I guess, to keep it in one place, and it won't be beta for much longer anyway.
Dean Roddey
Explorans limites defectum
Reply
#4
To make CQCVoice a bit easier to get along with, I've made some changes for 5.0.914:

Now, it only enters conversation mode if you start off with one of the 'wakeup' phrases, e.g. "Hello, [keyword]", "Help me, [keyword]" or "Wakeup, [keyword]". If you just start with a command it does that command, then it goes back to to the top level loop waiting for another (prefixed) command, or conversation.

This way, folks who prefer a conversation mode can do it that way, and those that just want to do single shot commands can do that. Or, you can do one or other other based on whether you plan to do a number of things or just one.

If you do enter conversation mode, it no longer asks "will there be anything else"? It will just wait about 12 seconds for another command. If it doesn't get one, it will just announce it's going away and go back to waiting for another command or conversation. This should make it less clingy.
Dean Roddey
Explorans limites defectum
Reply
#5
I'm SO used to conversation mode. I keep trying to talk to Voice and it's not talking back, and I can't figure out why, and it's because it's waiting for the keyword again.
Dean Roddey
Explorans limites defectum
Reply
#6
Version 5.0.914 was just posted and has the above changes in it. It also has some other improvements to help improve its performance, so give this one a try if you are testing CQCVoice.
Dean Roddey
Explorans limites defectum
Reply
#7
Anyone had a chance to try 914 to see if the improvements made any difference for them?
Dean Roddey
Explorans limites defectum
Reply
#8
(04-15-2017, 09:19 AM)Dean Roddey Wrote: Anyone had a chance to try 914 to see if the improvements made any difference for them?

Dean  - I figured I would dive in to this - I installed everything and this is how far I got - What should be my logical next steps? 
It says to see the logs, but I can not locate a specific log for the speech tray app. Nothing is reported in the main log.


.png   speech fail.PNG (Size: 14.8 KB / Downloads: 9)
Thanks,
Dave Bruner
Cool
Reply
#9
Did you install the Speech Platform 11 and SR/TTS voices? That would typically be why it would fail. I'll look at why nothing got logged. I may have made that debug mode only or something.
Dean Roddey
Explorans limites defectum
Reply
#10
So, just as a sanity check, I hacked in some code to write out the incoming audio (at the level that it would be heard by Voice) to a WAV file so that I could play it back and listen to it. Everything sounds good. Well it's not like it sounds a voice of god studio recording. It sounds like my talking from a slight distance in my horrible sounding room. But it's not picking up any of its own TTS output at all and there's no distortion or dropped audio or anything like that.

So I think that, from an audio processing point of view, it's doing the right thing. Worst case, I could give someone having some issues an off to the side version of one of the DLLs that contains this code and they could use it to validate what Voice is hearing in their setup.
Dean Roddey
Explorans limites defectum
Reply


Possibly Related Threads…
Thread Author Replies Views Last Post
  System monitoring? Spot 5 115 03-18-2023, 11:10 PM
Last Post: kblagron
  Text to Speach Voice George M 2 759 02-10-2022, 10:55 AM
Last Post: George M
  Am I misusing System Equals? Spot 6 979 01-30-2022, 07:35 PM
Last Post: Spot
  CQSL Interface Driver connects but no control NightLight 3 876 10-26-2021, 01:12 PM
Last Post: NightLight
  Camera Widget - URL, WebURL and Viewing While Not on Local network gReatAutomation 5 1,667 10-26-2020, 03:47 PM
Last Post: Dean Roddey
  IR Control LesAuber 1 993 08-03-2020, 07:40 PM
Last Post: LesAuber
  Unhandled system exception in the GUI Thread gReatAutomation 9 2,634 03-23-2020, 01:03 PM
Last Post: Dean Roddey
  Lutron RadioRA2 Driver and Lutron Visor Control gReatAutomation 29 12,593 03-19-2020, 01:03 PM
Last Post: gReatAutomation
  RadioRA2 Visor Control Receiver - Input support gReatAutomation 12 5,126 02-24-2020, 07:38 AM
Last Post: gReatAutomation
  CQC Voice simplextech 2 1,537 01-19-2020, 11:37 AM
Last Post: gReatAutomation

Forum Jump:


Users browsing this thread: 1 Guest(s)