The Composite Intelligence of Virtual Assistants

As early as 1968, influenced by visionary movies such as Stanley Kubrick’s 2001: A Space Odyssey, with its fictional computer character HAL 9000, we envisioned omniscient, intelligent machines that could easily contain the whole range of human knowledge. Then, in 1987, the Knowledge Navigator video from Apple Computer struck our imaginations and greatly boosted our expectations of what computers would be able to do for us in a few years’ time.

It is now 20 years later, but technology is still very far from fulfilling such hopes regarding virtual assistants:

Voice recognition isn’t immediate.
People adapt to machines’ lexicons rather than the other way around.
Computers cannot easily transcend the scope for which they have been programmed.
Self-learning machines are still experimental.
The generation of high-quality graphics representing emotionally rich, 3-D avatars is far from being real time, cheap, or extensively available.

The technology for fulfilling such high expectations simply does not yet exist. However, while we wait for technology to catch up with our dreams, the gap between our expectations and the reality of what we can now achieve has inspired two important trends in the development of virtual assistants:

simulation of generic conversations
application-specific virtual assistants

From the Generic to the Application Specific

Unfortunately, the first approach—a computer that can simulate a generic conversation with a real person—remains an unrealized dream. However, a series of intelligence-simulation experiments has culminated in the Loebner Prize in Artificial Intelligence—“the first formal instantiation of a Turing Test,” which demonstrates a computer’s intelligence and would award a gold medal and a $100,000 prize “for the first computer whose responses were indistinguishable from a human’s.” Every year, Hugh Loebner awards a bronze medal to the computer software that best simulates a generic conversation with a real person. Interestingly, Marvin Minsky, co-founder of MIT’s Artificial Intelligence (AI) Laboratory and author of several texts on AI, is one of the strongest critics of the Loebner Prize. He has even provocatively offered a monetary reward to anyone who can successfully convince Loebner to revoke his prize.

On the other hand, virtual assistant applications are moving toward becoming application specific, with the aim of responding appropriately to specific users’ needs by providing precise and relevant information.

While the Turing Test approach might offer the fastest results in successfully simulating generic human conversations, in reality, these simulations are often nothing more than escamotages, or sleights of hand, and are not able to achieve a real understanding of users’ ideas. Such experiments concentrate on the use of various tricks to make users believe a virtual assistant has understood them, while in reality, it is more about

turning users’ questions back on them—as a psychiatrist would do, for instance
suddenly changing the mood of the virtual assistant to surprise—covering the fact that it simply hadn’t understood the user’s input in the first place
switching the topic of conversation to one the application can handle better—by pretending not to be interested in what a user has said
building strong characters that tend to impose their will—thus getting control of the flow of a conversation to ensure it remains on safe ground

On the other hand, the second approach—application-specific virtual assistants—is a more slowly evolving process that would require considerable investment and the use of diverse software technologies. However, its aim is to try to build applications that actually can have a real understanding of users’ input. Circumscribing an application’s context simplifies the understanding of users’ needs, thus reducing the amount of intelligence it needs to understand and communicate effectively. This has the double effect of reducing technology requirements to a level we can realistically attempt and increasing the level of specialization each virtual assistant can achieve. We can concentrate all our efforts on an application’s main purpose rather than trying to support the broad knowledge that responding to generic conversations would require. Since, in comparison to the wide open context of the first approach, this approach limits an application’s context much more, it is generally less appealing to users’ imaginations, while at the same time, being much more effective in responding to users’ requests.

While the tricks inherent in the first approach are definitely worth considering, with their goal of increasing the level of human-like interaction virtual assistant applications can offer, I believe only the second approach enables us to build applications that really meet users’ needs. Typically, users rather quickly become accustomed to a virtual assistant, but get annoyed by any useless chatter, so the ability to carry on a generic conversation might not be desirable anyway. Also, the goal of the second approach is to build business applications, so in my opinion, this will be the approach that leads us down the path toward intelligent virtual assistants.

The Intelligence of Software

What we mean by intelligent virtual assistants often points directly to our definition of artificial intelligence. Each of us has some perception of what AI is, but if we try to unambiguously define it, we will rapidly fall into a series of borderless conceptions that bounce from science fiction reminiscences—the ability of the HAL 9000 to perceive the emotions of its interlocutors and understand all of their requests—to pragmatic requirements—machines that can automate tasks requiring complex decisions—to cutting-edge technologies—self-learning machines—or to your impressions the first time you ever saw a computer providing an answer to a query of yours. So, as you can see, it is more a matter of people’s perception of what intelligence really is. When an AI application actually succeeds, it is common to hear comments like “It was not that hard after all,” or “It’s not real AI anyway.” According to AI pioneer Raymond Kurzweil, such responses are inherent in the nature of AI itself. As he said in a c|net interview:

“It turns out to be easy to replicate most adult achievements, like playing chess or guiding cruse missiles, etcetera. At least a lot of them we can replicate in machines. Every time a machine does that, it’s the nature of artificial intelligence to say, “Well, that wasn’t very difficult after all.” The hard things are what children do, recognizing parents or playing with a dog.”

You can easily imagine, therefore, why the definition of artificial intelligence has been cause for discussion for decades now.

However, it is not the purpose of this article to discuss what artificial intelligence is or how we can emulate human intelligence. Rather, my intent is to understand how virtual assistants can approach intelligence—strictly from a pragmatic point of view.

Since software is the primary factor in making a machine act intelligently, I prefer to use the term software intelligence instead of artificial intelligence. This avoids people’s being misled by the expectations surrounding the latter and allows us to concentrate on the practical considerations that are involved in the creation of intelligent virtual assistants.

I have grouped the software intelligence we need to build virtual assistants into five different levels of intelligence, which together compose an intelligence users perceive as a monolithic, composite intelligence, within the context of their experience (thus, the use of the term composite intelligence in this article’s title). The five levels of software intelligence are:

input intelligence
emotional intelligence
logical intelligence
knowledge intelligence
output intelligence

Input Intelligence

There are numerous input devices that enable a user to input his requests to a virtual assistant—for instance, keyboards, touch screens, microphones, even motion capture systems. Various types of input intelligence—the first level of software intelligence—enable a virtual assistant to immediately interpret such inputs. Some examples include the following:

semantic intelligence—A semantic engine can immediately parse the text a user enters via a keyboard and identify the user’s needs by disambiguating the semantic meaning of the user’s words, thus avoiding the misinterpretation of ambiguous requests.
voice-recognition intelligence—Software can interpret the speed or change of tone in a user’s speech as emotional values that the other levels of intelligence can then use to more properly adapt to the user’s requests. Thus, voice recognition encompasses not only the ability to capture our words as we speak, but the intelligence enabling the software to perceive and interpret changes in speed and tone.
facial-recognition intelligence—Software can also perceive a user smiling and interpret a smile as an emotional value that the other levels of intelligence can use.

Input intelligence is, therefore, able to add a lot of metadata to the software’s understanding of a user’s intended inputs—the kinds of metadata that typically enhance human conversations.

Emotional Intelligence

Input intelligence can understand both a user’s actual communications and the metadata relating to a user’s requests, which have intrinsic emotional value that emotional intelligence can interpret. For example, if a user angrily shook his arms while pronouncing the sentence “Yeah, you helped me alright,” emotional intelligence would correctly disambiguate the sentence. Rather than perceiving the remark to be of a positive and satisfied nature, which the user’s words alone would suggest, it would understand that, in reality, the user is unsatisfied with his experience.

Moreover, by keeping track of a discussion’s evolution, software with emotional intelligence would gain a better understanding of the overall direction in which the dialogue was going—whether satisfying or dissatisfying to a user. Finally, emotional intelligence can also communicate a virtual assistant’s emotional status, which sets the ground for a more human-like type of interaction.

Thus, emotional intelligence can disambiguate mixed input signals, while providing an emotional dimension to the user experience by tracking the emotional status of both the user and the virtual assistant.

Logical Intelligence

The software that provides logical intelligence evaluates all available information and analyzes a user’s requests. It then continues the flow of the discussion by providing the virtual assistant’s next comment or question, or the answer to the user’s query, if it has already gathered all required information from the user.

Logical intelligence comprises the set of technologies that a virtual assistant uses to perform the necessary troubleshooting operations and manage the flow of discussion—such as rule-based engines, case-based reasoning, and bayesian or neural networks.

Knowledge Intelligence

While logical intelligence performs the core tasks of troubleshooting and managing dialogue, it often needs to rely upon data to successfully perform such tasks. It is not enough to merely analyze a user’s needs and identify the related solution. It is of utmost importance that the data a virtual assistant gathers from its knowledge bases matches the information the logical intelligence requests.

For example, if the logical intelligence has identified that a user wants to know about Polish furniture—that is, furniture manufactured in Poland—if the knowledge base were to reply with information on how to polish furniture—that is, clean it—it would have indeed failed to provide an intelligent response to the user’s query. In this case, the integration of a semantic search solution with the logical intelligence would have instead provided the correct information.

Knowledge intelligence lets a virtual assistant correctly gather data from knowledge bases. Not every interaction requires knowledge intelligence be consulted. For example, a single user input might be just a greeting, which logic intelligence alone could handle very well.

Output Intelligence

Output intelligence determines how a virtual assistant presents its response to a user’s request. It has at its disposal both the information the logical intelligence generated and that which the knowledge intelligence ultimately gathered. This information might be something the virtual assistant should say, but also might relate to documents or videos the user should view. Output intelligence also knows the emotional status of both the user and the virtual assistant.

Therefore, output intelligence integrates every bit of information and metadata that the other levels of software intelligence have computed, in order to build the best possible representation of the information. This includes graphic representations of facial expressions and gestures, voice patterns—which might reflect, for instance, the rush a user is in or the emotional status of the virtual assistant—and all of the interface design issues relating to the optimization of this representation.

Using Software Intelligence

Attentive readers will already have figured out that these five levels of software intelligence occur sequentially. The progressive elaboration on a user’s input becomes enriched through the successive steps, until the software intelligence finds the data that matches the information the user has requested and presents it to the user. All necessary levels of intelligence process every user request, to progressively identify what data the user is asking for. It is a process of input targeting, which relates to identifying and disambiguating a user’s requests, while gradually performing a process of output targeting, or computing the emotional and informational response that the software presents to satisfy a user’s request.

These five levels of software intelligence can, in my opinion, make the dream of virtual assistants a reality. Collectively, they make up the concept of composite intelligence, which comprises various software components—each gifted with some moderate degree of intelligence. Under the appropriate creative and technological conditions, this composite intelligence could provide a coherent solution that enables a degree of intelligence that might even be superior to the sum of its components. To fulfill the potential of such a solution, the individual components require a means of communication—including a common communications protocol—that can optimize and boost their overall performance.

Hopefully, my attempt to classify the software components that together support the perceived intelligence of virtual assistants contributes to the development of an understanding of the diverse professional domains the realization of such applications touches. Most people can readily understand and accept that software that can compute and rationally respond to a user’s requests is a vital part of the composition of a virtual assistant. However, it rapidly becomes apparent that we can achieve this capability only by first deeply understanding the nature of a user’s needs by

using a combination of input technologies and emotional algorithms
ensuring the appropriate integration of technology and data sources
providing a rich user experience that gives users an overall feeling of confidence in virtual assistants

Ultimately, if we can achieve all of this, users will gain the necessary degree of familiarity with virtual assistants that will, in the end, open their minds to letting virtual assistants meet their needs.

Originally published on UXmatters, October 8th, 2007.