+ All Categories
Home > Documents > Location grounding in multimodal local search

Location grounding in multimodal local search

Date post: 12-May-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
8
Location Grounding in Multimodal Local Search Patrick Ehlen AT&T 201 Mission Street San Francisco, CA 94105 [email protected] Michael Johnston AT&T Labs Research 180 Park Ave Florham Park, NJ 07932 [email protected] ABSTRACT Computational models of dialog context have often focused on unimodal spoken dialog or text, using the language itself as the primary locus of contextual information. But as we move from spoken interaction to situated multimodal interaction on mobile platforms supporting a combination of spoken dialog with graphical interaction, touch-screen input, geolocation, and other non-linguistic contextual factors, we will need more sophisticated models of context that capture the influence of these factors on semantic interpretation and dialog flow. Here we focus on how users establish the location they deem salient from the multimodal context by grounding it through interactions with a map-based query system. While many existing systems rely on geolocation to establish the location context of a query, we hypothesize that this approach often ignores the grounding actions users make, and provide an analysis of log data from one such system that reveals errors that arise from that faulty treatment of grounding. We then explore and evaluate, using live field data from a deployed multimodal search system, several different context classification techniques that attempt to learn the location contexts users make salient by grounding them through their multimodal actions. Categories and Subject Descriptors H.5.2 [Information Interfaces and Presentation (e.g. HCI)]: User Interfaces—input devices and strategies (e.g. mouse, touchscreen), natural language I.2.7 [Artificial Intelligence]: Natural Language Processing General Terms Algorithms, Design, Experimentation, Human Factors Keywords Multimodal, Speech, Gesture, Dialog, Location-based, Search 1 GROUNDING LOCATIONS IN LOCAL SEARCH In recent years, the capabilities of mobile devices and the data networks supporting them have advanced to the point where it is possible to offer multimodal local search capabilities to mobile consumers. For example, applications such as Speak4it SM [18] allow people to find businesses by using spoken queries, and then browse the results on a graphical interface. Figure 1 shows the result of one such interaction, where a user pressed the “Speak / Draw” button and spoke the query, “Italian restaurants in San Francisco, California.” The query results are plotted on a dynamic, pan-able and zoom-able map display. The user can then browse business details, and call or get directions to one. An important feature of this system, and applications with similar local search functionality, such as Google Mobile [21] and Vlingo [22], is their ability to use GPS, cell tower triangulation, or WiFi IP to determine the approximate location of the device in order to constrain the results returned so they are relevant to the user’s presumed local context. When a user says “gas stations,” the system will return a map showing gas stations in the immediate vicinity of the location of the device. This strategy allows users to conduct searches even when they do not know the name or pronunciation of the town they are in, and, like other kinds of multimodal input, is likely to reduce the complexity of their queries, simplifying recognition and understanding (cf. [16]). Figure 1. Speak4it SM results displayed on map But, as interactive multimodal dialog capabilities are added, and a broader set of use cases is considered, we hypothesize that the ‘brute force’ approach of assuming that the most salient location for the user is always the current physical location of the device may not be sufficient. If the device has a touchscreen with a map, the salient location may be a location they have explicitly touched. If the map is pan-able, through touch or graphical interface controls, the most salient location may be the last location panned to. Or if the user is able to refer to locations by voice, as in “Show the Empire State Building” or “Chinese restaurants on the Upper West Side,” then the relevant location referent may have been introduced as part of that spoken dialog. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI-MLMI’10, November 8-10, 2010, Beijing, China. Copyright 2010 ACM 978-1-4503-0414-6/10/11…$10.00.
Transcript

Location Grounding in Multimodal Local Search

Patrick Ehlen AT&T

201 Mission Street San Francisco, CA 94105

[email protected]

Michael Johnston AT&T Labs Research

180 Park Ave Florham Park, NJ 07932

[email protected]

ABSTRACT

Computational models of dialog context have often focused on unimodal spoken dialog or text, using the language itself as the primary locus of contextual information. But as we move from spoken interaction to situated multimodal interaction on mobile platforms supporting a combination of spoken dialog with graphical interaction, touch-screen input, geolocation, and other non-linguistic contextual factors, we will need more sophisticated models of context that capture the influence of these factors on semantic interpretation and dialog flow. Here we focus on how users establish the location they deem salient from the multimodal context by grounding it through interactions with a map-based query system. While many existing systems rely on geolocation to establish the location context of a query, we hypothesize that this approach often ignores the grounding actions users make, and provide an analysis of log data from one such system that reveals errors that arise from that faulty treatment of grounding. We then explore and evaluate, using live field data from a deployed multimodal search system, several different context classification techniques that attempt to learn the location contexts users make salient by grounding them through their multimodal actions.

Categories and Subject Descriptors H.5.2 [Information Interfaces and Presentation (e.g. HCI)]: User Interfaces—input devices and strategies (e.g. mouse, touchscreen), natural language I.2.7 [Artificial Intelligence]: Natural Language Processing General Terms Algorithms, Design, Experimentation, Human Factors Keywords Multimodal, Speech, Gesture, Dialog, Location-based, Search

1 GROUNDING LOCATIONS IN LOCAL SEARCH

In recent years, the capabilities of mobile devices and the data networks supporting them have advanced to the point where it is possible to offer multimodal local search capabilities to mobile consumers. For example, applications such as Speak4itSM [18]

allow people to find businesses by using spoken queries, and then browse the results on a graphical interface. Figure 1 shows the result of one such interaction, where a user pressed the “Speak / Draw” button and spoke the query, “Italian restaurants in San Francisco, California.” The query results are plotted on a dynamic, pan-able and zoom-able map display. The user can then browse business details, and call or get directions to one. An important feature of this system, and applications with similar local search functionality, such as Google Mobile [21] and Vlingo [22], is their ability to use GPS, cell tower triangulation, or WiFi IP to determine the approximate location of the device in order to constrain the results returned so they are relevant to the user’s presumed local context. When a user says “gas stations,” the system will return a map showing gas stations in the immediate vicinity of the location of the device. This strategy allows users to conduct searches even when they do not know the name or pronunciation of the town they are in, and, like other kinds of multimodal input, is likely to reduce the complexity of their queries, simplifying recognition and understanding (cf. [16]).

Figure 1. Speak4itSM results displayed on map

But, as interactive multimodal dialog capabilities are added, and a broader set of use cases is considered, we hypothesize that the ‘brute force’ approach of assuming that the most salient location for the user is always the current physical location of the device may not be sufficient. If the device has a touchscreen with a map, the salient location may be a location they have explicitly touched. If the map is pan-able, through touch or graphical interface controls, the most salient location may be the last location panned to. Or if the user is able to refer to locations by voice, as in “Show the Empire State Building” or “Chinese restaurants on the Upper West Side,” then the relevant location referent may have been introduced as part of that spoken dialog.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICMI-MLMI’10, November 8-10, 2010, Beijing, China. Copyright 2010 ACM 978-1-4503-0414-6/10/11…$10.00.

In short, by interacting with the system the user may have established a series of actions aimed at grounding some location [3,4,5]. Thus the user would now view that grounded location as salient, and as the location reference to be inferred when the location is otherwise left ambiguous or unspecified. To make this grounding problem concrete, consider the following scenario: a user is interacting with a GPS-enabled mobile device in Manhattan, and is currently located in the Lower East Side but browsing to find a Thai restaurant up near Central Park. The user says “Show Central Park” and then scrolls and zooms the map to view a four-block square area on the Upper West Side next to Central Park. If the user then says, “Thai restaurants,” most people would agree that this user seeks information about Thai restaurants in the four-block zone of the Upper West Side now displayed on the device, since the user’s speech and actions have laid down a trail of contextual traces that lead to the Upper West Side as the grounded location, for at least the duration of that interaction. But a system that solely uses the device GPS to establish the location for a query would fail that simple test of human understanding, and would instead display restaurants in the user’s immediate vicinity of the Lower East Side—probably undoing the user’s map navigation actions in the process, and losing the established context of interaction. So we face a simple yet vital question: When people have access to multimodal devices in a mobile context, how can we establish the grounded location—that is, the location a person believes has been established as mutually salient with the search system when issuing some request—from among the many possible locations that could also be relevant? To get a handle on this question, and also to get a better sense of the scope of the problem, we first conducted an empirical investigation of the existing Speak4it logs. Since the queries handled by Speak4it only cover descriptions of categories or names of businesses, the queries it receives tend to be short and not grammatically complex. So when someone makes an effort to speak the name of a location in a query, it is safe to assume that the uttered location is salient to that person for that query. But for the majority of cases, we found that people do not explicitly state a location, revealing a need for some mechanism to determine the intended location. Our logs of Speak4it queries reveal people making references to specific locations (e.g., “Police Department in Jessup, Maryland” or “Office Depot Linden Boulevard”) only 18% of the time. The remaining majority of utterances are “unlocated” topic phrase queries, such as “gas station” or “Saigon restaurant,” in which users leave resolution of their perceived salient location in the hands of some unseen intelligence. This may be because they expect the GPS to provide that information, or they assume the map already provides a grounded context, or they believe they must utter queries in a hurry—or due to many other reasons. In any case, the question arises: When people do bother to specify the name of a location, why do they do it? And how should a grounded location be handled, in terms of prioritizing one location context over another? And how long does a grounded location remain salient, and at what rate does its salience decay? One way to get a glimpse of how location grounding works is to look for cases where it breaks down—that is, cases where people “repair” their queries by reformulating them with locations added or omitted. For example, we frequently found individual user query patterns like this:

Serendipity …followed shortly thereafter by:

Serendipity Dallas Texas Here a query has evidently failed because the user assumed a grounded location that was not reflected in the search results they received. In our log analysis, 8% of queries with references to locations arise from query restatements of this type. To put it in other terms, nearly a tenth of the cases where people speak locations could be attributed to a faulty initial assumption of the location the user thought had been grounded. While it is always difficult to speculate about what transpires in a user’s mind, a more compelling example shows a user introducing a location in one query, and later appearing to assume that the earlier location context is still grounded:

Starbucks Cape Girardeau …followed six minutes later by:

Lowes …and then right away:

Lowes Cape Girardeau Query reformulations such as these examples from the Speak4it logs—where people reformulate a query by adding a location that was missing from a previous query—prompted us to believe that location grounding problems deserve further investigation, and that more advanced multimodal interfaces will demand an approach to modeling and handling those contextual factors in an intelligent way that is better at meeting the expectations of users. In order to address the limitations of Speak4it and other contemporary local search systems, and to further investigate the challenge of location grounding, we developed and fielded a new prototype that supports true multimodal commands where the user can combine speech and touch, for example by saying, “Chinese restaurants near here” while touching or circling a location on the map. We also deployed an initial location grounding algorithm that provides a more flexible determination of the grounded location than using only the physical device location. We then instrumented the prototype and its supporting architecture with a multimodal disambiguation mechanism to enable capture of the “ground truth” of users’ intentions when it came to the locations they believed had been grounded when they issued their queries. Section 2 describes the capabilities of the extended Speak4it prototype and outlines its underlying architecture. Section 3 describes a range of different strategies for modeling location grounding and discusses related work on multimodal reference resolution. Section 4 describes the multimodal approach used for capturing the ground truth of each user’s intended grounded location, and presents an empirical evaluation of the range of different location grounding models outlined in Section 3.

2 SPEAK4IT APPLICATION Though it may have gone by other names, ‘local search’ has long been used as a test bed application for investigation of multimodal interface techniques and processing models for multimodal language and dialog. The MATCH system [2,13] enabled users to interact using speech and drawing to search and browse for restaurants and other businesses in major U. S. cities. Local search and navigation was one of the key domains for the SmartKom

Mobile prototype [18]. The AdApt system provided a multimodal interface for searching for apartments in Stockholm [10], and users interacting with dynamic maps for a real estate search task showed significant advantages in terms of task performance and preference for multimodal compared to unimodal interfaces in a Wizard-of-Oz simulation conducted by Oviatt [15]. QuickSet [6] addresses a related dynamic map task of laying down the locations of map features and entities. Thanks to a number of technical advances—including the capabilities of contemporary mobile computing devices such as smartphones and tablets, the availability of high speed mobile networks, and access to speech and language processing capabilities “in the cloud” [7]—it is now possible to put true multimodal interaction into the hands of customers for use in their daily lives. To the best of our knowledge, the application described here is the first commercially deployed mobile local search application to support true multimodal interaction, where users can issue commands that integrate simultaneous inputs from speech and gesture. Versions of Speak4it for both the iPhone and iPad are available for free download [20]. Speak4it is a consumer-oriented mobile search application that leverages multimodal input and output to allow users to search for local business information. On launching the application, users see a map of their local area (Figure 2, Left), and can either click a “Speak / Draw” button or lift the device to the ear to initiate a speech-based query.

Figure 2. Speak4it interaction

Figure 3. Radar view

Users can search for businesses based on the name of the business, a category, or associated keywords. For example, in Figure 2 (Left), the user presses the ‘Speak / Draw’ button and says “coffee shops.” The system zooms in and displays the closest coffee shops to the user’s current location as selectable pins on the

map (Figure 2, Middle). The results can also be viewed as a spin-able text list (Figure 2, Right) or in a ‘radar’ view (Figure 3). In the radar view, depending on the orientation of the device, results appear either as green markers on a compass view (Figure 3, Left) or superimposed over the camera viewfinder, providing an augmented reality effect where tags describing businesses float in line with the user’s perspective (Figure 3, Right). In all of the views, users can tap on businesses to access more detailed information and initiate actions such as making a call, getting directions, or sending business details by email or text. Search over all U.S. business names is supported, including both chains and local businesses. For example, users can say “Starbucks” or “Ninth Street Espresso.” Users can explicitly state a location in a query, for instance, by saying, “real estate attorneys in Detroit, Michigan,” or “Italian restaurants near the empire state building.” Queries can also be entered as text by tapping on the text window at the top of the display. Users can manipulate the map using drag (pan) and pinch (zoom) touch gestures, or by using spoken commands. For example, if the user says “San Francisco,” the map will pan and zoom to show San Francisco.

Figure 4. Speak4it gesture inputs

In addition to these unimodal spoken commands for business search and map manipulation, the system also supports multimodal commands where the location is specified directly by drawing on the map (cf. [2,6,13,18]). The system supports point, area, and line gestures (Figure 4). For example, the user can say, “French restaurants here” in conjunction with a point or area gesture. For the point gesture, the system returns French restaurants closest to the point. For an area gesture, results within the area are returned. For a line gesture, businesses closest to the line are returned. For example, the user might say “gas stations” and trace a route on the map (Figure 4, Right). Speak4it will show the gas stations along that route. Figure 5 lays out the underlying architecture supporting Speak4it. The user interacts with a client application on the iPhone or iPad which communicates over HTTP with a multimodal search platform that performs speech and gesture recognition, query parsing, geocoding, search, and grounded location modeling. The client application captures and encodes the user’s speech and touch input. One of the design challenges in adding multimodality to this application was working out how to capture ink input on a map. The established paradigm for map manipulation on the iPhone and similar devices is to use touch gestures to perform pan and zoom operations. In fact, touch gestures on the map cannot be accessed in the way they are for other UI elements. Thus, for multimodal inputs, we needed to establish (a) a method to capture and display the ink touch gestures when users make them; (b) a paradigm where users could either manipulate the map in the

traditional way or draw on the map using the newer ink gestures; and (c) a method of determining when the user wants to use touch gestures for referring to areas of the display rather than to directly manipulate the display. Our solution was to use the “Push to Speak” button as essentially a “Push to Interact” button (now “Speak / Draw”) that also works to override the default map-touch paradigm. When the user presses the button, touch gestures on the map are interpreted as referential pointing and drawing actions rather than direct map manipulation. After the user makes a gesture, clicks stop, or stops speaking, the map ceases to work as a drawing canvas and returns to the direct map manipulation mode. The user’s ink gesture points are streamed over HTTP along with their encoded speech input to the multimodal platform. This stream also contains additional information used for location grounding and search, such as current GPS coordinates of the device, and a history of recent movements of the map display and utterances spoken. This multimodal data stream is received and decoded by a Multimodal Interaction Manager component. The user’s ink trace is passed to a gesture recognition component, which uses a classifier based on Rubine’s template-based gesture recognizer [17] to classify the input as point, line, or area. The audio stream is forwarded to a speech platform which first performs speech recognition using a statistical language model trained on previous query data. From here, the speech recognition output is passed to a natural language parser (NLU) that parses the query into a topic phrase that designates the user’s desired search subject (e.g., “pizza”) and, if applicable, a location phrase (e.g., “San Francisco”) that designates a desired location [9]. In cases where there is an explicit location phrase—like “pizza restaurants in San Francisco”—the location phrase is geo-coded so search results from the topic phrase may be sorted and displayed according to their proximity to that location. If the location is not stated explicitly in the query, the Interaction Manager passes a series of features that pertain to possible salient locations in the current

interaction to the location grounding component, which uses those features as input to attempt to determine the current grounded location. The Multimodal Interaction Manager component is also responsible for making decisions regarding explicit disambiguation requests made to the user, as when, for example, there are several possible locations named “Springfield” and the user has not specified which was intended. These options for disambiguation are passed to the client and overlaid on the map for the user to select one. This mechanism also underlies the disambiguation method used in the experiment we will soon describe in more detail in Section 4. Most of the information passed throughout this architecture is written to logs that can be analyzed for behavioral trends. In addition, audio recordings of each user query are manually transcribed for evaluation.

3 MULTIMODAL CONTEXT MODELS FOR LOCATION GROUNDING

We explore and compare three different techniques for modeling location grounding: Fixed Local Salience, Threshold-Rank, and Grounded Location Classification. All assume a mobile multimodal system in which users interact with maps or other complex geographic displays using four main sources of potentially salient location referents, listed and exemplified in Table 1. Fixed Local Salience. The simplest approach—which is essentially the strategy taken by every deployed local search system we know of, including the initial deployment of Speak4it—is to always use the device location for queries in which the location is not explicitly stated. We use this as a baseline model in our experiments. To a large extent, this approach is championed by our existing log data, where 82% of queries show no explicit location specified in the spoken query

Figure 5. Multimodal architecture

and query reformulations are relatively infrequent (8%). Keep in mind, however, that the patterns found in our data reflect the current capabilities of a fielded system which always uses the device location as the ‘default’ location—and this design in itself will influence user behavior. Another consideration is that the default behavior of the system is to initially display a map of the location determined from GPS, which is likely to be interpreted by the user as a location grounding action.

Table 1. Location types Touch User touches or circles a location on the map. Spoken User spoke the name of a location in a

previous query: e.g., “Portuguese restaurants in Newark New Jersey” or “Show downtown Kansas City.”

Map Location shown on the map display GPS The current geospatial location of the device

determined using GPS sensor, cell tower triangulation, or WiFi.

Threshold-Rank. The second approach we developed allows for four kinds of grounded locations, and utilizes a fixed temporal threshold for each, combined with an overall rank preference order on location reference modes (Figure 5). We will refer to this as the Threshold-Rank model.

Touch > Map > Spoken > GPS Figure 6. Location preference order

In this model, the system maintains a multimodal context model that stores the last location referent specified by the user for each mode of location reference (Touch, Spoken, Map, and GPS). Each mode is associated with a fixed temporal threshold, after which the location is removed from the model. When a grounded location referent is needed, the system checks the current state of the multimodal context model. If more than one location is available, a location is chosen based on the mode preference order listed in Figure 6. The temporal thresholds ideally would be set, or better still continually adapted, according to usage data. Intuitively, thresholds for touch should be very short, while spoken location references are more persistent, and the map view location referent which results from scrolling and panning actions would lie somewhere between the two: not as ephemeral as touch, but not as persistent as speech. Our initial model for Speak4it was a simplified variant of Threshold-Rank. If the user gestured within the current turn, the location of the gesture is assumed to be the most salient. Based on development data from an initial pilot we found a threshold of 6 seconds for Map location referents to work well. In the fielded system, the last spoken location was not available, so that was not an option. A related approach was adopted by Kehler et al [14], who describe a Wizard-of-Oz data collection with a travel guide system. Their study examines the more general problem of reference resolution in multimodal interfaces, rather than location reference in particular. Their work also pre-dates the current ubiquity of GPS, and the widespread deployment of touchscreens for easy manipulation of map displays on mobile devices. For their application, they found that a set of decision rules similar in spirit to Figure 6 accounted for all of their data. In cases where type constraints imposed by the content of the referring expression leave more than one possible referent, their model gives preference to a simultaneous gesture, then to the currently

selected object, before other referents. Similarly, in our suggested model, recent touch gestures are primary, followed by what is selected (in our case the current map view), and then referents introduced in the spoken dialog. Our suggested approach differs in its introduction of explicit temporal thresholds for each reference mode. Another related approach which does incorporate temporal thresholds is found in Huls et al [12], who expand on an approach introduced by Alshawi [1], which assigns significance weights to a series of context factors determined for each referent, along with decay functions that reduce those weights over time. The relative salience of different referential entities is calculated by summing the context factors, each multiplied by a significance weight. Huls et al use this model to account for reference in a multimodal system that combines typed input with mouse actions to interact with a visualization of a file store (containing emails, reports, etc.). Like the Threshold-Rank model, Huls et al make deictic gesture primary, though this is done by assigning it a very high weight. They also capture temporal thresholding through the decay functions, where each function is a list of weight targets that determine how the weight will change over dialog turns (e.g., 30, 1, 0 seconds for deictic gestures). Grounded Location Classification. As data become available, it would be preferable to cast location grounding as a machine learning problem rather than pursuing hand-coded heuristic strategies. Grounded location determination can be treated as a multi-class classification problem over features such as the time from the last map movement or spoken utterance, the type of query, history features, recognition confidence, and so on. It is interesting to note that [14] was a Wizard-of-Oz system and [12] used typed input, and neither had to deal with the challenges of speech recognition errors. In an interactive system that is used in mobile and potentially noisy environments, the multimodal context mechanism must take account of any uncertainty in establishing and ranking grounded location hypotheses. In our third strategy below, we evaluate the performance of a classifier operating over a series of nominal and numeric features automatically derived from our interaction logs, to predict the most-likely grounded location for queries.

4 EVALUATION If users establish their intended location references in multimodal search transactions through a process of grounding—that is, by taking actions to make them salient in common ground [3,4,5]—how can we evaluate the effectiveness of models that predict the location that a user believes has been grounded? One traditional method for evaluating salient aspects of common ground is to use a Schelling task [3,5] in which the independent judgments of two people about aspects of common ground are compared to determine which aspects they find to be most salient. If two people agree on the same aspect of a situation, we can posit that their agreement indicates some salience of that aspect in common ground. But the Schelling task is difficult to implement in a search system like Speak4it that is deployed to users in the wild, because search is a highly subjective activity and two independent judges in similar situations may have quite different purposes and intentions that are not captured by the context used in the Schelling task.

If we are not able to establish salient aspects of common ground by comparing judgments of two independent users, another

approach is to establish the ground-truth intentions of a large collection of users, hoping to obtain enough data to create a generalized model of how people ground their salient locations over a variety of situations. To divine users’ ground-truth intentions, we chose the simplest method: Ask them.

As we mentioned earlier, when people do not explicitly indicate a location, there are up to four locations that may be salient in common ground: the user’s current location, the location context displayed on the map, the last location they may have explicitly spoken, and the last location they touched on the map. Rather than making a guess about which of these locations the user has in mind, we sought to gather ground-truth by providing a disambiguation screen that asked users which location context they meant. Following a query where a user did not speak a location as part of the query, they would be prompted with a dialogue box that asked, for example, “Which location did you want to search for coffee shops in?” The box contained up to four buttons labeled with up to four responses, labeled as: “My current location”; “Area shown on map”; “The place I touched”; and “The last place I mentioned.” There was also a “None of the above” button, in case the user’s intentions did not match any of these.

The four location options were presented in an order that was randomized across users and queries, to prevent bias arising from the locations of the buttons. The exception was the “None of the above” option, which was always at the bottom. In addition, we presented only options that were relevant to the user’s recent behavior. So, for example, if a user had not touched the map and had not recently mentioned a location, they would see only two options (current location and map), not four.

Figure 7. Multimodal disambiguation interface

While it is a great thing for researchers to be able to ask users what they mean when their queries are ambiguous, we realize it is not a great experience for users to have to explain themselves every time they make a query, even if only by pressing a button. In fact, presenting this type of disambiguation screen every time users fail to specify a location might lead them to catch on to the fact that they are asked to perform an extra action (pressing a button) when they have not explicitly uttered a location, and change their behavior accordingly, leading to more explicit locations in their queries than they would issue otherwise. To compensate, we put this disambiguation screen presentation on a “bucket-throttle” schedule, so users would only see it during a fraction of the queries they issued. During the period of our experiment, this throttling varied so that only a small fraction of queries that didn’t contain an explicit location were presented with the disambiguation prompt.

In short, for a random subset of queries we received from users that did not contain an explicit location in the query, those requests would be presented with our disambiguation screen before they received search results, and the user’s subsequent indication of the intended location was recorded, along with all of the data we could gather that might relate to the user’s query and its context. In order to maintain a cooperative search system, we also performed searches for each of the options presented in these requests, and would immediately display the results that corresponded to the user’s choice. So if a user’s query led to separate results for both the user’s current location and the area shown on the map, the server returned sets of results for both searches, and the user’s button selection would determine which of those result sets the application subsequently displayed. For the cases where users did not receive the disambiguation screen, we used a simplified variant of the Threshold-Rank algorithm, as described above, to determine which location to use.

4.1 Empirical Support for Location Referents The first goal of the study was to seek empirical support for our hypothesis that locations other than the current user location are salient in local search tasks on mobile devices. Did users always choose “My current location”—in keeping with the default location strategies of many current search systems—or were other location contexts frequently salient for the user? Table 2 shows how frequently users chose each option in a random sample of 3000 queries without an explicitly stated location.

Table 2. Frequency of location context choices Location context option Choice frequency

“My current location” 53.87 % “Area shown on map” 36.85 “The place I touched” 3.28 “The last place I mentioned” 6.00

While “My current location” was chosen in a slim majority of queries, we can see that it was not chosen exclusive to all else; at 53.87%, users chose this salient location more frequently than “Area shown on map.” But the map view was considered salient in over a third of cases, making it a viable contender as a grounded salient location table in many queries. The remaining two options, “The last place I touched” and “The last place I mentioned” were selected relatively infrequently, at 3.28% and 6.00%, respectively.

Before we dismiss the user’s map touches and spoken locations as potentially insignificant, we should note that each of these location context options was only presented in the disambiguation interface when it was relevant to the user’s recent actions, and therefore each option was presented a different number of times from the others. So actions that were relevant infrequently would be represented less frequently as user selections in the overall frequency analysis in Table 2. The question remains, when those infrequent actions were presented, how frequently did users select them as salient? In some ways, this statistic provides a better representation of the strength of salience, since a context option might come infrequently, but then be highly relevant and salient when it does occur.

Table 3 shows each context option as a proportion of the frequency with which users chose an option compared to how frequently it was presented to them. The frequency proportions here are similar to those in Table 2 for user selections of their

current location, of the area shown on the map, and of their last spoken location. But users’ ink gestures on the map, while very infrequent, were considered salient over two-thirds of the time, painting a different picture from the previous table. When users touch the screen, they often consider those touches as grounding actions.

Table 3. Proportion of location selections Location context option Frequency when presented

“My current location” 64.37 % “Area shown on map” 38.04 “The place I touched” 69.29 “The last place I mentioned” 13.59

The first lesson to take home here is that a mobile, multimodal search system that always chooses the user’s current location from GPS or WiFi has a fair chance of failing to correctly identify the user’s intended location context. In fact, multiple sources of grounding for the location context are not only possible, but frequent. So the next question becomes, if choosing only the user’s location is not the best model to determine the location a user views as grounded and salient, then what is?

4.2 Evaluation of Location Grounding Models The second goal of this study was to evaluate the effectiveness of different kinds of location grounding models. Because we had access to user-provided ground-truth data about each user’s intended salient location for a large number of queries spoken in-the-wild, we were able to use this data to evaluate several different strategies. For each model, we compared the predictions of the model for each query to the ground-truth data we collected from the user for that query, and marked the user intention prediction for that query as “successful” if the two matched, or “unsuccessful” if they did not. We tested three models: a baseline, Fixed Location Salience model; the currently-deployed version of the Threshold-Rank model; and, for the third strategy, we trained a classifier using a held-out training set of user responses to the disambiguation prompt as gold standard annotations.

From approximately 3000 observations sampled from our log data, we extracted features for the classifier which included continuous values such as the time since the last map movement, time since the last gesture, time since the last spoken location, and categorical values such as whether there was a gesture in the current turn, the type of the last map movement (pan, zoom), the type of gesture (point, line, area), and whether the device location fell within the current map view. Given the combination of numeric and categorical features, we chose a decision tree classifier, specifically the J48 implementation in the WEKA toolkit [19]. For the evaluation, we conducted a 10-fold cross validation. In each fold the data were randomized and split into training and test sets. The classifier was trained on the training set. All three models were evaluated on the test set for each fold. The results in the table below are averages over all ten folds.

Table 4. Mean model accuracy over 10 folds Location salience model Accuracy p

Baseline: Fixed Local Salience 53.01 % n/a Threshold-Rank 55.74 0.031 J48 Decision Tree Classifier 56.45 0.006

Given the distribution of the data, the baseline performs well at 53.01%. The heuristic Threshold-Rank strategy gains us 2.7%

with the temporal threshold for map movement set at 6 seconds. The classifier ekes out an additional 0.71% at 56.45%. Significance values show each model compared to the baseline model, based on a 2-tailed, homoscedastic t-test. The difference between Threshold-Rank and J48 was not significant (p = 0.523).

A few issues should be clarified: For one, as a classification problem, the classes we are trying to predict are not independent. In fact, they are highly confounded. The user’s current location may or may not be within the area currently shown on the map view (in our data, the overlap is around 70%). The last spoken location—especially if it was in the last turn—may also overlap with the map view, or with the user’s current location. And of course, user touches on the map are always within the current map view, so the “touch” class is a full subset of the map view class.

Given this degree of possible overlap, the most important question for classification is not how often the system identifies the correct class, but rather, how often does the system do something that would appear sensible to the user (even if it misclassified the location context)? We calculated a second accuracy measure to give a better sense of when the system would ‘do the right thing.’ Specifically, a hypothesis label (Map, GPS, Touch, Spoken) is counted as correct if it matches the reference, or if the hypothesis and reference are both either Map or GPS and the GPS location coordinates are within the current map view. Table 5 shows this ‘concept’ accuracy for each model. Because of the overlap of many of the location contexts, we can see the user-perceived performance of the system would be higher overall for all models, with a solid 7.23% improvement of the J48 model over the baseline. Differences in concept accuracy for all three models are statistically significant at p < 0.001.

Table 5. Mean concept accuracy over 10 folds Location salience model Accuracy p

Baseline: Fixed Local Salience 77.87 % n/a Threshold-Rank 81.32 0.0004 J48 Decision Tree Classifier 85.10 < .0001

5 CONCLUSION As multimodal local search systems evolve from the simple one-step search interaction paradigm to support of multimodality, iterative query dialog, and sensor inputs such as GPS and orientation, more sophisticated models for dialog context and relevance are required (cf. [8]). Specifically here, our hypothesis was that the assumption in contemporary multimodal local search systems that the device location is always salient is too simplistic. We found initial support for this claim by investigating the query logs, which showed that users repeat queries and add locations to overcome errors arising from that assumption. We then conducted an empirical evaluation of a multimodal local search system that supported a richer model of location grounding and found strong evidence that, in addition to the last location touched and the device location, the current location shown on the map is frequently salient (37% of explicitly grounded queries). We compared a range of different kinds of location salience models. Defaulting to the device location remains a strong baseline, as our data show this location is salient a little over half of the time. We demonstrated a significant improvement over the majority baseline (Fixed Local Salience) and the initial heuristic strategy (Threshold-Rank) by training statistical classifiers for

location grounding, using the ground-truth annotation data collected in the field from users. However, significant challenges and avenues for further exploration remain. Our mechanism for explicit confirmation of ground truth from users may be problematic, given the frequent overlap of the map view and the device location. Users may not be consistent in their choices, and the two classes are confounded in the training data for classification. One avenue for future effort may be to offer multimodal disambiguation as an explicit part of the grounding strategy; that is, rather than always making a hard decision between the four possible location referents, if the system is truly uncertain then it should explicitly ask for assistance in grounding the location with the user. There may also be benefits to operating over a richer set of features. For example, the recent sequence and magnitude of map movements may help in predicting whether the map is salient over GPS. Also, the current model generalizes all users into a single model. But there may be significant individual differences among users, and we might benefit from a personalized and adaptive approach to grounding location salience and other contextual features in interactive multimodal systems. From a random sample of individual users and their queries from the Speak4it logs, only 33.6% ever spoke a location reference at all, while 7.7% spoke locations in at least 80% of their queries, showing a clear indication of individual differences in these behaviors. Finally, we are up against the ongoing changes in people’s assumptions of context and grounding caused by rapid changes in technologies and their modes of interaction. In our case, a newly-introduced method of multimodal interaction (speaking while gesturing on a map) had yet to catch on to many users. As a result, our users have yet to provide as much data about grounding this location context as we see for the contexts that are more familiar to them. Learning how to provide better education to users about the use of that mode—perhaps through a situated multimodal help approach [11]—will be necessary in future work.

6 ACKNOWLEDGEMENTS Thanks to Jay Lieske, Clarke Retzer, Brant Vasilieff, Diamantino Caseiro, Junlan Feng, Srinivas Bangalore, Claude Noshpitz, Barbara Hollister, Remi Zajac, Mazin Gilbert, Barbara Hollister, and Linda Roberts for their contributions to Speak4it.

7 REFERENCES [1] Alshawi, H. 1987. Memory and Context for Language

Interpretation. Cambridge, Cambridge University Press. [2] Bangalore, S. and Johnston, M. 2009. Robust Understanding

in Multimodal Interfaces. Computational Linguistics 35:3, 345-397.

[3] Clark, H., Schreuder, R., and Buttrick, S. 1983. Common Ground and the Understanding of Demonstrative Reference. Journal of Verbal Learning and Verbal Behavior 22, 245-258.

[4] Clark, H. and Brennan, S. E. 1991. Grounding in Communication. In L. B. Resnick, J. M. Levine, and J. S. Teasley (eds.) Perspectives on socially shared cognition, American Psychological Association.

[5] Clark, H. 1996. Using Language. Cambridge, MA: Cambridge University Press.

[6] Cohen, P. R., Johnston, M., McGee, D., Oviatt, S.L., Pittman, J., Smith, I., Chen, L., and Clow, J. 1998. Multimodal Interaction for Distributed Interactive Simulation. In M. Maybury and W. Wahlster (eds.) Readings in Intelligent Interfaces. Morgan Kaufmann Publishers, San Francisco, CA, 562–571.

[7] DiFabbrizio, G., Okken, T., and Wilpon, J. 2009. A Speech Mashup Framework for Multimodal Mobile Services. In Proceedings of the 2009 International Conference on Multimodal Interfaces, 71-78.

[8] Ehlen, P., Zajac, R., and Rao, B. P. 2009. Location and Relevance. In E. Wilde, S. Boll, K. Cheverst, P. Fröhlich, R. Purves, and J. Schöning (eds.). Proceedings of the Second International Workshop on Location and the Web (LocWeb 2009), Boston, Massachusetts, April 2009, 17-19.

[9] Feng, J., Bangalore, S., and Gilbert, M. 2009. Role of Natural Language Understanding in Voice Local Search. Proceedings of Interspeech 2009, 1859-1862.

[10] Gustafson J., Bell, L., Beskow, J., Boye, J., Carlson, R., Edlund, J., Granström, B., House D., and Wirén, M. 2000. AdApt – A Multimodal Conversational Dialogue system in an Apartment Domain. In Proceedings of ICSLP 2000. Vol. 2, 134-137.

[11] Hastie, H., Johnston, M., and Ehlen. P. 2002. Context-sensitive Help for Multimodal Dialog. In Proceedings of the 4th IEEE Conference on Multimodal Interfaces, 93-101.

[12] Huls, C., Bos, E., Claassen, W. 1995. Automatic Referent Resolution of Deictic and Anaphoric Expressions. Computational Linguistics. 21:1, 59-79.

[13] Johnston, M., Bangalore, S. Vasireddy, G., Stent, A., Ehlen, P., Walker, M., Whittaker, S., Maloor, P. 2002. MATCH: An Architecture for Multimodal Dialogue Systems. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 376-383.

[14] Kehler, A., Martin, J-C., Cheyer, A., Julia, L., Hobbs, J. and Bear, J. 1998. On Representing Salience and Reference in Multimodal Human Computer Interaction. AAAI'98 Workshop on Representations for Multi-Modal Human-Computer Interaction. Madison, WI. (USA), 33-39.

[15] Oviatt, S.L. 1997. Multimodal Interactive Maps: Designing for Human Performance. Human Computer Interaction. 12, 93-129.

[16] Oviatt, S. L. and Kuhn, K. 1998. Referential Features and Linguistic Indirection in Multimodal Language. Proceedings of ICSLP 1998, 286-304.

[17] Rubine, D. 1991. Specifying Gestures by Example. Computer Graphics 25:4, 329-337.

[18] Wahlster, W. 2006. SmartKom: Foundations of Multimodal Dialogue Systems (Cognitive Technologies). Springer-Verlag New York, Inc.

[19] Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2nd Edition. Morgan Kaufmann, San Francisco.

[20] http://speak4it.com/ [21] http://www.google.com/mobile/apple/app.html [22] http://www.vlingo.com


Recommended