Easier retrieval of verbal commands using multimodal interfaces
Patent Information
- Authority / Receiving Office
- DE · DE
- Patent Type
- Patents
- Current Assignee / Owner
- ADOBE INC
- Filing Date
- 2019-12-20
- Publication Date
- 2026-07-02
Smart Images

Figure 00000000_0000_ABST
Abstract
Description
[0001] The discoverability (knowledge and understanding) of suitable verbal commands poses a persistent problem for users of speech-based interfaces. For a user who does not know which verbal commands are available (knowledge) and / or how to formulate commands so that they are understood by the system supporting the interface (understanding), discoverability, aside from questions of speech recognition accuracy, is only relevant when obstacles arise that users of speech-based interfaces encounter. Users often try to guess verbal commands that they believe the supporting system might recognize, and / or use formulations, speech patterns, or dialects to which they are accustomed, but which may not be understood by the system; both of these often lead to execution errors and frustration.
[0002] One approach to addressing these discoverability problems in current systems is to present users with a list of sample commands as part of an introductory procedure, since this is the natural time when users should be introduced to operations and commands supported by a language-based system. However, such lists presented during the introductory procedure are often closed by users before they have thoroughly reviewed them, unless they are actively engaged in a specific task or action. Even after a user has thoroughly reviewed a list of sample commands, they often forget the presented commands until they attempt to use a specific command while performing an action or task.
[0003] To make users aware of newly supported and / or infrequently used commands, some solutions send users messages reminding them of available commands or that new commands will become available. Similarly, some solutions send weekly emails reminding users of available commands and updates to those commands. However, presenting sample command suggestions only at specific intervals is not effective, as users tend to forget these commands by the time they are performing actions and / or tasks using voice-based systems. Summary
[0004] Embodiments of the present disclosure include, among others, a framework for generating and presenting examples of verbal commands to facilitate the discoverability of relevant verbal commands understood by systems supporting multimodal interfaces. The framework described here additionally enables users to become familiar with available verbal commands over time. The described framework facilitates command discoverability by providing exemplary verbal command suggestions when non-linguistic input (such as direct manipulation) is used. A target associated with direct manipulation input (such as touch, keyboard, or mouse input) received from a user via a multimodal user interface is determined, and one or more exemplary verbal command suggestions relevant to the target are generated.At least some of the generated verbal command suggestions are made available for presentation in conjunction with the multimodal user interface using one of three interface variants. These variants include an interface that presents verbal command suggestions using a list-based approach, an interface that uses context-sensitive pop-up windows to present verbal command suggestions, and an interface that presents verbal command suggestions embedded within a graphical user interface (GUI). Each of the presented interface variants facilitates the user's understanding of the verbal commands that the system supporting the multimodal interface can execute and simultaneously guides the user on how to access available verbal commands (for example, through appropriate wording variations and multimodal interactions).
[0005] This summary is provided to introduce, in simplified form, a selection of concepts that are further described in detail below. It is not intended to identify key features or essential features of the claimed invention, nor is it intended to be used as an aid in determining the scope of the claimed invention. List of characters
[0006] The present invention is described in detail below with reference to the accompanying drawing figures. Fig. Figure 1 is a schematic diagram illustrating an abstracted overview of a command suggestion generation framework according to implementations of the present disclosure. Fig. Figure 2 is a block diagram illustrating an exemplary system for facilitating the location of verbal commands in multimodal user interfaces according to implementations of the present disclosure. Fig. Figure 3 is a schematic diagram illustrating an exemplary list of formulation templates associated with a set of operations that can be selected when a goal implies image editing, according to implementations of the present disclosure. Fig. Figure 4 is an exemplary screen display to illustrate an exhaustive interface according to implementations of the present disclosure. Fig. 5A to Fig. Figure 5E shows exemplary screen displays to illustrate an adaptive interface according to implementations of the present disclosure. Fig. 6A to Fig. Figure 6D shows exemplary screen displays for representing an embedded interface according to implementations of the present disclosure. Fig. Figure 7 is a schematic diagram illustrating an exemplary procedure for facilitating the retrieval of verbal commands in multimodal interfaces according to implementations of the present disclosure. Fig. Figure 8 is a schematic diagram illustrating an exemplary procedure for facilitating the retrieval of verbal commands in multimodal interfaces according to implementations of the present disclosure. Fig. Figure 9 is a block diagram of an exemplary computing environment suitable for use in implementations of the present disclosure. Detailed description
[0007] The subject matter of the invention disclosed herein is specifically described in a manner that satisfies regulatory requirements. However, the description itself is not intended to limit the scope of the present patent. Rather, the invention allows the claimed subject matter to be embodied in other ways and may include other steps or combinations of steps similar to those described in this document, in conjunction with other current or future technologies. Although the terms "step" and / or "block" may be used here to denote different elements of the methods employed, they should not be understood as implying a specific sequence among or between the various steps disclosed herein, unless the sequence of individual steps is explicitly stated.
[0008] Locating available verbal commands and formulating them in a way that assistive systems can understand remains a persistent challenge for users of natural language interfaces (NLIs). Improvements in speech-to-text engines and the proliferation of commercial language interfaces as part of purely linguistic and multimodal solutions have exposed more end users to this modality. However, the “invisible” nature of speech (or other verbal input) relative to other GUI elements makes learning and using it particularly difficult for users.In this context, the term "discoverability" encompasses not only making users aware of operations that can be performed using verbal commands (i.e., knowledge), but also guiding users on how to formulate these verbal commands so that the system can interpret them correctly (i.e., understanding). A lack of support in discovering verbal commands often forces users to guess supported verbal commands and / or formulations. Since it is highly likely that these guesses will be misinterpreted and errors will be more frequent, users confronted with such systems may become frustrated when using voice input, regardless of the system they are using.
[0009] Multimodal interfaces that support verbal input and at least one form of direct manipulation input (such as touch, keyboard, mouse, eye tracking, spatial gestures, or similar) offer advantages over purely verbal interfaces. Since the strengths of multiple input modalities can complement each other, direct manipulation input can more effectively support users of verbal input, and vice versa. For example, in a multimodal document reader, a purely verbal interface might make it difficult for a user to ask for the correct pronunciation of a word. A user would have to guess the pronunciation of a word they want the system to tell them. With a multimodal interface that supports both speech and touch, a user can point to a word and ask for its pronunciation.Conversely, verbal input can support interfaces that accept direct manipulation input. Instead of learning where and how to invoke operations within a GUI, a user can, for example, simply point to a word and say, "Say this." With increasing support from intelligence (such as entity recognition in images) on the part of applications, the opportunities for multimodal interactions grow. In a multimodal image editor, for instance, a user can point to a person in an image and give the command, "Remove the shadow from the face." However, the questions remain: how can a user find what they can say, and how should they say it?
[0010] Embodiments of the present disclosure address the problem of guiding users of multimodal user interfaces regarding which commands they can use to elicit desired results, as well as the precise manner of inputting such commands (e.g., their formulation and the like) so that the system supporting the multimodal interface understands the desired results. To this end, embodiments of the present disclosure facilitate the retrieval of verbal commands (e.g., verbal commands in natural language) in multimodal user interfaces by allowing users to interactively select targets via a direct manipulation modality (e.g., touch, keyboard, mouse, and the like) and, in response, presenting exemplary verbal commands in conjunction with the multimodal user interface.In this way, nonverbal modalities can help users narrow the abstract question "What can I say?" to the more specific question "What can I say here and now?" Implementations also facilitate the discovery of verbal commands in multimodal user interfaces by providing relevant command suggestions in direct temporal relation to the interface. This is achieved by presenting exemplary verbal command suggestions within the interface as it is being used by the user. Three interface variants are considered. The first variant is an interface that presents suggestions using a list-based approach (referred to below as the "exhaustive" interface). The second variant is an interface that uses contextual pop-up windows to present suggestions (referred to below as the "adaptive" interface).A third option is an interface that embeds commands within the GUI (hereinafter referred to as an "embedded" interface). These interface options make it easier to inform users about the operations that the system supporting the multimodal user interface can perform, and simultaneously guide users on how to access available verbal commands (such as appropriate phrasing options and multimodal interactions).
[0011] The drawing shows Fig. 1. A schematic diagram to represent an abstracted overview of a command suggestion generation framework 100corresponding to implementations of the present disclosure. If a target (that is, an area of a multimodal user interface that is the object of a direct manipulation input) is given for which verbal commands are to be suggested, the framework iterates through a list or catalog of available operations, see 110 , (that is, system actions that can be performed) for the goal. A subset of operations for which exemplary verbal command suggestions are to be generated is selected, see 112 Such a selection can, for example, be based on one or more of a type that is related to the goal, see 114 , the relevance of an operation to a workflow with which the user is involved, see 116, the number of times a verbal command for an operation has been given to the user (or to a set of users, for example, to all users) (“count of given operations”), see 118 , and the number of times an operation has previously been presented to the user (or a set of users, for example, all users) in suggested commands (“count of shown operations”), see 120 .
[0012] For the selected operations, the system then goes through a predefined list or catalog of formula templates, see 122 , and selects at least one for presentation, see 124 . Such a template formulation selection can be based, for example, on one or more of a type associated with the received direct manipulation input, see 126(that is, how the input that leads to the generation of verbal command suggestions was called), a complexity of the formulation template, see 128 (that is, the number of parameters needed to complete the template), the number of times a formulation template for the selected operation has been given to a specific user (or to a set of users, for example, to all users), see 130 , (“Count of given templates”) and the number of times the wording template has been presented to a specific user (or set of users, for example, all users) in suggested commands, see 132 (Count of displayed templates).
[0013] Finally proven, see 134, the framework takes any modifiable parameters (that is, attributes for which more than one value may be suitable, such as color names, filter names, tool names, and the like) included in the selected templates, with sample parameter values to generate the final sample verbal command suggestions to be presented to the user, see 136 The modifiable parameters can be verified in an example based on one or more factors relevant to a workflow the user is involved with; see [reference]. 138 , and an active state of the target, see 140 .
[0014] Fig. Figure 2 shows a block diagram to illustrate an exemplary system. 200for the purpose of facilitating the retrieval of natural language commands in multimodal user interfaces. It should be understood that these and other arrangements described here are merely examples. Other arrangements and elements (such as machines, interfaces, functions, sequences and groupings of functions, and the like) may be used in addition to or in place of those shown above, and some elements may be omitted entirely. Many of the elements described here are also functional entities that may be implemented as discrete or distributed components, or in conjunction with other components, in any suitable combination and location. Various functions described here as being performed by one or more entities may be executed by hardware, firmware, and / or software.For example, various functions can be performed by a processor that executes instructions stored in memory.
[0015] The system 200 This is an example of a suitable architecture for implementing certain aspects of the present disclosure. Among other components not shown, the system includes 200 a user computing device 210 , which uses a verbal command detection engine 212 interacts to facilitate the discovery of verbal commands using multimodal user interfaces. Each of the in Fig. The components shown in 2 can be installed on one or more computing devices, such as the computing device 900 from Fig. 9, as explained below, will be provided. As in Fig. As shown in 2, the user computing device 210 and the verbal command detection engine 212 over a network 214Communication can occur over one or more local area networks (LANs) and / or wide area networks (WANs). Such networked environments are common in offices, enterprise computer networks, intranets, and the internet. It should be obvious that any number of user devices and verbal command detection engines can be present in the system. 200 within the scope of this disclosure. Each can comprise a single device or multiple devices working together in a distributed environment. The verbal command detection engine 212 This can be provided, for example, by multiple server devices that support the functionality of the verbal command detection engine. 212 , as described here, are provided collectively. Other components not shown may also be included in the network environment.
[0016] The verbal command detection engine212 It is generally configured to facilitate the location of verbal commands in multimodal user interfaces. Multimodal user interfaces are user interfaces that support more than one input mode. In certain aspects, exemplary multimodal interfaces support verbal input (for example, voice input) and direct manipulation input (for example, input received via touch, keyboard, eye tracking, spatial gestures, mouse, or other nonverbal input). The user device 210 can rely on the verbal command detection engine 212 via a web browser or other application running on the user's computer device 210 It runs, accesses it, and communicates with it. Alternatively, the verbal command detection engine can be used. 212 on the user's computing device 210 must be installed so that access via the network is possible. 214is not necessary.
[0017] The verbal command detection engine 212 includes a direct manipulation input / receiver component 216 , a target-setting component 218 , an operation determination component 220 , an operation subset selection component 222 , a verbal command suggestion generation component 224 and a presentation component 226 The direct manipulation input / receiver component 216 is configured to accept direct manipulation input from a user via a multimodal interface connected to the user's computing device 210 is connected to receive. Direct manipulation inputs can include, for example, touch inputs, keyboard inputs, mouse click inputs, and hover inputs.
[0018] The target-setting component 218is configured to determine a target associated with a received direct manipulation input. A target is an area of a multimodal user interface that is the object of a direct manipulation input. A target can therefore be an object, an application, a user interface element, an image, text, or the like, located in a multimodal interface near the location from which a direct manipulation input is received. For example, if a received direct manipulation input is a touch input received in conjunction with an image, the target in that image could be an object (such as a background image, a person, a shape, and the like) that was under a user's finger when the touch input was received. A target can also be a widget, an icon, a toolbar, a toolbar function, or the like.If, for example, a received direct manipulation input is a mouse click received in conjunction with a function indicator located in a toolbar, the target can be the function indicator itself and, consequently, the function. Any object, element, application, image, or the like connected to a multimodal interface can be a target if it is associated with a received direct manipulation input.
[0019] The operation determination component 220 It is configured to determine several available operations that can be performed on a direct manipulation input target. The specific list of operations is generally determined by the system. 200 predefined and is used in conjunction with the verbal command detection engine. 212(or stored in a (not shown) separate data storage that is accessible to them). The operation subset selection component 222 is configured to select a subset of the operations that are defined by the operation determination component 220 to determine and focus the generated verbal command suggestions on. Selecting a suitable subset of operations can be based on a number of factors. A first exemplary factor might be the relevance of an operation to the type of target for which the suggested verbal commands are generated (114 in Fig. 1) When generating verbal command suggestions, which are relevant, for example, for a form presented in the multimodal interface, the system selects 200It will probably select the "Fill Color" operation, as this is relevant to the type of target (for example, the shape), and probably will not select the "Filter" operation, as this is not relevant to the type of target.
[0020] A second exemplary factor can be the relevance of an operation to a workflow that the user is involved in (116 of Fig. 1) For the purposes of this document, the term "workflow" refers to a set of operations that can assist a user in completing a task or action. Workflows are generally defined by the system 200Predefined, but can also be defined by a user. For example, if a user is engaged in a workflow that uses an image editing application to modify a color image by converting it to black and white and changing the border color from black to white, a relevant workflow may imply the operations "Apply a grayscale filter" and "Change the border color." In embodiments of the present disclosure, if the system 200 Determines that the user is engaged in a workflow; operations relevant to the workflow are selected by the operation subset selection component. 222 This must be taken into account when selecting the set of operations.
[0021] A third exemplary factor, which is determined by the operation subset selection component 222 One method that can be used to select a suitable subset of operations is the counting of given elements ( 118 from Fig. 1) For the purposes of this document, the term “counting of given instances” refers to the number of times a verbal command for an operation has been given to a specific user (or, in some embodiments, to a set of users, for example, to all users). In some embodiments, priority may be given to operations for which verbal commands are frequently given, since such operations may represent actions that the user frequently engages with in connection with the specific goal. In other embodiments, priority may be given to operations for which verbal commands are not frequently given, since these may be less familiar to the user with regard to operations that are frequently performed by the system. 200 can be carried out, can guide.
[0022] A fourth exemplary factor, which is determined by the operation subset selection component 222One method that can be used to select a suitable subset of operations is the counting of what is shown ( 120 from Fig. 1) For the purposes of this document, the term "count of what has been shown" refers to the number of times an operation has previously been presented to a specific user (or, in some embodiments, to a set of users, for example, all users) in verbal command suggestions. In some embodiments, priority may be given to operations for which verbal commands are frequently presented, since such operations may represent actions that users frequently engage in in connection with the specific goal. In other embodiments, priority may be given to operations for which verbal commands are not frequently presented, since these may be less familiar to the user with regard to operations that the system offers. 200 can be carried out, can guide.
[0023] The operation subset selection component 222 includes an operation rank evaluation component 228 The operation rank evaluation component 228 It is configured to rank operations that comprise multiple operations relative to each other in order to generate a proposed rank score. In embodiments, one or more of the factors listed above (that is, target type, workflow relevance, counting of given elements, and counting of shown elements) can be determined by the operation rank score component. 228 This is used to generate the proposal ranking according to a predetermined set of priority rules. Once a proposal ranking has been generated, the operation subset selection component 222It is configured to use the suggestion ranking to at least partially select a subset of operations on which the generated verbal command suggestions are focused.
[0024] The verbal command suggestion generation component 224 is configured to generate multiple verbal command suggestions for a subset of operations selected by the operation subset selection component. 222 The verbal command suggestion generation component is selected based on relevance. 224 includes a wording template selection component 230 , a formulation template clause selection component 232 and a parameter assignment component 234 Formulation templates are generally provided by the system. 200 predefined, although in some implementations these can also be predefined by a user. An example shows Fig. 3. A list of formulation templates associated with a set of operations selected by the operation subset selection component. 222 These tools can be selected if a goal implies image processing. It should be obvious to the average professional in the field that such a list is purely exemplary and is not intended to limit implementations in any way. Similarly, it should be clear that the framework and system described here are not specific to image processing tools and can also be used by other multimodal systems that improve verbal command discoverability.
[0025] The wording template selection component 230 is configured to generally select multiple formulation templates by iterating through a predefined list of formulation templates, which are relevant for a subset of operations defined by the operation subset selection component. 222The selected options are relevant. The formulation template clause selection component 232 is configured to select a formulation template for each operation that comprises the selected subset of operations. In certain embodiments, the formulation template subset selection component can 232 Consider four exemplary factors when selecting form templates. The first exemplary factor is the type of input received ( 126 from Fig. 1) That is, the way in which the direct manipulation input, which leads to the generation of verbal command suggestions, was invoked. A second exemplary factor is the complexity of the formulation template ( 128 from Fig. 1), as expressed by the number of parameters required to complete the template. In some embodiments, the system's default setting is 200This involves selecting formulation templates with the lowest complexity (that is, the fewest modifiable parameters). In certain embodiments, the complexity of selected formulation templates can, for example, be increased by one parameter each time a user performs an operation more than once, until a predefined maximum number of parameters is reached. In certain embodiments, users are presented with complex verbal commands over time as they learn to perform basic operations.
[0026] A third exemplary factor is the counting of what is given ( 130 from Fig. 1) That is, the number of times a template for the selected operation has been given to a specific user (or to a set of users, for example, all users). A fourth exemplary factor is the count of what has been shown ( 132from Fig. 1) That is, the number of times the template wording has been presented to a specific user (or a set of users, for example, all users). In some implementations, wordings with a low count of given elements and a low count of shown elements are ranked higher than those with a high count of given elements and a high count of shown elements.
[0027] Formulation templates often contain at least one modifiable parameter. As such, the parameter assignment component... 234 the verbal command suggestion generation component 224 Configured to populate templates containing parameters with sample parameter values. Determines the verbal command detection engine. 212 In certain embodiments, if the user is involved in a workflow, the parameter assignment component can 234Select parameter values that are oriented towards the workflow. In certain implementations, the parameter assignment component can 234 Select parameter values that differ from the current state of the target. For example, if the specified target is a green rectangle, the suggested fill command when touching the green rectangle will be a color that is not green.
[0028] The proposal presentation component 226It is configured to present specific, filtered, ranked, and assigned verbal command suggestions in conjunction with a multimodal user interface. Three interface variants are provided for presentation: an "exhaustive" interface, an "adaptive" interface, and an "embedded" interface. Each interface variant facilitates users' in-situ search for commands but makes different trade-offs and presents different aspects of the verbal command suggestion space to support command knowledge and understanding. The exhaustive interface presents a list of all available operations and sample commands for each operation. The adaptive interface presents command suggestions using contextual pop-ups that appear when users directly manipulate the active window or parts of the interface.These suggestions appear alongside the target of the direct manipulation input. The embedded interface, finally, presents suggestions alongside one or more GUI elements. By varying when, where, and which sample commands are presented, the different interfaces promote different types of discovery and mapping between verbal commands and interface elements.
[0029] Fig. 4 is an example screen display 400to represent an exhaustive interface. The exhaustive interface is modeled in the style of conventional command menus, displaying a list of available commands for all operations. In certain implementations, users can select a suitable on-screen trigger (for example, a microphone indicator) to present a comprehensive list of available operations and sample commands for each operation. A portion of the resulting list is in Fig. Figure 4 shows that to aid readability, commands can be grouped by operation (as shown), and users can be allowed to expand / collapse groups of operations that focus on operations of interest together. In certain implementations, the exhaustive interface can use specific contextual information and remove the highlighting of operations and commands that are not applicable to the active state of the interface. For example, if an image editing application is used and there are no shapes in the active window, commands for operations that correspond to shapes (such as fill color, border size, and the like) can be hidden (or their highlighting removed in some other way). The exhaustive interface helps users discover the range of commands applicable to an active state of the interface.
[0030] Fig. 5A to Fig. Figure 5E shows exemplary screen displays illustrating an adaptive interface according to implementations of the present disclosure. In certain embodiments, the adaptive interface uses tooltip-like overlays to suggest verbal commands relating to a direct manipulation input target. Fig. 5A shows the presentation of sample command suggestions when a user provides direct manipulation input to a person in the image of the active window. Fig. Figure 5B shows the presentation of sample command suggestions when a user provides direct manipulation input in a dropdown menu, for example in a properties field of the multimodal interface. Fig. 5C shows the presentation of sample command suggestions when a user provides direct manipulation input on an entity detection button (that is, a selectable button that, when called, detects entities in the active window), which is presented, for example, in a toolbar. Fig. 5D shows the presentation of sample command suggestions when a user provides direct manipulation input at a microphone trigger (by speaking). Fig. 5E shows the presentation of sample command suggestions when a user provides direct manipulation input to a form shown in the active window of the multimodal interface.
[0031] To access command suggestions using the adaptive interface, users can long-press (for example, hold for more than 1 second) various parts of the interface, including the active window, widgets and buttons in the properties panel and toolbar, or use the talk button. Suggestions are presented as pop-ups next to the user's finger. Suggestions can be specific to something directly under the user's finger (for example, a shape or image) or can be more general to the interface. When using a touch-based interface, to avoid being obscured by the hand, the pop-ups can appear above the user's finger on the active window and be positioned to the left or right of the properties panel and toolbar.
[0032] In certain implementations, suggestions in the adaptive interface are context-sensitive, based on the target under the user's finger. If the target is a widget, the suggestions appear above the widget. If the user touches the active window, the suggestion appears above the object under the user's finger (for example, a background image, a person, a shape, and so on, if an image is in the active window). Suggestions to apply filters (for example, "Apply a grayscale filter") may appear, for instance, when a user long-presses a widget's effect-add button (that is, a selectable button that, when selected, invokes the ability to add effects to a widget) in the properties panel, or when a user directly manipulates an object in an image.
[0033] The system can suggest any number of exemplary available command suggestions for any number of available operations within the scope of embodiments of the present disclosure. In certain embodiments, the system can suggest an exemplary command by means of an applicable operation. Command formulations and parameter values vary over time. For example, the user may initially see "Apply a sepia effect here" and later "Add a morph filter." To help users become familiar with using language, the system initially suggests simpler formulations with fewer parameters and gradually presents users with more complex formulations with more parameters. This is adaptive with respect to the end user's "learning." For example, if the user issues individual commands frequently enough, the system switches to multi-parameter commands.
[0034] As explained above, a workflow, as defined here, is a set of operations that assist a user in completing a task. For example, if a user is engaged in a workflow and is using an image editing application to modify a color image by converting it to black and white and changing the border color from black to white, a relevant workflow might imply the operations "Apply a grayscale filter" and "Change the border color to white." As the user follows a workflow, the adaptive interface limits the number of suggestions it presents and prioritizes commands that are appropriate to the workflow. For instance, a single verbal command might be suggested to apply the sepia filter if that is the next step in the predefined workflow.If no predefined workflow is available, the system, in certain implementations, defaults to the strategy of suggesting one command per applicable operation.
[0035] Fig. 6A to Fig. Figure 6D shows exemplary screen displays illustrating an embedded interface according to implementations of this disclosure. The embedded interface is similar to the adaptive interface but has two main differences. First, it creates a visual mapping between GUI elements and their corresponding verbal commands by "extending" the GUI widgets with command suggestions. Second, it may not consider the user task (i.e., the workflow) when selecting relevant examples. The adaptive interface is of the "high precision, low recall" type, while the embedded interface is of the "lower precision, high recall" type.
[0036] In certain implementations, the embedded interface presents command suggestions along with the application's GUI. To view command suggestions, users can long-press various parts of the interface. For example, if the user long-presses the active window, the system may present command suggestions within the properties panel. Fig. 6B and Fig. 6C). In exemplary embodiments, highlighted text (e.g., displayed in color, bold, or the like) can correspond to verbal command suggestions that extend the GUI widgets in the properties panel. In certain embodiments, users can also use a direct long press on the toolbar ( Fig. 6 A) or execute the properties field (instead of just pressing objects in the active window). A long press of the speech button displays sample command suggestions that correspond to objects in the active window ( Fig. 6D), and also adds commands to the toolbar and the properties panel.
[0037] Because the embedded interface extends the existing GUI widgets, it uses command templates instead of command examples. For example, the command template "Change border color to" might appear next to a drop-down menu for changing the border color. In certain implementations, the system displays the same template throughout a session to provide consistent procedure and reassure users about how to speak. Since the toolbar offers limited space for embedding text commands, in certain implementations, tool suggestions in the toolbar might take the form of command examples rather than templates, similar to the adaptive interface. The examples presented when the user activates a microphone trigger follow the same procedure as in the adaptive interface.
[0038] In certain embodiments, the system can instead present exemplary commands in a user interface in such a way that they can be read by the user, or alternatively, verbally present command suggestions to the user (this means it can deliver the command via a loudspeaker connected to a user computing device, such as the user computing device). 210 from Fig. 2 is connected, “speak”). In certain exemplary embodiments, the selectable “Speak Commands” button (not shown) can be selected by the user to request the system to provide verbal command suggestions. In embodiments, the system's default setting may include verbal presentation. Any and all such variations and any combination thereof shall be within the scope of the embodiments of this disclosure.
[0039] In certain implementations, once a user issues a verbal command, a combination of a template-based and a lexicon-based parser can be used to interpret the received verbal command. Speech parsers are familiar to the average expert in the field and are therefore not described further here. Operations, targets, and parameters of the verbal command can be identified by comparing the interpreted verbal input with predefined templates. If the interpreted verbal input does not match a template, the system can tokenize the verbal command string and search for specific keywords to infer this information.In cases where the verbal command does not specify a target, the system can infer the target from the interface state (for example, from which objects were previously selected) or from direct manipulation input (for example, from which object was pointed to when the verbal command was given). In this way, direct manipulation can be used to specify parts of a verbal command (or to eliminate its ambiguity).
[0040] In certain implementations, the system includes a feedback mechanism for cases where a verbal command has not been successfully interpreted. For all three interfaces, a feedback area can be displayed below the text box and can also show sample command suggestions generated similarly to the description above. However, instead of responding to direct manipulation input, the presented suggestions can be given in response to unrecognized verbal input. To suggest sample commands in this area, the system infers the most likely error type, for example, based on heuristics (heuristics are familiar to the average person in the field and are therefore not described further here). One type of error is a phrasing error.Formulation errors are errors identified as commands containing a valid parameter but inconsistent with the grammar or missing keywords (for example, "Make sepia"). In such cases, the system may suggest a sample command using that parameter value (for example, "Add a sepia filter"). A second type of error is a parameter error. A parameter error is detected when a valid operation exists but a parameter value is missing or unsupported (for example, "Change the fill color" or "Add the retro filter"). In cases of parameter errors, the feedback indicates that the command is incomplete and presents a list of supported values with an example (for example, "Change the fill color to green").A third type of error, namely an operation-object mapping error, occurs when the system infers both an operation and parameters, but the command is directed at an unsupported object (for example, saying "Apply a Morph filter" while pointing at a rectangle). In this case, the feedback can list the applicable object types (in this example, images). Finally, if the system cannot infer either the operation or the parameter in a command, it interprets this as a fourth type of error, namely an operation recognition error, and indicates to the user that they should try one of the suggested verbal commands.
[0041] In certain implementations, the system includes a feedback mechanism for situations where a user relies solely on direct manipulation to perform a task or action. For example, if a user uses only direct manipulation to select a color in a dialog box using a mouse, the system can inform the user (e.g., in the feedback area below the dialog box) that the command "Change the color to red" can also be spoken instead of using the mouse. This proactive approach not only helps the user acquire knowledge of how to use verbal commands but also guides them with example commands and appropriate command formulations.
[0042] In Fig. Figure 7 shows a schematic diagram illustrating an exemplary procedure. 700 To facilitate the discovery of verbal commands in a multimodal interface. It will be used, as with Block. 710 is specified as a target associated with a direct manipulation input received from a user via a multimodal interface connected to a user computing device (for example, the user computing device). 210 from Fig. 2) is connected, (for example, by the target definition component) 218 the verbal command detection engine 212 from Fig. 2) determined. It will be, as with block 712 A set of operations is specified (for example, using the operation subset selection component). 222 the verbal command detection engine 212 from Fig. 2) selected, on which the verbal command suggestions should focus. As with Block,714 Specified is one or more verbal command suggestions relevant to the selected set of operations, for example from the verbal command suggestion generation component. 224 the verbal command detection engine 212 from Fig. 2 is generated. Finally, as with block 716 It is specified that at least some of the generated verbal command suggestions are used for presentation in conjunction with the multimodal user interface (for example, using the presentation component). 226 the verbal command detection engine 212 from Fig. 2) provided to facilitate the discovery of verbal commands that the system understands.
[0043] In Fig. Figure 8 shows a schematic diagram illustrating another exemplary procedure. 800 To facilitate the discovery of verbal commands in a multimodal interface. Similar to Block.810 Specified is a target associated with a direct manipulation input received from a user of a multimodal user interface (for example, from the target determination component). 218 the verbal command detection engine 212 from Fig. 2) determined. As with block 812 As specified, several operations related to the specific goal are performed, for example by the operation determination component. 220 the verbal command detection engine 212 from Fig. 2 determined. As with block 814 The operations that comprise the multiple operations are specified (for example, by the operation rank rating component). 228 the operation set selection component 222 the verbal command detection engine 212 from Fig. 2) ranked relative to each other to generate a proposal ranking. As with Block. 816As shown, using at least one part of the proposal evaluation, a subset of the several operations (for example, using the operation subset selection component) is selected. 222 the verbal command detection engine 212 from Fig. 2) selected. As with block 818 If specified, for example, one or more verbal command suggestions relevant to the subset of operations will be generated by the verbal command suggestion generation component. 224 the verbal command detection engine 212 from Fig. 2 is generated. Finally, as with block 820 It is specified that at least some of the generated verbal command suggestions are used for presentation in conjunction with the multimodal user interface (for example, using the presentation component). 226 the verbal command detection engine 212 from Fig. 2) provided to facilitate the discovery of verbal commands that the system understands.
[0044] Accordingly, embodiments of the present disclosure relate to computing systems for facilitating the retrieval of verbal commands using multimodal user interfaces. Computer systems may include one or more processors and one or more computer storage media that store computer-usable instructions which, when used by the one or more processors, cause the one or more processors to perform certain functions.In certain embodiments, such functions may include: determining a target associated with a direct manipulation input received from a user via a multimodal user interface; selecting a set of operations relevant to the specified target; generating one or more verbal command suggestions relevant to the selected set of operations and the specified target; and providing at least a portion of the generated one or more verbal command suggestions for presentation in conjunction with the multimodal user interface in such a way as to facilitate the retrieval of the verbal commands understood by the system.
[0045] Further embodiments of the present disclosure relate to computer-implemented methods for facilitating the retrieval of verbal commands using multimodal interfaces. Such computer-implemented methods may include: determining a target associated with direct manipulation input from a user of a multimodal user interface; determining several operations associated with the determined target; ranking operations comprising the several operations relative to each other to generate a suggestion ranking; selecting a subset of the several operations relevant to the determined target, at least partially, using the suggestion ranking; and generating one or more verbal command suggestions relevant to the selected subset of operations and to the determined target.and providing at least a part of the generated one or several verbal command suggestions for presentation in conjunction with the multimodal user interface in such a way that the user's ability to find verbal commands is facilitated.
[0046] Some embodiments of the present disclosure relate to computing systems for facilitating the retrieval of verbal commands using multimodal interfaces. Such computing systems may include means for generating one or more verbal command suggestions relevant to a target of direct manipulation input received from a user via a multimodal user interface; and means for providing at least a part of the one or more verbal command suggestions for presentation in conjunction with the multimodal user interface such that the retrieval of verbal commands understood by the system is facilitated.
[0047] Following the description of implementations of the present disclosure, an exemplary operating environment in which embodiments of the present disclosure may be implemented is described below in order to provide a general context for various aspects. Fig. Figure 9 shows an exemplary operating environment for implementing embodiments of the present disclosure, which is generally referred to as a computing device. 900 is referred to as the calculating device. 900 This is merely an example of a suitable computing environment and is not intended to suggest any limitations regarding the scope of use or functionality of the invention. Likewise, the computing device is not intended to be a limitation of any kind. 900 This should be understood to mean that it has a dependency or requirement in connection with any component of the components shown or a combination thereof.
[0048] Embodiments can be described in the general context of computer code or machine-usable instructions, including computer-executable instructions such as program modules that are executed by a computer or other machine, such as a personal digital assistant or other handheld device. In general, program modules, including routines, programs, objects, components, data structures, and the like, refer to code that performs specific tasks or implements specific abstract data types. Embodiments of the present disclosure can be implemented by a variety of system configurations, including handheld devices, consumer electronics devices, general-purpose computers, specialized computing devices, and the like.Embodiments of the present disclosure can also be implemented in distributed computing environments, where tasks are performed by remote processing devices connected via a communication network.
[0049] What's next in Fig. As shown in 9, the computing device includes 900 a bus 910 , which directly or indirectly couples the following devices: a storage device 912 , one or more processors 914 , one or more presentation components 916 , I / O ports (Input / Output I / O) 918 , Input / Output Components 920 and an illustrative benefit provision 922 The bus 910 represents something that can be one or more buses (such as an address bus, a data bus, or a combination thereof). Although the various blocks of Fig. Although the components are represented by lines for clarity in Figure 9, the demarcation of different components from one another is not so clear in practice, so that the lines should actually be gray or blurred in a figurative sense. For example, a presentation component, such as a display device, can be considered an I / O component. Furthermore, processors have memory. It should be acknowledged in the context of the present invention that this is the nature of the field, and it is reiterated that the diagram of Fig. Figure 9 is purely illustrative of an exemplary computing device that can be used in conjunction with one or more embodiments of the present invention. There is no distinction between categories such as "workstation", "server", "laptop", "handheld device", and the like, since they are all within the scope of Fig. 9 and should be included when referring to a “computing device”.
[0050] The calculating device 900 It typically includes a variety of computer-readable media. Computer-readable media can be any available media to which the computing device can access it. 900Computer-readable media can be accessed by any human operator and include both volatile and non-volatile media, as well as removable and non-removable media. For illustrative purposes only, and not as a limitation, computer-readable media can include computer storage media and communication media. Computer storage media include both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storing information, such as computer-readable instructions, data structures, program modules, or other data.Computer storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other storage technology, CD-ROM, DVD or other optical disk storage, magnetic cartridges, magnetic tapes, magnetic disk storage or other magnetic storage devices, or any other medium used to store the desired information and to which the computing device is connected. 900can be accessed. Computer storage media do not include signals per se. Communication media typically embody computer-readable instructions, data structures, program modules, or other data in a modulated data signal, such as a carrier wave or other transport mechanism, and include any information distribution media. The term "modulated data signal" refers to a signal in which one or more properties are set or modified such that information is encoded in the signal. By way of example, and not by limitation, communication media include wired media, such as a wired network or a direct-wired connection, as well as wireless media, such as acoustic, radio-frequency-based, infrared, and other wireless media. Combinations of any of the foregoing are also to be included in the scope of computer-readable media.
[0051] The storage 912 Computer storage media includes volatile and / or non-volatile memory. The memory can be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state drives, hard drives, optical drives, and the like. The computing device 900 includes one or more processors that process data from various entities, such as memory. 912 or the I / O components 920 , read. The presentation component(s) 916 Presentation devices present data to a user or other device. Examples of presentation devices include a display device, a loudspeaker, a pressure component, a vibration component, and the like.
[0052] The I / O ports 918 enable the computing device 900a logical coupling with other devices, the I / O components 920 These include components, some of which may be built-in. Illustrative examples include a microphone, joystick, gamepad, satellite dish, scanner, printer, wireless device, and the like. The I / O components 920 They can provide a natural user interface (NUI) that processes gestures in space, speech, or other physical inputs generated by a user. In some cases, inputs can be transferred to a suitable network element for further processing. An NUI can incorporate any combination of speech recognition, touch and pen recognition, facial recognition, biometric recognition, gesture recognition both on and off a screen, spatial gesture, head and eye tracking, and touch recognition related to displays on the computing device. 900implement the computing device 600 It can be equipped with depth cameras, such as stereoscopic camera systems, infrared camera systems, RGB camera systems, and combinations thereof, for gesture detection and recognition. Additionally, the computing device can... 900 be equipped with accelerometers or gyroscopes that enable the detection of movement.
[0053] As described above, implementations of the present disclosure relate to a framework for generating and presenting examples of verbal commands (for example, commands in natural language) to facilitate the discovery of relevant verbal commands that systems supporting multimodal interfaces understand and to enable users to learn available verbal commands over time. The present disclosure is based on certain embodiments, which are intended to be illustrative and non-restrictive in every respect. Alternative embodiments are readily apparent to a person skilled in the art in the field to which the present invention belongs, without any deviation from its scope.
[0054] It is evident from the foregoing that the present disclosure is suitable for achieving all the aforementioned objectives and tasks, together with further advantages that are obvious or inherent to the system and the method. It should be clear that certain features and combinations thereof are useful and can be employed without reference to other features and combinations thereof. This is encompassed within the scope of the claims and within the scope thereof.
Claims
[1] Computing system, comprising: one or more processors; and one or more computer storage media that store computer-usable instructions which, when used by one or more processors, cause one or more processors to: Determining a goal associated with a direct manipulation input received from a user via a multimodal user interface; selecting a set of operations relevant to that specific goal; Generating one or more verbal command suggestions relevant to the selected set of operations and the specific goal; and Providing at least a part of the generated one or several verbal command suggestions for presentation in conjunction with the multimodal user interface in such a way as to facilitate the discovery of verbal commands that the system understands. [2] Computing system according to claim 1, wherein the computer-usable instructions, when used by the one or more processors, cause the one or more processors to: select the set of operations relevant to the specified goal by: Determining several operations that are related to the specified goal; Ranking operations that encompass multiple operations relative to each other to generate a proposal ranking; and Selecting the set of operations that are relevant to the specific goal, at least partially using the proposal ranking evaluation. [3] Computing system according to claim 2, wherein the computer-usable instructions, when used by the one or more processors, cause the one or more processors to: Ranking the operations, which comprise the multiple operations, relative to each other based on at least one of a type associated with the specified goal, relevance to a workflow the user is involved in, a count of given operations, and a count of displayed operations to generate the proposal rating. [4] A computing system according to any of the preceding claims, wherein the computer-usable instructions, when used by the one or more processors, cause the one or more processors to: Generating one or more verbal command suggestions relevant to the selected set of operations by: Selecting a set of formulation templates that are relevant to the set of operations; and Generating one or more verbal command suggestions using the set of formulation templates. [5] Computing system according to claim 4, wherein the computer-usable instructions, when used by the one or more processors, cause the one or more processors to: Selecting the set of formula templates based on at least one of a type of received direct manipulation input, a complexity of each formula template included in the set of formula templates, a count of output or given templates, and a count of displayed templates. [6] Computing system according to claim 4 or 5, wherein at least one formulation template of the set of formulation templates includes a modifiable parameter and wherein the computer-usable instructions, when used by the one or more processors, further cause the one or more processors to: Assigning a parameter value to the modifiable parameter to generate one or more verbal command suggestions. [7] Computing system according to claim 6, wherein the computer-usable instructions, when used by the one or more processors, cause the one or more processors to: Assigning the modifiable parameter the parameter value based on at least one of an active state of the target and relevance to a workflow with which the user is involved. [8] Computing system according to any of the preceding claims, wherein the received direct manipulation input is a touch input, a keyboard input, an eye tracking input, a gesture input or a mouse input. [9] Computing system according to one of the preceding claims, wherein at least one of the one or more verbal command suggestions is a command suggestion in natural language. [10] Computer-implemented method, comprising: Determining a target associated with a direct manipulation input received from a user of a multimodal user interface; Determining several operations that are related to the specified goal; Rank the operations that comprise the multiple operations relative to each other to generate a proposal ranking; Selecting a subset of the several operations that are relevant to the specific goal, at least partially using the proposal ranking evaluation; Generating one or more verbal command suggestions relevant to the subset of operations and the specific target; and Providing at least a part of the generated one or several verbal command suggestions for presentation in conjunction with the multimodal user interface in such a way that the user's ability to find verbal commands is facilitated. [11] Computer-implemented method according to claim 10, wherein the operations comprising the multiple operations are ranked on the basis of at least one of a type associated with the specified goal, relevance to a workflow with which the user is engaged, a count of output or given operations and a count of displayed operations relative to each other. [12] Computer-implemented method according to claim 10 or 11, wherein the one or more verbal command suggestions relevant to the subset of operations and to the specific target are generated by: Selecting a set of template phrases that are relevant to the subset of operations; Generating one or more verbal command suggestions using the set of formulation templates. [13] Computer-implemented method according to claim 12, wherein the set of formulation templates relevant for the subset of operations is selected on the basis of at least one of a type of received direct manipulation input, a complexity of each formulation template included in the set of formulation templates, a count of output or given templates, and a count of displayed templates. [14] Computer-implemented method according to claim 12 or 13, wherein at least one formulation template of the set of formulation templates includes a modifiable parameter and wherein the method further comprises: Assigning a parameter value to the modifiable parameter to generate one or more verbal command suggestions. [15] Computer-implemented method according to claim 14, wherein the modifiable parameter is assigned a parameter value based on at least one active state of the target and its relevance to a workflow with which the user is involved. [16] Computer-implemented method according to any one of claims 10 to 15, wherein the received direct manipulation input is a touch input, a keyboard input, an eye-tracking input, a gesture input or a mouse input. [17] Computer-implemented method according to any one of claims 10 to 16, wherein at least one of the one or more verbal command suggestions is a command suggestion in natural language. [18] Computing system, comprising: Means of generating one or more verbal command suggestions relevant to a target of direct manipulation input received from a user via a multimodal user interface; and Means of providing at least a part of one or more verbal command suggestions for presentation in conjunction with the multimodal user interface in such a way as to facilitate the finding of verbal commands that the system understands. [19] Computing system according to claim 18, further comprising: Means for selecting a set of operations relevant to the direct manipulation input target, wherein the means includes generating one or more verbal command suggestions: Means of generating one or more verbal command suggestions that are relevant to the set of operations and to the goal of direct manipulation input. [20] Computing system according to claim 19, further comprising: Means for selecting a set of formulation templates relevant to the set of operations, wherein the means includes generating one or more verbal command suggestions: Means of generating one or more verbal command suggestions using the set of formulation templates.