Voice-based interaction method, electronic device, storage medium, and program product

CN122201299APending Publication Date: 2026-06-12KE COM (BEIJING) TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
KE COM (BEIJING) TECHNOLOGY CO LTD
Filing Date
2026-04-21
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing voice interaction technologies suffer from low operational efficiency and poor flexibility in human-computer interaction scenarios, especially in situations where hands are restricted or complex, which can easily lead to long interaction latency and interaction failure.

Method used

By synchronously acquiring timestamped voice data streams and screen image sequences, and combining voice parsing and screen image recognition, semantic matching and location matching are performed to determine the target element corresponding to the voice command, thereby achieving contactless interaction.

🎯Benefits of technology

It improves operational efficiency and flexibility, enhances the intuitiveness and trustworthiness of interaction, and can be applied to various visual interfaces, including H5 pages and custom controls, thus lowering the barrier to entry.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201299A_ABST
    Figure CN122201299A_ABST
Patent Text Reader

Abstract

The present disclosure provides a voice-based interaction method, an electronic device, a storage medium and a program product. The method of the present disclosure comprises: synchronously acquiring a voice data stream and a screen image sequence with timestamps; determining a starting time and an ending time of a voice instruction according to the voice data stream; selecting screen images corresponding to the starting time and the ending time from the screen image sequence; performing recognition processing on the voice instruction to obtain a voice analysis result containing an operation intention and target description information; performing recognition processing on the screen images to obtain an element set containing at least one interface element; wherein each interface element is associated with element coordinates and element attributes; performing semantic matching and position matching based on the voice analysis result and the element set to determine a target element corresponding to the voice instruction; and performing an operation on the target element according to the operation intention.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and more particularly to voice-based interaction methods, electronic devices, storage media, and program products. Background Technology

[0002] With the development of voice technology, voice interaction technology is being applied in more and more scenarios, such as human customer service.

[0003] In existing human-computer interaction scenarios, voice participation in interactive operations is very limited. Direct touch operation or external mouse control is often used, requiring staff to physically touch the screen or receive assistance from a specialist. When business demonstration scenarios are complex or hands are restricted, problems such as low operational efficiency, poor flexibility, long interaction delays, and even eventual interaction failure can easily occur. Summary of the Invention

[0004] This disclosure provides voice-based interaction methods, electronic devices, storage media, and program products.

[0005] According to a first aspect of this disclosure, a voice-based interaction method is provided. The method specifically includes: synchronously acquiring a timestamped voice data stream and a screen image sequence; determining the start and end times of a voice command based on the voice data stream; selecting a screen image from the screen image sequence corresponding to the start and end times; performing recognition processing on the voice command to obtain a voice parsing result containing operation intent and target description information; performing recognition processing on the screen image to obtain an element set containing at least one interface element; wherein each interface element is associated with element coordinates and element attributes; performing semantic matching and position matching based on the voice parsing result and the element set to determine the target element corresponding to the voice command; and executing an operation on the target element according to the operation intent.

[0006] According to the above scheme, by using screen image-based recognition processing, the dependency on underlying interfaces or control trees can be avoided, making this method applicable to various visual interfaces, including H5 pages, video interfaces, and custom controls, demonstrating excellent versatility. Furthermore, by combining semantic matching and location matching, accurate understanding of complex instructions is achieved, effectively improving positioning accuracy and thus facilitating contactless interaction. In scenarios where hands are limited, during demonstrations, or with accessibility assistance, operational efficiency and flexibility are significantly improved. Users do not need to learn complex instructions; they can control the interface simply by using natural language, lowering the barrier to entry and enhancing the intuitiveness and trustworthiness of the interaction.

[0007] According to at least one embodiment of this disclosure, a screen image is processed to obtain an element set containing at least one interface element, including: performing at least one of the following processing methods: performing optical character recognition processing on the screen image to identify text regions in the screen image and extracting the text content and corresponding bounding box coordinates of each text region; performing visual feature recognition processing on the screen image to identify interactive elements in the screen image and extracting the element type identifier and corresponding bounding box coordinates of each interactive element; generating an element set based on the element attributes obtained from at least one of the text content and element type identifier, and the element coordinates obtained from the bounding box coordinates.

[0008] According to the above scheme, by performing optical character recognition (OCR) and / or visual feature recognition (VCR) on screen images, element attributes and coordinates of interface elements can be extracted at the pixel level. This eliminates the dependence on the application's underlying interface or control tree and can be applied to any visual interface. Furthermore, by combining text content and element type identifiers as two sources of element attributes, it ensures accurate recognition of both text buttons and graphic icons, significantly improving the coverage and accuracy of element recognition.

[0009] According to at least one embodiment of this disclosure, voice commands are processed to obtain a voice parsing result containing operation intent and target description information, including: performing keyword recognition and semantic analysis on the voice commands to identify operation intent keywords and entity description keywords; determining an initial operation intent based on the operation intent keywords, wherein the initial operation intent includes at least one of click, input, and scroll; extracting at least one of initial position description information and initial text description information based on the entity description keywords; and generating a voice parsing result based on the initial operation intent and target description information.

[0010] According to the above scheme, by performing keyword recognition and semantic analysis on voice commands, the initial operation intent, initial location description information, and initial text description information can be accurately extracted, realizing the structured processing of natural language commands. This structured voice parsing result avoids erroneous operations caused by intent confusion in traditional schemes, significantly improving the accuracy and intelligence level of interaction.

[0011] According to at least one embodiment of this disclosure, the target description information includes location description information and / or text description information; determining the target element corresponding to the voice command by performing semantic matching and location matching based on the speech parsing result and the element set includes: performing at least one of the following processing methods: performing text matching on the element set based on the text description information to obtain a semantic matching result; performing spatial location matching on the element set based on the location description information to obtain a location matching result; and determining the target element based on at least one of the semantic matching result and the location matching result.

[0012] According to the above scheme, semantic matching results are obtained by performing text matching based on text description information, and / or spatial matching results are obtained by performing spatial matching based on location description information, thus achieving multi-dimensional fusion of element comprehensive positioning. That is, when a unique target cannot be determined by relying solely on semantic matching results, this embodiment can combine location matching results for secondary constraints to accurately locate buttons in specific areas, significantly improving the accuracy of target element determination.

[0013] According to at least one embodiment of this disclosure, text matching is performed on an element set based on text description information to obtain a semantic matching result, including: calculating the semantic similarity between the element attributes of each element in the element set and the text description information; sorting each element according to the semantic similarity to obtain a first candidate element list corresponding to the semantic matching result. Determining a target element based on at least one of the semantic matching result and the positional matching result includes: determining the candidate element with the highest semantic similarity in the first candidate element list as the target element; if there are multiple candidate elements with the same semantic similarity in the first candidate element list, and the target description information includes positional description information, then a secondary screening is performed on the first candidate element list based on the positional description information to determine the target element.

[0014] According to the above scheme, by calculating semantic similarity and generating a list of first candidate elements, the degree of matching between interface elements and user instructions can be quantitatively evaluated, ensuring that the elements with the closest semantics are selected first.

[0015] According to at least one embodiment of this disclosure, spatial location matching of an element set based on location description information is performed to obtain a location matching result, including: parsing the location description information to obtain spatial location constraints, which include screen area range or relative positional relationships; filtering candidate elements that meet the spatial location constraints based on the element coordinates of each interface element to obtain a second candidate element list. Determining a target element based on at least one of the semantic matching result and the location matching result includes: determining the unique candidate element in the second candidate element list as the target element; if the second candidate element list contains multiple candidate elements, and the target description information includes text description information, performing a secondary filtering of the second candidate element list based on the text description information to determine the target element.

[0016] According to the above scheme, by first obtaining a second candidate element list based on location description information, and then performing a secondary filtering based on text description information when the list contains multiple candidate elements, the technical problem that single location matching cannot distinguish multiple elements within the same area can be effectively solved. This significantly improves the positioning accuracy in dense interface layouts and avoids misoperations caused by overlapping elements in different areas.

[0017] According to at least one embodiment of this disclosure, determining a target element based on at least one of semantic matching results and position matching results includes: performing text matching on an element set based on text description information to obtain a first candidate element list corresponding to the semantic matching result, and performing spatial position matching on the element set based on position description information to obtain a second candidate element list corresponding to the position matching result; determining the intersection of the first candidate element list and the second candidate element list; determining the unique element in the intersection as the target element; if the intersection contains multiple elements, calculating a comprehensive matching confidence score using semantic similarity and position similarity, and determining the candidate element with the highest comprehensive matching confidence score as the target element.

[0018] According to the above scheme, by determining the intersection of the first and second candidate element lists, strong verification based on both text and position constraints is achieved, significantly reducing the false match rate. Especially when the intersection contains multiple elements, ranking them by calculating the comprehensive matching confidence score effectively resolves the ambiguity issues that still exist under the dual constraints.

[0019] According to at least one embodiment of this disclosure, before performing an operation on a target element according to the operation intention, the method further includes: rendering a visual feedback effect on the target element on a display interface, the visual feedback effect including at least one of a highlight box, a pulse animation, and a color gradient mark; performing an operation on the target element according to the operation intention includes: performing the operation on the target element according to the operation intention after the visual feedback effect is rendered.

[0020] According to the above scheme, by rendering a visual feedback effect for the target element on the display interface before performing the operation according to the user's intention, and only executing the operation after the visual feedback effect has been rendered, the transparency and visualization of the interaction process are achieved. This embodiment, by introducing a visual feedback step, allows the user to intuitively see the target element locked by the system (through a highlighted box, pulse animation, or color gradient mark), providing an opportunity for perception and confirmation before the operation is executed. This significantly enhances the user's sense of control and trust in the non-contact interaction process.

[0021] According to a second aspect of this disclosure, an electronic device is provided, comprising: a memory storing execution instructions; and a processor executing the execution instructions stored in the memory, such that the processor performs a first aspect of any embodiment of this disclosure.

[0022] According to a third aspect of this disclosure, a readable storage medium is provided, wherein executable instructions are stored therein, which, when executed by a processor, are used to implement a first aspect of any embodiment of this disclosure.

[0023] According to a fourth aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements a first aspect of any embodiment of this disclosure. Attached Figure Description

[0024] The accompanying drawings illustrate exemplary embodiments of the present disclosure and, together with the description thereof, serve to explain the principles of the present disclosure. These drawings are included to provide a further understanding of the present disclosure and are incorporated in and constitute a part of this specification.

[0025] Figure 1 A flowchart illustrating the voice-based interaction method provided in an embodiment of this disclosure.

[0026] Figure 2 This is a schematic flowchart of a screen image acquisition method provided in an embodiment of the present disclosure.

[0027] Figure 3 This is a schematic flowchart of a screen image recognition method provided in an embodiment of the present disclosure.

[0028] Figure 4 This is a schematic flowchart of the speech parsing method provided in an embodiment of the present disclosure.

[0029] Figure 5 This is a flowchart illustrating a method for determining a target element according to an embodiment of the present disclosure.

[0030] Figure 6 This is a flowchart illustrating another method for determining target elements provided in an embodiment of this disclosure.

[0031] Figure 7 This is a flowchart illustrating the location matching method provided in an embodiment of the present disclosure.

[0032] Figure 8 This is a flowchart illustrating another method for determining a target element provided in an embodiment of this disclosure.

[0033] Figure 9 This is a schematic diagram of a voice-based interaction process provided in an embodiment of this disclosure.

[0034] Figure 10 This is a schematic block diagram of a voice-based interactive device according to one embodiment of the present disclosure.

[0035] Figure 11 This is a schematic block diagram of an electronic device according to one embodiment of the present disclosure. Detailed Implementation

[0036] The present disclosure will now be described in further detail with reference to the accompanying drawings and examples. It should be understood that the specific examples described herein are for illustrative purposes only and are not intended to limit the scope of the disclosure. Furthermore, it should be noted that, for ease of description, only the parts relevant to the present disclosure are shown in the accompanying drawings.

[0037] It should be noted that, where there is no conflict, the embodiments and features described in this disclosure can be combined with each other. The technical solutions of this disclosure will now be described in detail with reference to the accompanying drawings and embodiments.

[0038] Figure 1 This is a flowchart illustrating a voice-based interaction method provided in an embodiment of this disclosure. Figure 1 The method shown includes steps S101 to S105. This method can be applied to terminal devices, such as smartphones, computers, and other terminal devices with display functions.

[0039] Specifically, Figure 1 The method shown includes step S101: in response to a user's voice command, acquiring a screen image.

[0040] It should be noted that the voice commands mentioned here refer to natural language voice signals containing operational requirements, input by the user through audio acquisition devices such as microphones. The screen images mentioned here refer to a visual snapshot of the user's terminal device's display interface at a specific moment (e.g., an image obtained through screenshotting), usually existing in the form of bitmaps or frame sequences.

[0041] When the system detects a user initiating voice interaction (e.g., triggered by a wake word or button press), it synchronously or nearly synchronously captures the screen image of the current terminal device's display interface. The key is to ensure that the acquired screen image is consistent with the interface state when the user issues the voice command; that is, to ensure that the captured screen image is exactly what the user wants to manipulate when speaking, avoiding subsequent positioning failures and operation malfunctions due to dynamic changes in interface elements. In practical applications, the screen capture module can be triggered simultaneously with the voice activation signal to acquire a timestamped screen image, ensuring strict consistency in data acquisition time.

[0042] Step S102: Recognize and process the voice command to obtain a voice parsing result containing the operation intention and target description information.

[0043] It should be noted that the voice parsing results mentioned here refer to the structured data generated after recognizing and analyzing voice commands. It includes at least the user's intended operation (such as clicking, inputting, scrolling, etc.) and target description information for locating the target (such as "top left corner", "save button", "top left corner save button", etc.).

[0044] In practical applications, Automatic Speech Recognition (ASR) technology can be used to convert audio signals into text, and then Natural Language Processing (NLP) technology can be used to parse the text structure. For example, when a user says "Click the save button in the upper left corner," the system recognizes the intention as "click," and the target description information includes the location description "upper left corner" and the text description "save button." Converting unstructured natural language into structured instructions that computers can understand provides a basis for subsequent matching. In an alternative approach, a local speech recognition engine can be invoked to complete the transcription, and a pre-trained intent recognition model can be used to extract keywords and entity information.

[0045] Step S103: Perform recognition processing on the screen image to obtain an element set containing at least one interface element; wherein each interface element is associated with element coordinates and element attributes.

[0046] It should be noted that the interface elements mentioned here refer to the smallest visual unit in the screen image that can be recognized and interacted with, such as buttons, icons, text boxes, and text paragraphs. The element coordinates mentioned here refer to the position information of the interface element in the screen image, which can be represented by bounding box coordinates (such as the coordinates of the top left corner and width and height) or center point coordinates. The element attributes mentioned here refer to the characteristic description information of the interface element, including but not limited to text content obtained through optical character recognition, element type identifiers obtained through visual feature recognition (such as "button", "input box", "icon"), etc., and also the supported interaction types of the interface element (such as click, input, etc.) identified through the backend code.

[0047] In practical applications, direct visual recognition processing of screen images can be performed without relying on the application's internal code structure. Specifically, as an alternative, Optical Character Recognition (OCR) technology can be used to detect text regions in an image, extract the text content as element attributes, and record its bounding box as element coordinates. Of course, some buttons do not have text content. For icons or buttons without text, visual feature recognition technology can be used to detect their shape, color, and other features, extract the element type identifier as an element attribute, and record its position as element coordinates. This can generate an element set containing all interactive elements on the interface. This vision-based recognition method has good versatility and can effectively solve the problem of lacking visual-level precise positioning.

[0048] Step S104: Based on the speech parsing results and the element set, perform semantic matching and position matching to determine the target element corresponding to the speech command.

[0049] The system compares the target description information from the speech analysis results with the element attributes and coordinates of each interface element in the element set. Semantic matching is mainly used to compare the text content in the text description and element attributes, such as calculating the similarity between the word "save" in the target description and various text areas on the interface; positional matching is mainly used to compare the position description and element coordinates, such as determining which elements are located in the "top left corner" area. By combining the results of semantic matching and positional matching, the system can filter out the target element that best matches the user's operation intent from the element set. This dual matching method can effectively solve the ambiguity problems that may arise from a single matching method. For example, when there are multiple "save" texts on the interface, combining the position information can accurately locate the button in the top left corner; when there are multiple top left corner elements on the interface, combining the text information can accurately locate the "save" button.

[0050] Step S105: Perform the operation on the target element according to the operation intention.

[0051] Once the target element and its coordinates are determined, corresponding input events can be simulated. For example, if the intended action is a click, a simulated click event is sent to the target element's coordinates; if the intended action is scrolling, the interface is scrolled to bring the target element into the visible area.

[0052] Based on the publicly available solutions described above, screen image-based recognition avoids dependence on underlying interfaces or control trees, enabling the method to be applied to various visual interfaces, including H5 pages, video interfaces, and custom controls, demonstrating excellent versatility. Furthermore, the combination of semantic and positional matching achieves accurate understanding of complex commands, effectively improving positioning accuracy and facilitating contactless interaction. This significantly enhances operational efficiency and flexibility in scenarios involving limited hands, demonstrations, or accessibility assistance. Users do not need to learn complex commands; they can control the interface using only natural language, lowering the barrier to entry and enhancing the intuitiveness and trustworthiness of the interaction.

[0053] In one or more embodiments of this disclosure, such as Figure 2 This is a schematic flowchart illustrating a screen image acquisition method provided in an embodiment of this disclosure. Figure 2 As shown, in response to a user's voice command, acquiring a screen image includes: Step S201: Synchronously acquiring a voice data stream with timestamps and a sequence of screen images. Step S202: Determining the start and end times of the voice command based on the voice data stream. Step S203: Selecting the screen image corresponding to the start and end times from the sequence of screen images.

[0054] In practical applications, the system can launch the voice acquisition channel and the screen capture channel in parallel, ensuring that both are timestamped based on the same system clock source. For example, the voice acquisition module acquires audio at a sampling rate of 16kHz and timestamps the acquisition time of each audio frame, while the screen capture module captures screen images at a frequency of 1 frame per second or higher and timestamps the capture time of each screen image. This synchronous acquisition mechanism ensures that the voice content and screen state are comparable on the timeline, providing a basis for subsequent data alignment. Voice acquisition can be performed as a voice data stream, that is, a sequence of audio data continuously acquired through audio acquisition devices such as microphones. Screen image sequence acquisition can be performed by continuously capturing a sequence of screen image frames at a predetermined frequency using the screen capture module.

[0055] The system performs Voice Activity Detection (VAD) on the voice data stream, identifying the start and end points of valid voice segments. If the detected valid voice energy exceeds a preset threshold, the moment is recorded as the start time; if the voice energy is below the preset threshold and persists for a certain duration, the moment is recorded as the end time. If no valid voice is detected, it is determined to be an invalid instruction, and the process terminates or waits again. The continuous voice data stream is converted into instruction segments with clearly defined time boundaries.

[0056] Furthermore, the system iterates through the timestamps of the screen image sequence, searching for a target image that matches the start and end times. Specifically, if an image in the screen image sequence has a timestamp exactly equal to the start or end time, it is selected directly; if no matching image exists, the screen image with the timestamp closest to the end time is selected, or the screen image with a timestamp between the start and end times is selected. This is because the end time of the user's command usually represents the moment the intent is determined, and the screen state at this time best reflects the interface seen by the user. If no image in the screen image sequence has a timestamp within a reasonable time window, the acquisition is considered a failure, and a recapture or error message may be triggered.

[0057] It should be noted that although this embodiment preferably uses synchronous acquisition to maximize time consistency, serial acquisition can also be used in other feasible implementations. That is, the screen image acquisition step is triggered only after the user's voice command is received and recognized. In practical applications, users can choose to acquire screen images and voice commands in synchronous parallel or serial manner as needed.

[0058] Based on the publicly available solutions described above, by simultaneously acquiring timestamped voice data streams and screen image sequences, and selecting corresponding screen images according to the start and end times of the voice command, it is ensured that the screen images used for recognition are highly consistent with the interface state when the user issues the voice command. Through timestamp alignment, the screen state at the time of the user's speech is accurately located, significantly improving the accuracy of element positioning and the reliability of the interaction.

[0059] In one or more embodiments of this disclosure, such as Figure 3 This is a schematic flowchart illustrating a screen image recognition method provided in an embodiment of this disclosure. Figure 3 As shown, the screen image is processed to obtain an element set containing at least one interface element, including performing at least one of the following processing methods: Step S301: Perform optical character recognition processing on the screen image to identify text regions in the screen image and extract the text content and corresponding bounding box coordinates of each text region. Step S302: Perform visual feature recognition processing on the screen image to identify interactive elements in the screen image and extract the element type identifier and corresponding bounding box coordinates of each interactive element. Step S303: Generate an element set based on the element attributes obtained from at least one of the text content and element type identifier, and the element coordinates obtained from the bounding box coordinates.

[0060] In practical applications, the system can use the OCR engine to recognize screen images. If a text region is detected in the image, the system recognizes the text content within that region and records the bounding box coordinates of that text region. If no text region is detected, this step is skipped or an empty result is generated. For example, when the screen image contains button text such as "Save" or "Cancel," the OCR processing will output the text "Save" and its corresponding bounding box coordinates. Accurate recognition of elements with text labels on the interface ensures that text information can be accurately extracted as the basis for subsequent matching.

[0061] In an alternative approach, the system can analyze the screen image using a pre-trained visual model. If a visual object with interactive features (such as an icon of a specific shape or button style) is detected, its element type identifier is identified, and its bounding box coordinates are recorded. If no such interactive element is detected, this step is skipped. For example, when the screen image contains a textless "close" icon (×), the visual feature recognition process will output the element type identifier "icon" or "close button" and its corresponding bounding box coordinates. This approach addresses the problem that pure OCR processing cannot recognize purely graphic elements on the interface that lack text but possess visual features.

[0062] In one alternative approach, both processing methods can be selected simultaneously; that is, optical character recognition (OCR) and visual feature recognition (VCR) can be performed concurrently, and the results of both can be combined. In practical applications, users can choose any one or both image recognition schemes according to their actual needs.

[0063] Subsequently, based on at least one element attribute obtained from the text content and element type identifier, and the element coordinates obtained from the bounding box coordinates, an element set is generated. The system uniformly encapsulates the extracted text content or element type identifier into element attributes, encapsulates the bounding box coordinates into element coordinates, and integrates all identified elements into a single element set. If no interface element is detected during the recognition process, an empty element set is generated, and a re-capture or error reporting process is triggered.

[0064] It should be noted that element information can also be obtained based on the UI tree (such as the Accessibility Node). However, the UI tree-based approach relies on the application exposing its underlying structural information, and often fails to obtain effective data for custom-drawn interfaces, video content, or cross-platform H5 pages.

[0065] Based on the aforementioned publicly available solutions, by performing optical character recognition (OCR) and / or visual feature recognition (VCR) on screen images, element attributes and coordinates of interface elements can be extracted at the pixel level, generating a comprehensive set of elements. This eliminates the reliance on the application's underlying interface or control tree, making it applicable to any visual interface. Furthermore, by combining text content and element type identifiers as sources of element attributes, it ensures accurate recognition of both text buttons and graphic icons, significantly improving the coverage and accuracy of element recognition. This provides a data foundation for subsequent semantic and positional matching, thereby achieving precise WYSIWYG interaction.

[0066] In one or more embodiments of this disclosure, such as Figure 4 This is a flowchart illustrating the speech parsing method provided in an embodiment of this disclosure. Figure 4 As shown, the voice command is recognized and processed to obtain a voice parsing result containing operation intent and target description information, including: Step S401: Keyword recognition and semantic analysis are performed on the voice command to identify operation intent keywords and entity description keywords. Step S402: Based on the operation intent keywords, the initial operation intent is determined, which includes at least one of click, input, and scroll. Step S403: Based on the entity description keywords, at least one of initial location description information and initial text description information is extracted. Step S404: The voice parsing result is generated based on the initial operation intent and target description information.

[0067] In practical applications, the system uses a natural language processing model to scan the converted text instructions. If a word matching a preset intent library is detected, it is marked as an operation intent keyword; if a word matching an entity library is detected, it is marked as an entity description keyword. If no operation intent keyword is detected, the system can default to viewing or clicking the intent, or prompt the user to provide additional instructions; if no entity description keyword is detected, the entity description keyword will be empty, and subsequent matching will rely solely on location or global context. For example, when a user says "Click the save button in the upper left corner," the system recognizes the operation intent keyword "click," the entity description keywords "upper left corner," and "save."

[0068] Further, based on the identified operation intent keywords, an initial operation intent is determined, which includes at least one of click, input, and scrolling. The system maps the identified operation intent keywords to standard operation types. It should be noted that "click" here can also refer to similar or related click operations such as selection, triggering, pressing, double-clicking, and right-clicking; "input" can also refer to synonyms such as filling in, typing, and entering; and "scrolling" can also refer to synonyms such as swiping, page turning, and scrolling down. Through this synonym-based extended mapping, the system can understand the user's diverse natural expressions. If the operation intent keywords cannot be mapped to any of the above initial operation intents, it is determined to be an unknown intent, and the process can be terminated or enter a chat mode. This standardizes diverse natural language expressions into machine-executable operation types.

[0069] The system analyzes the semantic category of entity description keywords. If a keyword contains locative words (such as "top left," "middle," or "bottom"), it is extracted as initial location description information. If a keyword contains specific text content (such as "save," "confirm," or "username"), it is extracted as initial text description information. If an entity description keyword contains both locative words and text content, both initial location description information and initial text description information are extracted simultaneously. If it contains only one of these, only that type of information is extracted. If it contains neither, both are empty. For example, for the instruction "click save," the extracted initial text description information is "save," and the initial location description information is empty; for the instruction "click top left corner," the extracted initial location description information is "top left corner," and the initial text description information is empty.

[0070] The system uses the determined initial operation intent as the operation intent field, and integrates and encapsulates the extracted initial location description information and / or initial text description information into a target description information field, forming a structured speech parsing result. If no valid information is found during the extraction process, an empty speech parsing result is generated and an exception handling process is triggered. This completes the conversion from unstructured natural language to structured instruction data, providing standardized input for subsequent element matching.

[0071] Based on the aforementioned publicly available solutions, keyword recognition and semantic analysis of voice commands can accurately extract the initial operational intent, initial location description, and initial text description, achieving structured processing of natural language commands. Furthermore, separating and extracting location and text descriptions allows subsequent matching steps to flexibly employ semantic matching, location matching, or a combination of both, effectively supporting the parsing of complex commands and providing a reliable basis for solving the problem of precise, "say what you mean" interaction. This structured voice parsing result avoids misoperations caused by intent confusion in traditional solutions, significantly improving the accuracy and intelligence of the interaction.

[0072] In one or more embodiments of this disclosure, such as Figure 5 This is a flowchart illustrating a method for determining a target element provided in an embodiment of this disclosure. Figure 5 As shown, the target description information includes at least one of location description information and text description information; based on the speech parsing results and the element set, semantic matching and location matching are performed to determine the target element corresponding to the speech command, including: performing at least one of the following processing methods: Step S501: Perform text matching on the element set based on the text description information to obtain a semantic matching result; perform spatial location matching on the element set based on the location description information to obtain a location matching result. Step S502: Determine the target element based on at least one of the semantic matching result and the location matching result.

[0073] In practical applications, if the target description information only contains text description information, then only text matching can be performed; if the target description information only contains location description information, then only location matching can be performed; if both text description information and location matching are included, then both text matching and location matching can be performed simultaneously.

[0074] The specific matching process is as follows: If the target description information contains text description information, then text matching is performed on the element set based on the text description information to obtain semantic matching results. The system traverses each interface element in the element set, extracts the text content from the element attributes of each interface element, and calculates its text similarity with the text description information (for example, it can be calculated through edit distance or word vector similarity). If there are interface elements with a similarity higher than a preset threshold, these elements are included in the semantic matching results, for example, generating a first candidate element list sorted by similarity; if there are no interface elements with a similarity higher than the preset threshold, the semantic matching result is empty. If the target description information does not contain text description information, the text matching step is skipped, and the semantic matching result is empty. This is mainly used to locate interface elements with clear text identifiers, for example, when a user says "click save", the corresponding button is found by matching the text "save".

[0075] In one alternative approach, if the target description information includes location description information, spatial location matching is performed on the element set based on the location description information to obtain the location matching result. The system parses the location description information and converts it into screen area constraints, for example, converting "top left corner" into the top left quarter of the screen area. Then, it iterates through the element set, filtering interface elements located within the screen area constraints based on the element coordinates of each interface element. If interface elements that meet the location constraints exist, these elements are included in the location matching result, for example, generating a second candidate element list; if no interface elements that meet the location constraints exist, the location matching result is empty. If the target description information does not contain location description information, the spatial location matching step is skipped, and the location matching result is empty. This is used to locate interface elements with clear directional characteristics, such as when a user says "click the top left corner," elements in the corresponding area are found by filtering by coordinates.

[0076] Furthermore, the system makes logical judgments based on the combinations of results obtained from the first two steps. If both the semantic matching result and the location matching result are not empty, the intersection or combined score of the two is calculated, and the element that matches both the text description and the location description is identified as the target element. If only the semantic matching result is not empty, the element with the highest similarity in the semantic matching result is directly identified as the target element. If only the location matching result is not empty, the element with the best location match in the location matching result is directly identified as the target element. If both the semantic matching result and the location matching result are empty, the matching is deemed to have failed, and re-recognition or prompting the user to clarify the instruction may be triggered. Through this flexible combination logic, the system can adapt to different types of voice instructions, whether they are pure text instructions, pure location instructions, or compound instructions, and can be effectively processed.

[0077] Based on the aforementioned publicly available solutions, a multi-dimensional fusion element localization mechanism is achieved by obtaining semantic matching results through text matching based on textual description information and / or location matching results through spatial location matching based on location description information. That is, when a unique target cannot be determined solely by relying on semantic matching results, this embodiment can combine location matching results for secondary constraints to accurately locate the button in a specific area; conversely, when multiple elements exist in the same area on the interface, location matching results alone cannot distinguish them, while this embodiment can combine semantic matching results for filtering. This significantly improves the accuracy and robustness of target element determination.

[0078] In one or more embodiments of this disclosure, such as Figure 6 This is a flowchart illustrating another method for determining a target element provided in an embodiment of this disclosure. Figure 6 As shown, text matching is performed on the element set based on text description information to obtain semantic matching results, including: Step S601: Calculate the semantic similarity between the element attributes of each element in the element set and the text description information. Step S602: Sort each element according to semantic similarity to obtain a first candidate element list corresponding to the semantic matching results. Based on at least one of the semantic matching results and the position matching results, determine the target element, including: Step S603: Determine the candidate element with the highest semantic similarity in the first candidate element list as the target element. Step S604: If there are multiple candidate elements with the same semantic similarity in the first candidate element list, and the target description information contains position description information, then a second filtering is performed on the first candidate element list based on the position description information to determine the target element.

[0079] The system iterates through each interface element in the element set, extracts the text content or element type identifier from its element attributes, and compares it with the text description information in the voice command. For example, if the text description information is "Save" and the element attribute is "Save Button," then the semantic similarity score is calculated. If the element attribute and text description information are completely identical, the similarity score is the highest; if they are partially identical or semantically similar, the score decreases accordingly; if they are completely unrelated, the score is zero.

[0080] Furthermore, the elements are sorted according to semantic similarity to obtain a list of first candidate elements corresponding to the semantic matching results. The system uses the calculated similarity score as the sorting criterion, arranging the elements in the element set from highest to lowest. If the similarity score of all elements is zero, the list of first candidate elements is empty. Subsequently, it is determined whether the candidate element with the highest semantic similarity in the list of first candidate elements is unique. If the candidate element with the highest semantic similarity is unique, then that candidate element is directly determined as the target element. For example, if only one element in the list has the highest similarity score and is much higher than other elements, then that element is directly selected, and the process ends.

[0081] If multiple candidate elements with the same semantic similarity exist in the first candidate element list, the system further determines whether the target description information contains location description information. If the target description information does not contain location description information, secondary filtering cannot be performed using location, and the system can default to selecting the element ranked first in the list as the target element, or prompt the user to supplement location information. If the target description information contains location description information, secondary filtering is performed on the first candidate element list based on the location description information to determine the target element. Specifically, the secondary filtering process includes: first, parsing the location description information and converting it into specific screen area constraints, such as parsing "top left corner" as a rectangular area with a horizontal coordinate of 0 to half the width and a vertical coordinate of 0 to half the height; then, traversing multiple candidate elements with the same semantic similarity in the first candidate element list to obtain the element coordinates of each candidate element; finally, determining which candidate elements' element coordinates fall within the screen area constraints, and identifying the candidate elements falling within this area as the target elements. If multiple elements still exist after secondary filtering, the element size can be further considered, or the element closest to the center of the area can be selected by default; if no element falls within the area after secondary filtering, the area constraints can be relaxed, or a matching failure message can be displayed.

[0082] Based on the publicly available solutions described above, by calculating semantic similarity and generating a first candidate element list, the degree of matching between interface elements and user commands can be quantitatively evaluated, ensuring that elements with the closest semantics are selected first. Especially when multiple candidate elements with the same semantic similarity exist in the first candidate element list, by determining whether the target description information contains location description information, and performing secondary filtering based on the location description information if it does, the problem of inaccurate positioning caused by multiple identical text elements in the interface can be effectively solved.

[0083] In one or more embodiments of this disclosure, such as Figure 7 This is a flowchart illustrating the location matching method provided in an embodiment of this disclosure. Figure 7As shown, spatial location matching of the element set based on location description information is performed to obtain location matching results, including: Step S701: Parsing the location description information to obtain spatial location constraints, which include screen area range or relative positional relationships. Step S702: Based on the element coordinates of each interface element, candidate elements that meet the spatial location constraints are filtered to obtain a second candidate element list. The target element is determined based on at least one of the semantic matching result and the location matching result, including: Step S703: The unique candidate element in the second candidate element list is determined as the target element. Step S704: If the second candidate element list contains multiple candidate elements, and the target description information contains text description information, the second candidate element list is further filtered based on the text description information to determine the target element.

[0084] The system maps natural language location descriptions to specific regions in the screen coordinate system. For example, if the location description is "top left corner," the system can use half the screen width and half the screen height as boundaries to define the top left quarter area as the screen region. If the location description is "below the save button," the system uses the coordinates of the "save button" element as the anchor point and defines the vertical area below it as the relative position. If parsing fails or the location description cannot be recognized, the spatial location constraint is empty, and the process can skip the location matching step or provide an error message to the user.

[0085] Furthermore, based on the element coordinates of each interface element, candidate elements that meet the spatial location constraints are filtered to obtain a second candidate element list. The system traverses the element set, reads the element coordinates of each interface element, and determines whether it falls within the area defined by the spatial location constraints. If the element coordinates are within the area, the element is added to the second candidate element list; if the element coordinates are outside the area, it is excluded. If the second candidate element list is empty after the traversal, the location matching is considered to have failed, and the constraints may be relaxed or a prompt may be made to the user. Utilizing location information quickly narrows the search scope and eliminates a large number of irrelevant elements.

[0086] Next, the system determines the number of candidate elements in the second candidate element list. If the second candidate element list contains only one candidate element, then that single candidate element is directly identified as the target element. For example, if only one button is identified in the "top left corner" area, that button is directly selected without further calculation. If the second candidate element list contains multiple candidate elements, the system enters the ambiguity handling logic. At this point, the system further determines whether the target description information contains text description information. If the target description information does not contain text description information, then secondary filtering using text is not possible. The system can either default to selecting the element ranked first in the second candidate element list as the target element or prompt the user to supplement the text information. If the target description information contains text description information, then secondary filtering of the second candidate element list is performed based on the text description information to determine the target element.

[0087] The specific process of secondary filtering is as follows: The system traverses multiple candidate elements in the second candidate element list, extracts the text content or element type identifier from the element attributes of each candidate element, and calculates its semantic similarity with the text description information. For example, if the second candidate element list contains a "Save" button and a "Cancel" button, and the text description information is "Save", the system calculates the similarity between both and "Save", and determines the candidate element with the highest similarity as the target element. If multiple elements with the same similarity still exist after secondary filtering, the element with the closest distance can be selected by combining the distance between the element coordinates and the center of the region; if no matching element is found after secondary filtering, the matching is deemed to have failed.

[0088] Based on the publicly available solutions described above, by first obtaining a second list of candidate elements based on location description information, and then performing secondary filtering based on text description information when the list contains multiple candidate elements, the technical problem of single location matching being unable to distinguish multiple elements within the same area can be effectively solved. This significantly improves the positioning accuracy in dense interface layouts and avoids misoperations caused by overlapping elements in the area.

[0089] In one or more embodiments of this disclosure, such as Figure 8 This is a flowchart illustrating another method for determining a target element provided in an embodiment of this disclosure. Figure 8As shown, the target element is determined based on at least one of the semantic matching result and the positional matching result, including: Step S801: Perform text matching on the element set based on text description information to obtain a first candidate element list corresponding to the semantic matching result, and perform spatial position matching on the element set based on position description information to obtain a second candidate element list corresponding to the positional matching result. Step S802: Determine the intersection of the first candidate element list and the second candidate element list. Step S803: Determine the unique element in the intersection as the target element. Step S804: If the intersection contains multiple elements, calculate the comprehensive matching confidence using semantic similarity and positional similarity, and determine the candidate element with the highest comprehensive matching confidence as the target element.

[0090] First, text matching is performed on the element set based on textual description information to obtain a first candidate element list corresponding to the semantic matching results. Then, spatial location matching is performed on the element set based on location description information to obtain a second candidate element list corresponding to the location matching results. The system performs text matching and spatial location matching in parallel, generating two independent candidate lists respectively. The first candidate element list contains all interface elements whose element attributes have a similarity to the textual description information higher than a threshold, and the second candidate element list contains all interface elements whose element coordinates meet the spatial location constraints.

[0091] Furthermore, the intersection of the first and second candidate element lists is determined. The system iterates through both lists to find interface elements that appear in both lists. Then, the number of elements in the intersection is counted. If the intersection contains a unique element, that unique element is directly identified as the target element. This is because this element satisfies both text and position constraints, resulting in the highest matching confidence, and no further calculation is needed to determine the target element.

[0092] If the intersection contains multiple elements, a comprehensive matching confidence score is calculated using semantic similarity and positional similarity. The candidate element with the highest comprehensive matching confidence score is then selected as the target element. Specifically, the system obtains the semantic similarity score for each element in the intersection in text matching and the positional similarity score in spatial location matching. For example, the distance between the element's coordinates and the center of the target region is normalized to obtain the positional similarity score. The comprehensive matching confidence score is then calculated using a weighted summation or product method. For instance, setting the text similarity weight to 0.6 and the positional similarity weight to 0.4, the system calculates the total score, sorts the results, and selects the element with the highest score. When ambiguity still exists due to the dual constraints, fine-grained sorting criteria are provided to ensure that the element best matches the user's intent is selected.

[0093] Furthermore, if the intersection is empty, an exception handling process is executed. The system determines whether rematching is allowed. If allowed, the spatial location constraints are relaxed or the text similarity requirement is lowered, and the matching step is re-executed. For example, if the initial location constraint of "top left corner" leads to no intersection, it can be relaxed to "left half screen"; if the initial text similarity threshold of 0.9 leads to no intersection, it can be lowered to 0.7. After regenerating the first candidate element list and / or the second candidate element list, the intersection is calculated again. If the intersection is still empty after rematching, the matching is deemed a failure, and the user is prompted to re-enter the information.

[0094] Based on the publicly available scheme, by determining the intersection of the first and second candidate element lists, strong validation with both text and position constraints is achieved, significantly reducing the false match rate. Especially when the intersection contains multiple elements, ranking them by calculating the comprehensive matching confidence score effectively resolves the ambiguity issues that still exist under the dual constraints.

[0095] In one or more embodiments of this disclosure, before performing an operation on a target element according to the operational intent, the method further includes: rendering a visual feedback effect on the target element on a display interface, the visual feedback effect including at least one of a highlighted box, a pulse animation, and a color gradient mark. Performing the operation on the target element according to the operational intent includes: performing the operation on the target element according to the operational intent after the visual feedback effect has been rendered.

[0096] In practical applications, before performing any operation on the target element according to the intended operation, a visual feedback effect for the target element is rendered on the display interface. The system obtains the element coordinates of the target element and creates a new rendering layer or overlay at the corresponding position on the display interface. If the system is configured to use a highlight box, a border of a specific color is drawn at the bounding box coordinates of the target element; if the system is configured to use a pulse animation, a periodically changing animation effect is started in that area; if the system is configured to use a color gradient marker, the area is filled with color and a gradient transition is performed. If the system supports multiple effects, at least one can be selected for rendering; if an error occurs during rendering that prevents the generation of a visual feedback effect, the system can skip this step and directly execute the operation or terminate the process and report an error.

[0097] Further, the system determines whether the visual feedback effect has been rendered successfully. If the visual feedback effect has been rendered successfully, for example, the highlighted box has been stably displayed for the preset duration, or the pulse animation has completed a full cycle, then proceed to the next step; if the visual feedback effect has not been rendered successfully, for example, the animation is still in progress or the drawing process is blocked, then the system waits until rendering is complete or the timeout threshold is reached. If the rendering is not completed even after the timeout threshold is reached, it is determined that the rendering has failed, and the system can cancel the visual feedback and directly execute the operation or terminate the process.

[0098] After the visual feedback effect is rendered, the system executes the operation on the target element according to the user's intent. Based on the intent in the voice analysis result, the system sends corresponding control commands to the element's coordinates. For example, if the intent is to click, a mouse click event is simulated; if the intent is to input, input focus is activated.

[0099] Based on the aforementioned publicly available solutions, by rendering a visual feedback effect for the target element on the display interface before executing the operation according to the user's intention, and only executing the operation after the visual feedback effect has been rendered, the transparency and visualization of the interaction process are achieved. This embodiment, by introducing a visual feedback step, allows the user to intuitively see the target element locked by the system (through a highlighted box, pulse animation, or color gradient mark), providing an opportunity for perception and confirmation before operation execution. This significantly enhances the user's sense of control and trust in the non-contact interaction process.

[0100] To facilitate understanding, the technical solution of this disclosure will be described in detail below through specific embodiments. Figure 9 This is a schematic diagram illustrating a voice-based interaction process provided in an embodiment of this disclosure. Figure 9 As shown, Step 91: The user sends a command to the voice input module to start voice acquisition, triggering the voice acquisition channel to open.

[0101] Step 92: The user sends a command to the screen capture module to start screen capture, triggering the opening of the screen capture channel.

[0102] Step 93: The screen capture module starts its own screen capture loop and continuously captures screen images at a predetermined frequency.

[0103] Step 94: The system enters the voice processing loop. In the loop, the user continuously inputs voice into the voice input module, that is, the user continues to speak.

[0104] Step 95: The voice input module performs end detection (i.e., voice activity detection) in a loop to determine the start and end times of the user's voice.

[0105] Step 96: The screen capture module sends the captured screen image data to the OCR processing module.

[0106] Step 97: After the OCR processing module completes the recognition processing of the screen image, it sends an OCR processing completion signal to the element matching module. This signal contains the set of recognized elements (each interface element is associated with element coordinates and element attributes).

[0107] Step 98: After the voice input module completes the ASR (Automatic Speech Recognition) conversion, it sends an ASR conversion completion signal to the semantic understanding module. This signal contains the converted voice command text.

[0108] Step 99: After parsing the voice command text, the semantic understanding module sends a semantic parsing completion signal to the element matching module. This signal contains the voice parsing result (including operation intent and target description information).

[0109] Step 910: After receiving the OCR processing completion signal and the semantic parsing completion signal, the element matching module performs semantic matching and position matching based on the speech parsing result and the element set to determine the target element and send the matching result to the visual feedback module.

[0110] Step 911: After receiving the matching result, the visual feedback module renders the visual feedback effect for the target element on the display interface, that is, displays a highlight effect to the user (that is, at least one of the highlight box, pulse animation, and color gradient mark mentioned above).

[0111] Step 912: The element matching module sends an execution operation instruction to the control execution module. The instruction contains the operation intent and the element coordinates of the target element.

[0112] Step 913: The control execution module performs the operation on the target element according to the operation intention, that is, performs a simulated click operation (or input, scrolling, or other operations) to the user.

[0113] Step 914: After the control execution module completes the operation, it sends an operation completion feedback to the user, informing the user that the interaction process has ended.

[0114] Based on any of the above embodiments, this disclosure also provides a voice-based interactive device. Figure 10 This is a schematic block diagram illustrating the structure of a voice-based interactive device according to one embodiment of this disclosure. Figure 10 As shown, the user-behavior-based voice-based interactive device includes: an acquisition module 1001, used to acquire a screen image in response to a user's voice command; a semantic recognition module 1002, used to process the voice command to obtain a voice parsing result containing operation intent and target description information; an image recognition module 1003, used to process the screen image to obtain an element set containing at least one interface element; wherein each interface element is associated with element coordinates and element attributes; a matching module 1004, used to perform semantic matching and position matching based on the voice parsing result and the element set to determine the target element corresponding to the voice command; and an execution module 1005, used to execute an operation on the target element according to the operation intent.

[0115] The acquisition module 1001 is used to synchronously acquire a voice data stream with timestamps and a screen image sequence; determine the start and end times of the voice command based on the voice data stream; and select the screen image corresponding to the start and end times from the screen image sequence.

[0116] The image recognition module 1003 is used to perform optical character recognition processing on the screen image, identify text regions in the screen image, extract the text content of each text region and the corresponding bounding box coordinates; and / or, perform visual feature recognition processing on the screen image, identify interactive elements in the screen image, extract the element type identifier of each interactive element and the corresponding bounding box coordinates; and generate an element set based on the element attributes obtained from the text content and / or element type identifier, and the element coordinates obtained from the bounding box coordinates.

[0117] The semantic recognition module 1002 is used to perform keyword recognition and semantic analysis on voice commands, identify operation intent keywords and entity description keywords; determine the initial operation intent based on the operation intent keywords, the initial operation intent including at least one of click, input, and scroll; extract initial position description information and / or initial text description information based on the entity description keywords; and generate voice parsing results based on the initial operation intent and target description information.

[0118] Optionally, the target description information includes location description information and / or text description information; the matching module 1004 is used to perform text matching on the element set based on the text description information to obtain a semantic matching result; and / or, to perform spatial location matching on the element set based on the location description information to obtain a location matching result; and to determine the target element based on the semantic matching result and / or the location matching result.

[0119] The matching module 1004 is used to calculate the semantic similarity between the element attributes and text description information of each element in the element set; sort each element according to the semantic similarity to obtain the first candidate element list corresponding to the semantic matching result; determine the target element according to the semantic matching result and / or the position matching result, including: determining the candidate element with the highest semantic similarity in the first candidate element list as the target element; if there are multiple candidate elements with the same semantic similarity in the first candidate element list, and the target description information contains position description information, then perform a second screening on the first candidate element list based on the position description information to determine the target element.

[0120] The matching module 1004 is used to parse the location description information to obtain spatial location constraints, which include screen area range or relative positional relationships; based on the element coordinates of each interface element, it filters candidate elements that meet the spatial location constraints to obtain a second candidate element list; based on the semantic matching results and / or location matching results, it determines the target element, including: determining the unique candidate element in the second candidate element list as the target element; if the second candidate element list contains multiple candidate elements and the target description information contains text description information, it performs a second filtering on the second candidate element list based on the text description information to determine the target element.

[0121] The matching module 1004 is used to perform text matching on the element set based on text description information to obtain a first candidate element list corresponding to the semantic matching result, and to perform spatial location matching on the element set based on location description information to obtain a second candidate element list corresponding to the location matching result; determine the intersection of the first candidate element list and the second candidate element list; determine the unique element in the intersection as the target element; if the intersection contains multiple elements, calculate the comprehensive matching confidence using semantic similarity and location similarity, and determine the candidate element with the highest comprehensive matching confidence as the target element.

[0122] The execution module 1005 is used to render a visual feedback effect on a target element on the display interface. The visual feedback effect includes at least one of a highlight box, a pulse animation, and a color gradient mark. It also executes an operation on the target element according to the operation intention, including: after the visual feedback effect is rendered, executing the operation on the target element according to the operation intention.

[0123] The specific implementation process of the functions and roles of each module in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0124] The entity executing the information sending method in the specific embodiments of this disclosure may be an electronic device such as a server (including a local server or a cloud computing platform).

[0125] Therefore, based on any of the above embodiments, this disclosure also provides an electronic device that can execute the voice-based interaction method of any of the embodiments described above.

[0126] Figure 11 This is a schematic block diagram of an electronic device according to one embodiment of the present disclosure.

[0127] The hardware architecture of the electronic device 1000 can be implemented using a bus architecture. The bus architecture can include any number of interconnect buses and bridges, depending on the specific application of the hardware and overall design constraints. Bus 1100 connects various circuits, including one or more processors 1200, memory 1300, and / or hardware modules. Bus 1100 can also connect various other circuits 1400, such as peripheral devices, voltage regulators, power management circuits, external antennas, etc.

[0128] Bus 1100 can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Component (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, only one connection line is used in this diagram, but this does not imply that there is only one bus or only one type of bus.

[0129] This disclosure also provides a readable storage medium storing a computer program that, when executed by a processor, is used to implement the methods described above. A "readable storage medium" can be any means capable of containing, storing, communicating, propagating, or transmitting a program for use by or in conjunction with an instruction execution system, apparatus, or device. More specific examples of a readable storage medium include: an electrical connection with one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and programmable read-only memory (EPROM or flash memory), fiber optic devices, and portable read-only memory (CDROM), etc.

[0130] This disclosure also provides a computer program product, the methods of which can be implemented wholly or partially through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented wholly or partially as a computer program product. The computer program product includes one or more computer programs or instructions. When the computer program or instructions are loaded and executed, all or part of the processes or functions of this disclosure are performed.

[0131] Computer programs or instructions can be stored in a readable storage medium or transferred from one readable storage medium to another. For example, the computer program or instructions can be transferred from one website, computer, server, or data center to another website, computer, server, or data center via wired or wireless means. The readable storage medium can be any available medium capable of access, or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium, such as a floppy disk, hard disk, or magnetic tape; an optical medium, such as a digital video optical disc; or a semiconductor medium, such as a solid-state drive. The computer-readable storage medium can be a volatile or non-volatile storage medium, or it can include both volatile and non-volatile types of storage media.

[0132] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, systems, or computer program products. Therefore, this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this disclosure can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0133] This disclosure is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to this disclosure. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0134] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0135] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0136] In the description of this specification, the references to terms such as "one embodiment / mode," "some embodiments / modes," "example," "specific example," or "some examples," etc., refer to specific features, structures, or characteristics described in connection with that embodiment / mode or example, which are included in at least one embodiment / mode or example of this disclosure. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment / mode or example. Moreover, the specific features, structures, or characteristics described may be combined in any suitable manner in one or more embodiments / modes or examples. Furthermore, without contradiction, those skilled in the art can combine and integrate the different embodiments / modes or examples described in this specification, as well as the features of different embodiments / modes or examples.

[0137] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this disclosure, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.

[0138] Those skilled in the art should understand that the above embodiments are merely for illustrating the present disclosure and are not intended to limit the scope of the disclosure. Those skilled in the art can make other changes or modifications based on the above disclosure, and these changes or modifications still fall within the scope of the present disclosure.

Claims

1. A voice-based interaction method, characterized in that, The method includes: Simultaneously acquire timestamped audio data streams and screen image sequences; The start and end times of the voice command are determined based on the voice data stream; Select the screen image corresponding to the start time and the end time from the screen image sequence; The voice command is processed to obtain a voice parsing result containing the operation intention and target description information; The screen image is processed to obtain an element set containing at least one interface element; wherein each interface element is associated with element coordinates and element attributes; Based on the speech parsing results and the element set, semantic matching and position matching are performed to determine the target element corresponding to the speech command; Perform the operation on the target element according to the stated operational intent.

2. The voice-based interaction method according to claim 1, characterized in that, The step of recognizing the screen image to obtain an element set containing at least one interface element includes: Perform at least one of the following processing methods: perform optical character recognition processing on the screen image to identify text regions in the screen image and extract the text content and corresponding bounding box coordinates of each text region; perform visual feature recognition processing on the screen image to identify interactive elements in the screen image and extract the element type identifier and corresponding bounding box coordinates of each interactive element. The element set is generated based on the element attribute obtained from at least one of the text content and the element type identifier, and the element coordinates obtained from the bounding box coordinates.

3. The voice-based interaction method according to claim 1, characterized in that, The target description information must include at least one of location description information and text description information; The step of determining the target element corresponding to the voice command by performing semantic matching and position matching based on the speech parsing result and the element set includes: Perform at least one of the following processing methods: perform text matching on the element set based on the text description information to obtain a semantic matching result; and perform spatial location matching on the element set based on the location description information to obtain a location matching result; The target element is determined based on at least one of the semantic matching result and the position matching result.

4. The voice-based interaction method according to claim 3, characterized in that, The step of performing text matching on the element set based on the text description information to obtain semantic matching results includes: Calculate the semantic similarity between the element attributes of each element in the element set and the text description information; The elements are sorted according to the semantic similarity to obtain the first candidate element list corresponding to the semantic matching result; Determining the target element based on at least one of the semantic matching result and the position matching result includes: The candidate element with the highest semantic similarity in the first candidate element list is determined as the target element; If there are multiple candidate elements with the same semantic similarity in the first candidate element list, and the target description information includes the location description information, then the first candidate element list is further filtered based on the location description information to determine the target element.

5. The voice-based interaction method according to claim 4, characterized in that, The step of performing spatial location matching on the element set based on the location description information to obtain the location matching result includes: The location description information is parsed to obtain spatial location constraints, which include screen area range or relative positional relationships. Based on the element coordinates of each interface element, candidate elements that meet the spatial position constraints are filtered to obtain a second candidate element list; Determining the target element based on at least one of the semantic matching result and the position matching result includes: The unique candidate element in the second candidate element list is determined as the target element; If the second candidate element list contains multiple candidate elements, and the target description information includes the text description information, the second candidate element list is further filtered based on the text description information to determine the target element.

6. The voice-based interaction method according to claim 4, characterized in that, Determining the target element based on at least one of the semantic matching result and the position matching result includes: Based on the text description information, text matching is performed on the element set to obtain a first candidate element list corresponding to the semantic matching result; and based on the location description information, spatial location matching is performed on the element set to obtain a second candidate element list corresponding to the location matching result. Determine the intersection of the first candidate element list and the second candidate element list; The unique element in the intersection is determined as the target element; If the intersection contains multiple elements, the comprehensive matching confidence is calculated using semantic similarity and positional similarity, and the candidate element with the highest comprehensive matching confidence is determined as the target element.

7. The voice-based interaction method according to claim 1, characterized in that, Before performing the operation on the target element according to the operational intent, the method further includes: Render visual feedback effects for the target element on the display interface, the visual feedback effects including at least one of a highlight box, a pulse animation, and a color gradient mark; The step of performing the operation on the target element according to the operational intent includes: After the visual feedback effect is rendered, the operation on the target element is performed according to the operation intention.

8. An electronic device, characterized in that, include: The memory stores execution instructions; as well as, A processor that executes execution instructions stored in the memory, causing the processor to perform the method of any one of claims 1 to 7.

9. A computer-readable storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the method of any one of claims 1 to 7.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 7.