A voice control method, device, apparatus and storage medium

By taking screenshots to recognize text information and generating voice commands when the page changes, the problem of voice control failure caused by inconsistencies between the displayed information on the page and the control attribute text is solved, realizing a voice control method that allows users to speak what they can see.

CN116302228BActive Publication Date: 2026-06-26BEIJING UNISOUND INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING UNISOUND INFORMATION TECH CO LTD
Filing Date
2022-09-06
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In existing technologies, inconsistencies exist between the information displayed on the page and the text information in the control properties, leading to voice control failures.

Method used

By taking a screenshot when a page change is detected, the system identifies the text information and its attributes in the page image, generates and registers voice commands, and performs corresponding operations based on the voice input.

Benefits of technology

This ensures that voice control can be successfully executed even if the text information on the page and the text information in the control properties are inconsistent, thus avoiding voice control failure.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116302228B_ABST
    Figure CN116302228B_ABST
Patent Text Reader

Abstract

The application discloses a voice control method, device and equipment and a storage medium. The method comprises the following steps: when it is detected that a display element of a page changes, performing screenshot processing on the page to obtain a page picture corresponding to the page; in the page picture, identifying text information and determining attribute information of the text information; generating and registering a voice instruction according to the text information in the page picture; after receiving voice input information matched with the voice instruction, executing the voice instruction according to the attribute information corresponding to the voice instruction. The application identifies the text visible to a user in a picture in a screenshot mode, registers the text as a voice instruction, so that a voice control mode of saying what you see is realized. Even if the text information in the page and the text information in the attribute of a control are inconsistent, the voice control failure problem does not occur.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer technology, and in particular to a voice control method, apparatus, device, and storage medium. Background Technology

[0002] In some application scenarios, manually operating the terminal device is inconvenient for users, necessitating voice commands for control. For example, a user might want to open a car music app while driving to listen to music. However, driving requires concentration, and manually glancing at the screen, swiping to find the music app, and then manually selecting the song is time-consuming and potentially dangerous. In such cases, using voice commands to open the music app and play the desired song is clearly safer.

[0003] Currently, in user-visible voice control methods, control properties are pre-added to page controls, containing text information corresponding to the voice command. For example, if the page displays "OK," and the control property includes the text information corresponding to "OK," the user saying "OK" will be recognized as a voice command, and the page control will perform the corresponding operation. However, in some cases, there is a discrepancy between the information displayed on the page and the text information in the control property. The user can only see the information displayed on the page and cannot know the text information in the control property. Even if the user says the information displayed on the page, the corresponding operation cannot be performed because the control property does not contain that information, leading to voice control failure. For example, if the page displays "Play," but the control property's text information is "Open Audio," the user saying "Play" will not complete the voice control operation because the control property does not contain the text "Play." Summary of the Invention

[0004] The main objective of this invention is to provide a voice control method, apparatus, device, and storage medium to solve the problem of inconsistency between the information displayed on the page and the text information in the control properties in the prior art, which leads to voice control failure.

[0005] To address the aforementioned technical problems, the embodiments of the present invention are implemented through the following technical solutions:

[0006] This invention provides a voice control method, comprising: when a change is detected in the display elements of a page, performing a screenshot on the page to obtain a page image corresponding to the page; recognizing text information in the page image and determining the attribute information of the text information; generating and registering a voice command based on the text information in the page image; and executing the voice command based on the attribute information corresponding to the voice command after receiving voice input information matching the voice command.

[0007] The step of generating and registering voice commands based on the text information in the page image includes: when the text information includes symbols, querying the matching text of the symbol mapping in a preset symbol text mapping table, and generating and registering voice commands based on the matching text; wherein the symbol text mapping table is used to record the mapping relationship between symbols and matching text.

[0008] The attribute information of the text information includes: the page location where the text information is located.

[0009] The step of executing the voice command based on the attribute information corresponding to the voice command includes: querying the page position of the text information corresponding to the voice command; and triggering a preset click operation event at the page position so as to trigger the control at the page position to perform the operation corresponding to the voice command.

[0010] The step of executing the voice command according to the attribute information corresponding to the voice command after receiving voice input information that matches the voice command includes: if there are multiple voice commands that match the voice input information, then by performing an instruction selection interaction operation, determining the selected voice command among the multiple voice commands, and executing the selected voice command according to the attribute information corresponding to the selected voice command.

[0011] The step of selecting a voice command from multiple voice commands by executing an instruction selection interaction operation includes: issuing an attribute information selection prompt; receiving secondary voice input information; wherein the secondary voice input information is attribute information input according to the attribute information selection prompt; determining the attribute information that matches the secondary voice input information from the attribute information corresponding to the multiple voice commands respectively, and taking the voice command corresponding to the attribute information that matches the secondary voice input information as the selected voice command.

[0012] This invention also provides a voice control device, comprising: a detection and screenshot module, configured to perform screenshot processing on the page when a change in the display elements of the page is detected, thereby obtaining a page image corresponding to the page; a recognition and determination module, configured to recognize text information in the page image and determine the attribute information of the text information; a generation and registration module, configured to generate and register voice commands based on the text information in the page image; and a command execution module, configured to execute the voice command based on the attribute information corresponding to the voice command after receiving voice input information matching the voice command.

[0013] The attribute information of the text information includes: the page position where the text information is located; the instruction execution module is used to query the page position where the text information corresponding to the voice instruction is located; and to trigger a preset click operation event at the page position so as to trigger the control at the page position to perform the operation corresponding to the voice instruction.

[0014] This invention also provides a voice control device, which includes a processor and a memory; the processor is used to execute a voice control program stored in the memory to implement the voice control method described above.

[0015] This invention also provides a storage medium storing one or more programs, which can be executed by one or more processors to implement the voice control method described above.

[0016] The beneficial effects of this invention are as follows:

[0017] In this embodiment of the invention, when a change in the displayed elements of a page is detected, a screenshot is taken of the page to obtain a corresponding page image. Text information is identified within the page image, and the attribute information of the text information is determined. Based on the text information in the page image, a voice command is generated and registered. Upon receiving voice input information matching the voice command, the voice command is executed according to the attribute information corresponding to the voice command. This embodiment of the invention identifies user-visible text in an image through screenshots and registers this text as voice commands. This achieves a voice control method where what the user sees can be spoken, and even if the text information on the page and the text information in the control attributes are inconsistent, voice control will not fail. Attached Figure Description

[0018] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, illustrate exemplary embodiments of the invention and, together with their description, serve to explain the invention and do not constitute an undue limitation thereof. In the drawings:

[0019] Figure 1 A flowchart of a voice control method according to an embodiment of the present invention;

[0020] Figure 2 This is a schematic diagram of voice control logic according to an embodiment of the present invention;

[0021] Figure 3 This is a structural diagram of a voice control device according to an embodiment of the present invention;

[0022] Figure 4 This is a structural diagram of a voice control device according to an embodiment of the present invention. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0024] According to an embodiment of the present invention, a voice control method is provided. For example... Figure 1 The diagram shown is a flowchart of a voice control method according to an embodiment of the present invention.

[0025] Step S101: When a change in the display elements of the page is detected, a screenshot is taken of the page to obtain the corresponding page image.

[0026] The types of displayed elements include, but are not limited to: the position and text content of page controls, and the position and text content of text boxes.

[0027] Step S102: In the page image, identify text information and determine the attribute information of the text information.

[0028] Text information includes, but is not limited to: characters, words, sentences, and symbols.

[0029] The attribute information of the text information includes, but is not limited to, the page location where the text information is located.

[0030] Text recognition processing is performed on the page image to identify the text information within it. Furthermore, since multiple text regions may exist within the page image, layout analysis can be used to identify the text information within each region separately. Thus, the page position can be determined by the location of the text region and the relative positions of multiple text pieces.

[0031] Furthermore, OCR (optical character recognition) algorithms can be used to identify text information in page images and determine the page location of the text information.

[0032] In this embodiment, the text information and its attribute information are stored accordingly so that the attribute information can be queried later through the text information.

[0033] Step S103: Generate and register voice commands based on the text information in the page image.

[0034] Voice commands are used to execute operations specified by voice input. Since voice commands are generated from text information, the attribute information corresponding to the text information is also the attribute information corresponding to the voice command.

[0035] If a page image includes a single text message, that text message is generated and registered as a voice command. If a page image includes two or more text messages, each text message is generated and registered as a voice command separately.

[0036] Furthermore, each text message can generate and register multiple voice commands. This is because text messages may contain synonyms; when generating and registering voice commands, the synonyms of the text message are also generated and registered as the corresponding voice commands.

[0037] Since the text information may include symbols, when such text information includes symbols, a matching text for the symbol mapping can be queried in a preset symbol-text mapping table, and a voice command can be generated and registered based on the matching text. The symbol-text mapping table records the mapping relationship between preset symbols and matching text. In this way, even if the page includes symbol buttons, the user can still control those symbol buttons with voice.

[0038] Furthermore, a single symbol mapping can match multiple texts, and a voice command can be generated and registered based on each matching text to avoid missing commands.

[0039] For example, the symbol “√” maps to the text “OK” and “Confirm”, generating and registering “OK” and “Confirm” as voice commands respectively.

[0040] The generated and registered voice commands can be stored in a preset target voice command library. This target voice command library can be independent of the original voice command library, thereby distinguishing the voice commands of this embodiment from the voice commands of the original voice interaction program.

[0041] Step S104: After receiving voice input information that matches the voice command, execute the voice command according to the attribute information corresponding to the voice command.

[0042] After receiving voice input information, (in the preset target voice command library) it checks whether there is a voice command that matches the voice input information. If there is, the voice command is executed according to the attribute information corresponding to the voice command; if there is no, the normal interactive program (the original voice interactive program) is executed.

[0043] After receiving voice input information that matches the voice command, the system queries the page location of the text information corresponding to the voice command; and triggers a preset click operation event at the page location to trigger the control at the page location to perform the operation corresponding to the voice command.

[0044] In this embodiment, although there may be inconsistencies between the text information on the page and the text information in the control properties, the correspondence between the text information on the page and the text information in the control properties is inherent. As long as the control is triggered by the voice command in this embodiment, the corresponding operation of the control can be realized, that is, the operation corresponding to the voice command is realized. The operation corresponding to the voice command is both the operation corresponding to the voice command corresponding to the text information on the page and the operation corresponding to the voice command corresponding to the text information in the control properties.

[0045] In this embodiment, if there are multiple voice commands that match the voice input information, the selected voice command is determined from the multiple voice commands by executing the command selection interaction operation, and the selected voice command is executed according to the attribute information corresponding to the selected voice command.

[0046] Furthermore, the instruction selection interaction operation includes: issuing an attribute information selection prompt; receiving secondary voice input information; wherein, the secondary voice input information is attribute information input according to the attribute information selection prompt; among the attribute information corresponding to the multiple voice instructions respectively, determining the attribute information that matches the secondary voice input information, and taking the voice instruction corresponding to the attribute information that matches the secondary voice input information as the selected voice instruction.

[0047] The selected command interaction operation can be voice interaction. Attribute information selection prompts are used to prompt the user to input text information about attributes, such as the text's location on the page.

[0048] In this embodiment, when a change in the displayed elements of a page is detected, a screenshot is taken to obtain a corresponding page image. Text information is identified within the page image, and its attribute information is determined. Based on the text information in the page image, a voice command is generated and registered. Upon receiving voice input information matching the voice command, the voice command is executed according to its corresponding attribute information. This embodiment identifies user-visible text in an image through screenshots and registers this text as voice commands, thus achieving a voice control method where what the user sees can be spoken. Even if the text information on the page and the text information in the control attributes are inconsistent, voice control will not fail.

[0049] The following is a more specific example to illustrate the embodiments of the present invention, such as... Figure 2 The diagram shown is a schematic diagram of voice control logic according to an embodiment of the present invention.

[0050] In this example, the position attribute is the position of the text area. For example: top left, bottom right, center, etc.

[0051] Step S201: Detect whether the screen content has changed; if yes, proceed to step S202; if no, proceed to step S201.

[0052] Step S202: Take a screenshot of the page on the screen to obtain a page image.

[0053] Step S203: Identify the text content in the page image.

[0054] Step S204: Determine the location of the text area of ​​the text content.

[0055] Step S205: Maintain the mapping relationship between the text region position and the text content.

[0056] Step S206: Generate and register the text content as a voice command. The voice command is automatically mapped to the location of the text region.

[0057] Step S207: Receive voice input information from the user.

[0058] Step S208: Using the registered voice command as the query basis, determine whether the voice input information has been registered as a voice command; if yes, proceed to step S209; if no, proceed to step S211.

[0059] Step S209: Based on the maintained mapping relationship between text region position and text content, find the text region position corresponding to the voice command.

[0060] Step S210: A click operation event is triggered at the location of the text area so that the control here can perform the corresponding operation, thereby completing the voice control.

[0061] Step S211: Execute the normal voice interaction process.

[0062] This invention also provides a voice control device. For example... Figure 3 The diagram shown is a structural diagram of a voice control device according to an embodiment of the present invention.

[0063] The voice control device includes:

[0064] The detection and screenshot module 301 is used to perform screenshot processing on the page when it detects that the display elements of the page have changed, and obtain the page image corresponding to the page.

[0065] The identification and determination module 302 is used to identify text information and determine the attribute information of the text information in the page image.

[0066] The generation and registration module 303 is used to generate and register voice commands based on the text information in the page image.

[0067] The instruction execution module 304 is used to execute the voice instruction according to the attribute information corresponding to the voice instruction after receiving voice input information that matches the voice instruction.

[0068] The attribute information of the text information includes: the page position where the text information is located; the instruction execution module 304 is used to query the page position where the text information corresponding to the voice instruction is located; and to trigger a preset click operation event at the page position so as to trigger the control at the page position to perform the operation corresponding to the voice instruction.

[0069] The generation and registration module 303 is used to query the matching text of the symbol mapping in a preset symbol text mapping table when the text information includes symbols, and generate and register voice commands based on the matching text; wherein the symbol text mapping table is used to record the mapping relationship between symbols and matching text.

[0070] The instruction execution module 304 is configured to, if there are multiple voice instructions that match the voice input information, perform an instruction selection interaction operation to determine the selected voice instruction among the multiple voice instructions, and execute the selected voice instruction according to the attribute information corresponding to the selected voice instruction.

[0071] The instruction execution module 304 is further configured to issue an attribute information selection prompt; receive secondary voice input information; wherein the secondary voice input information is attribute information input based on the attribute information selection prompt; among the attribute information corresponding to the multiple voice instructions respectively, determine the attribute information that matches the secondary voice input information, and take the voice instruction corresponding to the attribute information that matches the secondary voice input information as the selected voice instruction.

[0072] The functions of the device described in the embodiments of the present invention have been described in the above method embodiments. Therefore, for any parts not described in detail in this embodiment, please refer to the relevant descriptions in the foregoing embodiments, which will not be repeated here.

[0073] This embodiment also provides a voice control device. For example... Figure 4 The diagram shown is a structural diagram of a voice control device according to an embodiment of the present invention.

[0074] In this embodiment, the voice control device includes, but is not limited to, a processor 401 and a memory 402.

[0075] The processor 401 is used to execute the voice control program stored in the memory 402 to implement the above-mentioned voice control method.

[0076] Specifically, the processor 401 is used to execute the voice control program stored in the memory 402 to perform the following steps: when a change in the display elements of the page is detected, a screenshot is taken of the page to obtain a page image corresponding to the page; in the page image, text information is identified and the attribute information of the text information is determined; based on the text information in the page image, a voice command is generated and registered; after receiving voice input information that matches the voice command, the voice command is executed according to the attribute information corresponding to the voice command.

[0077] The step of generating and registering voice commands based on the text information in the page image includes: when the text information includes symbols, querying the matching text of the symbol mapping in a preset symbol text mapping table, and generating and registering voice commands based on the matching text; wherein the symbol text mapping table is used to record the mapping relationship between symbols and matching text.

[0078] The attribute information of the text information includes: the page location where the text information is located.

[0079] The step of executing the voice command based on the attribute information corresponding to the voice command includes: querying the page position of the text information corresponding to the voice command; and triggering a preset click operation event at the page position so as to trigger the control at the page position to perform the operation corresponding to the voice command.

[0080] The step of executing the voice command according to the attribute information corresponding to the voice command after receiving voice input information that matches the voice command includes: if there are multiple voice commands that match the voice input information, then by performing an instruction selection interaction operation, determining the selected voice command among the multiple voice commands, and executing the selected voice command according to the attribute information corresponding to the selected voice command.

[0081] The step of selecting a voice command from multiple voice commands by executing an instruction selection interaction operation includes: issuing an attribute information selection prompt; receiving secondary voice input information; wherein the secondary voice input information is attribute information input according to the attribute information selection prompt; determining the attribute information that matches the secondary voice input information from the attribute information corresponding to the multiple voice commands respectively, and taking the voice command corresponding to the attribute information that matches the secondary voice input information as the selected voice command.

[0082] This invention also provides a storage medium. The storage medium stores one or more programs. The storage medium may include volatile memory, such as random access memory; it may also include non-volatile memory, such as read-only memory, flash memory, hard disk, or solid-state drive; and it may also include combinations of the above types of memory.

[0083] The above-mentioned voice control method can be implemented when one or more programs in the storage medium can be executed by one or more processors.

[0084] Specifically, the processor is used to execute a voice control program stored in the memory to perform the following steps: when a change in the display elements of the page is detected, a screenshot is taken of the page to obtain a page image corresponding to the page; in the page image, text information is identified and the attribute information of the text information is determined; based on the text information in the page image, a voice command is generated and registered; after receiving voice input information that matches the voice command, the voice command is executed according to the attribute information corresponding to the voice command.

[0085] The step of generating and registering voice commands based on the text information in the page image includes: when the text information includes symbols, querying the matching text of the symbol mapping in a preset symbol text mapping table, and generating and registering voice commands based on the matching text; wherein the symbol text mapping table is used to record the mapping relationship between symbols and matching text.

[0086] The attribute information of the text information includes: the page location where the text information is located.

[0087] The step of executing the voice command based on the attribute information corresponding to the voice command includes: querying the page position of the text information corresponding to the voice command; and triggering a preset click operation event at the page position so as to trigger the control at the page position to perform the operation corresponding to the voice command.

[0088] The step of executing the voice command according to the attribute information corresponding to the voice command after receiving voice input information that matches the voice command includes: if there are multiple voice commands that match the voice input information, then by performing an instruction selection interaction operation, determining the selected voice command among the multiple voice commands, and executing the selected voice command according to the attribute information corresponding to the selected voice command.

[0089] The step of selecting a voice command from multiple voice commands by executing an instruction selection interaction operation includes: issuing an attribute information selection prompt; receiving secondary voice input information; wherein the secondary voice input information is attribute information input according to the attribute information selection prompt; determining the attribute information that matches the secondary voice input information from the attribute information corresponding to the multiple voice commands respectively, and taking the voice command corresponding to the attribute information that matches the secondary voice input information as the selected voice command.

[0090] The above description is merely an embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of the claims of the present invention.

Claims

1. A voice control method, characterized in that, include: When a change in the displayed elements of a page is detected, a screenshot is taken of the page to obtain the corresponding page image. In the page image, text information is identified and its attribute information is determined, and the text information and its attribute information are stored accordingly; Based on the text information in the page images, generate and register voice commands; Upon receiving voice input information that matches the voice command, the voice command is executed according to the attribute information corresponding to the voice command, and the attribute information corresponding to the voice command is the same as the attribute information corresponding to the text information. The step of generating and registering voice commands based on the text information in the page image includes: When the text information includes symbols, the matching text of the symbol mapping is queried in a preset symbol text mapping table, and a voice command is generated and registered based on the matching text; wherein, the symbol text mapping table is used to record the mapping relationship between symbols and matching text; The step of executing the voice command based on the attribute information corresponding to the voice command includes: Query the page location where the text information corresponding to the voice command is located; A preset click event is triggered at the page location so that the control at the page location can perform an operation corresponding to the voice command. The step of executing the voice command according to the attribute information corresponding to the voice command after receiving voice input information that matches the voice command includes: If there are multiple voice commands that match the voice input information, the selected voice command is determined from the multiple voice commands by executing the command selection interaction operation, and the selected voice command is executed according to the attribute information corresponding to the selected voice command. The step of selecting an interactive operation by executing an instruction, and determining the selected voice instruction from among a plurality of voice instructions, includes: Issue an attribute information selection prompt, which is used to prompt the user to input attribute information in text format; Receive secondary voice input information; wherein, the secondary voice input information is attribute information selected and prompted for input based on the attribute information; Among the attribute information corresponding to the multiple voice commands, the attribute information that matches the secondary voice input information is determined, and the voice command corresponding to the attribute information that matches the secondary voice input information is selected as the voice command.

2. The method according to claim 1, characterized in that, The attribute information of the text information includes: the page location where the text information is located.

3. A voice control device, employing the method described in any one of claims 1-2, characterized in that, include: The detection and screenshot module is used to perform screenshot processing on the page when it detects that the display elements of the page have changed, and to obtain the page image corresponding to the page. The identification and determination module is used to identify text information and determine the attribute information of the text information in the page image; The generation and registration module is used to generate and register voice commands based on the text information in the page images; The instruction execution module is used to execute the voice instruction according to the attribute information corresponding to the voice instruction after receiving voice input information that matches the voice instruction.

4. The apparatus according to claim 3, characterized in that, The attribute information of the text information includes: the page location where the text information is located; The instruction execution module is used to query the page position of the text information corresponding to the voice instruction; and to trigger a preset click operation event at the page position so as to trigger the control at the page position to perform the operation corresponding to the voice instruction.

5. A voice control device, characterized in that, The voice control device includes a processor and a memory; the processor is used to execute the voice control program stored in the memory to implement the voice control method according to any one of claims 1 to 2.

6. A storage medium, characterized in that, The storage medium stores one or more programs, which can be executed by one or more processors to implement the voice control method according to any one of claims 1 to 2.