Voice data-based interaction control method and device, equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By performing semantic structuring on text data and assigning weight values, a semantic understanding model is trained, which solves the problem of low accuracy in voice interaction control and achieves more efficient voice data recognition and control command execution.

CN117953895BActive Publication Date: 2026-06-19TENCENT TECHNOLOGY (SHENZHEN) CO LTD

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date: 2022-10-31
Publication Date: 2026-06-19

Application Information

Patent Timeline

31 Oct 2022

Application

19 Jun 2026

Publication

CN117953895B

IPC: G10L15/26; G10L15/22; G10L15/18; G10L15/16; G10L15/06

AI Tagging

Application Domain

Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In existing technologies, the method of identifying control commands through the physical spectrum characteristics of sound results in low accuracy of voice interaction control, and there is a lack of effective improvement solutions.

Method used

By acquiring text data and performing semantic structuring, assigning weight values to control commands, and training a semantic understanding model based on weighted semantic data, the accuracy of converting speech data into text data and recognizing control commands is improved.

Benefits of technology

It improves the accuracy of voice interaction control, saves computing resources, and enhances recognition efficiency during the interaction control process.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117953895B_ABST

Patent Text Reader

Abstract

This application provides an interactive control method, apparatus, device, and storage medium based on voice data. The method includes: acquiring text data for training samples, wherein the text data includes multiple control instruction texts; performing semantic structuring processing on the text data to obtain semantic structure data for each control instruction; acquiring a weight value corresponding to each control instruction; labeling the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data; and training a semantic understanding model based on the weighted semantic data, wherein the trained semantic understanding model is used to convert voice data into text data and recognize the control instructions corresponding to the text data. This application improves the accuracy of interactive control of terminal devices using voice data.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to artificial intelligence technology, and more particularly to an interactive control method, apparatus, device, and storage medium based on voice data. Background Technology

[0002] Artificial Intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or computers-controlled machines to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. Key technologies in speech technology include automatic speech recognition, speech synthesis, and voiceprint recognition. Enabling computers to hear, see, speak, and feel is the future direction of human-computer interaction, with speech being one of the most promising methods for future human-computer interaction.

[0003] Related technologies identify keywords in control commands by analyzing the physical spectrum characteristics of sound, thereby determining the corresponding control commands and enabling the terminal device to execute these commands for interactive control. However, this method relies on offline voice keyword recognition, which results in a low keyword matching rate, thus affecting the accuracy of voice-based interactive control.

[0004] Currently, there is no good technical solution for improving the accuracy of voice data in interactive control of terminal devices. Summary of the Invention

[0005] This application provides an interactive control method, device, electronic device, computer-readable storage medium, and computer program product based on voice data, which can improve the accuracy of interactive control of terminal devices through voice data.

[0006] The technical solution of this application embodiment is implemented as follows:

[0007] This application provides an interactive control method based on voice data, including:

[0008] Obtain text data for use as training samples, wherein the text data includes multiple control instruction texts;

[0009] The text data is subjected to semantic structuring processing to obtain the semantic structure data of each control instruction;

[0010] Obtain the weight value corresponding to each of the control commands;

[0011] The semantic structure data of each control instruction is labeled based on the weight value of each control instruction to obtain weighted semantic data;

[0012] The semantic understanding model is trained based on the weighted semantic data, wherein the trained semantic understanding model is used to convert the speech data into text data and identify the control commands corresponding to the text data.

[0013] This application provides an interactive control method based on voice data, the method comprising:

[0014] Display virtual scenes in the human-computer interaction interface;

[0015] Acquire voice data;

[0016] Based on the voice data, a semantic understanding model is invoked to perform semantic recognition processing to determine the control command corresponding to the voice data. The semantic understanding model is trained using the voice data-based interactive control method of this application embodiment.

[0017] Execute the control command.

[0018] This application provides an interactive control device based on voice data, including:

[0019] The sample acquisition module is configured to acquire text data for use as training samples, wherein the text data includes multiple control instruction texts;

[0020] The sample processing module is configured to perform semantic structuring processing on the text data to obtain semantic structure data for each control instruction;

[0021] The sample processing module is configured to obtain the weight value corresponding to each of the control commands;

[0022] The sample processing module is configured to annotate the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data.

[0023] The model training module is configured to train the semantic understanding model based on the weighted semantic data, wherein the trained semantic understanding model is used to convert the speech data into text data and recognize the control commands corresponding to the text data.

[0024] This application provides an interactive control device based on voice data, the device comprising:

[0025] The display module is configured to display a virtual scene in the human-computer interaction interface;

[0026] The voice acquisition module is configured to acquire voice data;

[0027] The recognition module is configured to call a semantic understanding model to perform semantic recognition processing based on the voice data, and determine the control command corresponding to the voice data. The semantic understanding model is trained by the voice data-based interactive control method of this application embodiment.

[0028] The display module is also configured to execute the control commands.

[0029] This application provides an electronic device, including:

[0030] Memory is used to store executable instructions for a computer;

[0031] When the processor executes computer-executable instructions stored in the memory, it implements the interactive control method based on voice data provided in the embodiments of this application.

[0032] This application provides a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the interactive control method based on voice data provided in this application.

[0033] This application provides a computer program product, including a computer program or computer executable instructions, which, when executed by a processor, implements the voice data-based interactive control method provided in this application.

[0034] The embodiments of this application have the following beneficial effects:

[0035] By converting the text of control commands into semantic structure data and labeling the semantic structures with weights, weighted semantic data is generated based on the labeled language structure data. This structuring of text data improves the accuracy of training sample labeling, enhances the accuracy of training the semantic understanding model, and consequently improves the accuracy of the semantic understanding model in recognizing semantics and control commands during interactive control, while saving computational resources required for interactive control. Attached Figure Description

[0036] Figure 1 This is a schematic diagram of the application mode of the interactive control method based on voice data provided in the embodiments of this application;

[0037] Figure 2A This is a schematic diagram of the server structure provided in an embodiment of this application;

[0038] Figure 2B This is a schematic diagram of the structure of the semantic understanding model provided in the embodiments of this application;

[0039] Figures 3A to 3G This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application;

[0040] Figures 4A to 4C This is a schematic diagram of the human-computer interaction interface corresponding to the terminal device provided in the embodiments of this application;

[0041] Figure 4D This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application;

[0042] Figure 5 This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application;

[0043] Figure 6A This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application;

[0044] Figure 6B This is a schematic diagram illustrating the application scenario provided in the embodiments of this application;

[0045] Figure 6C This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application;

[0046] Figure 6D This is a schematic diagram of training a speech recognition model in an embodiment of this application;

[0047] Figure 6E as well as Figure 6F This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application;

[0048] Figure 7A This is a schematic diagram of the data structure of the ASR speech recognition engine in an embodiment of this application;

[0049] Figure 7B This is a schematic diagram of the data structure of the semantic understanding model in an embodiment of this application;

[0050] Figure 7C This is a comparison table of the effects of the embodiments of this application. Detailed Implementation

[0051] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0052] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0053] In the following description, the terms "first, second, third" are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that "first, second, third" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0054] It should be noted that in the embodiments of this application, user information, user feedback data and other related data are involved. When the embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0055] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0056] Before providing a further detailed description of the embodiments of this application, the nouns and terms involved in the embodiments of this application will be explained, and the nouns and terms involved in the embodiments of this application shall be interpreted as follows.

[0057] 1) Automatic Speech Recognition (ASR): A technology that converts human speech into text, with the goal of converting the vocabulary of human language into computer-readable input, such as keystrokes, binary codes, or character sequences.

[0058] 2) Semantics: The meaning of a word, symbol, action, etc. Semantic understanding uses a series of AI algorithms to parse text into structured, machine-readable intent and slot information.

[0059] 3) Natural Language Understanding (NLU): This is a general term for all methods, models, or tasks that support machines in understanding text content.

[0060] 4) Classification Model: Based on the classification criteria and specific categories in the sample data, a classification model predicts which category a given object belongs to. For example, based on a sample, predicting the probability that the sample belongs to a different type.

[0061] This application provides an interactive control method, an interactive control device, an electronic device, a computer-readable storage medium, and a computer program product based on voice data, which can improve the accuracy of interactive control of terminal devices through voice data.

[0062] The following describes exemplary applications of the electronic devices provided in the embodiments of this application. These electronic devices can be implemented as various types of user terminals, such as laptops, tablets, desktop computers, set-top boxes, mobile devices (e.g., mobile phones, portable music players, personal digital assistants, dedicated messaging devices, portable gaming devices), and in-vehicle terminals, or as servers. The following will describe exemplary applications when the electronic device is implemented as a server.

[0063] refer to Figure 1 , Figure 1 This is a schematic diagram illustrating the application mode of the interactive control method based on voice data provided in the embodiments of this application; for example, Figure 1 The system involves a training server 200-1, a speech recognition server 200-2, a network 300, and a terminal device 400. The training server 200-1 and the speech recognition server 200-2 communicate through the network 300 or through other means. The terminal device 400 connects to the speech recognition server 200-2 through the network 300. The network 300 can be a wide area network, a local area network, or a combination of both.

[0064] In some embodiments, a game application, such as a card game, runs on the terminal device 400; the speech recognition server 200-2 is the server of the game platform, which runs a speech recognition service; the training server 200-1 generates corresponding training samples based on the control command text in the game, and trains the semantic understanding model. The user can be a player. The following is an explanation based on the above examples.

[0065] For example, training server 200-1 trains a semantic understanding model based on the interactive control method of voice data according to the embodiment of this application, and synchronizes the semantic understanding model to speech recognition server 200-2. When terminal device 400 receives voice from user, it converts the voice signal into language data and sends it to speech recognition server 200-2. Speech recognition server 200-2 calls the semantic understanding model based on the voice data to determine the semantics and the corresponding control command, and sends the control command and semantics to terminal device 400. Terminal device 400 executes the corresponding control command and displays the corresponding game screen, thereby improving the efficiency of voice interactive control.

[0066] In some embodiments, the interactive control method based on voice data of this application can also be applied in the following application scenarios:

[0067] (1) Autonomous driving: The user speaks control commands related to autonomous driving to the terminal device. The terminal device recognizes the user's voice, converts the sound signal into text data, and uses the semantic understanding model trained by the interactive control method based on voice data in the embodiments of this application to determine the control commands and control the vehicle to execute the corresponding control commands.

[0068] (2) Application control: For example, the application is online conferencing software. Users can say control commands such as "enter the meeting" and "exit the meeting" to the terminal device. The terminal device recognizes the user's voice, converts the sound signal into text data, and calls the semantic understanding model trained by the voice data-based interactive control method in this application embodiment to determine the control command and control the online conferencing software to execute the corresponding entry or exit control command.

[0069] (3) Game interaction control: For example, if the game is a card game, the user says the control command "play a card" to the terminal device. The terminal device recognizes the user's voice, converts the sound signal into text data, and calls the semantic understanding model trained by the voice data-based interaction control method in the embodiment of this application to determine that the control command is to play a card. The card game is controlled to play cards automatically, and the card result is displayed in the human-computer interaction interface of the terminal device.

[0070] For example, a client (e.g., a game application) runs on a terminal device. During the operation of the client, it outputs a virtual scene that includes role-playing elements. The virtual scene can be an environment for game characters to interact with, such as a plain, street, valley, etc., for game characters to fight. The first virtual object can be a game character controlled by the user, that is, the first virtual object is controlled by the real user. The user speaks to the terminal device with control commands such as "move left" or "jump" to control the movement of the virtual object. The terminal device recognizes the user's voice, converts the sound signal into text data, and calls the semantic understanding model trained by the voice data-based interactive control method of this application embodiment to determine that the control command is to play a card. The card game is then controlled to automatically play cards, and the movement process of the virtual object is displayed on the human-computer interaction interface of the terminal device.

[0071] This application embodiment can be implemented using blockchain technology. The image processing model trained in this application embodiment can be uploaded to the blockchain for storage, and the reliability of the image processing model can be guaranteed through a consensus algorithm. Blockchain is a new application model of computer technologies such as distributed data storage, peer-to-peer transmission, consensus mechanisms, and encryption algorithms. Essentially, a blockchain is a decentralized database, a chain of data blocks linked using cryptographic methods. Each data block contains information about a batch of network transactions, used to verify the validity of the information (anti-counterfeiting) and generate the next block. A blockchain can include a blockchain underlying platform, a platform product service layer, and an application service layer.

[0072] This application embodiment can be implemented using database technology. A database, simply put, can be viewed as an electronic filing cabinet storing electronic files, where users can perform operations such as adding, querying, updating, and deleting data. A "database" is a collection of data stored together in a certain way, capable of being shared by multiple users, having minimal redundancy, and being independent of application programs.

[0073] A Database Management System (DBMS) is a computer software system designed to manage databases, generally possessing basic functions such as storage, retrieval, security, and backup. DBMSs can be classified according to the database model they support, such as relational or XML (Extensible Markup Language); or according to the type of computer they support, such as server clusters or mobile devices; or according to the query language used, such as Structured Query Language (SQL) or XQuery; or according to performance priorities, such as maximum scale or maximum operating speed; or other classification methods. Regardless of the classification method used, some DBMSs can cross categories, for example, simultaneously supporting multiple query languages.

[0074] This application embodiment can also be implemented using cloud technology. Cloud technology is a general term for network technology, information technology, integration technology, management platform technology, and application technology based on cloud computing business models. It can form a resource pool, available on demand, offering flexibility and convenience. Cloud computing technology will become a crucial support. Backend services of technical network systems require substantial computing and storage resources, such as video websites, image websites, and many portal websites. With the rapid development and application of the internet industry, and driven by demands for search services, social networks, mobile commerce, and open collaboration, every item may eventually possess its own hash-coded identification mark, requiring transmission to a backend system for logical processing. Data at different levels will be processed separately, and various industry data will require robust system support, which can only be achieved through cloud computing.

[0075] In some embodiments, training server 200-1 and speech recognition server 200-2 can be integrated into a single physical server.

[0076] In some embodiments, the training server 200-1 or the speech recognition server 200-2 can be an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. Electronic devices can be smartphones, tablets, laptops, desktop computers, smart speakers, smartwatches, etc., but are not limited to these. Terminal devices and servers can be directly or indirectly connected via wired or wireless communication, which is not limited in this embodiment of the invention.

[0077] See Figure 2A , Figure 2A This is a schematic diagram of the server structure provided in an embodiment of this application. Figure 2A The training server 200-1 shown includes at least one processor 410, memory 450, and at least one network interface 420. The various components in the training server 200-1 are coupled together via a bus system 440. It is understood that the bus system 440 is used to implement communication between these components. In addition to a data bus, the bus system 440 also includes a power bus, a control bus, and a status signal bus. However, for clarity, ... Figure 2A The general labeled all buses as Bus System 440.

[0078] The processor 410 can be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, a digital signal processor (DSP), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor, etc.

[0079] The memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid-state storage, hard disk drives, optical disk drives, etc. The memory 450 may optionally include one or more storage devices physically located away from the processor 410.

[0080] The memory 450 may include volatile memory or non-volatile memory, or both. The non-volatile memory may be read-only memory (ROM), and the volatile memory may be random access memory (RAM). The memory 450 described in this application embodiment is intended to include any suitable type of memory.

[0081] In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules, and data structures or subsets or supersets thereof, as illustrated below.

[0082] Operating system 451 includes system programs for handling various basic system services and performing hardware-related tasks, such as the framework layer, core library layer, driver layer, etc., for implementing various basic business functions and handling hardware-based tasks;

[0083] The network communication module 452 is used to reach other electronic devices via one or more (wired or wireless) network interfaces 420, exemplary network interfaces 420 including: Bluetooth, WiFi, and Universal Serial Bus (USB), etc.

[0084] In some embodiments, the voice-based interactive control device provided in this application can be implemented in software. Figure 2A An interactive control device 455 based on voice data stored in memory 450 is shown. This device can be software in the form of programs and plug-ins, and includes the following software modules: a sample acquisition module 4551, a sample processing module 4552, and a model training module 4553. These modules are logically connected and can therefore be arbitrarily combined or further separated according to the functions they implement. The functions of each module will be described below.

[0085] The interactive control method based on voice data provided in this application will be described in conjunction with exemplary applications and implementations of the server provided in the embodiments of this application.

[0086] See Figure 3A , Figure 3A This is a flowchart illustrating the interactive control method based on voice data provided in the embodiments of this application, which will be combined with... Figure 3A The steps shown are explained.

[0087] In step 301, text data is obtained to be used as training samples.

[0088] For example, the text data includes multiple control instruction texts. Based on the business requirements corresponding to the interactive control (e.g., games, autonomous driving, smart homes, applications, etc.), multiple control instruction texts corresponding to the business requirements can be obtained, and the obtained text data can be used as training samples. For example, if the business of the interactive control is a card game, the control instructions would be "play a card," "bid for the landlord," etc. As another example, if the business of the interactive control is autonomous driving, the control instructions would be "turn left," "pull over," "accelerate," etc. The embodiments of this application will be explained below using a card game as an example.

[0089] In step 302, the text data is semantically structured to obtain the semantic structure data of each control instruction.

[0090] For example, semantics refers to the meaning of language words, and semantic structuring is the structural transformation of text data based on its semantics. In some embodiments, refer to Figure 3B This is a flowchart illustrating the interactive control method based on voice data provided in the embodiments of this application; step 302 can be implemented through the following steps 3021 to 3022, which are described in detail below.

[0091] In step 3021, the following processing is performed for each control instruction text: obtain the vocabulary attributes of each word in the control instruction text.

[0092] For example, lexical attributes include entity words and non-entity words. Entity words can be character names, object names, verbs, etc., while non-entity words can be modal particles. For example, in the control command "I want to play the Six of Hearts," "I am" is a character, "want" is a non-entity word, "play" is a verb, "hearts" is the suit of the card, and "six" is the card's sequence number (or value). Another example: in "I want to rob the landlord," "want" is a non-entity word. "I am" is a character, "rob" is an action, and "landlord" is a character.

[0093] In step 3022, entity words are replaced with lexical attribute tags corresponding to each entity word in the control instruction text, while retaining each non-entity word in the control instruction text, to obtain the semantic structure data of each control instruction.

[0094] For example, lexical attribute tags refer to using the attributes of words as tags for words. Taking a card game as an example, the attributes of entity words in control instructions include: sequence number, character, action, suit, shape, size, etc. Continuing to explain based on the control instruction text example above, we can replace the corresponding entity words with the lexical attribute tags corresponding to each entity word in "I want to play the Six of Hearts" to obtain the semantic structure data "$Character 1 wants $Action $Suit $Size".

[0095] In this embodiment of the application, text data is converted into semantic structure data, making the text data easier for computer devices to process, improving the efficiency of training models, thereby improving the accuracy of training models and improving the interaction efficiency of interactive control.

[0096] Continue to refer to Figure 3A In step 303, the weight value corresponding to each control command is obtained.

[0097] In some embodiments, reference Figure 3C , Figure 3C This is a flowchart illustrating the interactive control method based on voice data provided in this application embodiment; step 303 can be implemented through the following steps 3031 to 3033, which are described in detail below.

[0098] In step 3031, the following processing is performed for each control instruction text: the number of words in the control instruction text is obtained.

[0099] Continuing with the card game example above, for instance, the control command "I want to play the six of hearts" includes five words (I, want, play, hearts, six); the control command "play a card" is a single word used to control the game client to automatically play cards.

[0100] In step 3032, when the number of words is 1, the probability of occurrence of the words corresponding to the control instruction text in the text data is obtained, and the probability of occurrence is used as the weight value corresponding to the control instruction.

[0101] For example, the text data is the text data used as training samples in step 301. When the number of words corresponding to the control instruction text is 1, the probability of the word corresponding to the control instruction text appearing in all the text data used as training samples is obtained. The probability of occurrence is the ratio of the following parameters: the frequency of the word appearing in all the text data and the total number of words in the text data.

[0102] In step 3033, when the number of words is greater than 1, each word in the control instruction text is combined into a word sequence, and the word sequence is subjected to probability prediction processing to obtain the word sequence probability. The word sequence probability is used as the weight value of the control instruction.

[0103] For example, when the number of words is greater than 1, taking "I want to play the six of hearts" as an example, the combination is the word sequence [I, want, play, hearts, six]. The probability of the word sequence [I, want, play, hearts, six] can be predicted using an N-gram language model. The principle of N-gram language model prediction is: taking the first N-1 words in the word sequence as history, predicting the probability of the Nth word appearing. The probability of the word sequence [I, want, play, hearts, six] can be represented as the product of the probabilities of each word appearing in the word sequence.

[0104] For example, steps 3031 to 3033 can be represented by the following formula (1).

[0105]

[0106] Wherein, uni is a single word in the dictionary. count (x) refers to the statistical word frequency of a single word; SUM CNT N-gram refers to the sum of the frequencies of all words; prob The probability score predicted by an n-gram model; gram count Used to count the number of words included in control instructions.

[0107] For example: 1. The control command text is "Play Card", which includes one word. The frequency of the word "Play Card" (i.e., the frequency of its occurrence in the text data used as training samples) is 60, and the total number of words in the text data is 10,000. Therefore, using the frequency of the control command text as a score, score('Play Card') = 60 / 10,000 = 0.006. 2. The control command text is "Six of Hearts", which consists of two words, gram. count It equals 2. Using the N-gram model to calculate the probability, N-gram_prob('hearts', 'six') = 0.018.

[0108] Continue to refer to Figure 3A In step 304, the semantic structure data of each control instruction is labeled based on the weight value of each control instruction to obtain weighted semantic data.

[0109] In some embodiments, more weighted semantic data can be generated based on the semantic structure data of existing training samples as templates for training. (See reference) Figure 3D , Figure 3DIt is a schematic flowchart of an interaction control method based on voice data provided by an embodiment of the present application; step 304 can be implemented through the following steps 3041 to 3045, which are specifically described below.

[0110] In step 3041, multiple entity words associated with each lexical attribute tag are obtained, as well as the occurrence frequency corresponding to each entity word, and the occurrence frequency corresponding to each entity word is used as the weight value of the entity word.

[0111] Exemplarily, the method for obtaining the occurrence frequency has been described in step 3032 above.

[0112] In step 3042, the following processing is performed on the semantic structure data of each control instruction: the semantic structure data of the control instruction is used as a semantic template.

[0113] Exemplarily, for example: the semantic structure data "$role1 oneself wants $action $suit $size" corresponding to "I want to play the six of hearts" is used as a semantic template (PAT), and each lexical attribute tag included in the semantic template (PAT) can be used as a keyword (Slot).

[0114] In step 3043, combination processing is performed on multiple entity words associated with the lexical attribute tags included in the semantic template to obtain new multi-segment new semantic structure data.

[0115] In some embodiments, step 3043 can be implemented in the following manner: multiple combination processes are performed based on the semantic template to obtain multiple different lexical sequences, where the combination process includes: extracting a target entity word from multiple entity words associated with each lexical attribute tag respectively; according to the order of each lexical attribute tag in the semantic template, sequentially combining each target entity word into a lexical sequence; combining the non-entity words in the semantic template with each lexical sequence respectively to obtain new multi-segment new semantic structure data.

[0116] Exemplarily, assume that lexical sequence A = {lexical a, lexical b}, and lexical sequence B = {lexical 0, lexical 1, lexical 2}, then each lexical combination in the two lexical sequences can obtain a lexical sequence set {(lexical a, lexical 0), (lexical a, lexical 1), (lexical a, lexical 2), (lexical b, lexical 0), (lexical b, lexical 1), (lexical b, lexical 2)}. Assume that the non-entity word in the semantic template is "wants", then based on "I want to play the six of hearts" for combination, instructions corresponding to new semantic structure data such as "I want to play the three of spades" and "I want to play the two of diamonds" can be obtained.

[0117] In some embodiments, before combining non-entity words in the semantic template with each word sequence to obtain new multi-segment semantic structure data, the new semantic structure data can be trimmed in the following way (Beam algorithm): Perform the following processing for each word sequence: multiply the weight values corresponding to each entity word included in the word sequence in turn to obtain the word sequence probability corresponding to the word sequence; sort each word sequence in descending order based on the word sequence probability, and retain at least one word sequence at the head of the result of the descending sorting, wherein at least one word sequence is used to generate new semantic structure data.

[0118] For example, step 3043 can be represented by the following formulas (2.1), (2.2), and (2.3).

[0119] decare(A,B)={(x,y,w ai *w bi )|x i ∈A,y i ∈B,w ai ∈A,w bi ∈B} (2.1)

[0120] beam({c1,c2……c n})={c i |1≤i≤n,c i ≥beam threshold} (2.2)sementic gen ({seq})

[0121] ={beam(decare(seq)} i ,seq i+1 ))|seq i ∈seq,1≤i≤n-1} (2.3)

[0122] Here, decare(A,B) represents the weighted expansion of adjacent word sequences A and B using the Cartesian product algorithm. The Cartesian product algorithm is as follows: Assuming set A = {a, b} and set B = {0, 1, 2}, then the Cartesian product of the two sets is {(a, 0), (a, 1), (a, 2), (b, 0), (b, 1), (b, 2)}.

[0123] Among them, formulas (2.2) and (2.3), sementic genThe ({seq}) representation uses the Beam algorithm to dynamically sort and prune the expanded sequence, generating weighted semantic data. For example, the Beam algorithm is a pruning algorithm used to obtain the probabilities of word sequences and retain the sequences with the highest probabilities. For instance, if there are a total of 3*3=9 candidates, and the Beam algorithm retains 6, then it keeps the top 6 with the highest probabilities out of these 9 candidates.

[0124] In step 3044, the weight value of each segment of new semantic structure data is determined based on the weight values of the entity words included in each segment of new semantic structure data.

[0125] For example, the weight value of each new semantic structure data segment can be the product of the following parameters: the weight value of each entity word included in the new semantic structure data.

[0126] In step 3045, each new semantic structure data segment is labeled to obtain weighted semantic data.

[0127] For example, weight values are labeled into the corresponding new semantic structure data to obtain weighted semantic data. For example: I want to play Hearts: 0.0225.

[0128] In this embodiment of the application, a large amount of new semantic data is generated by using existing semantic structure data as templates, which saves the computational resources required for labeling training samples, improves the efficiency of training semantic understanding models, and thus improves the accuracy of training models.

[0129] Continue to refer to Figure 3A In step 305, a semantic understanding model is trained based on weighted semantic data.

[0130] For example, a semantic understanding model can be a combination of different models, or it can be end-to-end. The trained semantic understanding model is used to convert speech data into text data and recognize the control commands corresponding to the text data.

[0131] In some embodiments, reference Figure 2B , Figure 2B This is a schematic diagram of the structure of the semantic understanding model provided in an embodiment of this application; the semantic understanding model 201C includes a speech recognition model 202C and a domain classification model 203C. (Reference) Figure 3E , Figure 3E This is a flowchart illustrating the interactive control method based on voice data provided in the embodiments of this application; step 305 can be implemented through steps 3051E to 3053E, as detailed below.

[0132] In step 3051E, the weighted semantic data is normalized to obtain normalized weighted semantic data.

[0133] In step 3052E, a speech recognition model is trained based on normalized weighted semantic data, and a domain classification model is trained based on normalized weighted semantic data.

[0134] For example, a speech recognition model is used to convert speech data into text data and semantics, while a domain classification model is used to predict the control commands corresponding to the semantics. Different training tasks are performed for each of them.

[0135] For example, step 3052E can be implemented in the following ways: based on normalized weighted semantic data, call the speech recognition model to perform the training task of predicting the semantics corresponding to the control command text; based on normalized weighted semantic data, call the domain classification model to perform the training task of predicting the control commands corresponding to the semantic data.

[0136] For example, normalized weighted semantic data is used as supervision information to update the parameters of the speech recognition model, enabling the speech recognition model to convert speech data into text data and text data into corresponding semantic data.

[0137] For example, a domain classification model is trained synchronously by inputting normalized weighted semantic data into the model, causing it to perform the following processing: using each segment of normalized weighted semantic data as a matching template, obtaining any segment of semantic data, performing fuzzy matching on the characters of the arbitrary segment of semantic data and each segment of normalized weighted semantic data, obtaining the control instructions corresponding to the matched normalized weighted semantic data, determining the cross-entropy loss of the domain classification model based on the difference between the matched control instructions and the control instructions corresponding to the semantic data, and updating the parameters of the domain classification model based on the cross-entropy loss.

[0138] In step 3053E, the trained speech recognition model is combined with the trained domain classification model to obtain the trained semantic understanding model.

[0139] For example, the output of the trained speech recognition model is used as the input of the trained domain classification model, and then the two are combined.

[0140] In some embodiments, the trained speech recognition model can be applied to an automatic speech recognition engine, and the normalized weighted semantic data can be stored as a bin file. The domain classification model and the normalized weighted semantic data can be applied to the semantic understanding algorithm module. The output of the automatic speech recognition engine is used as the input to the semantic understanding algorithm module to achieve model combination.

[0141] For example, different models can be used to implement interactive control based on voice data. The training of different models can be carried out simultaneously, saving the time required for model training and improving the efficiency of model training. By training a domain classification model separately, the efficiency and accuracy of matching control commands are improved, thereby enhancing the accuracy and efficiency of interactive control based on voice data.

[0142] In some embodiments, reference Figure 3F , Figure 3F This is a flowchart illustrating the interactive control method based on voice data provided in the embodiments of this application; the model can be trained in an end-to-end manner, and step 305 can be implemented through steps 3051F to 3053F, as described in detail below.

[0143] In step 3051F, based on the weighted semantic data corresponding to each control instruction, the semantic understanding model is invoked to perform instruction prediction processing to obtain the predicted instruction.

[0144] For example, the semantic understanding model matches weighted semantic data with the semantic data corresponding to the stored control instructions (obtaining the number of overlapping characters between semantic data, or obtaining the similarity between semantic data), and uses the instruction corresponding to the semantic data with the highest matching degree as the predicted instruction.

[0145] In step 3052F, the first prediction loss of the semantic understanding model is determined based on the difference between the predicted instruction and the control instruction corresponding to the weighted semantic data.

[0146] For example, the first prediction loss uses the difference between the predicted instruction and the control instruction corresponding to the weighted semantic data as a factor, and can be various types of loss functions such as mean absolute error, cross-entry loss, etc.

[0147] Taking cross-entropy loss as an example, cross-entropy can be used as a loss function in neural networks (machine learning). In this embodiment, the cross-entropy loss function can measure the similarity between the actual control command and the predicted control command, and can be characterized by the following formula (3):

[0148]

[0149] Where C represents the number of control command types, with each control command corresponding to one type, and N represents the number of training samples, y ij p represents whether the i-th semantic sample belongs to the j-th type of control instruction. ij This represents the probability that the i-th training sample is predicted to be of class j, and its value ranges from [0,1].

[0150] In step 3053F, the semantic understanding model is backpropagated based on the first prediction loss to obtain the trained semantic understanding model.

[0151] For example, the parameters of the semantic understanding model are updated based on the first prediction loss to obtain the trained semantic understanding model.

[0152] This application embodiment saves computational resources required for training the model by performing end-to-end model training, resulting in higher accuracy of model prediction control commands and thus improving the accuracy of interactive control.

[0153] In some embodiments, after step 305, the semantic structure data corresponding to each control instruction is encapsulated to obtain semantic frame data corresponding to each control instruction; the following correspondence is stored: each control instruction and the semantic frame data corresponding to each control instruction.

[0154] For example, the frame header of the semantic structure data is determined based on the type of entity words in the semantic structure data, and the frame header and data ontology are encapsulated into frame data. For example, the weighted semantic data is "I want to play | play $suit $size: 0.9", where the types of entity words are $suit $size. The corresponding frame data is: {[{domain:'game name',type:'template',name:'chupai1',regular expression:'I want to [play | play] $suit $size',intent:'play',slot:['suit','size'],weight:0.9},…,]}.

[0155] In some embodiments, reference Figure 3G , Figure 3G This is a flowchart illustrating the interactive control method based on voice data provided in this application embodiment; after step 305, the server acts as the execution entity, and the control commands corresponding to the voice data are identified through steps 306 to 307.

[0156] In step 306, in response to receiving voice data, the semantic understanding model is invoked based on the voice data to perform the following processing: converting the voice data into text data, performing fuzzy matching processing on the text data and semantic frame data, and obtaining the search confidence corresponding to each segment of semantic frame data.

[0157] For example, voice data is sent to the server by the terminal device after recognizing the user's voice. Fuzzy matching is used to compare two or more records and calculate the probability that they belong to the same entity. It calculates the similarity between each string and the target string, and takes the string with the highest similarity as the fuzzy match result (search confidence). The most common approach to calculating the similarity between strings is to use the edit distance algorithm.

[0158] In step 307, the control instruction corresponding to the semantic frame data with the highest search confidence is obtained and executed.

[0159] In this embodiment, semantics is used as the matching retrieval data. Fuzzy matching saves the computing resources required for retrieval control commands, ensures the accuracy of matching control commands, improves the control efficiency of interactive control, and can improve the response speed of voice control.

[0160] This application also proposes an interactive control method based on voice data, with a terminal device as the execution subject, referring to... Figure 4D , Figure 4D This is a flowchart illustrating the interactive control method based on voice data provided in the embodiments of this application, which will be combined with... Figure 4D The steps shown are explained.

[0161] In step 401D, a virtual scene is displayed in the human-computer interaction interface.

[0162] For example, a virtual scene can be a game scene or a screen of an application.

[0163] In step 402D, voice data is acquired.

[0164] For example, the radio receiver in the terminal device acquires the sound emitted by the user, converts the sound from an acoustic signal into an electrical signal, and then converts the electrical signal into voice data.

[0165] In step 403D, the semantic understanding model is invoked based on the speech data to perform semantic recognition processing and determine the control command corresponding to the speech data.

[0166] For example, the semantic understanding model is trained using the voice data-based interactive control method of this application embodiment.

[0167] In step 404D, control instructions are executed.

[0168] Example, reference Figures 4A to 4C , Figures 4A to 4CThis is a schematic diagram of the human-computer interaction interface corresponding to the terminal device provided in this application embodiment. Taking a card game as an example, the human-computer interaction interface displays prompt message 401A "Please speak" to prompt the user (player) to speak a voice control command. The control command can be "play a card". The terminal device recognizes the user's voice, converts the voice into corresponding voice data, and sends the voice data to the server for voice recognition. At the same time, it displays prompt message 402A "Recognizing" to prompt the user that the voice recognition is currently in progress. When the control command corresponding to the voice data is recognized, the server returns the corresponding semantics and control command to the terminal device. The terminal device displays prompt message 403A "Automatically played a card" and execution result 404A, which is the execution result of the control command to play a card. Figure 4C The two cards played (both numbered J).

[0169] refer to Figure 5 , Figure 5 This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application. Figure 5 The process of server 200 and terminal device 400 collaboratively implementing the voice data-based interactive control method of this application embodiment is illustrated. Server 200 includes training server 200-1 and voice recognition server 200-2.

[0170] Terminal device 400 executes step 501, converts the received voice into voice data, and sends the voice data to server 200.

[0171] In step 502, server 200 converts the voice data into text data and identifies the control commands corresponding to the text data. In step 503, server 200 sends the control commands to terminal device 400.

[0172] Terminal device 400 executes step 504, executes the control command, and displays the screen corresponding to the control command.

[0173] By converting control commands into semantic structure data and labeling the semantic structure with weight values, weighted semantic data is generated based on the labeled language structure data. This improves the accuracy of the labeling of training samples, enhances the accuracy of training the semantic understanding model, and ultimately improves the accuracy of the semantic understanding model in recognizing semantics and control commands during interactive control.

[0174] The following will describe an exemplary application of the voice data-based interactive control method of this application in a real-world application scenario.

[0175] Related technologies utilize the physical spectrum characteristics of sound, employing a small on-device speech recognition system to identify game control keywords. These keywords are then used to determine and execute control commands. However, this approach suffers from poor accuracy, limited quantity, and slow speed in offline speech keyword recognition, impacting the hit rate of keyword matching commands and resulting in low interactive control efficiency. Taking game control via speech recognition as an example, related technologies use an ASR (Automatic Speech Recognition) system to recognize text, converting it into a vector sequence. This sequence is then compared and matched with the vector sequences corresponding to the game's control commands to derive the final control commands. However, text vectors are incompatible with different contexts; retraining the text vectors is necessary for new application scenarios, leading to low business support efficiency. Furthermore, text vectors have limited support for different semantic sentence structures, further contributing to low interactive control efficiency.

[0176] The interactive control method based on voice data proposed in this application converts the control command text corresponding to business requirements (e.g., games, map software) in application scenarios into semantic templates (PATs) and semantic keywords (Slots). It has the following advantages: 1. Wide application scope: The scope of control command requirements is represented and covered using semantic templates, and the objects involved in the control commands can be represented and covered using semantic keywords. 2. High update efficiency: If the scope of control command requirements changes, only the corresponding semantic template needs to be updated; if the objects involved in the control commands change, only the corresponding semantic keywords need to be updated. 3. Reusable semantic data: Semantic templates and semantic entities (semantic keywords) can automatically construct weighted semantic data and semantics.

[0177] This application embodiment trains a semantic understanding model capable of speech recognition using weighted semantic data generated from personalized semantic templates and semantic entities. It has the following advantages: 1. High automation efficiency in converting control commands into personalized ASR model data. Control commands use semantically structured representations to automatically generate data and train personalized ASR recognition resources. 2. Control commands and automatic speech recognition are integrated through personalized semantic data, significantly improving speech recognition accuracy and interactive control efficiency.

[0178] This application's embodiments create weighted semantic frame data from semantic templates and semantic real data, enabling semantic understanding algorithms to recognize control commands. It has the following advantages: 1. High efficiency in converting control commands into personalized semantic frame data. The semantic structured representation of control commands is converted into semantic frame data with high efficiency. 2. Control commands and semantic understanding are integrated through personalized (structured) semantic frame data, significantly improving the accuracy of semantic understanding.

[0179] refer to Figure 6A , Figure 6A This is a flowchart illustrating the interactive control method based on voice data provided in this application embodiment, with the server as the execution entity, combining... Figure 6A The steps shown are explained.

[0180] In step 601A, the semantic data will be formatted.

[0181] For example, prior to step 601A, the text of the control instruction is obtained based on business requirements.

[0182] refer to Figure 6C , Figure 6C This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application.

[0183] In step 601C, the text of the control instructions is semantically formatted to obtain a semantic template and semantic keywords.

[0184] For example, if the business is a card game, then the control instructions include: play a card, call the landlord, and bid for the landlord. The entity words with the same variable attributes in the sentence structure of the control instructions are used as semantic keyword fragments and replaced with attribute tags; the remaining content, which serves as the semantic template, remains unchanged.

[0185] The semantic templates and semantic keyword lists are used to fully represent the control instruction text. Semantic templates (Pat-class) and semantic keyword lists (Slot-class) are constructed to conform to the control instruction text. A semantic template is a fragment of key sentence structure, with variable keyword positions left blank and replaced with symbols representing keyword category information. For example, the semantic templates (Pat-class) and semantic keyword lists (Slot-class) created for control instruction text based on a voice dialogue in a card game (business-related instruction text) are shown in Tables 1 and 2 below.

[0186] Control command text The corresponding semantic template is Pat-class "Call the landlord," "Can't afford it," "Let's play another round." Action 2 has no subject or object: "Landlord, play your card!" "The player before you, hurry up!" $Character 2 Others$Action 3 Others "I want to seize the landlord's property!" and "Cancel the trusteeship!" $Character 1 needs to do it themselves$ Action 2 is neither main nor supporting character. "I want to play the Ace," "I want to play the King." I want to pay out $size "I want to play the six of hearts," "I want to play the nine of clubs." I want to play | $suit$ size

[0187] Table (1)

[0188] The type of semantic keyword Slot-class Datasets for each keyword type (examples) Action 2: No subject or object. "Call the landlord," "Can't afford it," "Let's play another round." Role 2 Others "Upstream player", "Downstream player", "Landlord" Action 3 Others "Play your cards!" "Hurry up!" "Speed!" Character 1 (self) "I","_" Size "A", "One", "Six", "Nine", "Old K" Flower color Hearts, Spades, Diamonds, Clubs

[0189] Table (2)

[0190] For example, in step 601A, the semantic template and semantic keywords are formatted to obtain weighted semantic data and structured semantic frame data.

[0191] Following step 601A, step 602A is executed to train the speech recognition model, and step 604A is executed to train the domain classification model. Step 602A will be explained first. Steps 601A to 602A can be implemented through steps 601C to 603C. In step 602C, the weight values corresponding to the semantic template and semantic keywords are calculated. In step 603C, the weight values are labeled for the semantic template and semantic keywords.

[0192] In some embodiments, semantic templates and semantic keywords can be scored using the following formula (1), and the scores are used as the weights corresponding to the semantic data.

[0193]

[0194] Wherein, uni is a single word in the dictionary. count (x) refers to the statistical word frequency of a single word; SUM CNT N-gram refers to the sum of the frequencies of all words; prob The probability score predicted by an n-gram model; gram count Used to count the number of words included in control instructions.

[0195] For example: 1. The control command text is "Play Card", which includes one word. The frequency of the word "Play Card" is 60, and the total frequency of all words is 10000. Therefore, using the frequency of the control command text as a score, score('Play Card') = 60 / 10000 = 0.006. 2. The control command text is "Six of Hearts", which consists of two words, gram. count It equals 2. Using the N-gram model to calculate the probability, N-gram_prob('hearts', 'six') = 0.018.

[0196] To address the issue of an excessive number of free parameters, the Markov assumption is introduced: the probability of any word appearing depends only on the finite number of words preceding it (n-1 words). The statistical language model based on this assumption is called the N-gram language model. That is, it uses the preceding N-1 words as history to estimate the probability of the current (Nth) word.

[0197] Weight values are assigned to semantic keywords or semantic templates in Table (1) or Table (2), which can be represented in the following form:

[0198] PAT “I want to output $size”: 0.9;

[0199] PAT "$Character2Others$Action3Others": 0.6;

[0200] SLOT "Call the Landlord": 0.7, "Can't Win": 0.8, "Play Another Round": 0.2.

[0201] For example, a large amount of weighted semantic data is generated based on semantic templates and semantic keywords. This weighted semantic data can be used to train a semantic understanding model. The semantic template and the set of keywords contained within it are obtained. For example, the type of keywords included in the semantic template is $HREO, where $HREO is an entity word, and the corresponding set of keywords could be the names of heroes (character A, character B).

[0202] For example, according to the semantic feature form of the semantic template, the keyword tags in it are pointed to the corresponding keyword set to obtain a large amount of weighted text data. The calculation process can be represented by the following formulas (2.1), (2.2), and (2.3).

[0203] decare(A,B)={(x,y,w ai *w bi )|x i ∈A,y i ∈B,w ai ∈A,w bi ∈B} (2.1)

[0204] beam({c1,c2……c n})={c i |1≤i≤n,c i ≥beam threshold} (2.2)sementic gen ({seq})

[0205] ={beam(decare(seq)} i ,seq i+1 ))|seq i ∈seq,1≤i≤n-1} (2.3)

[0206] Here, decare(A,B) represents the weighted expansion of adjacent word sequences A and B using the Cartesian product algorithm. Cartesian product algorithm: Assuming set A = {a,b} and set B = {0,1,2}, then the Cartesian product of the two sets is {(a,0),(a,1),(a,2),(b,0),(b,1),(b,2)}.

[0207] Among them, formulas (2.2) and (2.3), sementic genThe ({seq}) representation uses the Beam algorithm to dynamically sort and prune the expanded sequence, generating weighted semantic data. For example, the Beam algorithm is a pruning algorithm used to obtain the probabilities of word sequences and retain the sequences with the highest probabilities. For instance, if there are a total of 3*3=9 candidates, and the Beam algorithm retains 6, then it keeps the top 6 with the highest probabilities out of these 9 candidates.

[0208] For example, weighted semantic data generated based on semantic templates and semantic keywords can be represented as shown in the following table (3).

[0209]

[0210] Table (3)

[0211] refer to Figure 6E , Figure 6E This is a flowchart illustrating the interactive control method based on voice data provided in an embodiment of this application.

[0212] In step 601E, the semantically weighted data is normalized to obtain normalized semantically weighted data. In step 602E, a language model N-gram is trained based on the normalized semantically weighted data. In step 603E, an automatic speech recognition model is constructed based on the trained language model N-gram.

[0213] For example, an N-gram model (domain classification model) is trained using normalized weighted semantic data, and a state machine resource with business classification information recognition is created based on the N-gram model. The N-gram model is used to statistically analyze word sequences of length N or greater. By using the statistical data and the concept of maximum likelihood, the probability of obtaining a word sequence by combining the first n-1 words with the nth word is calculated.

[0214] Example, reference Figure 6D , Figure 6DIt is a schematic diagram of training a speech recognition model in an embodiment of the present application; among them, state node 0 of the state machine, state node 1 of the state machine, state node 2 of the state machine, state node 3 of the state machine, state node 4 of the state machine, state node 5 of the state machine. The arrows between the state nodes represent the data jump directions between the state nodes, and each arrow is marked with a semantic keyword and the weight value corresponding to the semantic keyword. For example: The data jump direction of the vocabulary sequence [out, plum blossom, six] is from state node 0 of the state machine, state node 1 of the state machine, state node 2 of the state machine to state node 5 of the state machine. The data flow between the state machine nodes is used to represent the process of obtaining the probability corresponding to the vocabulary sequence. The probability corresponding to the vocabulary sequence [out, plum blossom, six] is the product of the probabilities of state node 0, state node 0.6, and each vocabulary. The 0.6 corresponding to state node 5 of the state machine represents the merged jump probability when packing the semantic understanding model into the FST storage structure.

[0215] In an embodiment of the present application, an identification state machine resource with service classification information is made based on the N-gram model. The state machine resource has a corresponding data flow, which can improve the efficiency of interactive control based on voice data, save computing resources, and thus improve the interactive efficiency.

[0216] Continue to refer to Figure 6A , in step 603A, an ASR speech recognition engine is constructed.

[0217] Exemplarily, an automatic speech recognition engine is made based on the trained N-gram model. Refer to Figure 7A , Figure 7A It is a schematic diagram of the data structure of the ASR speech recognition engine in an embodiment of the present application. The data structure of the ASR speech recognition engine 701A includes service configuration information, a general model, and multiple trained speech recognition models (speech recognition model 1, speech recognition model N-1, speech recognition model N). Fst is the storage format of the model file.

[0218] Continue to refer to Figure 6A , in step 605A, a semantic understanding model is constructed.

[0219] Exemplarily, the weighted semantic structured data is converted into weighted semantic frame data, and a semantic understanding strategy model resource is trained and made, so that the semantic understanding model can determine each control instruction in the game based on voice data. Refer to Figure 6F , Figure 6F It is a schematic diagram of the process of the interactive control method based on voice data provided in an embodiment of the present application. Step 605A can be implemented through step 601F and step 602F.

[0220] In step 601F, the weighted semantic data is converted into weighted semantic data frame data. In step 602F, the weighted semantic data frame data is normalized, and a semantic understanding model is generated based on the normalized weighted semantic data frame data.

[0221] For example, the semantic representation data of game control command text is converted into weighted semantic frame data according to a fixed format.

[0222] The semantic understanding model loads weighted semantic frame data according to the control instruction text, performs fuzzy matching on the input text data and weighted semantic frame data, and gives the search hit confidence of the text data for each weighted semantic frame data. The one with the highest hit confidence is selected as the weighted semantic frame data corresponding to the recognized text, and the terminal device executes the control instruction corresponding to the weighted semantic frame data.

[0223] For example: Weighted semantic data is: I want to play | Play $suit$ Size: 0.9

[0224] The corresponding frame data is: {[{domain:'Game Name',Type:'Template',Name:'chupai1',Regular Expression:'I want to [play|hit]$suit$size',Intent:'Play Card',Slot:['Suit','Size'],Weight:0.9},…,]}.

[0225] The semantic keywords are: $suit (hearts: 0.25, spades: 0.25, clubs: 0.25...)

[0226] The corresponding frame data is: {domain:'Game Name', Type:'Slot', Key:'Suit', Value:['Hearts:0.25', 'Spades:0.25', 'Clubs:0.25'...]}

[0227] Continue to refer to Figure 6A In step 605A, semantic recognition and intent extraction are performed.

[0228] refer to Figure 7B , Figure 7B This is a schematic diagram of the data structure of the semantic understanding model in an embodiment of this application; the semantic understanding model 701B can be used for semantic recognition and intent understanding. Frame data can be stored in the format bin, and thus the semantic understanding model 701B includes multiple semantic templates (semantic template 1, semantic template 2...semantic template N) and multiple semantic keywords (semantic keyword 1, semantic keyword 2...semantic keyword N).

[0229] refer to Figure 6B , Figure 6B This is a schematic diagram of an application scenario provided in an embodiment of this application; wherein, the voice interaction control service 601B runs in Figure 1 In server 200-2, client 602B runs on terminal device 400. (Reference) Figures 4A to 4C , Figures 4A to 4C This is a schematic diagram of the human-computer interaction interface corresponding to the terminal device provided in this application embodiment. Taking a card game as an example, the human-computer interaction interface displays prompt message 401A "Please speak" to prompt the user (player) to speak a voice control command. The control command can be "play a card". The terminal device recognizes the user's voice, converts the voice into corresponding voice data, and sends the voice data to the server for voice recognition. At the same time, it displays prompt message 402A "Recognizing" to prompt the user that the voice recognition is currently in progress. When the control command corresponding to the voice data is recognized, the server returns the corresponding semantics and control command to the terminal device. The terminal device displays prompt message 403A "Automatically played a card" and execution result 404A, which is the execution result of the control command to play a card. Figure 4C The two cards played (both numbered J).

[0230] This application's embodiments outperform competitors in terms of flexibility and efficiency in supporting speech recognition services. They also surpass competitors in terms of speech recognition service effectiveness and achievement rate. Abstract business control command texts are structured and concretized, supporting rapid customization and data sharing, thus improving business support efficiency. The ASR speech recognition engine uses a semantic understanding model, improving the accuracy of speech recognition text from the source. Semantic enhancement runs through the entire system chain, with each link working together to improve the speech recognition service support effect, ultimately enhancing the interactive control efficiency of the speech recognition service. (See reference...) Figure 7C , Figure 7C This is a comparison table of the effects of the embodiments of this application. It can be seen that the semantic understanding model of the interactive control method based on voice data provided in the embodiments of this application is superior to the full-domain model of related technologies in terms of confusion degree, recognition accuracy, and interactive control efficiency (command achievement rate).

[0231] The following continues to describe the exemplary structure of the voice data-based interactive control device 455 provided in the embodiments of this application as a software module. In some embodiments, such as Figure 2AAs shown, the software modules stored in the voice data-based interactive control device 455 in the memory 450 may include: a sample acquisition module 4551, configured to acquire text data used as training samples, wherein the text data includes multiple control instruction texts; a sample processing module 4552, configured to perform semantic structuring processing on the text data to obtain semantic structure data for each control instruction; the sample processing module 4552 is further configured to acquire the weight value corresponding to each control instruction; the sample processing module 4552 is further configured to annotate the semantic structure data of each control instruction based on the weight value of each control instruction to obtain weighted semantic data; and a model training module 4553, configured to train a semantic understanding model based on the weighted semantic data, wherein the trained semantic understanding model is used to convert voice data into text data and recognize the control instructions corresponding to the text data.

[0232] In some embodiments, the sample processing module 4552 is configured to perform the following processing for each control instruction text: obtain the lexical attributes of each word in the control instruction text, wherein the lexical attributes include: entity words and non-entity words; replace entity words with lexical attribute tags corresponding to each entity word in the control instruction text, and retain each non-entity word in the control instruction text, to obtain the semantic structure data of each control instruction.

[0233] In some embodiments, the sample processing module 4552 is configured to perform the following processing for each control instruction text: obtain the number of words in the control instruction text; when the number of words is 1, obtain the probability of occurrence of the corresponding word in the text data, and use the probability of occurrence as the weight value of the control instruction; when the number of words is greater than 1, combine each word in the control instruction text into a word sequence, perform probability prediction processing on the word sequence to obtain the word sequence probability, and use the word sequence probability as the weight value of the control instruction.

[0234] In some embodiments, the sample processing module 4552 is configured to obtain multiple entity words associated with each lexical attribute tag and the occurrence frequency of each entity word, and use the occurrence frequency of each entity word as the weight value of the entity word; and perform the following processing on the semantic structure data of each control instruction: use the semantic structure data of the control instruction as a semantic template; combine the multiple entity words associated with the lexical attribute tags included in the semantic template to obtain new multi-segment new semantic structure data; determine the weight value of each segment of new semantic structure data based on the weight value of the entity words included in each segment of new semantic structure data; and perform annotation processing on each segment of new semantic structure data to obtain weighted semantic data.

[0235] In some embodiments, the sample processing module 4552 is configured to perform multiple combination processes based on a semantic template to obtain multiple different word sequences. The combination process includes: extracting a target entity word from multiple entity words associated with each word attribute label; combining each target entity word into a word sequence according to the order of each word attribute label in the semantic template; and combining non-entity words in the semantic template with each word sequence to obtain new multi-segment semantic structure data.

[0236] In some embodiments, the sample processing module 4552 is configured to perform the following processing on each word sequence before combining the non-entity words in the semantic template with each word sequence to obtain new multi-segment new semantic structure data: multiplying the weight values corresponding to each entity word included in the word sequence in sequence to obtain the word sequence probability corresponding to the word sequence; sorting each word sequence in descending order based on the word sequence probability, and retaining at least one word sequence at the head of the result of the descending sorting, wherein at least one word sequence is used to generate new semantic structure data.

[0237] In some embodiments, the semantic understanding model includes a speech recognition model and a domain classification model; the model training module 4553 is configured to normalize the weighted semantic data to obtain normalized weighted semantic data; train the speech recognition model based on the normalized weighted semantic data, and train the domain classification model based on the normalized weighted semantic data; and combine the trained speech recognition model and the trained domain classification model to obtain the trained semantic understanding model.

[0238] In some embodiments, the model training module 4553 is configured to, based on normalized weighted semantic data, call a speech recognition model to perform a training task to predict the semantics corresponding to the control instruction text; and based on normalized weighted semantic data, call a domain classification model to perform a training task to predict the control instructions corresponding to the semantic data.

[0239] In some embodiments, the model training module 4553 is configured to call the semantic understanding model to perform instruction prediction processing based on the weighted semantic data corresponding to each control instruction, thereby obtaining a predicted instruction; determine the first prediction loss of the semantic understanding model based on the difference between the predicted instruction and the control instruction corresponding to the weighted semantic data; and perform backpropagation processing on the semantic understanding model based on the first prediction loss to obtain the trained semantic understanding model.

[0240] In some embodiments, the model training module 4553 is configured to encapsulate the semantic structure data corresponding to each control instruction after training the semantic understanding model based on weighted semantic data to obtain the semantic frame data corresponding to each control instruction; and store the correspondence between the following data: each control instruction and the semantic frame data corresponding to each control instruction.

[0241] In some embodiments, the model training module 4553 is configured to store the correspondence between the following data, and in response to receiving voice data, call the semantic understanding model based on the voice data to perform the following processing: convert the voice data into text data, perform fuzzy matching processing on the text data and semantic frame data to obtain the search confidence corresponding to each semantic frame data segment; obtain the control instruction corresponding to the semantic frame data with the highest search confidence, and execute the control instruction.

[0242] This application also proposes an interactive control device based on voice data. The device includes: a display module configured to display a virtual scene in a human-computer interaction interface; a voice acquisition module configured to acquire voice data; a recognition module configured to call a semantic understanding model based on the voice data to perform semantic recognition processing and determine the control command corresponding to the voice data, wherein the semantic understanding model is trained by the interactive control method based on voice data in this application; and a display module configured to execute the control command.

[0243] This application provides a computer program product, which includes a computer program or computer-executable instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer-executable instructions from the computer-readable storage medium and executes the computer-executable instructions, causing the computer device to perform the voice data-based interactive control method described above in this application.

[0244] This application provides a computer-readable storage medium storing computer-executable instructions. When these computer-executable instructions are executed by a processor, they cause the processor to execute the voice data-based interactive control method provided in this application. For example... Figure 3A The interactive control method based on voice data is shown.

[0245] In some embodiments, the computer-readable storage medium may be a memory such as FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; or it may be a variety of devices including one or any combination of the above-mentioned memories.

[0246] In some embodiments, computer-executable instructions may take the form of programs, software, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

[0247] As an example, computer-executable instructions may, but do not necessarily, correspond to files in a file system. They may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a HyperText Markup Language (HTML) document, in a single file dedicated to the program in question, or in multiple co-located files (e.g., files that store one or more modules, subroutines, or code sections).

[0248] As an example, computer-executable instructions can be deployed to execute on a single electronic device, or on multiple electronic devices located at one location, or on multiple electronic devices distributed across multiple locations and interconnected via a communication network.

[0249] In summary, by converting control commands into semantic structure data and labeling the semantic structure with weight values, this application's embodiments improve the accuracy of training sample labeling, enhance the accuracy of training the semantic understanding model, and ultimately improve the accuracy of the semantic understanding model in recognizing semantics and control commands during interactive control.

[0250] The above description is merely an embodiment of this application and is not intended to limit the scope of protection of this application. Any modifications, equivalent substitutions, and improvements made within the spirit and scope of this application are included within the scope of protection of this application.

Claims

1. A method of interactive control based on voice data, characterized by, The method includes: Obtain text data for use as training samples, wherein the text data includes multiple control instruction texts; The text data is semantically structured to obtain semantic structure data for each control instruction text. This semantic structure processing includes: performing the following processing on each control instruction text: obtaining the lexical attributes of each word in the control instruction text, wherein the lexical attributes include entity words and non-entity words; entity words include any one of the following: role nouns, object nouns, and verbs; non-entity words include any one of the following: modal particles and modal verbs; replacing the entity words with lexical attribute tags corresponding to each entity word in the control instruction text, while retaining each non-entity word in the control instruction text, to obtain the semantic structure data for each control instruction text. Obtain the weight value corresponding to each of the control instruction texts; The semantic structure data of each control instruction text is annotated based on its weight value to obtain weighted semantic data. This annotation process includes: acquiring multiple entity words associated with each lexical attribute tag and the frequency of occurrence of each entity word; using the frequency of occurrence of each entity word as its weight value; performing the following processing on the semantic structure data of each control instruction text: using the semantic structure data of the control instruction text as a semantic template; combining multiple entity words associated with the lexical attribute tags included in the semantic template to obtain new segments of new semantic structure data; determining the weight value of each segment of new semantic structure data based on the weight values of the entity words included in each segment; and annotating each segment of new semantic structure data to obtain weighted semantic data. A semantic understanding model is trained based on the weighted semantic data, wherein the trained semantic understanding model is used to convert the speech data into text data and identify the control command text corresponding to the text data.

2. The method of claim 1, wherein, The step of obtaining the weight value corresponding to each of the control instruction texts includes: For each of the control command texts, the following processing is performed: Obtain the number of words in the control instruction text; When the number of words is 1, the probability of occurrence of the words corresponding to the control instruction text in the text data is obtained, and the probability of occurrence is used as the weight value corresponding to the control instruction text. When the number of words is greater than 1, each word in the control instruction text is combined into a word sequence, and the word sequence is subjected to probability prediction processing to obtain the word sequence probability. The word sequence probability is used as the weight value of the control instruction text.

3. The method of claim 1, wherein, The process of combining multiple entity words associated with the lexical attribute tags included in the semantic template to obtain new multi-segment semantic structure data includes: Multiple combination processes are performed based on the semantic template to obtain multiple different word sequences. The combination process includes: extracting a target entity word from multiple entity words associated with each word attribute tag; and combining each target entity word into a word sequence according to the order of each word attribute tag in the semantic template. The non-entity words in the semantic template are combined with each of the word sequences to obtain new multi-segment semantic structure data.

4. The method of claim 3, wherein, Before combining the non-entity words in the semantic template with each of the word sequences to obtain new multi-segment semantic structure data, the method further includes: For each of the word sequences, the following processing is performed: the weight values corresponding to each entity word included in the word sequence are multiplied sequentially to obtain the word sequence probability corresponding to the word sequence; Each word sequence is sorted in descending order based on the word sequence probability, and at least one word sequence at the head of the result of the descending order sorting is retained, wherein the at least one word sequence is used to generate new semantic structure data.

5. The method of claim 1, wherein, The semantic understanding model includes a speech recognition model and a domain classification model; The training of the semantic understanding model based on the weighted semantic data includes: The weighted semantic data is normalized to obtain normalized weighted semantic data; The speech recognition model is trained based on the normalized weighted semantic data, and the domain classification model is trained based on the normalized weighted semantic data. The trained speech recognition model is combined with the trained domain classification model to obtain the trained semantic understanding model.

6. The method of claim 5, wherein, The training of the speech recognition model based on the normalized weighted semantic data, and the training of the domain classification model based on the normalized weighted semantic data, include: Based on the normalized weighted semantic data, the speech recognition model is invoked to perform the training task of predicting the semantics corresponding to the control instruction text; Based on the normalized weighted semantic data, the domain classification model is invoked to perform the training task of predicting the control instruction text corresponding to the semantic data.

7. The method of claim 1, wherein, The training of the semantic understanding model based on the weighted semantic data includes: Based on the weighted semantic data corresponding to each control instruction text, a semantic understanding model is invoked to perform instruction prediction processing to obtain the predicted instruction. Based on the difference between the predicted instruction and the control instruction text corresponding to the weighted semantic data, the first prediction loss of the semantic understanding model is determined; The semantic understanding model is backpropagated based on the first prediction loss to obtain the trained semantic understanding model.

8. The method of claim 1, wherein, After training the semantic understanding model based on the weighted semantic data, the method further includes: The semantic structure data corresponding to each control instruction text is encapsulated to obtain semantic frame data corresponding to each control instruction text. The following correspondences are stored: each control instruction text and the semantic frame data corresponding to each control instruction text.

9. The method of claim 8, wherein, After storing the correspondence between the following data, the method further includes: In response to receiving voice data, the semantic understanding model is invoked to perform the following processing based on the voice data: The speech data is converted into text data, and the text data is subjected to fuzzy matching with the semantic frame data to obtain the search confidence score corresponding to each segment of the semantic frame data. Obtain the control instruction text corresponding to the semantic frame data with the highest search confidence, and execute the control instruction text.

10. An interactive control method based on voice data, characterized by, The method includes: Display virtual scenes in the human-computer interaction interface; Acquire voice data; Based on the speech data, a semantic understanding model is invoked to perform semantic recognition processing to determine the control command text corresponding to the speech data, wherein the semantic understanding model is trained by the interactive control method based on speech data as described in any one of claims 1 to 9; Execute the control command text.

11. An interactive control device based on voice data, characterized by The device includes: The sample acquisition module is configured to acquire text data for use as training samples, wherein the text data includes multiple control instruction texts; The sample processing module is configured to perform semantic structuring processing on the text data to obtain semantic structure data for each control instruction text. This semantic structuring processing is implemented as follows: for each control instruction text, the following processing is performed: Lexical attributes of each word in the control instruction text are obtained, whereby the lexical attributes include entity words and non-entity words. Entity words include any one of the following: role nouns, object nouns, and verbs. Non-entity words include any one of the following: modal particles and modal verbs. The entity words are replaced with their corresponding lexical attribute tags, while retaining each non-entity word in the control instruction text, thus obtaining the semantic structure data for each control instruction text. The sample processing module is further configured to obtain the weight value corresponding to each of the control instruction texts; The sample processing module is further configured to annotate the semantic structure data of each control instruction text based on the weight value of each control instruction text to obtain weighted semantic data. This annotation process is implemented as follows: multiple entity words associated with each lexical attribute tag and the frequency of occurrence of each entity word are obtained; the frequency of occurrence of each entity word is used as the weight value of the entity word. The semantic structure data of each control instruction text is then processed as follows: the semantic structure data of the control instruction text is used as a semantic template; multiple entity words associated with the lexical attribute tags included in the semantic template are combined to obtain new segments of new semantic structure data; the weight value of each segment of new semantic structure data is determined based on the weight values of the entity words included in each segment; and each segment of new semantic structure data is annotated to obtain weighted semantic data. The model training module is configured to train a semantic understanding model based on the weighted semantic data. The trained semantic understanding model is used to convert the speech data into text data and identify the control command text corresponding to the text data.

12. The apparatus according to claim 11, characterized in that, The sample processing module is further configured to perform multiple combination processes based on the semantic template to obtain multiple different word sequences. The combination process includes: extracting a target entity word from the multiple entity words associated with each word attribute tag; combining each target entity word into a word sequence according to the order of each word attribute tag in the semantic template; and combining the non-entity words in the semantic template with each word sequence to obtain new multi-segment semantic structure data.

13. The apparatus according to claim 12, characterized in that, The sample processing module is further configured to perform the following processing for each word sequence: multiply the weight values corresponding to each entity word included in the word sequence in sequence to obtain the word sequence probability corresponding to the word sequence; sort each word sequence in descending order based on the word sequence probability, and retain at least one word sequence at the head of the result of the descending order sorting, wherein the at least one word sequence is used to generate new semantic structure data.

14. An interactive control device based on voice data, characterized by The device includes: The display module is configured to display a virtual scene in the human-computer interaction interface; The voice acquisition module is configured to acquire voice data; The recognition module is configured to call a semantic understanding model to perform semantic recognition processing based on the speech data, and determine the control command text corresponding to the speech data, wherein the semantic understanding model is trained by the interactive control method based on speech data as described in any one of claims 1 to 9; The display module is also configured to execute the control command text.

15. An electronic device, comprising: The electronic device includes: Memory is used to store executable instructions for a computer; A processor, when executing computer-executable instructions stored in the memory, implements the interactive control method based on voice data as described in any one of claims 1 to 10.

16. A computer-readable storage medium storing computer-executable instructions, wherein execution of the computer-executable instructions by one or more processors of a computing system causes the one or more processors to perform operations comprising: When the computer-executable instructions are executed by the processor, they implement the interactive control method based on voice data as described in any one of claims 1 to 10.

17. A computer program product comprising computer programs or computer executable instructions, characterized in that, When the computer program or computer-executable instructions are executed by the processor, the interactive control method based on voice data as described in any one of claims 1 to 10 is implemented.