Speech processing method, apparatus, and device
By utilizing a model to process speech in a speech response relationship network to obtain semantic features and using a similarity algorithm to determine the response, the problem of low speech processing accuracy in existing technologies is solved, and higher speech processing precision is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INDUSTRIAL AND COMMERCIAL BANK OF CHINA
- Filing Date
- 2023-04-20
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, when determining responses by keyword matching of user-asked voice questions, there is a problem of low accuracy in voice processing due to significant semantic differences.
By acquiring the voice response relationship network, the first model is used to process the voice to obtain the semantic features of the keywords, and a similarity algorithm is used to determine the candidate questions in the network and output the target response, thus avoiding situations where the keywords are the same or similar but have large semantic differences.
It improves the accuracy of voice processing, ensuring that responses better match user intent and enhancing the precision of voice processing.
Smart Images

Figure CN116467407B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a speech processing method, apparatus, and device. Background Technology
[0002] During business transactions, companies can use AI devices to automatically answer users' questions, thereby saving human resources and improving work efficiency.
[0003] In related technologies, user-generated voice inquiries can be processed as follows: After acquiring the user's voice inquiry, the AI device can perform text extraction to obtain at least one keyword corresponding to the inquiry. Based on this keyword, the question with the highest similarity to the inquiry is identified in a database. The corresponding response is then sent to the client.
[0004] In the above process, the question text with the highest similarity to the question voice is determined only by using at least one keyword corresponding to the question voice. However, the question text with the highest similarity to the question voice may have a significant semantic difference from the question voice. Therefore, determining the corresponding response text based on the question text may not be the same as the response to the question voice, leading to low accuracy in voice processing. Summary of the Invention
[0005] This application provides a speech processing method, apparatus, and device to solve the problem of low accuracy in speech processing.
[0006] In a first aspect, embodiments of this application provide a voice processing method, including:
[0007] Get the first audio recording;
[0008] Obtain a voice response relationship network, which includes multiple question groups and responses corresponding to each question group, wherein the similarity of questions in the question group is greater than or equal to a first threshold.
[0009] The first speech is processed by the first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech;
[0010] Based on the at least one semantic feature, multiple candidate questions are determined in the voice response relationship network, wherein the similarity between the candidate questions and the at least one semantic feature is greater than or equal to a second threshold.
[0011] Based on the candidate questions, the target response corresponding to the first voice is determined in the voice response relationship network, and the target response is output.
[0012] Secondly, embodiments of this application provide a voice processing apparatus, the apparatus comprising:
[0013] The first acquisition module is used to acquire the first voice message;
[0014] The second acquisition module is used to acquire a voice response relationship network, which includes multiple question groups and responses corresponding to each question group, wherein the similarity of questions in the question group is greater than or equal to a first threshold.
[0015] The processing module is used to process the first speech using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech;
[0016] A first determining module is configured to determine multiple candidate questions in the voice response relationship network based on the at least one semantic feature, wherein the similarity between the candidate questions and the at least one semantic feature is greater than or equal to a second threshold.
[0017] The second determining module is used to determine the target response corresponding to the first voice in the voice response relationship network based on the candidate questions, and output the target response.
[0018] Thirdly, embodiments of this application provide a voice processing device, including:
[0019] At least one processor; and
[0020] A memory communicatively connected to the at least one processor; wherein,
[0021] The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method described in any of the first aspects.
[0022] Fourthly, embodiments of this application provide a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform the method described in any one of the first aspects.
[0023] Fifthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the method described in any one of the first aspects.
[0024] The speech processing method, apparatus, and device provided in this application embodiment acquire a first speech and a speech response relationship network. The first speech is processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech. A first similarity algorithm is used to determine the similarity between the at least one semantic feature and each question in the speech response relationship network. Multiple similarities are sorted from largest to smallest to sort the questions in the speech response relationship network, resulting in a sorted set of questions. The top K questions in the sorted set are defined as a first question set, and the remaining questions are defined as a second question set. Based on the at least one semantic feature, the first question set, and the second question set, multiple candidate questions are determined. Based on the candidate questions, a target response corresponding to the first speech is determined in the speech response relationship network, and the target response is output. In the above process, since the first speech can be processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech, the same semantic feature is used to indicate all words with the same or similar semantics. Multiple candidate questions are determined through the semantic features corresponding to the keywords. Based on multiple candidate questions, the target response corresponding to the first speech is determined in the speech response relationship network. This avoids situations where the first speech and the candidate questions in the database have the same or similar keywords but different semantics, thus improving the accuracy of speech processing. Attached Figure Description
[0025] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0026] Figure 1 A schematic diagram illustrating the application scenarios provided in the embodiments of this application;
[0027] Figure 2 A schematic flowchart of a speech processing method provided in an embodiment of this application;
[0028] Figure 3 This is a schematic diagram illustrating the process of acquiring the first voice as provided in an embodiment of this application;
[0029] Figure 4 A flowchart illustrating another speech processing method provided in an embodiment of this application;
[0030] Figure 5 This is a schematic diagram of the structure of the voice relationship response network provided in an embodiment of this application;
[0031] Figure 6 This is a schematic diagram of the speech processing process provided in an embodiment of this application;
[0032] Figure 7This is a schematic diagram of the structure of a voice processing device provided in an embodiment of this application;
[0033] Figure 8 This is a schematic diagram of another voice processing device provided in an embodiment of this application;
[0034] Figure 9 This is a schematic diagram of the structure of the voice processing device provided in an embodiment of this application.
[0035] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0036] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0037] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0038] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with relevant laws, regulations and standards, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0039] It should be noted that the speech processing method and apparatus of this application can be used in the field of artificial intelligence, or in any field other than artificial intelligence. The application field of the speech processing method and apparatus of this application is not limited.
[0040] To facilitate understanding, the following will be combined with... Figure 1The application scenarios applicable to the embodiments of this application will be described.
[0041] Figure 1 This is a schematic diagram illustrating an application scenario provided in an embodiment of this application. Please refer to [link / reference]. Figure 1 The system includes a terminal device 101 and a voice processing device 102. The terminal device 101 can be a mobile phone, computer, etc., and the voice processing device 102 can be a server. Users can ask questions through an application provided by the terminal device 101. The terminal device 101 acquires the user's voice question and sends it to the voice processing device 102. Based on the voice question sent by the terminal device 101, the voice processing device 102 determines the corresponding response in its database and sends the response back to the terminal device 101. The terminal device 101 can display or play the response so that the user receives a reply corresponding to their question.
[0042] In related technologies, user-generated voice inquiries can be processed as follows: After acquiring the user's voice inquiry, an AI device can perform text extraction to obtain at least one keyword corresponding to the voice inquiry. Based on this keyword, the question with the highest similarity to the voice inquiry is identified in a database. The corresponding response is then sent to the client. However, this process relies solely on at least one keyword to determine the question with the highest similarity. This means that the question with the highest similarity might have a significant semantic difference from the voice inquiry. Therefore, the response determined based on the question may not correspond to the voice inquiry, leading to low accuracy in voice processing.
[0043] In this embodiment, a first voice recording corresponding to a user's question and a voice response relationship network are obtained. The first voice recording is processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice recording. Based on the at least one semantic feature, multiple candidate questions with the highest similarity to the first voice recording are determined in the voice response relationship network. Based on the candidate questions, the target response corresponding to the first voice recording is determined in the voice response relationship network and output. In the above process, since the first voice recording can be processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first voice recording, the same semantic feature is used to indicate all words with the same or similar semantics. Multiple candidate questions are determined by the semantic features corresponding to the keywords. Based on the multiple candidate questions, the target response corresponding to the first voice recording is determined in the voice response relationship network, avoiding situations where the keywords of the first voice recording and the candidate questions in the database are the same or similar but have significantly different semantics, thus improving the accuracy of voice processing.
[0044] The method described in this application will now be illustrated through specific embodiments. It should be noted that the following embodiments may exist independently or in combination with each other; identical or similar content will not be repeated in different embodiments.
[0045] Figure 2 This is a schematic flowchart illustrating a speech processing method provided in an embodiment of this application. Please refer to [link / reference]. Figure 2 The method may include:
[0046] S201, Obtain the first voice recording.
[0047] The execution entity in this application embodiment can be a voice processing device or a voice processing apparatus installed within a voice processing device. The voice processing apparatus can be implemented through software or a combination of software and hardware. The voice processing device can be a server.
[0048] Users can ask questions through prompts displayed on the application within their terminal device. The terminal device's audio capture device obtains the user's question and sends it to the voice processing device. The terminal device can be a mobile phone, tablet, etc.
[0049] Below, in conjunction with Figure 3 The process of acquiring the first speech is explained. Figure 3 This is a schematic diagram illustrating the process of acquiring the first voice recording as provided in an embodiment of this application. Please refer to [link / reference]. Figure 3 This includes interfaces 301 and 302. Interfaces 301 and 302 are query pages provided by the application in the terminal device. Referring to interface 301, the user opens the query page in the terminal device application by clicking. A dialog box is displayed on the query page to prompt the user to perform corresponding operations. When the user performs an information query, they can click and hold the speak button on interface 301. The terminal device responds to the user's click operation and begins recording the user's question through a recording device. Referring to interface 302, after recording the user's question, the terminal device sends the first voice recording corresponding to the question to the voice processing device, and simultaneously displays the text corresponding to the first voice recording on the query page to notify the user that the question has been received.
[0050] S202, Obtain the voice response relationship network.
[0051] The voice response relationship network includes multiple question groups and the corresponding responses for each question group, where the similarity of questions in the question group is greater than or equal to a first threshold.
[0052] Based on the voice data acquired within a historical time period and the corresponding responses, a voice response relationship network can be established using a knowledge graph, and this network can be stored in the preset storage space of the voice processing device.
[0053] Multiple questions in a question group can have the same or similar semantics. For example, multiple questions in a question group can be shown in Table 1:
[0054] Table 1
[0055] question Problem content Question 1 How much money is left in the account? Question 2 Check current account balance Question 3 How much money is left in your account?
[0056] S203. Process the first speech using the first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech.
[0057] The first model can be a word vector (word2vec) model.
[0058] At least one semantic feature corresponding to at least one keyword in the first speech can be obtained in the following way: perform speech recognition processing and word segmentation processing on the first speech to obtain the first text corresponding to the first speech, the first text including at least one sentence text corresponding to the first speech and keyword tags corresponding to the sentence text, the keyword tags being used to indicate the part-of-speech classification to which the keyword belongs; process the first text through a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech.
[0059] The parts of speech classification includes at least the following: nouns, verbs, adjectives, prepositions, pronouns, numerals, conjunctions, auxiliary words, time words, stative words, locative words, and punctuation marks.
[0060] Semantic features can be represented using word vectors. A word vector can be an n-dimensional vector, where each element indicates the semantic feature corresponding to a keyword in the first text.
[0061] Semantic features are used to indicate multiple keywords with the same or similar semantics. For example, balance, remaining balance, amount left, and how much money is left can all be represented by the same semantic feature.
[0062] For example, after acquiring the first speech, the speech processing device performs speech recognition processing, resulting in the text: "How much money is left in my account?" The text is then segmented into words, resulting in the following first text: "I" (pronoun), "in my account" (noun), "how much" (adverb), "money" (noun). Each keyword in the first text is input into a first model, which processes the text to obtain at least one semantic feature word vector A corresponding to at least one keyword in the first speech. Specifically, the word vector can be A = (a, b, c, d). T Each element indicates the semantic features corresponding to each keyword in the first text.
[0063] S204. Based on at least one semantic feature, identify multiple candidate questions in the voice response relationship network.
[0064] The similarity between the candidate question and at least one semantic feature is greater than or equal to a second threshold.
[0065] Multiple candidate questions can be identified in a voice response relationship network as follows: Using a first similarity algorithm, determine the similarity between at least one semantic feature and each question in the voice response relationship network; sort the multiple similarities from largest to smallest to rank the questions in the voice response relationship network, resulting in a ranked set of questions; determine the first K questions from the ranked set as a first question set and the remaining questions (excluding the first K questions) as a second question set, where K is an integer greater than or equal to 1; and determine multiple candidate questions based on at least one semantic feature, the first question set, and the second question set.
[0066] The first similarity algorithm can be the Word Centoid Distance (WCD) algorithm.
[0067] In the first and second problem sets, K problems with a similarity greater than or equal to the second threshold and a similarity greater than that of other problems excluding the first K problems are identified as multiple candidate problems.
[0068] For example, consider a voice response relationship network with 100 questions, assuming K is 10. Using the WMD algorithm, determine the similarity between at least one semantic feature and each of the 100 questions in the voice response relationship network. Sort these 100 similarities from highest to lowest, thus sorting the questions in the voice response relationship network, resulting in a sorted set of questions. The top 10 questions (questions 1-10) from the sorted set are defined as the first question set. The remaining 90 questions (questions 11-100) are defined as the second question set. Based on at least one semantic feature, the first question set, and the second question set, select 10 questions with a similarity greater than or equal to a second threshold of 90% to determine multiple candidate questions.
[0069] If the similarity between questions 1-10 is greater than or equal to the similarity between questions 11-100, and the similarity between questions 1-10 is greater than or equal to the second threshold, then questions 1-10 in the first question set are identified as multiple candidate questions. If any question among questions 11-100 has a similarity greater than that of questions 1-10, then the ranking of all questions is updated until there are 10 questions whose similarity is greater than or equal to the second threshold and whose similarity is greater than that of the remaining 90 questions. These 10 questions are identified as multiple candidate questions.
[0070] S205. Based on the candidate questions, determine the target response corresponding to the first voice in the voice response relationship network, and output the target response.
[0071] For example, the speech processing device determines multiple candidate questions in the speech response relationship network based on at least one semantic feature, as shown in Table 2:
[0072] Table 2
[0073] question Problem content Question 1 When is my next payment due? Question 2 What is the repayment period? Question 3 Before what date will it be possible to make a payment without being overdue?
[0074] Based on the candidate questions, the voice processing device determines the target response corresponding to the first voice message in the voice response relationship network. This target response can be the repayment date, which is the 5th of each month. The voice processing device can then send the target response to the terminal device, which can directly play or display the target response.
[0075] The speech processing method provided in this application embodiment obtains a first speech and a speech response relationship network. The first speech is processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech. Based on the at least one semantic feature, multiple candidate questions are determined in the speech response relationship network. Based on the candidate questions, a target response corresponding to the first speech is determined in the speech response relationship network and output. In the above process, since the first speech can be processed using the first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech, and the same semantic feature is used to indicate all words with the same or similar semantics, multiple candidate questions are determined through the semantic features corresponding to the keywords. Based on the multiple candidate questions, the target response corresponding to the first speech is determined in the speech response relationship network, avoiding situations where the keywords of the first speech and the candidate questions in the database are the same or similar but have significantly different semantics, thus improving the accuracy of speech processing.
[0076] Based on any of the above embodiments, the following, in conjunction with Figure 4 This section provides a detailed explanation of the speech processing process.
[0077] Figure 4 This is a flowchart illustrating another speech processing method provided in an embodiment of this application. Please refer to... Figure 4 The method includes:
[0078] S401, Obtain the first voice message.
[0079] It should be noted that the execution steps of S401 can be found in S201, and will not be repeated here.
[0080] S402, Obtain the voice response relationship network.
[0081] The voice response relationship network includes multiple question groups and the corresponding responses for each question group, where the similarity of questions in the question group is greater than or equal to a first threshold.
[0082] Before acquiring the voice response relationship network, a voice response relationship network can be established based on multiple questions acquired within a historical time period and the corresponding responses for each question. This voice response relationship network is then stored in the preset storage space of the voice processing device.
[0083] The voice response relationship network can be determined as follows: obtain multiple voice questions and the corresponding responses for each voice question; classify the multiple voice questions to obtain multiple question groups, where the similarity of the questions in the question groups is greater than or equal to a first threshold; determine the responses for each question set; and generate a voice response relationship network based on the multiple voice questions, the multiple question sets, and the responses for each question set.
[0084] Below, in conjunction with Figure 5 The structure of the voice response relationship network is explained. Figure 5 This is a schematic diagram of the structure of the voice relationship response network provided in an embodiment of this application. Please refer to [link / reference]. Figure 5 The system includes a voice response relationship network 501, which is stored in a preset storage space of the voice processing device. The voice response relationship network 501 includes five question groups: question group 1, question group 2, question group 3, question group 4, and question group 5. It also includes corresponding responses for each question group: response 1, response 2, response 3, response 4, and response 5. Each question group contains multiple questions, and the similarity of each question is greater than or equal to a first threshold of 95%.
[0085] Associations can be established between multiple problem groups of the same type. These associations indicate that the problems within a problem group belong to the same business type. For example, Figure 5 The question groups 1 and 2 shown are related (indicated by dashed boxes). The business type corresponding to each question in question groups 1 and 2 is account information inquiry. Question groups 3 and 4 are also related; the business type corresponding to each question in question groups 3 and 4 is repayment period inquiry.
[0086] When determining the relationships between question groups, multiple dimensions and ranges of business types can be pre-defined, and relationships between multiple question groups can be established based on these dimensions. For example, for questions related to querying business information, the dimensions corresponding to the business types can be set to include querying account information, querying account type, and querying account deposit and withdrawal limits. In this way, multi-dimensional and multi-level relationships can be established between question groups based on multiple questions.
[0087] S403. Process the first speech using the first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech.
[0088] Before processing the first speech using the first model, the first model can be trained based on the speech acquired in historical time periods and the corresponding responses to the speech, in order to improve the accuracy of the first model's output.
[0089] The first model can be trained as follows: A training set is obtained, which includes multiple second speech samples and at least one semantic feature corresponding to each second speech sample; speech recognition and word segmentation are performed on the second speech samples to obtain the second text corresponding to the second speech sample and the keyword tags corresponding to the second text. The keyword tags are used to indicate the part-of-speech classification of the keywords; the i-th intermediate model is trained through the second speech sample for the i-th iteration to obtain the (i+1)-th intermediate model, where i is 1, 2, 3, ..., until the i-th intermediate model converges. When i is greater than or equal to N, the i-th intermediate model is determined as the first model, where N is a preset number of iterations and is an integer greater than 1. The first intermediate model is the initial model.
[0090] The (i+1)th intermediate model can be obtained as follows: feature extraction is performed on each second text using the i-th intermediate model to obtain at least one predicted semantic feature corresponding to each second text; the loss value is determined based on at least one predicted semantic feature corresponding to each first text and at least one corresponding semantic feature in the training set; the model parameters of the i-th intermediate model are updated based on the loss value to obtain the (i+1)-th intermediate model, where the model parameters include the dimension of the word vectors and the window size.
[0091] The model converges when the loss value is less than or equal to a preset threshold. That is, the similarity between at least one semantic feature corresponding to at least one predicted semantic feature in the training set is less than or equal to the preset threshold.
[0092] The method for determining the loss value is the same as the method for determining the similarity of multiple candidate problems described below, and will not be repeated here.
[0093] S404. Using the first similarity algorithm, determine the similarity between at least one semantic feature and each question in the speech response relationship network.
[0094] When determining similarity using the first similarity algorithm, since it is necessary to determine the similarity between all questions in the speech response relationship network and at least one semantic feature, a similarity algorithm with lower time complexity can be used to improve computational efficiency.
[0095] S405. Sort the multiple similarities from largest to smallest to sort the questions in the voice response relationship network, and obtain the sorted questions.
[0096] For example, the speech processing device uses a first similarity algorithm to determine the similarity between at least one semantic feature and 10 questions in a speech response relationship network. The 10 similarities are then sorted from highest to lowest to rank the 10 questions in the speech response relationship network, resulting in the following 10 ranked questions: Question 2, Question 1, Question 4, Question 6, Question 3, Question 9, Question 8, Question 10, Question 5, and Question 7.
[0097] S406. Determine the first K problems from the sorted problems as the first problem set, and determine the other problems from the multiple problems excluding the first K problems as the second problem set.
[0098] K is an integer greater than or equal to 1. The value of K can be determined based on the number of questions that have a similarity to at least one semantic feature greater than or equal to a second threshold.
[0099] For example, if the number of questions with a similarity of more than or equal to the second threshold of 95% to at least one semantic feature is 3, then the value corresponding to K can be determined to be 3.
[0100] For example, suppose K is 3. Then, based on the 10 questions sorted as shown in the example above, the first set of questions includes the first 3 questions, namely questions 2, 1, and 4. The second set of questions includes the last 7 questions, namely questions 6, 3, 9, 8, 10, 5, and 7.
[0101] S407. Using the second similarity algorithm, determine the first similarity between at least one semantic feature and each question in the first question set.
[0102] The second similarity algorithm can be the Word Mover's Distance algorithm.
[0103] For example, based on the first set of questions illustrated above, the second similarity algorithm determines the first similarity between at least one semantic feature and each question in the first set of questions, as shown in Table 3:
[0104] Table 3
[0105]
[0106]
[0107] S408. Using the third similarity algorithm, determine the second similarity between at least one semantic feature and each question in the second question set.
[0108] The third similarity algorithm can be the Relaxed Word Moving Distance (RWMD) algorithm.
[0109] After determining the word shift distance between at least one semantic feature and each question using the aforementioned similarity algorithm, the similarity between at least one semantic feature and each question can be determined based on the word shift distance between them.
[0110] S409. Determine whether there is a target similarity among multiple second similarities.
[0111] The target similarity is greater than each first similarity.
[0112] If so, execute S411.
[0113] If not, proceed with S410.
[0114] S410. Determine the target question corresponding to the target similarity, update the first question set according to the target question, and determine multiple candidate questions according to the first question set.
[0115] The first question set can be updated based on the target question as follows: Determine the similarity between at least one semantic feature and the target question using the second similarity algorithm; Sort the questions in the first question set and the target question in descending order of similarity; Update the first question set with the sorted top K questions.
[0116] For example, based on the second set of questions illustrated above, the second similarity between at least one semantic feature and each question in the second set can be determined using the third similarity algorithm, as shown in Table 4:
[0117] Table 4
[0118] Second set of questions Second similarity Question 6 95.2% Question 3 91.5% Question 9 90.0% Question 8 88.0% Question 10 85.0% Question 5 83.0% Question 7 75.0%
[0119] Based on the first similarity shown in Table 3 and the second similarity shown in Table 4, it can be determined that the second similarity of question 6 is greater than that of question 4. Therefore, it can be determined whether a target similarity of 95.2% exists among the multiple second similarities. At this point, according to the second similarity algorithm, at least one semantic feature is determined to have a similarity of 98.0% with question 6. The questions in the first question set and the target question are sorted in descending order of similarity, resulting in the order: Question 6, Question 2, Question 1, and Question 4. The first question set is then updated to include the first three questions in the sorted list. That is, the first question set includes Question 6, Question 2, and Question 1.
[0120] After S410, execute S412.
[0121] S411. Based on the first set of questions, determine several candidate questions.
[0122] Multiple candidate questions can be determined based on the first set of questions in the following way: for any question in the first set of questions, determine the question group to which the question belongs in the voice response relationship network; if the question groups to which all questions in the first set of questions belong are the same, then multiple questions in the first set of questions are determined as multiple candidate questions.
[0123] For example, based on the second set of questions illustrated above, the second similarity between at least one semantic feature and each question in the second set can be determined using the third similarity algorithm, as shown in Table 5:
[0124] Table 5
[0125] Second set of questions Second similarity Question 6 93.2% Question 3 91.5% Question 9 90.0% Question 8 88.0% Question 10 85.0% Question 5 83.0% Question 7 75.0%
[0126] Based on the first similarity shown in Table 3 and the second similarity shown in Table 5, it can be determined that no target similarity exists among the multiple second similarities. At this point, the question group containing questions 2, 1, and 4 from the first question set shown in Table 3 is determined in the voice response relationship network. If the question group containing questions 2, 1, and 4 from the first question set is all question group 2, then questions 2, 1, and 4 from the first question set are determined as multiple candidate questions.
[0127] Combining three similarity algorithms to identify multiple candidate problems allows for the use of algorithms with lower time complexity when computation is demanding, filtering out problems with low similarity to reduce computation time and improve efficiency. Conversely, when computation is limited, a more complex but accurate algorithm is used to improve the accuracy of similarity determination.
[0128] S412. Based on the candidate questions, determine the target response corresponding to the first voice in the voice response relationship network, and output the target response.
[0129] For example, the voice processing device identifies the multiple candidate questions as Question 2, Question 1, and Question 4, as shown in the example above. In the voice response relationship network, it determines that Question 2, Question 1, and Question 4 belong to Question Group 2. The target response for Question Group 2 is determined to be a repayment date of April 12th of each month. The voice processing device sends the target response to the terminal device, which then displays or plays the target response.
[0130] The speech processing method provided in this application embodiment obtains a first speech and a speech response relationship network. The first speech is processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech. A first similarity algorithm is used to determine the similarity between the at least one semantic feature and each question in the speech response relationship network. Multiple similarities are sorted from largest to smallest to sort the questions in the speech response relationship network, resulting in a sorted set of questions. The top K questions in the sorted set are defined as a first question set, and the remaining questions are defined as a second question set. Based on the at least one semantic feature, the first question set, and the second question set, multiple candidate questions are determined. Based on the candidate questions, a target response corresponding to the first speech is determined in the speech response relationship network, and the target response is output. In the above process, since the first speech can be processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech, the same semantic feature is used to indicate all words with the same or similar semantics. Multiple candidate questions are determined by the semantic features corresponding to the keywords. Based on multiple candidate questions, the target response corresponding to the first speech is determined in the speech response relationship network. This avoids situations where the first speech and the candidate questions in the database have the same or similar keywords but different semantics, thus improving the accuracy of speech processing.
[0131] Based on any of the above embodiments, the following, in conjunction with Figure 6 The detailed process of speech processing is illustrated with examples.
[0132] Figure 6 This is a schematic diagram illustrating the speech processing procedure provided in an embodiment of this application. Please refer to [link / reference]. Figure 6 It includes a terminal device 601 and a voice processing device 602. The terminal device 601 can be a mobile phone, computer, etc., and the voice processing device 602 can be a server. The voice processing device 602 is equipped with a first algorithm, and the preset storage space of the voice processing device 602 stores a voice relationship response network.
[0133] The user opens the query page in the application on terminal device 601 by clicking, and then performs corresponding input and selection operations according to the prompts displayed on the query page. In response to the user's click, terminal device 601 begins recording the user's question via a recording device. After recording the user's question, terminal device 601 sends the first voice recording corresponding to the question to voice processing device 602, and simultaneously displays the corresponding text or prompt information on the query page to notify the user that the question has been received. The first voice recording can be the current repayment date of the account.
[0134] The speech processing device 602 performs speech recognition and word segmentation on the first speech to obtain the first text corresponding to the first speech. The first text includes account (noun), current (noun), of (particle), and repayment time (time word). The speech processing device 602 processes the first speech using a first model to obtain at least one semantic feature A = (a, b, c) corresponding to at least one keyword in the first speech. T Each element indicates the semantic features corresponding to each keyword in the first text.
[0135] The speech processing device 602 retrieves a speech response relationship network from a preset storage space and determines the similarity between at least one semantic feature and each question in the speech response relationship network using a first similarity algorithm. The speech processing device 602 sorts the multiple similarities from largest to smallest, thus sorting the questions in the speech response relationship network, resulting in a sorted set of questions including questions 11, 2, 7, 3, 6, 1, 5, 8, 10, 4, 12, 9, and 13. Assuming K is 5, the first question set includes the first 5 sorted questions, namely questions 11, 2, 7, 3, and 6. The second question set includes the last 8 sorted questions, namely questions 1, 5, 8, 10, 4, 12, 9, and 13. The speech processing device 602 determines the first similarity between at least one semantic feature and each question in the first question set using the second similarity algorithm, as shown in Table 6.
[0136] Table 5
[0137] First set of questions First similarity Question 11 98.2% Question 2 97.5% Question 7 97.0% Question 3 96.0% Question 6 95.0%
[0138] The speech processing device 602 determines the second similarity between at least one semantic feature and each question in the second question set using a third similarity algorithm, as shown in Table 6:
[0139] Table 6
[0140] Second set of questions Second similarity Question 1 94.2% Question 5 94.0% Question 8 93.5% Question 10 92.0% Question 4 90.0% Question 12 88.0% Question 9 86.3% Question 13 80.0%
[0141] Based on the first similarity shown in Table 5 and the second similarity shown in Table 6, the voice processing device 602 determines that there is no target similarity among the multiple second similarities. The voice processing device 602 determines in the voice response relationship network that the question groups containing questions 11, 2, 7, 3, and 6 in the first question set shown in Table 5 are all question group 1. At this time, the voice processing device 602 identifies questions 11, 2, 7, 3, and 6 in the first question set as multiple candidate questions. Based on the candidate questions, the voice processing device 602 determines in the voice response relationship network that the target response corresponding to the first voice is the current repayment date of April 12th, and outputs the target response. The voice processing device 602 sends the target response to the terminal device 601, and the terminal device 601 displays or plays the target response through an application.
[0142] The speech processing method provided in this application embodiment obtains a first speech and a speech response relationship network. The first speech is processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech. A first similarity algorithm is used to determine the similarity between the at least one semantic feature and each question in the speech response relationship network. Multiple similarities are sorted from largest to smallest to sort the questions in the speech response relationship network, resulting in a sorted set of questions. The top K questions in the sorted set are defined as a first question set, and the remaining questions are defined as a second question set. Based on the at least one semantic feature, the first question set, and the second question set, multiple candidate questions are determined. Based on the candidate questions, a target response corresponding to the first speech is determined in the speech response relationship network, and the target response is output. In the above process, since the first speech can be processed using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech, the same semantic feature is used to indicate all words with the same or similar semantics. Multiple candidate questions are determined by the semantic features corresponding to the keywords. Based on multiple candidate questions, the target response corresponding to the first speech is determined in the speech response relationship network. This avoids situations where the first speech and the candidate questions in the database have the same or similar keywords but different semantics, thus improving the accuracy of speech processing.
[0143] Figure 7 This is a schematic diagram of a voice processing device provided in an embodiment of this application. Please refer to... Figure 7 The voice processing device 10 may include:
[0144] The first acquisition module 11 is used to acquire the first voice;
[0145] The second acquisition module 12 is used to acquire a voice response relationship network, which includes multiple question groups and responses corresponding to each question group, wherein the similarity of questions in the question group is greater than or equal to a first threshold.
[0146] Processing module 13 is used to process the first speech through the first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech;
[0147] The first determining module 14 is used to determine multiple candidate questions in the voice response relationship network based on the at least one semantic feature, wherein the similarity between the candidate questions and the at least one semantic feature is greater than or equal to a second threshold.
[0148] The second determining module 15 is used to determine the target response corresponding to the first voice in the voice response relationship network according to the candidate question, and output the target response.
[0149] In one possible implementation, the second determining module 15 is specifically used for:
[0150] The similarity between the at least one semantic feature and each question in the voice response relationship network is determined using a first similarity algorithm.
[0151] The questions in the voice response relationship network are sorted by similarity from largest to smallest to obtain a sorted set of questions.
[0152] The first K problems among the sorted problems are defined as the first problem set, and the other problems among the multiple problems excluding the first K problems are defined as the second problem set, where K is an integer greater than or equal to 1;
[0153] The plurality of candidate questions are determined based on the at least one semantic feature, the first question set, and the second question set.
[0154] In one possible implementation, the second determining module 15 is specifically used for:
[0155] The first similarity between the at least one semantic feature and each question in the first question set is determined using a second similarity algorithm.
[0156] The second similarity between the at least one semantic feature and each question in the second question set is determined using a third similarity algorithm.
[0157] The plurality of candidate questions are determined from the first question set and the second question set based on a plurality of first similarity scores and a plurality of second similarity scores.
[0158] In one possible implementation, the second determining module 15 is specifically used for:
[0159] Determine whether a target similarity exists among the plurality of second similarities, wherein the target similarity is greater than each first similarity;
[0160] If so, then determine the target question corresponding to the target similarity, update the first question set according to the target question, and determine the plurality of candidate questions according to the first question set;
[0161] If not, then the plurality of candidate questions are determined based on the first set of questions.
[0162] In one possible implementation, the second determining module 15 is specifically used for:
[0163] According to the second similarity algorithm, the similarity between the at least one semantic feature and the target question is determined;
[0164] Sort the questions in the first question set and the target question in descending order of similarity;
[0165] Update the first set of problems to the sorted top K problems.
[0166] In one possible implementation, the second determining module 15 is specifically used for:
[0167] For any question in the first set of questions, determine the question group to which the question belongs in the voice response relationship network;
[0168] If all questions in the first question set belong to the same question group, then multiple questions in the first question set are identified as the multiple candidate questions.
[0169] In one possible implementation, the processing module 13 is specifically used for:
[0170] The first speech is subjected to speech recognition processing and word segmentation processing to obtain the first text corresponding to the first speech. The first text includes at least one sentence text corresponding to the first speech and keyword tags corresponding to the sentence text. The keyword tags are used to indicate the part-of-speech category to which the keyword belongs.
[0171] The first text is processed by the first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech.
[0172] The voice processing device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.
[0173] Figure 8 This is a schematic diagram of another voice processing device provided in an embodiment of this application. Figure 7 Based on the illustrated embodiments, please refer to Figure 8 The voice processing device 10 also includes a generation module 16.
[0174] The generation module 16 is used for:
[0175] Retrieve multiple voice questions and the corresponding responses for each voice question;
[0176] The multiple speech problems are classified to obtain multiple problem groups, and the similarity of the problems in the problem groups is greater than or equal to a first threshold.
[0177] Determine the response for each set of questions;
[0178] The voice response relationship network is generated based on the multiple voice questions, the multiple question sets, and the responses corresponding to each question set.
[0179] The voice processing device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.
[0180] Figure 9 This is a schematic diagram of the structure of the voice processing device provided in an embodiment of this application. Please refer to... Figure 9 The voice processing device 20 may include a memory 21 and a processor 22. Exemplarily, the memory 21 and the processor 22 are interconnected via a bus 23.
[0181] Memory 21 is used to store program instructions;
[0182] The processor 22 is used to execute the program instructions stored in the memory, so that the voice processing device 20 performs the method shown in the above method embodiment.
[0183] The voice processing device provided in this application embodiment can execute the technical solution shown in the above method embodiment. Its implementation principle and beneficial effects are similar, and will not be described again here.
[0184] This application provides a computer-readable storage medium storing computer-executable instructions, which are used to implement the above-described method when executed by a processor.
[0185] This application embodiment may also provide a computer program product, including a computer program that, when executed by a processor, can implement the above-described method.
[0186] All or part of the steps in the above-described method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a readable memory. When the program is executed, it performs the steps of the above-described method embodiments; and the aforementioned memory (storage medium) includes: read-only memory (ROM), random access memory (RAM), flash memory, hard disk, solid-state drive, magnetic tape, floppy disk, optical disk, and any combination thereof.
[0187] This application describes embodiments with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processing unit of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0188] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0189] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0190] Obviously, those skilled in the art can make various modifications and variations to the embodiments of this application without departing from the spirit and scope of this application. Therefore, if these modifications and variations to the embodiments of this application fall within the scope of the claims of this application and their equivalents, this application also intends to include these modifications and variations.
Claims
1. A speech processing method, characterized in that, include: Get the first audio recording; Obtain a voice response relationship network, which includes multiple question groups and responses corresponding to each question group, wherein the similarity of questions in the question group is greater than or equal to a first threshold. The first speech is processed by a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech; wherein, the first model is a word vector model; Based on the at least one semantic feature, multiple candidate questions are determined in the voice response relationship network, wherein the similarity between the candidate questions and the at least one semantic feature is greater than or equal to a second threshold; wherein the voice response relationship network is established by a knowledge graph based on voices and corresponding responses obtained in historical time periods, and an association is established between multiple question groups of the same type, wherein the association is used to indicate that the questions in the question group belong to the same business type; Based on the candidate questions, determine the target response corresponding to the first voice in the voice response relationship network, and output the target response; Based on the at least one semantic feature, multiple candidate questions are determined in the voice response relationship network, including: The similarity between the at least one semantic feature and each question in the speech response relationship network is determined by a first similarity algorithm; wherein, the first similarity algorithm is a word centroid distance algorithm; The questions in the voice response relationship network are sorted by similarity from largest to smallest to obtain a sorted set of questions. The first K problems among the sorted problems are defined as the first problem set, and the other problems among the multiple problems excluding the first K problems are defined as the second problem set, where K is an integer greater than or equal to 1; The first similarity between the at least one semantic feature and each question in the first question set is determined by a second similarity algorithm; wherein the second similarity algorithm is a word-shift distance algorithm. The second similarity between the at least one semantic feature and each question in the second question set is determined using a third similarity algorithm; wherein the third similarity algorithm is a relaxed word shift distance algorithm. Determine whether a target similarity exists among the plurality of second similarities, wherein the target similarity is greater than each first similarity; If so, then determine the target question corresponding to the target similarity, and determine the similarity between the at least one semantic feature and the target question according to the second similarity algorithm; sort the questions in the first question set and the target question in descending order of similarity; update the first question set to the sorted top K questions, and determine the plurality of candidate questions based on the first question set; If not, then based on the first set of questions, determine the plurality of candidate questions; Based on the first set of questions, the plurality of candidate questions are determined, including: For any question in the first set of questions, determine the question group to which the question belongs in the voice response relationship network; If all questions in the first question set belong to the same question group, then multiple questions in the first question set are identified as the multiple candidate questions.
2. The method according to claim 1, characterized in that, The first speech is processed using the first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech: The first speech is subjected to speech recognition processing and word segmentation processing to obtain the first text corresponding to the first speech. The first text includes at least one sentence text corresponding to the first speech and keyword tags corresponding to the sentence text. The keyword tags are used to indicate the part-of-speech category to which the keyword belongs. The first text is processed by the first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech.
3. The method according to claim 2, characterized in that, Before obtaining the voice response relationship network, it also includes: Retrieve multiple voice questions and the corresponding responses for each voice question; The multiple speech problems are classified to obtain multiple problem groups, and the similarity of the problems in the problem groups is greater than or equal to a first threshold. Determine the response for each set of questions; The voice response relationship network is generated based on the multiple voice questions, the multiple question sets, and the responses corresponding to each question set.
4. A voice processing device, characterized in that, The device includes: The first acquisition module is used to acquire the first voice message; The second acquisition module is used to acquire a voice response relationship network, which includes multiple question groups and responses corresponding to each question group. The similarity of questions in the question group is greater than or equal to a first threshold. The voice response relationship network is established by using a knowledge graph based on voices acquired in historical time periods and their corresponding responses. It also establishes associations between multiple question groups of the same type, and these associations are used to indicate that the questions in the question group belong to the same business type. The processing module is used to process the first speech using a first model to obtain at least one semantic feature corresponding to at least one keyword in the first speech; wherein, the first model is a word vector model; A first determining module is configured to determine multiple candidate questions in the voice response relationship network based on the at least one semantic feature, wherein the similarity between the candidate questions and the at least one semantic feature is greater than or equal to a second threshold. The second determining module is used to determine the target response corresponding to the first voice in the voice response relationship network based on the candidate questions, and output the target response; The first determining module is specifically used to determine the similarity between the at least one semantic feature and each question in the voice response relationship network using a first similarity algorithm; wherein the first similarity algorithm is a word centroid distance algorithm; sorting multiple similarities from largest to smallest to sort the questions in the voice response relationship network, obtaining a sorted set of questions; determining the first K questions from the sorted set of questions as a first question set and the remaining questions from the set of questions excluding the first K questions as a second question set, where K is an integer greater than or equal to 1; determining the first similarity between the at least one semantic feature and each question in the first question set using a second similarity algorithm; wherein the second similarity algorithm is a word shift distance algorithm; and determining the first similarity between the at least one semantic feature and each question in the first question set using a third similarity algorithm. A similarity algorithm is used to determine the second similarity between the at least one semantic feature and each question in the second question set; wherein the third similarity algorithm is a relaxed word-shift distance algorithm; it is determined whether there is a target similarity among the multiple second similarities, the target similarity being greater than each first similarity; if so, the target question corresponding to the target similarity is determined, and the similarity between the at least one semantic feature and the target question is determined according to the second similarity algorithm; each question in the first question set and the target question are sorted in descending order of similarity; the first question set is updated to the sorted top K questions, and the multiple candidate questions are determined according to the first question set; if not, the multiple candidate questions are determined according to the first question set. The first determining module is further specifically configured to, for any one question in the first question set, determine the question group to which the question belongs in the voice response relationship network; if the question groups to which all questions in the first question set belong are the same, then the multiple questions in the first question set are determined as the multiple candidate questions.
5. A voice processing device, characterized in that, include: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 3.
6. A non-transitory computer-readable storage medium storing computer instructions, characterized in that, in, The computer instructions are used to cause the computer to perform the method according to any one of claims 1 to 3.
7. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 3.