Representative verbal skill fragment extraction device and method based on seat voice segmentation
A voice segment and voice segmentation technology, applied in the field of speech recognition, can solve the problems of time-consuming and labor-intensive learning, agents spend a lot of time, and it is difficult to be used as a reference for agents to improve the quality of their own speech, so as to improve the reference value. , the effect of saving time for self-improvement and learning
Pending Publication Date: 2020-07-31
FUDAN UNIV
0 Cites 0 Cited by
AI-Extracted Technical Summary
Problems solved by technology
However, the speech quality of all kinds of recordings is usually not good or bad, and the agent needs to spend a lot of time listening to and distinguishing various recordings, so the efficiency of speech improvement will be very low
[0005] Even if these agent recordings are marked in advance (i.e., quality scores are performed in advance), so that the agent can purposely filter out record...
Method used
According to the representative speech skill segment extraction device and method based on agent voice segmentation provided by the present embodiment, because the agent voice is predicted by the speech skill prediction model to obtain its corresponding speech skill label and confidence level of each speech skill level, And classify the agent's speech according to the speech tags to obtain representative samples under each category. Therefore, while realizing the extraction of more representative agent's speech based on the agent's speech, the accuracy of the agent's speech (high confidence level) is also considered ). Further, since the MFCC feature extraction is performed on each representative sample, the segmentation point of each representative sample is obtained based on the MFCC feature, and each representative sample is divided into speech segments according to the segmentation point, so the extraction from the complete agent voice is realized. The function of generating representative voice clips removes less useful parts of the agent's voice, and reflects the agent's speech skills contained in the agent's voice most directly and truly. Finally, since the time-domain features and frequency-domain features are extracted for each speech segment and the heuristic algorithm is used to screen the speech segments to obtain the final speech segment, it is also guaranteed that the agent can obtain a variety of speech segments. It avoids a large number ...
Abstract
The invention provides a representative verbal skill fragment extraction device and method based on seat voice segmentation, which are used for extracting a most representative voice fragment from seat voice as a verbal skill fragment, and are characterized by comprising a seat voice acquisition part for acquiring a plurality of seat voices to be processed; the verbal skill label prediction part is used for predicting each seat voice in sequence based on a preset verbal skill prediction model and outputting a verbal skill label and confidence of the seat voice; the voice sample classificationpart is used for classifying the seat voice into a representative sample set according to the verbal skill label and the confidence coefficient; a speech feature extraction unit that extracts MFCC features for each representative sample; a cut point acquisition unit that acquires a cut point for each representative sample; the voice segment cutting part is used for cutting each representative sample according to the cutting points to form voice segments; a segment feature extraction unit for extracting time-domain features and frequency-domain features from the speech segment; and a verbal skill segment acquisition unit for extracting verbal skill segments using a heuristic algorithm.
Application Domain
Natural language data processingNeural architectures +4
Technology Topic
Speech classificationFrequency domain +4
Image
Examples
- Experimental program(1)
Example Embodiment
[0022] In order to make the technical means, creative features, objectives and effects of the present invention easy to understand, the following describes the representative speech segment extraction device and method based on agent voice segmentation of the present invention in detail with reference to the embodiments and drawings.
[0023]
[0024] figure 1 It is a structural block diagram of a representative speech segment extraction device based on agent voice segmentation in an embodiment of the present invention.
[0025] Such as figure 1 As shown, the representative speech segment extraction device 100 based on agent speech segmentation has an agent speech acquisition unit 101, a speech tag prediction unit 102, a speech sample classification unit 103, a speech feature extraction unit 104, a cutting point acquisition unit 105, and speech The segment cutting unit 106, the segment feature extraction unit 107, the speech segment acquisition unit 108, the communication unit 109, and the control unit 110 that controls the above-mentioned respective units.
[0026] Among them, the communication unit 109 is used to exchange data between the various components of the representative speech segment extraction device 100 and between the representative speech segment extraction device 100 and other devices or systems, and the control unit 110 stores data for communication A computer program that controls the operation of each component of the representative speech fragment extraction device 100.
[0027] The seat voice acquiring unit 101 is used to acquire multiple seat voices to be processed.
[0028] In this embodiment, the seat voice acquiring unit 101 acquires the seat voice to be processed from a pre-prepared seat voice data set. The seat voice data set can be formed by recording and collecting the seat service of each seat member in advance.
[0029] For example, the representative speech fragment extraction device 100 of this embodiment can be set in a server that is connected to a plurality of agent terminals in communication, where the agent terminal is a computer held by each agent, and is used to communicate with the agent. A microphone for recording when providing seat services and a terminal-side communication unit that can send the seat voices formed by these recordings to the server. In this way, the agent voices recorded by each agent terminal every day can be collected by the server to form an agent voice data set, and the representative speech fragment extraction device 100 further organizes and extracts representative voice fragments.
[0030] The speech tag prediction unit 102 uses a preset speech prediction model to predict the speech level corresponding to each agent's speech in turn, and outputs the speech tag corresponding to the agent's speech indicating the speech level and the confidence of the agent's speech.
[0031] Among them, the speech skill level is the level for evaluating the quality of the agent's speech skill, and each speech skill level corresponds to a corresponding speech skill label. In this embodiment, the speech skill level is divided into low, intermediate, and high level, and the speech skill label has three corresponding labels.
[0032] In this embodiment, the speech prediction model is pre-built and stored in the speech tag prediction unit 102. The input of the speech prediction model is the agent's voice, and the output is the speech tag and confidence level corresponding to the agent's voice.
[0033] figure 2 It is a schematic diagram of the structure of the speech prediction model in the embodiment of the present invention.
[0034] Such as figure 2 As shown, the speech prediction model 40 can output the speech tag and confidence level corresponding to the agent's voice according to the input agent speech. The speech prediction model 40 includes an input module 41, a multi-view extraction module 42, and a feature weight extraction module 43. The prediction module 44 and the output fusion module 45.
[0035] The input module 41 is used to input the seat voice.
[0036] The multi-view extraction module 42 is configured to perform multi-view feature extraction on the seat voice and obtain a multi-view feature corresponding to the seat voice.
[0037] In this embodiment, the multi-view features are text features, time domain features, and frequency domain features of speech, and the multi-view extraction module 42 specifically includes a text processing extraction unit 42-1 and a speech processing extraction unit 42-2.
[0038] The text processing extraction unit 42-1 is configured to process the speech data into preprocessed words and extract them to obtain text features corresponding to the text information.
[0039] In this embodiment, the text processing extraction part 42-1 has a text conversion part 42-1a, a preprocessing part 42-1b, a vectorization part 42-1c, and a text feature extraction part 42-1d.
[0040] The text conversion part 42-1a is used to convert the seat voice into text information.
[0041] In this embodiment, the text conversion part 42-1a uses conventional voice recognition technology (for example, calling Baidu Voice and other voice transcription tools through API) to convert voice information into text information.
[0042] The preprocessing part 42-1b is used for preprocessing the text information including at least word segmentation and denoising to obtain preprocessed words.
[0043] In this embodiment, the pre-processed words are segmented through the pre-processing part 42-1b to form multiple vocabularies and denoise to remove useless words in the multiple vocabularies, and finally obtain a pre-processed word composed of multiple vocabularies.
[0044] The vectorization part 42-1c is used to vectorize multiple preprocessed words through a preset word2vec model to obtain multiple corresponding text vectors.
[0045] The text feature extraction part 42-1d is used to input the text vector into the preset LSTM model and use the output of the last hidden layer in the last neural part of the LSTM model as the text feature.
[0046] In this embodiment, the LSTM model is pre-based on labeled samples, supervised by category tags, and trained.
[0047] In this embodiment, the word2vec model and the LSTM model are conventional language analysis models, and the LSTM model uses a single-layer LSTM model, which has (1) embedding layer (batch=32, input_length=500,dimention=dictionary dimension); (2) ) LSTM layer (the number of neurons in the hidden layer is 128); (3) the softmax layer (the activation function is sigmoid), the output dimension is equal to the number of skill levels.
[0048] The voice processing and extracting unit 42-2 is configured to process the voice of the agent to extract the time domain features and frequency domain features corresponding to the voice of the agent.
[0049] In this embodiment, the speech processing extraction unit 42-2 has a speech conversion part 42-2a and a speech feature extraction part 42-2b.
[0050] The voice conversion part 42-2a is used to convert the agent voice into Mel frequency cepstrum coefficients.
[0051] The speech feature extraction part 42-2b performs feature and index extraction based on the Mel frequency cepstrum coefficient to obtain time domain features and frequency domain features.
[0052] Specifically, the voice conversion part 42-2a first performs pre-emphasis (filtering) processing on the continuous voice data, then divides the frame, adds windows (increasing the continuity of the left and right ends of the frame), then performs the fast Fourier transform, and inputs Mel( Mel) The frequency filter bank smoothes the frequency spectrum and eliminates harmonics. Then, the voice feature extraction part 42-2b calculates the logarithmic energy output by each filter bank, and finally, the MFCC is obtained through the discrete cosine transform (DCT) coefficient.
[0053] In this embodiment, the time domain features include form factor, pulse factor, kurtosis, skewness, margin factor, and peak value; frequency domain features include center of gravity frequency, mean square frequency, root mean square frequency, frequency variance, and frequency standard deviation.
[0054] Through the above processing, the text feature, time domain feature, and frequency domain feature of each agent's voice are obtained.
[0055] The feature weight extraction module 43 regresses and normalizes the multi-view features based on the L1 norm (Lasso) and obtains the feature weight corresponding to each agent's voice.
[0056] The prediction module 44 includes a predetermined number of base classifiers, which are respectively used to predict multi-view features and obtain respective intermediate prediction results.
[0057] In this embodiment, the base classifier is selected as the XGBoost model, and each base classifier can predict multi-view features and respectively output their respective predicted intermediate prediction results, that is, each agent voice will be predicted by a predetermined number of intermediate forecast result.
[0058] image 3 It is a flowchart of the construction process of the base classifier in the embodiment of the present invention.
[0059] Such as image 3 As shown, the construction process of the base classifier includes the following steps:
[0060] Step S1-1: Obtain training speech. In this embodiment, the training voice is an agent voice prepared in advance and used for training.
[0061] Step S1-2: Perform multi-view feature extraction for each training speech and obtain training multi-view features corresponding to the training speech.
[0062] Step S1-3: Regress and normalize the multi-view features based on the L1 norm (Lasso), and obtain the feature weight corresponding to each training speech.
[0063] In this embodiment, the processing methods of the above steps S1-2 and S1-3 are the same as the multi-view extraction module 42 and the feature weight extraction module 43 respectively, and will not be repeated here.
[0064] Step S1-4: Probabilistic sampling of the multi-view features for training based on the feature weights to obtain a predetermined number of feature subsets for training.
[0065] In this embodiment, ten feature subsets are extracted in step S1-4, and the feature extraction ratio is selected as 0.5. Finally, ten base classifiers are correspondingly trained in step S1-5, so that the prediction result fused by the output fusion module is finally More stable and accurate. In other solutions of the present invention, the number of feature subsets extracted and the number of base classifiers constructed can also be adjusted according to actual requirements, and the feature extraction ratio can also be adjusted within the range of (0,1).
[0066] Step S1-5: Train and construct a base classifier based on each feature subset for training to obtain a predetermined number of base classifiers.
[0067] The output fusion module 45 fuses the intermediate prediction results output by each base classifier based on the feature weight.
[0068] In this embodiment, the intermediate prediction result is the probability corresponding to each speech level output by the base classifier. When the output fusion module 45 fuses each intermediate prediction result: the output fusion module 45 will classify the probabilities of the seat speech predicted by each base classifier according to the skill level and take the average value, so as to obtain the voice sample corresponding to each The average probability of the speech level, and the maximum probability is further taken as the confidence level of the speech sample. For example, for three classifications, the predicted probability values (intermediate prediction results) of a speech sample on 1-3 categories through two base classifiers are: base classifier one 0.3, 0.3, 0.4; base classifier two 0.2, 0.2, 0.6 , Then, the average probability of the speech sample is 0.25, 0.25, 0.5, and its confidence is 0.5. At the same time, if the average probability values of the two samples on the 1-3 categories are: sample one 0.3, 0.3, 0.4; sample two 0.1, 0.1, 0.8, both samples are predicted to be the third category of speech level, but The confidence of sample 2 is higher.
[0069] Based on the above-mentioned speech prediction model, the speech label prediction unit 102 can predict the speech label corresponding to each agent's speech (that is, the speech level corresponding to the agent's speech) and the confidence of each agent's speech.
[0070] The speech sample classification unit 103 is configured to sort and classify the agent speech into a plurality of representative sample sets according to the speech label and the confidence level.
[0071] In this embodiment, the speech sample classification unit 103 classifies the agent speech according to the speech tags, and correspondingly obtains the top n agent speeches with the highest confidence as representative samples for each speech tag, and further forms a plurality of respective and The representative sample set corresponding to each utterance label. Each representative sample set contains all representative samples corresponding to the corresponding verbal label.
[0072] The speech feature extraction unit 104 sequentially extracts the MFCC feature of each representative sample, that is, Mel Frequency Cepstrum Coefficient (MFCC).
[0073] The cutting point acquiring unit 105 sequentially acquires the cutting point of each representative sample according to the MFCC feature of each representative sample based on a preset cutting point acquisition method.
[0074] In this embodiment, the cutting point acquisition method is:
[0075] Model the speech sequence represented by the MFCC of the representative sample (speech signal) as an independent multivariate Gaussian process: xi~N(μi, Σi), the dimension of the feature point xi is d, and the number of features (ie the speech sequence The length of) is N; further based on the continuity of the Gaussian process and the model checking criterion BIC to recursively calculate whether each feature point x is a cutting point, if the feature point xi is a jumping point (cutting point), then:
[0076] Model 0: x1...xN~N(μ,Σ) (continuous)
[0077] Model 1: x1...xi~N(μ1,Σ1); xi+1...xN~N(μ2,Σ2) (discontinuous, two-stage Gaussian model)
[0078] Where Σ is the covariance matrix of all data, Σ1 is the covariance matrix of {x1...xi}, and Σ2 is the covariance matrix of {xi+1...xN}{xi+1...xN} , Μ is the mean value of the speech sequence before cutting, and μ1 and μ2 are respectively the mean value of the two sequences after cutting the speech sequence based on the feature point xi.
[0079] The symbol |Σ| represents the determinant of the matrix Σ, then the logarithmic maximum likelihood ratio corresponding to model 0 and model 1 is:
[0080] R(i)=Nln|Σ|-N1ln|Σ1|-N2ln|Σ2|
[0081] In the formula, N1 and N2 are respectively the length of two sequences after the speech sequence is cut based on the cutting point xi.
[0082] Then, the BIC score BIC(i) of the feature point xi is:
[0083] BIC(i)=R(i)-λP
[0084] In the formula, λ is the penalty coefficient. The larger the λ, the greater the penalty for discriminating the difference between the two voices after cutting. P is:
[0085] P=1/2(d+1/2d(d+1))lnN
[0086] Finally, the goal of finding the cutting point is:
[0087] i(cut)=arg max BIC(i)
[0088] The cutting point acquisition unit 105 can calculate max BIC(i) through the above process and determine the cutting point of the representative sample based on this value, specifically: if {maxiBIC(i)}> 0, the feature point xi is the cutting point; if {maxiBIC(i)} <0, the feature point xi is not a cutting point.
[0089] The Gaussian model described in the above formula judges the cutting point, and the judgment of the cutting point is based on making the BIC optimal. Popular understanding is that the larger the BIC, the greater the difference between the two ends of the voice distribution based on the cutting point. The ideal segmentation is that the two voices after each segmentation are quite different, and the difference judgment condition is R(i).
[0090] The segmentation process is as follows. First, fix the size of a sliding window, and regard a voice to be cut as a sequence. From the starting point to the end point of the sliding window is the first iteration. Use the BIC criterion to determine whether there is a cutting point.
[0091] (1) If a cutting point is detected in the window, the starting point of the sliding window is moved to the cutting point, and the forward iteration determination is continued. Does not change the sliding window size.
[0092] (2) If the cutting point is not detected in this window, the starting point of the window remains unchanged, and the end point extends forward until the cutting point is detected.
[0093] (3) Repeat (1) and (2) until the detection of the last window (the end point reaches the end of the speech sequence) is completed.
[0094] The voice segment cutting part 106 is used for cutting each representative sample according to the cutting point to form a voice segment corresponding to each representative sample.
[0095] In this embodiment, after the cutting point acquiring unit 105 acquires n cutting points of a representative sample, the voice segment cutting unit 106 divides the representative sample (voice) into n+1 segments accordingly. After that, all the cut segments of all speech corresponding to each speech tag will be put together to participate in the subsequent process of extracting this type of representative subset.
[0096] The segment feature extraction unit 107 extracts the time domain feature and the frequency domain feature of the corresponding speech segment based on the MFCC feature of each representative sample.
[0097] In this embodiment, the extraction method of the time domain feature and the frequency domain feature performed by the segment feature extraction unit 107 is the same as the extraction method of the speech feature extraction unit 42-2b, and will not be repeated here.
[0098] The speech fragment acquisition unit 108 uses a heuristic algorithm to construct an optimal representative subset corresponding to each speech level based on the time domain features and frequency domain features of each speech fragment, and uses the speech fragments in the optimal representative subset as speech skills. Fragment.
[0099] In this embodiment, for each type of speech tag, the speech segment acquisition unit 108 will use a heuristic algorithm to optimize all speech segments and form an optimal representative subset corresponding to the type of speech tag. , Each speech segment in the optimal representative subset is the speech segment corresponding to the similar speech tag.
[0100] In this embodiment, the speech segment acquisition unit 108 specifically uses the simulated annealing algorithm to construct the optimal representative subset. The simulated annealing algorithm is mainly to set a goal and a collection, and continuously replace the objects in the collection (ie, Voice segment), and calculate the sum of similarity in the set objects. If the sum of similarity is higher, the diversity is worse. The goal of optimization and iteration is to make the similarity index as small as possible, and stop the iteration until the convergence condition is reached. Therefore, in this embodiment, the training objectives of each optimal representative subset are:
[0101]
[0102] In the formula, T is the number of all speech fragments under all speech tags, m is the number of speech fragments that need to be extracted under each type of speech tag, cos Is the cosine similarity, r i Is the i-th speech segment characterized by time domain features, r j It is the jth speech segment characterized by time domain features.
[0103] In this embodiment, the essence of constructing the optimal representative subset described above is to find the most unique and alternative voice segment in each category, the voice segment:
[0104] (1) It needs to be distinguished from other types of speech fragments to the greatest extent. For example, the same or similar speech fragments that appear in each category cannot be used as representative samples, and there is no discrimination.
[0105] (2) It also needs to be distinguished from the speech fragments in this category. For example, the m speech fragments obtained by extraction are the most special of such speech fragments.
[0106] Therefore, the optimization goal is to minimize the sum of the sample similarity within the class and the sample similarity between the classes. In this embodiment, speech segments based on time-domain features and frequency-domain features form a matrix, one row is feature value data corresponding to one speech segment, and one column is a time-frequency domain feature. cos i r j In the optimization process, the cosine similarity between the feature vectors corresponding to the two speech segments is calculated. The specific algorithm flow is:
[0107] For each type of speech segment, select m initial samples, where m is the set number of samples to be extracted. In each iteration, the samples in part m are replaced, and the total similarity is calculated. Through multiple iterations, the total similarity gradually decreases. The m samples obtained when the iteration is finally stopped constitute the optimal representative subset, which is the representative segment of the category (speech tag).
[0108] In this embodiment, the convergence condition is that the difference between the min index values of the previous and subsequent iterations does not exceed a set threshold (generally 0.001) or the number of iterations reaches the set upper limit (the iteration stops after 1000 iterations).
[0109] Through the above process, the representative speech fragment extraction device 100 can obtain the representative fragments corresponding to each speech tag, and these representative fragments can be stored in the server corresponding to each speech tag and provided to the agent. For example, the agent can use the computer held by the agent to allow the agent to select the required speech tag, and the server can retrieve the stored representative fragments according to the speech tag to obtain the corresponding representative fragment, and then send it to the agent terminal for the agent View.
[0110] Figure 4 It is a flowchart of the process of extracting speech fragments in the embodiment of the present invention.
[0111] Such as Figure 4 As shown, after the agent speech data set to be processed is input into the representative speech fragment extraction device 100, the representative speech fragment extraction device 100 starts the speech fragment extraction process. The speech fragment extraction process specifically includes the following steps:
[0112] Step S1, the seat voice acquiring unit 101 acquires the seat voice to be processed from the seat voice data set, and then proceeds to step S2;
[0113] Step S2: The speech tag prediction unit 102 sequentially inputs each agent's speech into the speech prediction model and outputs the speech tag and confidence corresponding to the agent's speech, and then proceeds to step S3;
[0114] In step S3, the speech sample classification unit 103 classifies the agent speech according to the speech tags output in step S2, and correspondingly obtains the top n agent speeches with the highest confidence as representative samples for each speech tag, and further forms a plurality of respective A representative sample set corresponding to each idiom tag, and then go to step S4;
[0115] Step S4, the speech feature extraction unit 104 sequentially extracts the MFCC features of each representative sample, and then proceeds to step S5;
[0116] In step S5, the cutting point acquisition part 105 acquires the cutting point of each representative sample based on the MFCC feature extracted in step S4 by the cutting point acquisition method, and then proceeds to step S6;
[0117] Step S6, the voice segment cutting unit 106 cuts each representative sample according to the cutting point obtained in step S5 to form a corresponding voice segment, and then proceeds to step S7;
[0118] Step S7, the segment feature extraction unit 107 extracts time domain features and frequency domain features of the corresponding speech segment based on the MFCC features of the representative sample, and then proceeds to step S8;
[0119] In step S8, the speech segment acquisition unit 108 uses a heuristic algorithm to construct an optimal representative subset corresponding to each speech level based on the time domain feature and frequency domain feature of each speech segment, and combines the optimal representative subset of the speech segment As a fragment of speech, and then enter the end state.
[0120] Example function and effect
[0121] According to the representative speech segment extraction device and method based on agent speech segmentation provided in this embodiment, since the speech prediction model is used to predict the speech of the agent to obtain the speech label and confidence level corresponding to each speech level, and according to the speech The technical label classifies the seat voices to obtain representative samples under each category. Therefore, while realizing the extraction of more representative seat voices based on the seat voices, the accuracy of the seat voices (high confidence) is also considered. Furthermore, since the MFCC feature extraction is performed on each representative sample, the segmentation point of each representative sample is obtained based on the MFCC feature, and each representative sample is segmented into voice fragments according to the segmentation point, thus realizing extraction from the complete agent voice The function of generating representative voice fragments removes the less useful part of the seat voice, and most directly and truly reflects the seat speech skills contained in the seat voice. Finally, because the time domain features and frequency domain features are extracted for each speech segment, and the heuristic algorithm is used to filter the speech segments to obtain the final speech segment, it also ensures that the agent can obtain the diversity of speech segments. Avoid the existence of a large number of homogenized speech fragments in the speech fragments, and improve the reference value of the speech fragments. The representative speech fragment extraction device and method of the present invention can directly extract and integrate the most representative speech fragments in the seat voice and provide them to the seat staff, which not only saves the seat staff the self-improvement and learning based on the score recording Time also allows the agents to learn more efficiently based on the diversity of words and skills.
[0122] In the embodiment, since the speech cutting point is determined by the BIC criterion and the multi-Gaussian process, and at the same time, the speech segment is characterized by the time domain feature and the frequency domain feature, it can fully reflect the characteristics of the agent's voice such as fluctuation and intonation , So as to more accurately extract representative voice segments from the seat voice.
[0123] In the embodiment, the simulated annealing algorithm is used to optimize and quickly obtain speech fragments satisfying accuracy (high confidence) and uniqueness (representative speech fragments corresponding to each speech level are different from each other). If the optimization algorithm is not used, the algorithm complexity of the pairwise comparison of speech fragments is too high, which not only takes up a lot of computing resources, but also leads to poor screening effects.
[0124] In the embodiment, when performing text analysis, because the LSTM model is used to extract text features, the sequence dependency in the context is effectively captured, so that the characterization of text information is more accurate.
[0125] In the embodiment, because the speech prediction model extracts the multi-view features of the seat voice through the multi-view extraction module, and extracts the feature weights based on the multi-view features through the feature weight extraction module, the multi-view features are input to multiple pre-built When the base classifier outputs intermediate prediction results, the output fusion module can be used to fuse each intermediate prediction result based on feature weights. Through such a prediction process, it is possible to accurately and stably predict the level of speech corresponding to the agent's voice.
[0126] Further, in the embodiment, since the multi-view features include text features, time domain features, and frequency domain features, the speech scoring module can predict the speech level of the agent's speech from multiple aspects such as words, intonation, etc., so as to comprehensively, Accurately evaluate the "how" and "what" of the agent in the voice of each agent, and finally get the level of speech and confidence that is more in line with objective judgment.
[0127] In the embodiment, since the time-frequency domain features are further extracted based on the MFCC features, that is, the MFCC features after compression into rows are calculated (the standard deviation of each feature vector is taken as the representative of the feature point, and the MFCC matrix is compressed into one row) The mean value, variance, and waveform characteristics of each segment, so that the final feature dimension becomes lower, which is equivalent to dimensionality reduction and abstracts the fluctuation law of a long sequence. Therefore, the original MFCC features (tens of thousands, or hundreds of thousands) with high dimensionality can be used for model learning while preserving voice features.
[0128] The foregoing embodiments are only used to illustrate specific implementations of the present invention, and the present invention is not limited to the description scope of the foregoing embodiments.
PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
Similar technology patents
Prediction method for electromagnetic radiation of CDMA base station
ActiveCN108229757AEfficient and Precise PredictionHigh reference value
Owner:XIANGTAN UNIV
Classification method and device
InactiveCN109165677AHigh reference value
Owner:BEIJING FRIENDSHIP HOSPITAL CAPITAL MEDICAL UNIV
Metal material heat shrinkage parameter measuring device used for 3D printing and analogue simulation
Owner:广西慧思通科技有限公司
A long-term water stability evaluation test device and method of asphalt mixture
Owner:JIANGSU ZHONGLU ENG DETECTION CO LTD
Biological information data analysis and result quality control automation method and system
ActiveCN112365928AReasonable and efficient managementHigh reference value
Owner:赛福解码(北京)基因科技有限公司
Classification and recommendation of technical efficacy words
- High reference value
Urban inland inundation modeling prediction method
ActiveCN106779232AHigh reference valueminus data processing
Owner:HEFEI UNIV
Medical image diagnosing system and diagnosing method based on migrated nuclear matching tracing
InactiveCN101609485AHigh reference valueImprove diagnostic recognition rate
Owner:XIDIAN UNIV
Bond quality assessment method for new-old interface in building restoration
InactiveCN102680484AHigh reference value
Owner:CHONGQING UNIV
Pressure chamber for rock-soil rheological test
Owner:CHINA UNIV OF MINING & TECH (BEIJING)
Method for withstanding false targets on basis of station's position error fusion algorithm
ActiveCN106680783AHigh reference valueReduce the chance of being cheated
Owner:XIDIAN UNIV