A text feature extraction method and system
By segmenting text strings and combining phoneme features to construct a two-dimensional feature matrix, the problem of insufficient text speech feature extraction in existing technologies is solved, achieving efficient and accurate phoneme-level feature extraction and improving the effect of pronunciation correction training.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- PEKING UNION MEDICAL COLLEGE HOSPITAL
- Filing Date
- 2025-12-23
- Publication Date
- 2026-06-12
Smart Images

Figure CN121366567B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of text processing technology, and in particular to a method and system for extracting text features. Background Technology
[0002] Pronunciation correction is an important component of speech therapy, aiming to help children and adults with language disorders overcome pronunciation difficulties and improve their social communication skills. Effective pronunciation correction relies on selecting appropriate text samples as training materials. Ideal text samples should meet the following requirements: cover the key phonetic features needed for training, such as combinations of vowels and consonants; be suitable for the target group's age, language level, and type of pronunciation problems; and employ a language style and content close to everyday communication, such as nursery rhymes, tongue twisters, or simple stories, to enhance the fun and practicality of training. However, how to efficiently select and analyze the phonetic features of text samples has become a key challenge in the field of pronunciation correction.
[0003] In existing technologies, the extraction of text-to-speech features mainly relies on the following methods: one type of method synthesizes the text into a speech signal and then uses signal processing techniques to extract features, such as analyzing the frequency components of speech through short-time Fourier transform or simulating the human ear's frequency perception characteristics through Mel-frequency cepstral coefficients; another type of method utilizes deep learning models, such as Transformer-based bidirectional encoder representations, to extract semantic features of the text. These methods are primarily designed for speech recognition or semantic understanding tasks and struggle to fully capture the speech details and phoneme features required for pronunciation correction, thus limiting their applicability in pronunciation correction scenarios. Therefore, there is an urgent need for a method that can effectively extract text-to-speech features to better support pronunciation correction training. Summary of the Invention
[0004] In view of the above problems, embodiments of this application are proposed to provide a method and system for extracting text features that overcome or at least partially solve the above problems.
[0005] To address the aforementioned problems, this application discloses a method for extracting text features, the method comprising:
[0006] Get the text string to be processed;
[0007] The text string is segmented to obtain the segmentation result;
[0008] Generate a phoneme sequence based on the word segmentation results;
[0009] The phoneme sequence is separated into phoneme feature combinations;
[0010] The text string is generated based on the combination of phoneme features.
[0011] Optionally, the step of performing word segmentation on the text string to obtain the word segmentation result includes:
[0012] The text string is segmented into single words and / or phrases according to dictionary rules and / or thesaurus rules.
[0013] Optionally, after performing word segmentation on the text string to obtain the segmentation result, the method further includes:
[0014] If the text string contains symbols, then the symbols in the word segmentation result are removed to obtain standardized text data.
[0015] Optionally, generating a phoneme sequence based on the word segmentation result includes:
[0016] The word segmentation results are converted into a pinyin sequence according to the dictionary pinyin and / or the thesaurus pinyin.
[0017] Optionally, separating the phoneme sequence into phoneme feature combinations includes:
[0018] The pinyin sequence is separated into phoneme feature combinations consisting of consonants and vowels.
[0019] Optionally, the step of generating the feature data of the text string based on the phoneme feature combination includes:
[0020] Construct a two-dimensional feature matrix based on the combination of phoneme features;
[0021] Wherein, one dimension of the two-dimensional feature matrix is the number of times the consonant appears, and the other dimension of the two-dimensional feature matrix is the number of times the vowel appears.
[0022] Optionally, constructing a two-dimensional feature matrix based on the phoneme feature combination includes:
[0023] The values at all positions of the two-dimensional feature matrix are initialized to 0;
[0024] Iterate through the phoneme feature combinations and count the occurrence frequency of each consonant and vowel.
[0025] The frequency of occurrence of the consonants and vowels is filled into the corresponding positions of the two-dimensional feature matrix to obtain the feature map matrix.
[0026] Optionally, after constructing a two-dimensional feature matrix based on the phoneme feature combination, the method further includes:
[0027] The two-dimensional feature matrix is normalized.
[0028] The similarity between the text string and the target string is evaluated based on the normalized two-dimensional feature matrix.
[0029] Optionally, after constructing a two-dimensional feature matrix based on the phoneme feature combination, the method further includes:
[0030] The normalized two-dimensional feature matrix is used as training sample data to generate pronunciation correction text.
[0031] This application also discloses a text feature extraction system, the system comprising:
[0032] The text string acquisition module is used to acquire the text string to be processed.
[0033] The text string segmentation module is used to perform segmentation processing on the text string to obtain the segmentation result;
[0034] A phoneme sequence generation module is used to generate a phoneme sequence based on the word segmentation results;
[0035] A phoneme sequence separation module is used to separate the phoneme sequence into phoneme feature combinations;
[0036] The feature data generation module is used to generate feature data for the text string based on the phoneme feature combination.
[0037] Optionally, the text string segmentation module is used to segment the text string into single characters and / or phrases according to dictionary rules and / or thesaurus rules.
[0038] Optionally, the system further includes:
[0039] The symbol removal module is used to remove symbols from the text string segmentation result after the text string segmentation module has processed the text string into a segmentation result, so as to obtain standardized text data.
[0040] Optionally, the phoneme sequence generation module is used to convert the word segmentation result into a pinyin sequence according to dictionary pinyin and / or thesaurus pinyin.
[0041] Optionally, the phoneme sequence separation module is used to separate the pinyin sequence into phoneme feature combinations composed of consonants and vowels.
[0042] Optionally, the feature data generation module is used to construct a two-dimensional feature matrix based on the phoneme feature combination;
[0043] Wherein, one dimension of the two-dimensional feature matrix is the number of times the consonant appears, and the other dimension of the two-dimensional feature matrix is the number of times the vowel appears.
[0044] Optionally, the feature data generation module includes:
[0045] An initialization module is used to initialize the values at all positions of the two-dimensional feature matrix to 0;
[0046] The frequency counting module is used to traverse the phoneme feature combinations and count the occurrence frequency of the consonants and vowels respectively;
[0047] The frequency input module is used to input the frequency of occurrence of the consonant and the vowel into the corresponding positions of the two-dimensional feature matrix to obtain the feature map matrix.
[0048] Optionally, the system further includes:
[0049] The normalization module is used to normalize the two-dimensional feature matrix after the feature data generation module constructs the two-dimensional feature matrix based on the phoneme feature combination.
[0050] The similarity evaluation module is used to evaluate the similarity between the text string and the target string based on the normalized two-dimensional feature matrix.
[0051] Optionally, the system further includes:
[0052] The corrected text generation module is used to generate pronunciation corrected text by using the normalized two-dimensional feature matrix as training sample data after the feature data generation module constructs a two-dimensional feature matrix based on the phoneme feature combination.
[0053] The embodiments of this application have the following advantages:
[0054] This application provides a text feature extraction scheme, which involves obtaining a text string to be processed; performing word segmentation on the text string to obtain the word segmentation result; generating a phoneme sequence based on the word segmentation result; separating the phoneme sequence into phoneme feature combinations; and generating feature data of the text string based on the phoneme feature combinations.
[0055] This application's embodiments accurately extract phoneme-level features from text by generating phoneme sequences and separating phoneme feature combinations. This directly adapts to the need for detailed speech analysis in pronunciation correction scenarios, overcoming the limitations of traditional methods that focus on speech signal processing or semantic understanding. Simultaneously, it eliminates the need for complex speech synthesis or signal processing, significantly reducing computational complexity and improving feature extraction efficiency. Furthermore, the generated feature data supports diverse post-processing applications, such as pronunciation feedback or text generation, providing more efficient and flexible technical support for speech therapy, thereby enhancing the effectiveness and practicality of pronunciation correction training. Attached Figure Description
[0056] Figure 1 This is a flowchart illustrating the steps of a text feature extraction method according to an embodiment of this application;
[0057] Figure 2 This is a schematic diagram of a two-dimensional feature matrix according to an embodiment of this application;
[0058] Figure 3 This is a schematic diagram illustrating the principle of an application scheme for text features according to an embodiment of this application;
[0059] Figure 4 This is a structural block diagram of a text feature extraction system according to an embodiment of this application. Detailed Implementation
[0060] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0061] This application proposes a text feature extraction scheme. It involves segmenting the input text string into single characters or phrases; then removing punctuation and other symbols to standardize the text data; next, converting the text into a phoneme sequence and separating it into phoneme feature combinations, such as consonants and vowels; constructing a two-dimensional feature data matrix based on this, counting the frequency of phoneme combinations, and generating feature data; optionally, normalizing the feature data and supporting post-processing applications, such as pronunciation feedback or generating corrective text. This application emphasizes phoneme-level feature extraction, is applicable to Chinese and other languages, and significantly improves the targeting and practicality of pronunciation correction training.
[0062] Reference Figure 1 This document illustrates a flowchart of a text feature extraction method according to an embodiment of this application. This text feature extraction method can be applied to systems such as text feature extraction systems or feature extraction systems (hereinafter referred to as "systems"). Specifically, the text feature extraction method may include the following steps:
[0063] Step 101: Obtain the text string to be processed.
[0064] In practical applications, the text string can be natural language text in any form, such as Chinese nursery rhymes, tongue twisters, simple stories, or daily conversation content, depending on the needs of the pronunciation correction target group, such as children, adults, or users with specific language disorders. The system obtains the text string through user input, file reading, or database call, etc., ensuring that its format is a character sequence that can be processed. For example, the text string can be "Chongqing is a very important city", which contains Chinese characters, punctuation marks, or other characters. The goal of this step is to ensure that the input text string is complete and accurate, providing a reliable data source for subsequent word segmentation processing and feature extraction.
[0065] The acquisition process may involve preprocessing operations, such as unifying character encoding to support multilingual text, or performing preliminary cleaning on the input text to remove irrelevant control characters. The system supports multiple input methods, including manual input, speech-to-text conversion, or external text file import, to meet the needs of different application scenarios. The length and content type of the text string can be flexibly adjusted according to the actual application. For example, short sentences are used for beginners' pronunciation training, and long texts are used for advanced language therapy. During the acquisition process, the system needs to ensure the integrity of the text string, avoiding data loss or format errors, to ensure the processing accuracy of subsequent steps. In addition, this step also needs to consider the diversity of text sources, such as supporting the input of Chinese, English, or other languages.
[0066] Step 102: Perform word segmentation on the text string to obtain the word segmentation result.
[0067] In this step, the system divides the text string into word segmentation units of single characters and / or phrases based on dictionary rules and / or lexical rules. For example, for the text string "Chongqing is a very important city", the word segmentation result may be "Chongqing / is / very / important / of / city" after word segmentation. The word segmentation rules usually rely on a pre-built dictionary library, which contains common words, phrases, and their grammar rules to ensure the accuracy of segmentation. The system uses natural language processing techniques, such as statistical-based word segmentation algorithms or rule-based word segmentation methods, to identify the word boundaries in the text string. For Chinese, word segmentation is particularly important because Chinese text does not have natural space separators like English, and ambiguity needs to be resolved through dictionary matching, such as distinguishing "Chongqing" as a whole place name rather than the individual "Chong" and "Qing".
[0068] The word segmentation process also needs to consider multilingual scenarios. For example, English text can be directly segmented by spaces, and the validity of words can be verified by combining with a dictionary. The word segmentation results are usually represented as a sequence of units marked by delimiters (such as " / "), which is convenient for subsequent processing. Word segmentation can effectively solve the problem of polyphonic characters and lay a foundation for the generation of phoneme sequences. For example, the character "重" has different pronunciations in "重庆市" and "重要", and the semantic context needs to be clarified through word segmentation. In addition, the word segmentation process needs to ensure the integrity and consistency of the results to avoid deviations in the subsequent generation of phoneme sequences caused by incorrect segmentation.
[0069] Step 103: Generate a phoneme sequence according to the word segmentation results.
[0070] In this step, the system converts single characters or phrases in the word segmentation results into corresponding phoneme sequences based on dictionary and / or lexicon queries. For example, for the word segmentation result "重庆市 / 是 / 非常 / 重要 / 的 / 城市", the system generates the phoneme sequence "chong qing shi / shi / fei chang / zhong yao / de / cheng shi" after querying the lexicon. Phoneme sequences are composed of phonemes (the smallest speech units in a language), which usually appear as pinyin representations in the Chinese scenario. The dictionary library contains the mapping relationship between words and their phoneme representations, supporting the accurate processing of polyphonic characters. For example, the character "重" is pronounced as "chong" in "重庆市" and "zhong" in "重要", and the system selects the correct phoneme representation based on the semantic context of the word segmentation results.
[0071] Step 104: Separate the phoneme sequence into phoneme feature combinations.
[0072] In this step, the system decomposes the phoneme sequence into phoneme feature combinations, taking the combination of consonants and vowels as an example for illustration. For example, for the phoneme sequence "chong", the system separates it into the consonant "ch" and the vowel "ong"; for a phoneme containing only a vowel, such as "o" (corresponding to the Chinese character "哦"), the separation result is "no consonant" and the vowel "o". Consonants include no consonant, b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s, y, and vowels include a, o, e, i, u, ü, ai, ei, ui, ao, ou, iu, ie, üe, er, an, en, in, un, ün, ang, eng, ing, ong, ia, iao, ian, iang, iong, ua, uo, uai, uan, uang, ueng, üan. The system performs the separation through predefined phoneme classification rules, which may involve special mapping processing. For example, the pinyin "yao" is decomposed into "no consonant" and the vowel "iao" to handle special phonemes such as y and w.
[0073] The separation process must ensure that each phoneme sequence unit can be accurately decomposed into its corresponding phoneme feature combination, avoiding omissions or errors. This step helps extract more detailed speech features, supporting the analysis of phoneme details in pronunciation correction. The system may employ lookup tables or algorithms to achieve separation, such as phoneme decomposition algorithms based on Chinese Pinyin rules, ensuring the standardization of the separation results. The separated phoneme feature combinations are stored in ordered pairs or sequences for easy generation of subsequent feature data.
[0074] Step 105: Generate feature data for the text string based on phoneme feature combinations.
[0075] In this step, the system constructs a two-dimensional feature data matrix based on the separated phoneme feature combinations. One dimension represents one type of phoneme feature (such as consonants), and the other dimension represents another type of phoneme feature (such as vowels). For example, the system initializes a two-dimensional matrix with all position values set to zero; then it iterates through the phoneme sequence, counts the frequency of each phoneme feature combination (such as "ch+ong") in the text, and fills the statistical results into the corresponding positions in the matrix to form feature data. Each cell of the matrix represents the frequency of a specific phoneme feature combination, reflecting the distribution of speech features in the text.
[0076] The system may employ data structures (such as arrays or matrices) to store feature data, ensuring computational efficiency and accuracy. The process of generating feature data must ensure that all phoneme feature combinations are statistically analyzed to avoid omissions and support special case handling, such as the correct mapping of combinations without consonants. The generated feature data can be used for various post-processing applications, such as comparing the similarity between the feature data and standard pronunciation feature data using vector dot product or cosine similarity to provide feedback for pronunciation practice; or using the feature data as input to a deep learning model to generate corrected text that conforms to the target phoneme features.
[0077] This application's embodiments accurately extract phoneme-level features from text by generating phoneme sequences and separating phoneme feature combinations. This directly adapts to the need for detailed speech analysis in pronunciation correction scenarios, overcoming the limitations of traditional methods that focus on speech signal processing or semantic understanding. Simultaneously, it eliminates the need for complex speech synthesis or signal processing, significantly reducing computational complexity and improving feature extraction efficiency. Furthermore, the generated feature data supports diverse post-processing applications, such as pronunciation feedback or text generation, providing more efficient and flexible technical support for speech therapy, thereby enhancing the effectiveness and practicality of pronunciation correction training.
[0078] In one exemplary embodiment of this application, one method of obtaining word segmentation results by segmenting a text string is as follows: the text string is divided into single characters and / or phrases according to dictionary rules and / or thesaurus rules.
[0079] During the implementation process, the system utilizes a pre-built dictionary library that contains common words, phrases, and their grammar rules, and identifies word boundaries in text strings through a matching algorithm. For example, for the input text string "Chongqing is a very important city", the system divides it into "Chongqing / is / very / important / of / city" according to the dictionary rules, generating a word segmentation result that includes single characters (such as "is", "of") and phrases (such as "Chongqing", "very", "important", "city"). Dictionary rules are usually based on the grammar and lexical structure of the language, while the dictionary rules rely on a large-scale word database to ensure the accuracy and semantic consistency of the segmentation. For example, it distinguishes "Chongqing" as a whole place name rather than the individual "Chong", "qing", thus providing the correct semantic context for subsequent phoneme sequence generation (such as "chong qing shi"). The system may adopt natural language processing techniques to improve the accuracy of word segmentation, handle ambiguity and polyphonic characters (such as the different pronunciations of "chong" in "Chongqing" and "important"). For multilingual scenarios, the implementation method supports splitting English by spaces combined with dictionary verification, or segmenting other languages according to their grammar rules. The word segmentation result is stored as a sequence of units marked by delimiters (such as " / "), ensuring the processing efficiency and consistency of subsequent steps.
[0080] This implementation method generates a structured word segmentation result through precise word segmentation processing, effectively solving the problem of word boundary recognition for languages without clear delimiters such as Chinese, significantly improving the accuracy of phoneme sequence generation, providing a reliable input basis for the pronunciation correction scenario, and at the same time supporting multilingual expansion, enhancing the generality and practicality of the method.
[0081] In an exemplary embodiment of the present application, after obtaining the word segmentation result by performing word segmentation processing on the text string, one implementation method is: if there are symbols in the text string, remove the symbols in the word segmentation result to obtain standardized text data.
[0082] Based on the description of the foregoing embodiment, word segmentation processing divides the text string (such as "Chongqing is a very important city.") into a word segmentation result (such as "Chongqing / is / very / important / of / city / ."), which may contain punctuation marks (such as commas, periods) or special symbols (such as hyphens, quotation marks). In this implementation method, the system scans the word segmentation result, identifies and removes these symbols. For example, it processes "Chongqing / is / very / important / of / city / ." into "Chongqing / is / very / important / of / city", removing the period, and generating standardized text data. This process may adopt regular expression matching or rule-based symbol filtering algorithms to ensure that all punctuation and special symbols are accurately removed while retaining the semantic integrity of single characters and phrases.
[0083] The generation of standardized text data is crucial because symbols usually do not carry phonetic information. If retained, they may interfere with the subsequent generation of phoneme sequences (such as "chong qing shi shi fei chang zhong yao de cheng shi"), resulting in feature extraction biases. This implementation supports multilingual scenarios. For example, in English text, commas, full stops, etc. are removed, or in other languages, specific punctuation marks are processed to ensure the generality of the output data. The system may integrate a symbol recognition rule library covering common punctuation marks and special characters to improve processing efficiency and accuracy.
[0084] This implementation generates clean standardized text data by removing symbol interference in the word segmentation results, significantly improving the accuracy and consistency of phoneme sequence generation, providing a high-quality input basis for the pronunciation correction scenario, while enhancing the multilingual applicability of the method and the reliability of feature extraction.
[0085] In an exemplary embodiment of the present application, an implementation of generating a phoneme sequence according to the word segmentation result is as follows: Convert the word segmentation result into a pinyin sequence according to the dictionary pinyin and / or the dictionary pinyin.
[0086] In this implementation, the system uses a pre-built dictionary pinyin and / or dictionary pinyin database to map single characters or phrases in the word segmentation result to the corresponding pinyin sequence through querying. For example, convert "重庆市 / 是 / 非常 / 重要 / 的 / 城市" to "chong qing shi / shi / fei chang / zhong yao / de / cheng shi". Dictionary pinyin usually contains the pinyin mapping of single characters, while dictionary pinyin covers the pinyin representations of phrases or words, combining semantic contexts to handle the problem of polyphonic characters. For example, "重" is pronounced as "chong" in "重庆市" and "zhong" in "重要". The system selects the correct pinyin representation through the semantic information of the word segmentation result.
[0087] The conversion process may adopt natural language processing techniques to ensure the accuracy of the pinyin sequence. The pinyin sequence is stored in a standardized pinyin format, usually separated by spaces. This implementation particularly emphasizes the handling of polyphonic characters. Through the semantic analysis of dictionary pinyin, it solves the problem of pronunciation ambiguity caused by different contexts in Chinese. In addition, this implementation supports multilingual expansion. For example, the English word segmentation result can be converted into an international phonetic alphabet sequence, and other languages generate corresponding phoneme representations according to their phonetic rules.
[0088] This implementation generates accurate pinyin sequences by converting between dictionary pinyin and / or thesaurus pinyin, effectively solving the problem of polyphonic characters, improving the accuracy and consistency of phoneme sequences, providing reliable input for phoneme-level feature extraction in pronunciation correction scenarios, and enhancing the method's ability to analyze speech details and its multilingual applicability.
[0089] In one exemplary embodiment of this application, one implementation of separating a phoneme sequence into a phoneme feature combination is as follows: separating a pinyin sequence into a phoneme feature combination composed of consonants and vowels.
[0090] In this implementation, the system decomposes each phoneme unit in the pinyin sequence into a combination of consonants and vowels using predefined phoneme classification rules. The separation process may employ a decomposition algorithm based on Chinese pinyin rules or a lookup table to ensure that each phoneme unit is correctly parsed into a combination of consonants and vowels. The results are stored in ordered pairs for easy generation of subsequent feature data.
[0091] The system supports multilingual extensions; for example, English phonetic sequences can be separated into voiced and voiceless consonant combinations, while other languages are decomposed according to their phonological rules, ensuring the method's universality. The resulting phoneme feature combinations provide accurate speech feature input for generating feature data, avoiding analytical biases caused by errors in phoneme decomposition.
[0092] This implementation method separates the pinyin sequence into phoneme features of consonants and vowels, enabling the system to accurately extract the speech details of the text. This significantly improves the targeting and accuracy of phoneme-level analysis in pronunciation correction scenarios, provides high-quality input for subsequent feature data generation, and supports multilingual applications, enhancing the flexibility and practicality of the method.
[0093] In one exemplary embodiment of this application, one implementation method for generating feature data of a text string based on phoneme feature combinations is as follows: a two-dimensional feature matrix is constructed based on the phoneme feature combinations; wherein, one dimension of the two-dimensional feature matrix is the frequency of consonant occurrences, and the other dimension is the frequency of vowel occurrences. Specifically, the values at all positions in the two-dimensional feature matrix are initialized to 0; the phoneme feature combinations are traversed, and the frequency of consonant and vowel occurrences is counted respectively; the frequency of consonant and vowel occurrences is filled into the corresponding positions in the two-dimensional feature matrix to obtain a feature map matrix.
[0094] In this implementation, the system first initializes a two-dimensional feature matrix, setting all element values to zero. For example... Figure 2As shown, the matrix rows correspond to consonants (including no consonants (represented by -), b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s, y), and the columns correspond to vowels (including a, o, e, i, u, ü, ai, ei, ui, ao, ou, iu, ie, üe, er, an, en, in, un, ün, ang, eng, ing, ong, ia, iao, ian, iang, iong, ua, uo, uai, uan, uang, ueng, üan). Subsequently, the system traverses the phoneme sequence and counts the frequency of each phoneme feature combination (e.g., "ch+ong") in the text string. Figure 2 The system generates a two-dimensional feature matrix by inputting the statistical results into the corresponding positions. For example, the intersection of the "ch" row and the "ong" column records the occurrence of "ch+ong" as 0 times, and the intersection of the "ch" row and the "a" column records the occurrence of "ch+a" as 6 times. The system may use array or matrix data structures to implement the statistics, ensuring computational efficiency and accuracy, and handling special cases, such as combinations of "no consonants" and vowels (e.g., "iao"), to ensure the integrity of the matrix. The generated two-dimensional feature matrix serves as feature data, supporting subsequent applications, such as comparing the similarity between the feature data and standard pronunciation feature data through vector dot product or cosine similarity to provide feedback for pronunciation practice, or serving as input to a deep learning model to generate corrective text.
[0095] This implementation constructs a two-dimensional feature matrix with the frequency of consonant and vowel occurrences as dimensions. The system can accurately capture the phoneme-level feature distribution of text in a structured manner, significantly improving the pertinence and efficiency of speech feature analysis in pronunciation correction scenarios. At the same time, it supports diverse post-processing applications, enhancing the practicality and multilingual applicability of the method.
[0096] In one exemplary embodiment of this application, after constructing a two-dimensional feature matrix based on phoneme feature combinations, one implementation method is to: normalize the two-dimensional feature matrix; and evaluate the similarity between the text string and the target string based on the normalized two-dimensional feature matrix.
[0097] In this implementation, the system first normalizes the two-dimensional feature matrix, such as using L2 norm normalization, which divides each element of the matrix by the square root of the sum of the squares of all elements in the matrix to ensure a uniform range of eigenvalues. For example, it converts the original frequency values into relative values between 0 and 1. This process eliminates dimensional differences, enhances the comparability of feature data, and facilitates subsequent analysis. The normalized two-dimensional feature matrix serves as feature data, used to evaluate the similarity between the input text string (e.g., "Chongqing is a very important city") and the target string (e.g., a standard pronunciation sample). Specific evaluation methods include calculating the vector inner product or cosine similarity. For example, the normalized two-dimensional feature matrix is flattened into a vector and compared with the feature vector of the target string to obtain a similarity score. This similarity evaluation can provide feedback for pronunciation practice, such as indicating the deviation between the user's pronunciation and the target pronunciation.
[0098] The system may also employ a high-efficiency matrix operation library for normalization and integrate similarity calculation algorithms to ensure accuracy. This implementation supports multilingual scenarios; for example, feature matrices in English or other languages can also be normalized and similarity evaluated.
[0099] This implementation method normalizes the two-dimensional feature matrix and evaluates its similarity. Based on standardized feature data, the system can accurately quantify the differences in speech features between the text string and the target string, significantly improving the accuracy and specificity of feedback in pronunciation correction scenarios. At the same time, it reduces the impact of differences in feature value dimensions, enhancing the versatility of the method and the flexibility of post-processing applications.
[0100] In one exemplary embodiment of this application, one implementation method after constructing a two-dimensional feature matrix based on phoneme feature combinations is to use the normalized two-dimensional feature matrix as training sample data to generate pronunciation correction text.
[0101] In this implementation, the system uses a normalized two-dimensional feature matrix as training sample data, inputting it into a deep learning model to generate pronunciation correction text that matches the target phoneme features. For example, for the input text string "Chongqing is a very important city," its normalized two-dimensional feature matrix reflects the frequency distribution of phoneme feature combinations. The deep learning model generates targeted correction text, such as adjusted phrases or sentences, based on a specified range of phoneme features (e.g., emphasizing certain consonants or vowels), to help users practice specific pronunciations. The generation process may involve a sequence generation model, using the normalized feature matrix as input features and outputting text that matches the target pronunciation features. This pronunciation correction text generation can be used in speech therapy to help users improve pronunciation details. The system needs to ensure the quality of the training sample data; normalization processing ensures that feature values are within a uniform range, enhancing the stability of model training. Furthermore, the system may combine a predefined pronunciation rule library or standard pronunciation templates to ensure the relevance and practicality of the generated text.
[0102] This implementation generates pronunciation correction text by using the normalized two-dimensional feature matrix as training sample data. The system can accurately customize pronunciation training content based on phoneme-level features, significantly improving the pertinence and effectiveness of pronunciation correction. At the same time, it uses normalized data to improve the training efficiency of deep learning models, enhancing the applicability of the method in multilingual scenarios and the practicality of speech therapy.
[0103] Based on the above description of the text feature extraction method embodiments, an application scheme of text features is introduced below. (Refer to...) Figure 3 The diagram illustrates the principle of an application scheme for text features according to an embodiment of this application.
[0104] This application solution mainly includes the following parts:
[0105] Step 1, word segmentation:
[0106] The input text string is segmented into smaller units, such as single characters or phrases. For example, the original text "Chongqing is a very important city." becomes "Chongqing / is / very / important / city / " after word segmentation. Using word segmentation preprocessing instead of directly converting Chinese characters can effectively solve the problem of polyphonic characters.
[0107] Step 2, Symbol Processing:
[0108] The results of word segmentation are processed to remove punctuation marks and special symbols in order to standardize the text data.
[0109] Following the previous example, the result after symbol processing is "Chongqing City is a very important city".
[0110] Step 3: Pinyin generation:
[0111] Convert the processed text into pinyin representation (which can be achieved by various methods, such as dictionary + dictionary query). The result after processing is:
[0112] “chong qing shi shi fei chang zhong yao de cheng shi”
[0113] Note that introducing a dictionary here can effectively solve the problem of polyphonic characters. For example, in the words "Chongqing City" and "important", the pronunciation of "chong" is different.
[0114] Step 4: Separate vowels and consonants:
[0115] Separate the vowels and consonants in the pinyin, which helps to extract more detailed speech features.
[0116] Taking Mandarin pinyin as an example, the consonants are: no consonant, b, p, m, f, d, t, n, l, g, k, h, j, q, x, zh, ch, sh, r, z, c, s, y; the vowels are: a, o, e, i, u, ü, ai, ei, ui, ao, ou, iu, ie, üe, er, an, en, in, un, ün, ang, eng, ing, ong, ia, iao, ian, iang, iong, ua, uo, uai, uan, uang, ueng, üan.
[0117] For example, for the pinyin chong, after the vowel-consonant separation operation, it gets ch + ong. There may also be a situation without a consonant. For example, for "o", which only contains the vowel "o", the separation operation gets: "no consonant" + "o".
[0118] For consonants such as y and w, there are some special cases. For example, the pinyin yao actually corresponds to "no consonant" + vowel "iao" after splitting; special mapping processing is required for such cases.
[0119] Step 5: Generate a feature map:
[0120] Generate a feature map based on the separated vowel and consonant information to represent the features of the speech signal. Construct a two-dimensional matrix, where one dimension is for consonants and the other dimension is for vowels. Then, any pinyin can find its corresponding position in this two-dimensional matrix. Initialize all the position values in the matrix to 0, traverse the text pinyin, count the number of times each pinyin appears in the text and fill it into the corresponding position in the matrix, and thus obtain the feature map matrix corresponding to this text.
[0121] Step 6: Feature map normalization:
[0122] The generated feature map is normalized to ensure that the eigenvalues are within a uniform range. This step is optional but necessary for some post-processing applications. There are various methods for normalizing the feature map matrix, such as L2 norm normalization, which involves dividing each element xi of the matrix by its L2 norm to obtain the normalized feature map matrix. The L2 norm is the square root of the sum of the squares of all elements.
[0123] Step 7, Post-processing:
[0124] Based on the text feature maps obtained in step 5 or step 6, different post-processing applications can be performed.
[0125] For example:
[0126] Post-processing application example 1: Evaluate the similarity between text and target based on feature maps.
[0127] By utilizing text feature maps, this method provides language learners with feedback for pronunciation practice. By comparing a learner's pronunciation feature map with a standard pronunciation feature map, the differences in pronunciation are displayed in an intuitive and visual way, helping learners correct pronunciation errors. The degree of similarity / deviation between different feature maps can be quantified using vector dot product or cosine similarity.
[0128] Post-processing application example 2: Using feature maps as input to a deep learning model to generate target text.
[0129] After the speech therapist sets the target training focus and range of vowels and consonants, the system automatically constructs pronunciation correction text that meets the target characteristics.
[0130] The text feature extraction method proposed in this application can more comprehensively analyze the distribution of pronunciation features in text, providing more effective support for pronunciation correction, and also providing new possibilities for the application expansion of deep learning models.
[0131] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this application are not limited to the described order of actions, because according to the embodiments of this application, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this application.
[0132] Reference Figure 4 The diagram illustrates a structural block diagram of a text feature extraction system according to an embodiment of this application. Specifically, the text feature extraction system may include the following modules.
[0133] Text string acquisition module 41 is used to acquire the text string to be processed;
[0134] The text string segmentation module 42 is used to perform segmentation processing on the text string to obtain the segmentation result;
[0135] Phoneme sequence generation module 43 is used to generate a phoneme sequence based on the word segmentation result;
[0136] Phoneme sequence separation module 44 is used to separate the phoneme sequence into phoneme feature combinations;
[0137] The feature data generation module 45 is used to generate feature data of the text string based on the phoneme feature combination.
[0138] In one exemplary embodiment of this application, the text string segmentation module 42 is used to segment the text string into single characters and / or phrases according to dictionary rules and / or thesaurus rules.
[0139] In one exemplary embodiment of this application, the system further includes:
[0140] The symbol removal module is used to remove symbols from the text string after the text string segmentation module 42 has processed the text string into a segmentation result, so as to obtain standardized text data.
[0141] In one exemplary embodiment of this application, the phoneme sequence generation module 43 is used to convert the word segmentation result into a pinyin sequence according to dictionary pinyin and / or the dictionary pinyin.
[0142] In one exemplary embodiment of this application, the phoneme sequence separation module 44 is used to separate the pinyin sequence into phoneme feature combinations composed of consonants and vowels.
[0143] In one exemplary embodiment of this application, the feature data generation module 45 is used to construct a two-dimensional feature matrix based on the phoneme feature combination;
[0144] Wherein, one dimension of the two-dimensional feature matrix is the number of times the consonant appears, and the other dimension of the two-dimensional feature matrix is the number of times the vowel appears.
[0145] In one exemplary embodiment of this application, the feature data generation module 45 includes:
[0146] An initialization module is used to initialize the values at all positions of the two-dimensional feature matrix to 0;
[0147] The frequency counting module is used to traverse the phoneme feature combinations and count the occurrence frequency of the consonants and vowels respectively;
[0148] The frequency input module is used to input the frequency of occurrence of the consonant and the vowel into the corresponding positions of the two-dimensional feature matrix to obtain the feature map matrix.
[0149] In one exemplary embodiment of this application, the system further includes:
[0150] The normalization module is used to normalize the two-dimensional feature matrix after the feature data generation module 45 constructs the two-dimensional feature matrix based on the phoneme feature combination.
[0151] The similarity evaluation module is used to evaluate the similarity between the text string and the target string based on the normalized two-dimensional feature matrix.
[0152] In one exemplary embodiment of this application, the system further includes:
[0153] The text correction generation module is used to generate pronunciation correction text by using the normalized two-dimensional feature matrix as training sample data after the feature data generation module 45 constructs a two-dimensional feature matrix based on the phoneme feature combination.
[0154] As the system implementation is basically similar to the method implementation, it is described in a relatively simple way. For relevant details, please refer to the description of the method implementation.
[0155] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0156] Those skilled in the art will understand that embodiments of this application can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this application can take the form of entirely hardware embodiments, entirely software embodiments, or embodiments combining software and hardware aspects. Furthermore, embodiments of this application can take the form of computer program products implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0157] This application describes embodiments with reference to flowchart illustrations and / or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0158] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0159] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0160] Although preferred embodiments of the present application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of the embodiments of the present application.
[0161] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.
[0162] The foregoing has provided a detailed description of a text feature extraction method and a text feature extraction system provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A method for extracting text features, characterized in that, The method includes: Get the text string to be processed; The text string is segmented to obtain the segmentation result; Generate a phoneme sequence based on the word segmentation results; The phoneme sequence is separated into phoneme feature combinations; the separation process involves special mapping processing for special cases of the consonants y and w. The text string is generated based on the combination of phoneme features; the feature data reflects the distribution of the speech features of the text. The step of generating the feature data of the text string based on the phoneme feature combination includes: constructing a two-dimensional feature matrix based on the phoneme feature combination; one dimension of the two-dimensional feature matrix is the number of times consonants appear, and the other dimension of the two-dimensional feature matrix is the number of times vowels appear. The step of constructing a two-dimensional feature matrix based on the phoneme feature combination includes: initializing the values of all positions in the two-dimensional feature matrix to 0; traversing the phoneme feature combination and counting the occurrence counts of the consonants and vowels respectively; and filling the occurrence counts of the consonants and vowels into the corresponding positions in the two-dimensional feature matrix to obtain a feature map matrix. After constructing a two-dimensional feature matrix based on the phoneme feature combination, the method further includes: normalizing the two-dimensional feature matrix; and evaluating the similarity between the text string and the target string based on the normalized two-dimensional feature matrix.
2. The method according to claim 1, characterized in that, The step of performing word segmentation on the text string to obtain the word segmentation result includes: The text string is segmented into single words and / or phrases according to dictionary rules and / or thesaurus rules.
3. The method according to claim 1 or 2, characterized in that, After performing word segmentation on the text string to obtain the segmentation result, the method further includes: If the text string contains symbols, then the symbols in the word segmentation result are removed to obtain standardized text data.
4. The method according to claim 1, characterized in that, The step of generating a phoneme sequence based on the word segmentation result includes: The word segmentation results are converted into a pinyin sequence according to the dictionary pinyin and / or the thesaurus pinyin.
5. The method according to claim 4, characterized in that, The step of separating the phoneme sequence into phoneme feature combinations includes: The pinyin sequence is separated into phoneme feature combinations consisting of consonants and vowels.
6. The method according to claim 1, characterized in that, After constructing a two-dimensional feature matrix based on the phoneme feature combination, the method further includes: The normalized two-dimensional feature matrix is used as training sample data to generate pronunciation correction text.
7. A text feature extraction system, characterized in that, The system includes: The text string acquisition module is used to acquire the text string to be processed. The text string segmentation module is used to perform segmentation processing on the text string to obtain the segmentation result; A phoneme sequence generation module is used to generate a phoneme sequence based on the word segmentation results; A phoneme sequence separation module is used to separate the phoneme sequence into phoneme feature combinations; the separation process involves special mapping processing for special cases of the consonants y and w. The feature data generation module is used to generate feature data for the text string based on the phoneme feature combination; the feature data reflects the distribution of the speech features of the text; The feature data generation module is used to construct a two-dimensional feature matrix based on the phoneme feature combination; one dimension of the two-dimensional feature matrix is the number of times consonants appear, and the other dimension of the two-dimensional feature matrix is the number of times vowels appear. The feature data generation module includes: an initialization module, used to initialize the values of all positions in the two-dimensional feature matrix to 0; a frequency counting module, used to traverse the phoneme feature combinations and count the occurrence frequency of the consonants and vowels respectively; and a frequency filling module, used to fill the occurrence frequency of the consonants and vowels into the corresponding positions in the two-dimensional feature matrix to obtain a feature map matrix. The system further includes: a normalization module, used to normalize the two-dimensional feature matrix after the feature data generation module constructs the two-dimensional feature matrix based on the phoneme feature combination; and a similarity evaluation module, used to evaluate the similarity between the text string and the target string based on the normalized two-dimensional feature matrix.