Method, device and electronic equipment for recognizing unregistered words
By segmenting the text, obtaining word vectors, and converting them into frequency domain signals, the problem of not being able to identify out-of-vocabulary words composed of multiple words in existing technologies has been solved, achieving higher recognition accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- WEBANK (CHINA)
- Filing Date
- 2021-12-17
- Publication Date
- 2026-06-12
AI Technical Summary
Existing technologies cannot effectively identify out-of-vocabulary words composed of multiple words, resulting in poor accuracy of the identification results.
By segmenting the text to be processed, word vectors of words are obtained, converted into time domain signals and transformed into frequency domain signals. The frequency domain signals are used to express the close relationship between words, thereby identifying out-of-vocabulary words.
It improves the accuracy of identifying out-of-vocabulary (OV) words, and can identify OV words consisting of multiple words, especially those with more than three OV words.
Smart Images

Figure CN114239573B_ABST
Abstract
Claims
1. A method for identifying out-of-vocabulary words, characterized in that, include: The text to be processed is segmented to obtain multiple words; The word vectors corresponding to each of the multiple words are obtained respectively, and the word vectors corresponding to each of the multiple words are subjected to modulo processing to obtain the time domain signals corresponding to each of the multiple words. The time-domain signals corresponding to each of the multiple words are transformed to obtain the frequency-domain signals corresponding to each of the multiple words. Based on the frequency domain signals corresponding to each of the multiple words, out-of-vocabulary words are determined from the multiple words; The step of determining out-of-vocabulary words from the plurality of words based on their respective frequency domain signals includes: Based on the frequency domain signals corresponding to each of the multiple words and the positions of each of the multiple words in the text to be processed, multiple target words with a distance less than a preset threshold are determined from the multiple words. For each target word, the unregistered word is determined from the plurality of words based on the frequency domain signal corresponding to the word preceding the target word, the frequency domain signal corresponding to the target word, and the frequency domain signal corresponding to the word following the target word. The step of determining the unregistered word from the plurality of words based on the frequency domain signal corresponding to the preceding word, the frequency domain signal corresponding to the target word, and the frequency domain signal corresponding to the following word includes: Determine a first difference between the frequency domain signal corresponding to the preceding word and the frequency domain signal corresponding to the target word, and determine a second difference between the frequency domain signal corresponding to the target word and the frequency domain signal corresponding to the following word; The out-of-vocabulary word is determined from the plurality of words based on the first difference and the second difference; The step of determining the out-of-vocabulary word from the plurality of words based on the first difference and the second difference includes: If the first difference is greater than the first threshold and the second difference is less than the second threshold, then the preceding word, the target word, and the following word are determined as the unregistered words. or, If the first difference is less than the first threshold and the second difference is greater than the second threshold, then the preceding word, the target word, and the following word are determined as the unregistered words.
2. The method according to claim 1, characterized in that, The step of obtaining the word vectors corresponding to each of the plurality of words includes: Based on the respective positions of the multiple words in the text to be processed, determine the initial vector corresponding to each of the multiple words; For each word in the plurality of words, a target vector corresponding to the word is determined based on the initial vectors corresponding to the other words in the plurality of words and the number of the other words; and the target vector corresponding to the word is mapped within a preset range to obtain the word vector corresponding to the word.
3. The method according to claim 2, characterized in that, The step of determining the target vector corresponding to the word based on the initial vectors corresponding to the other words among the plurality of words and the number of the other words includes: Determine the sum vector between the initial vectors corresponding to the other words; The ratio of the sum vector to the number of other words is used to determine the target vector corresponding to the word.
4. The method according to claim 1, characterized in that, The method further includes: Determine the information entropy corresponding to the out-of-vocabulary word and the word frequency of the out-of-vocabulary word in the text to be processed; If the information entropy is greater than the information entropy threshold and the word frequency is greater than the word frequency threshold, then the unregistered word is identified as the target unregistered word.
5. The method according to claim 1, characterized in that, The method further includes: Based on the unregistered words, the text to be processed is re-segmented.
6. A device for identifying out-of-vocabulary words, characterized in that, include: The word segmentation unit is used to segment the text to be processed into multiple words; The first processing unit is used to obtain the word vectors corresponding to each of the plurality of words respectively, and to perform modulo processing on the word vectors corresponding to each of the plurality of words to obtain the time domain signals corresponding to each of the plurality of words. The second processing unit is used to transform the time-domain signals corresponding to each of the plurality of words to obtain the frequency-domain signals corresponding to each of the plurality of words. The determining unit is configured to determine out-of-vocabulary words from the plurality of words based on the frequency domain signals corresponding to each of the plurality of words; The determining unit includes a first determining module and a second determining module; The first determining module is used to determine multiple target words whose distance is less than a preset threshold from multiple words based on the frequency domain signals corresponding to each of the multiple words and the positions of each of the multiple words in the text to be processed. The second determining module is used to determine the unregistered words from multiple words based on the frequency domain signal corresponding to the word preceding the target word, the frequency domain signal corresponding to the target word, and the frequency domain signal corresponding to the word following the target word for each target word. The second determining module is specifically used to determine the first difference between the frequency domain signal corresponding to the previous word and the frequency domain signal corresponding to the target word, and to determine the second difference between the frequency domain signal corresponding to the target word and the frequency domain signal corresponding to the next word; based on the first difference and the second difference, to determine the unregistered word from multiple words; The second determining module is specifically used to determine the preceding word, the target word, and the following word as out-of-vocabulary words if the first difference is greater than the first threshold and the second difference is less than the second threshold; or, if the first difference is less than the first threshold and the second difference is greater than the second threshold, then the preceding word, the target word, and the following word are determined as out-of-vocabulary words.
7. An electronic device, characterized in that, include: Memory and processor; Memory; Used to store computer programs; The processor is configured to read the computer program stored in the memory and execute the out-of-vocabulary word identification method according to any one of claims 1-5 based on the computer program in the memory.
8. A readable storage medium, characterized in that, The computer program stores computer execution instructions, which, when executed by a processor, are used to implement the method for identifying out-of-vocabulary words as described in any one of claims 1-5.
9. A computer program product, characterized in that, The computer program product includes a computer program that, when executed, implements the method for identifying out-of-vocabulary words as described in any one of claims 1-5.