Method, device and electronic equipment for recognizing unregistered words

By segmenting the text, obtaining word vectors, and converting them into frequency domain signals, the problem of not being able to identify out-of-vocabulary words composed of multiple words in existing technologies has been solved, achieving higher recognition accuracy.

CN114239573BActive Publication Date: 2026-06-12WEBANK (CHINA)

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
WEBANK (CHINA)
Filing Date
2021-12-17
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies cannot effectively identify out-of-vocabulary words composed of multiple words, resulting in poor accuracy of the identification results.

Method used

By segmenting the text to be processed, word vectors of words are obtained, converted into time domain signals and transformed into frequency domain signals. The frequency domain signals are used to express the close relationship between words, thereby identifying out-of-vocabulary words.

🎯Benefits of technology

It improves the accuracy of identifying out-of-vocabulary (OV) words, and can identify OV words consisting of multiple words, especially those with more than three OV words.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114239573B_ABST
    Figure CN114239573B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a method and device for identifying an out-of-vocabulary word and an electronic device. When identifying an out-of-vocabulary word in a text to be processed, the text to be processed is first segmented to obtain a plurality of words. Word vectors corresponding to the plurality of words are obtained respectively, and the word vectors corresponding to the plurality of words are processed by taking modulo to obtain time domain signals corresponding to the plurality of words respectively. The plurality of time domain signals are transformed to obtain frequency domain signals corresponding to the plurality of words respectively. Since the frequency domain can effectively express the close relationship between words, the out-of-vocabulary word can be accurately determined from the text to be processed based on the frequency domain signals corresponding to the plurality of words, the identification of the out-of-vocabulary word is realized, and the accuracy of the identification result is improved.
Need to check novelty before this filing date? Find Prior Art

Claims

1. A method for identifying out-of-vocabulary words, characterized in that, include: The text to be processed is segmented to obtain multiple words; The word vectors corresponding to each of the multiple words are obtained respectively, and the word vectors corresponding to each of the multiple words are subjected to modulo processing to obtain the time domain signals corresponding to each of the multiple words. The time-domain signals corresponding to each of the multiple words are transformed to obtain the frequency-domain signals corresponding to each of the multiple words. Based on the frequency domain signals corresponding to each of the multiple words, out-of-vocabulary words are determined from the multiple words; The step of determining out-of-vocabulary words from the plurality of words based on their respective frequency domain signals includes: Based on the frequency domain signals corresponding to each of the multiple words and the positions of each of the multiple words in the text to be processed, multiple target words with a distance less than a preset threshold are determined from the multiple words. For each target word, the unregistered word is determined from the plurality of words based on the frequency domain signal corresponding to the word preceding the target word, the frequency domain signal corresponding to the target word, and the frequency domain signal corresponding to the word following the target word. The step of determining the unregistered word from the plurality of words based on the frequency domain signal corresponding to the preceding word, the frequency domain signal corresponding to the target word, and the frequency domain signal corresponding to the following word includes: Determine a first difference between the frequency domain signal corresponding to the preceding word and the frequency domain signal corresponding to the target word, and determine a second difference between the frequency domain signal corresponding to the target word and the frequency domain signal corresponding to the following word; The out-of-vocabulary word is determined from the plurality of words based on the first difference and the second difference; The step of determining the out-of-vocabulary word from the plurality of words based on the first difference and the second difference includes: If the first difference is greater than the first threshold and the second difference is less than the second threshold, then the preceding word, the target word, and the following word are determined as the unregistered words. or, If the first difference is less than the first threshold and the second difference is greater than the second threshold, then the preceding word, the target word, and the following word are determined as the unregistered words.

2. The method according to claim 1, characterized in that, The step of obtaining the word vectors corresponding to each of the plurality of words includes: Based on the respective positions of the multiple words in the text to be processed, determine the initial vector corresponding to each of the multiple words; For each word in the plurality of words, a target vector corresponding to the word is determined based on the initial vectors corresponding to the other words in the plurality of words and the number of the other words; and the target vector corresponding to the word is mapped within a preset range to obtain the word vector corresponding to the word.

3. The method according to claim 2, characterized in that, The step of determining the target vector corresponding to the word based on the initial vectors corresponding to the other words among the plurality of words and the number of the other words includes: Determine the sum vector between the initial vectors corresponding to the other words; The ratio of the sum vector to the number of other words is used to determine the target vector corresponding to the word.

4. The method according to claim 1, characterized in that, The method further includes: Determine the information entropy corresponding to the out-of-vocabulary word and the word frequency of the out-of-vocabulary word in the text to be processed; If the information entropy is greater than the information entropy threshold and the word frequency is greater than the word frequency threshold, then the unregistered word is identified as the target unregistered word.

5. The method according to claim 1, characterized in that, The method further includes: Based on the unregistered words, the text to be processed is re-segmented.

6. A device for identifying out-of-vocabulary words, characterized in that, include: The word segmentation unit is used to segment the text to be processed into multiple words; The first processing unit is used to obtain the word vectors corresponding to each of the plurality of words respectively, and to perform modulo processing on the word vectors corresponding to each of the plurality of words to obtain the time domain signals corresponding to each of the plurality of words. The second processing unit is used to transform the time-domain signals corresponding to each of the plurality of words to obtain the frequency-domain signals corresponding to each of the plurality of words. The determining unit is configured to determine out-of-vocabulary words from the plurality of words based on the frequency domain signals corresponding to each of the plurality of words; The determining unit includes a first determining module and a second determining module; The first determining module is used to determine multiple target words whose distance is less than a preset threshold from multiple words based on the frequency domain signals corresponding to each of the multiple words and the positions of each of the multiple words in the text to be processed. The second determining module is used to determine the unregistered words from multiple words based on the frequency domain signal corresponding to the word preceding the target word, the frequency domain signal corresponding to the target word, and the frequency domain signal corresponding to the word following the target word for each target word. The second determining module is specifically used to determine the first difference between the frequency domain signal corresponding to the previous word and the frequency domain signal corresponding to the target word, and to determine the second difference between the frequency domain signal corresponding to the target word and the frequency domain signal corresponding to the next word; based on the first difference and the second difference, to determine the unregistered word from multiple words; The second determining module is specifically used to determine the preceding word, the target word, and the following word as out-of-vocabulary words if the first difference is greater than the first threshold and the second difference is less than the second threshold; or, if the first difference is less than the first threshold and the second difference is greater than the second threshold, then the preceding word, the target word, and the following word are determined as out-of-vocabulary words.

7. An electronic device, characterized in that, include: Memory and processor; Memory; Used to store computer programs; The processor is configured to read the computer program stored in the memory and execute the out-of-vocabulary word identification method according to any one of claims 1-5 based on the computer program in the memory.

8. A readable storage medium, characterized in that, The computer program stores computer execution instructions, which, when executed by a processor, are used to implement the method for identifying out-of-vocabulary words as described in any one of claims 1-5.

9. A computer program product, characterized in that, The computer program product includes a computer program that, when executed, implements the method for identifying out-of-vocabulary words as described in any one of claims 1-5.