A method, recognition system and storage medium for minority language text recognition
A text recognition and language recognition technology, applied in the field of machine learning, can solve the problems of ineffective small language and recognition
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0051] Such as figure 1 As shown, the overall technical framework of the method provided by the present invention is as follows:
[0052] 1. Build a training text set
[0053] The training text comes from the corresponding language datasets on Wikipedia. One of them is selected as a positive sample, and other related language datasets are selected as negative samples. The ratio of positive and negative samples is 1:1. Taking Uyghur (ISO 639-lug) as an example, extract 1 million Uyghur texts from the training set as positive samples, extract 800,000 texts from similar language families such as Arabic and Turkish, and randomly select texts from other language families20 Thousands of samples are used as negative samples. Positive samples and negative samples constitute the training text set.
[0054] 2. Data preprocessing
[0055] The original training data often contains more erroneous data or redundant information, so operations such as data cleaning and deduplication are p...
Embodiment 2
[0127] This embodiment provides a minor language recognition system, the system applies the method content of Embodiment 1, the system includes a training text set construction module, used to extract byte-based N-gram rank features, and mutual information-based measurement features , a feature extraction module based on the probability feature of the transition probability, a classifier training module and a classifier for training the classifier.
[0128] In this embodiment, the feature extraction module includes a first feature extraction module for extracting byte-based N-gram rank features, a second feature extraction module for extracting metric features based on mutual information, and a second feature extraction module for extracting features based on transition probability The third feature extraction module of the probabilistic feature.
[0129] Meanwhile, this embodiment also provides a storage medium, which stores a computer program inside, and executes the method ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


