Unlock instant, AI-driven research and patent intelligence for your innovation.

A method, recognition system and storage medium for minority language text recognition

A text recognition and language recognition technology, applied in the field of machine learning, can solve the problems of ineffective small language and recognition

Active Publication Date: 2021-08-06
NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT GUANGDONG BRANCH +1
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the technical defect that the existing technology cannot effectively identify minority languages, the present invention provides a method for text recognition in minority languages

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method, recognition system and storage medium for minority language text recognition
  • A method, recognition system and storage medium for minority language text recognition
  • A method, recognition system and storage medium for minority language text recognition

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0051] Such as figure 1 As shown, the overall technical framework of the method provided by the present invention is as follows:

[0052] 1. Build a training text set

[0053] The training text comes from the corresponding language datasets on Wikipedia. One of them is selected as a positive sample, and other related language datasets are selected as negative samples. The ratio of positive and negative samples is 1:1. Taking Uyghur (ISO 639-lug) as an example, extract 1 million Uyghur texts from the training set as positive samples, extract 800,000 texts from similar language families such as Arabic and Turkish, and randomly select texts from other language families20 Thousands of samples are used as negative samples. Positive samples and negative samples constitute the training text set.

[0054] 2. Data preprocessing

[0055] The original training data often contains more erroneous data or redundant information, so operations such as data cleaning and deduplication are p...

Embodiment 2

[0127] This embodiment provides a minor language recognition system, the system applies the method content of Embodiment 1, the system includes a training text set construction module, used to extract byte-based N-gram rank features, and mutual information-based measurement features , a feature extraction module based on the probability feature of the transition probability, a classifier training module and a classifier for training the classifier.

[0128] In this embodiment, the feature extraction module includes a first feature extraction module for extracting byte-based N-gram rank features, a second feature extraction module for extracting metric features based on mutual information, and a second feature extraction module for extracting features based on transition probability The third feature extraction module of the probabilistic feature.

[0129] Meanwhile, this embodiment also provides a storage medium, which stores a computer program inside, and executes the method ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The present invention relates to a kind of method that is used for minority language text identification, comprises the following steps: S1. builds the training text set that comes from different languages; S2. carries out the extraction of N-gramrank feature based on byte to the text in training text set; S3. Extract the metric features based on mutual information on the text in the training text set, that is, calculate the information metric of all the information bytes in the text in a single language; S4. Perform the probability feature based on the transition probability on the text in the training text set Extraction, that is, calculate the probability that all adjacent bytes in the text can express complete information in a single language; S5. Use the features extracted in steps S2-S4 to train the classifier; S6. Perform features on the text to be recognized according to steps S2-S4 The extracted features are then input into the classifier for recognition, and the classifier outputs the language recognition result.

Description

technical field [0001] The present invention relates to the technical field of machine learning, and more specifically, to a method, recognition system and storage medium for minority language text recognition. Background technique [0002] With the rapid development of the mobile Internet, the amount of data has increased dramatically, and a large amount of text log information is generated every day. How to analyze valuable information from massive data is a subject of increasing concern. The present invention starts from the angle of language identification, identifies language information from a large amount of text data, and analyzes and obtains group attributes. [0003] Language recognition or language monitoring is essentially a process of text processing of information data, and when the data text contains multiple languages ​​such as Chinese, English, and Japanese, sometimes it cannot be processed at the same time. At this time, it is necessary to judge the specif...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06K9/62
Inventor 李高翔周小敏石易鲍青波黄彦龙宋宜昌周晓阳林建树林佳涛周神保
Owner NAT COMP NETWORK & INFORMATION SECURITY MANAGEMENT CENT GUANGDONG BRANCH