Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Dictionary entry extraction and identification method for low-resource language and general language

A recognition method and language technology, applied in character and pattern recognition, electrical digital data processing, natural language data processing, etc., to achieve a wide range of adaptability

Inactive Publication Date: 2020-09-11
GUANGDONG UNIVERSITY OF FOREIGN STUDIES
View PDF5 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

It has been able to recognize the text of most scenes, book text, specific scene content such as ID cards, etc., but the current technology can only extract and recognize all the words in the pictures without typesetting

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Dictionary entry extraction and identification method for low-resource language and general language
  • Dictionary entry extraction and identification method for low-resource language and general language
  • Dictionary entry extraction and identification method for low-resource language and general language

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0058] Example: such as figure 1 , figure 2 , image 3 , Figure 4 , Figure 5 , Figure 6 , Figure 7 , Figure 8 and Figure 9 As shown, the present invention is a method for extracting and identifying dictionary entries of low-resource languages ​​and common languages. The overall structure of the method is as follows figure 1 shown, including the following steps:

[0059] S1: First detect whether the input image needs to be pre-processed and corrected. If it is read as a non-grayscale image or a non-scanned image through identification methods such as image parameters, it is necessary to proceed to the next step S2, otherwise step S2 can be skipped directly;

[0060] S2: First, preprocess and rectify the input dictionary image, perform edge detection on the image, and submit the detection result to geometric correction to align the four corners of the edge of the entity dictionary to the four corners of the edge of the image, and use the text direction detection n...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a dictionary entry extraction and recognition method for a low-resource language and a general language, and the method comprises the following steps: carrying out the geometric correction and binarization of an inputted dictionary image after completing the training of a basic network model; detecting the column division condition of the dictionary, dividing the dictionaryinto two columns or multiple columns, and selecting the boxes, which are smaller than a certain threshold T, among the textboxes as a unified column; performing image entry cutting in each column; and transmitting the target entry image obtained by cutting into a text recognition module. According to the dictionary entry extraction and identification method for a low-resource language and a general language, an input dictionary image is preprocessed and corrected; the input dictionary images are subjected to text detection, then the dictionary is subjected to column detection, the entry textimages obtained through cutting are transmitted to the text recognition module, finally, the recognition result is normalized according to the corpus format and automatically imported into the specified corpus, and the efficiency of importing dictionary entries into the corpus is greatly improved.

Description

technical field [0001] The invention relates to a method for extracting and identifying dictionary entries, in particular to a method for extracting and identifying dictionary entries for low-resource languages ​​and common languages. Background technique [0002] Natural language processing technology is constantly iteratively updated, and more and more small languages ​​are included in the scope of scientific research. With the popularity of machine learning in China, it is increasingly necessary to establish correspondence between small languages ​​and general languages ​​such as Chinese, and further Enrich the target language corpus to facilitate subsequent natural language processing and lay a solid foundation for the use of various deep learning models. [0003] At present, in the existing market, there is no technology and work for extracting and identifying dictionary entries for minor languages, and automatically connecting and importing corpora. The present invent...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06K9/00G06K9/32G06K9/62G06N3/04G06F40/242
CPCG06F40/242G06V30/413G06V10/24G06N3/045G06F18/254G06F18/259
Inventor 颜学明薛海威蒋盛益刘建明
Owner GUANGDONG UNIVERSITY OF FOREIGN STUDIES
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products