New Chinese word recognition method based on graph structure

A new word recognition and graph structure technology, which is applied in the fields of instruments, computing, electrical digital data processing, etc., can solve the problems of unrecognizable long words, unsuitable for network data with variable structure, etc., and achieve the effect of accurate discovery and recognition

Inactive Publication Date: 2014-08-06
CHINA INFORMATION TECH SECURITY EVALUATION CENT +1
View PDF4 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The above documents are all representative new word discovery / recognition algorithms, and they all fulfill the needs of new word discovery from a certain angle, but they are not suitable for network data with variable structure, and because the length of words needs to be determined, so Can't recognize long words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • New Chinese word recognition method based on graph structure
  • New Chinese word recognition method based on graph structure
  • New Chinese word recognition method based on graph structure

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be further described below through specific embodiments and accompanying drawings.

[0037] figure 1 It is a flow chart of the steps of the graph-based new word recognition method of the present invention, specifically comprising the following steps:

[0038] 102 is to perform word segmentation preprocessing on the document set, if there is a word segmentation program, then directly perform word segmentation, otherwise each word is divided into a word by default;

[0039] 104 is the process of abstracting the word graph of the document set, see the specific implementation method figure 2 ;

[0040] 106 is the traversal of the graph, and the discovery and analysis of alternative new words are carried out for each point;

[0041] 108 is the alternative new word discovery process to each point, and the specific implementation method sees image 3 ;

[0042]110 is a summary of phased results, sorting out all alternative words.

[0043] 112 ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a new Chinese word recognition method based on a graph structure. The method comprises the steps that (1) a document set is abstracted into a weighted digraph according to the adjacency relation between words; (2) all points of the weighted digraph are traversed, and an alternative new word of each point is selected based on the concurrence rate; (3) the alternative new words are subjected to path expansion, maximum-weight paths are found, the concurrence rate of the maximum-weight paths is always larger than a threshold, and then complete alternative new words are obtained; (4) the complete alternative new words are filtered according to information entropy, and a final alternative new word set is obtained. The method that the document set is abstracted into the graph structure for new word discovery and recognition is put forward for the first time, new word discovery is converted into maximum-weight path discovery in the weighted digraph, the characteristics of the digraph are well utilized, and the new word discovery and recognition method is low in time complexity and high in recall rate and accuracy rate.

Description

technical field [0001] The invention is related to natural language processing and relates to the field of Chinese information processing. It is a graph-based new word recognition method using co-occurrence rate and information entropy, and can accurately recognize new long words. Background technique [0002] According to the "Modern Chinese Common Words List" published by the Commercial Press, there are more than 50,000 words commonly used in today's society. However, with the continuous development of society, especially the rapid development of the Internet, new words are constantly being created. On the one hand, these words were born and exploded in popularity with the occurrence of special events. They are hot words discussed by the public and often contain the attitude of the public towards current events, making these new words themselves of great analytical value; on the other hand, In the field of Chinese information processing, due to the characteristics of Chin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/28
Inventor 武嘉怡陈薇王腾蛟
Owner CHINA INFORMATION TECH SECURITY EVALUATION CENT
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products