Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode

A technology of automatic identification and new words, applied in the fields of electrical digital data processing, instruments, calculations, etc., can solve the problems of data sparseness, difficulty, and low extraction accuracy, and achieve the effect of improving the accuracy rate

Inactive Publication Date: 2013-03-06
EAST CHINA NORMAL UNIVERSITY
View PDF3 Cites 25 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] The advantages of the rule-based method are high accuracy and strong pertinence, but there will be great difficulties in establishing and maintaining rules
Moreover, the rules are generally related to certain fields, so the portability and adaptability are relatively poor
The advantages of statistical-based methods are flexibility, strong adaptability, and good portability, but they require a large-scale corpus for training.
And because there are relatively few features that can be counted, there are generally disadvantages of sparse data and low extraction accuracy.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode
  • Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode
  • Technology and system for automatically recognizing Chinese new words in single-word-string mode and affix mode

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0014] Now describe technical scheme of the present invention in detail in conjunction with accompanying drawing:

[0015] figure 1 A flowchart showing a method for automatically recognizing new Chinese words according to a specific embodiment of the present invention.

[0016] First, step S101 is executed to segment large-scale short texts. The present invention uses short texts as the corpus for new word recognition. In this embodiment, since a webpage news is analyzed, news titles on the webpage are captured, and the captured news titles are word-segmented using ICTCLAS.

[0017] Next, step S102 is executed to store the news headlines and word-segmented news headlines in the local database. Those skilled in the art understand that specifically, after the above step S102 is performed, that is, after performing Chinese word segmentation on large-scale short texts, the word segmentation fragments are first stored in the database through physical disk storage means. In the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a technology and a system for automatically recognizing Chinese new words in a single-word-string mode and an affix mode, namely a method for recognizing the Chinese new words in the single-word-string mode and the affix mode on the basis of a large-scale short text language database. Statistics and rules are combined in the method, formation modes of the new words and statistics of word frequencies of the new words are combined, the new words are divided into new words in the single-word-string mode and new words in the affix mode, different extraction methods and different new word filtering methods are respectively adopted, and the Chinese new words in the two modes are extracted by combining the methods with word frequency information.

Description

technical field [0001] The invention relates to the field of natural language processing, in particular to a control method for automatic recognition and extraction of new Chinese words and a corresponding control system. Background technique [0002] Chinese automatic word segmentation is the basis for processing Chinese natural language, but due to the rapid development of information, Chinese language has undergone tremendous changes in a wide range of fields, and more and more new words continue to appear on the Internet. It has brought great challenges to the dictionary creation of Chinese word segmentation tools, and will inevitably lead to a decrease in the accuracy of word segmentation. Therefore, new word recognition has become a bottleneck in the field of Chinese information processing. The new word automatic recognition technology is of great help in improving the accuracy of Chinese word segmentation technology. In addition, the new word automatic discovery can ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/27
Inventor 吕钊蒋鑫曹艳娇
Owner EAST CHINA NORMAL UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products