New network word discovery method in combination with internal polymerization degree and external discrete information entropy

A technology of discrete information entropy and degree of aggregation, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as increased time complexity, lack of internal and external structural features of words, and limitations

Inactive Publication Date: 2013-02-13
ZHEJIANG UNIV
View PDF5 Cites 48 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The network new word discovery method based on statistics can make better use of statistical information to discover new words, but it lacks the consideration of the internal and external structural characteristics of words. At the same time, this method, when identifying words with low frequency of occurrence, The effect is not good, and the method based on statistics will lead to a sharp increase in time complexity when discovering longer new words, so the new word discovery method based on statistics is generally limited to identifying shorter new words

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • New network word discovery method in combination with internal polymerization degree and external discrete information entropy
  • New network word discovery method in combination with internal polymerization degree and external discrete information entropy
  • New network word discovery method in combination with internal polymerization degree and external discrete information entropy

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] As we all know, a candidate word string as a network neologism has the following rules: that is, the candidate word string should have a certain frequency of use on the Internet, rather than appearing occasionally. On this basis, the inventors of the present invention further discovered the following rules: (a) the probability of the candidate word string appearing in the network is significantly greater than the probability of the candidate word string being randomly combined to form the candidate word string; ( b) The candidate string has the same meaning when it appears in multiple different contexts as an independent unit. Based on this, different from the prior art, the present invention considers the three factors involved in the above rules simultaneously when judging whether a candidate word string is a new word on the Internet, and proposes to judge whether a candidate word string is a new word on the Internet for the first time. The other two key factors: the ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a new network word discovery method in combination with internal polymerization degree and external discrete information entropy. The new network word discovery method comprises the following steps: carrying out word segmentation treatment on all text sentences in network corpora, and taking all different segmented word strings as candidate word strings; and calculating the internal polymerization degree and the external discrete information entropy of the candidate word string with the frequency of occurrence in the network corpora exceeding a fixed threshold value, and further judging if a candidate target word string is a new network word according to the internal polymerization degree and the external discrete information entropy of the candidate word string. According to the new network word discovery method, two key factors for judging if one candidate word string is the new network word are proposed as follows: the internal polymerization degree and the external discrete information entropy of the candidate word string, and meanwhile, the stability, the independence and the completeness of the candidate word string are considered, so that the new network word can be effectively discovered.

Description

technical field [0001] The invention relates to a method for discovering network new words, which belongs to the field of computer natural language processing. Background technique [0002] With the rapid development of the Internet and the continuous expansion of the number of Internet users, a large number of new words appear on the Internet and quickly penetrate into people's daily life, which has become a language phenomenon. At the same time, in many Chinese information processing fields such as information retrieval, automatic word segmentation, dictionary compilation, and machine translation, the effect of new word discovery greatly affects the effect of these Chinese information processing fields, especially Chinese automatic word segmentation technology is the most important. Obviously, due to the characteristics of Chinese itself, it does not have obvious spaces between words like English. How to accurately segment the emerging new words is already a crucial step i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 林怀忠陈泽锋李鹏飞
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products