Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Chinese semantic library new word generation method

A semantic library and new word technology, applied in natural language data processing, special data processing applications, instruments, etc., can solve problems that consume a lot of time and manpower, and achieve the effect of saving time and manpower

Inactive Publication Date: 2018-08-21
山东爱城市网信息技术有限公司
View PDF4 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

With the continuous development of society and the continuous evolution of culture, the Chinese vocabulary has also been continuously expanded. If manual processing is used to add dictionaries, it will consume a lot of time and manpower

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Chinese semantic library new word generation method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0019] Such as figure 1 Shown, a kind of generation method of new word of Chinese semantic base, described method is by setting up text block as corpus, by scanning text block, the word that appears adjacently forms set, if this set is not in the dictionary, counts the number of occurrences of this set , if the number of occurrences exceeds the threshold, the adjacently occurring word is identified as a new word and added to the semantic library.

[0020] The composition of the text block includes: a single-word set, a double-word set, and a set composed of a combination of a single-word set and a double-word set.

[0021] The statistics of adjacent words appearing in the text block are calculated according to the offset of each word in the text, by establishing the offset vector that each word appears, and then counting each offset vector, the statistics Count the number of occurrences of adjacent words.

[0022] The word collection is obtained from user logs and database f...

Embodiment 2

[0024] In Chinese, a single character can form a word, so you only need to use a single character as a basic word, assuming a certain character is Wn. All Chinese words are a set of W{W1,W2,W3,...,Wn}, containing n different Chinese characters.

[0025] Another double word set Y, a participle is Ym, Y{Y1,Y2,Y3,...,Ym}, where Ym={wi-wj,wj-wi}, where i, j are between 1 and n , where the symbol '-' represents a relationship, for example, 'beauty' and 'person', there are two combinations, namely 'beauty' and 'renmei', all meaningful words composed of two characters for Y.

[0026] Similarly, all two-character words and single-character words can be combined as N={Wi-Yj, Yj-Wi}, where i and j are between 1 and n, and the symbol '-' represents a connection Relationships such as: Love-Beauty and Beauty-Love.

[0027] Collect all the text as a text block, and then scan for the first time, record the offset of each word, and establish the offset vector of each word.

[0028] Then c...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a Chinese semantic library new word generation method. The method comprises the following steps of: setting a text block as a corpus; scanning the text block so as to form adjacent words into a set; if the set is not in a dictionary, carrying out statistical analysis on an occurrence frequency of the set; and if the occurrence frequency exceeds a threshold value, considering the adjacent words as new words and adding the new words into a semantic library. According to the method, the similarities between Chinese texts can be accurately compared, and new words can be rapidly recognized, so that the dictionary is supplemented; along with the continuous development of the society and the continuous evolution of the culture, the Chinese vocabulary is continuously extended; through using the method, recognized new words are added into the dictionary, so that plenty of time and labor are saved.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to a method for generating new words in a Chinese semantic database, which can accurately compare the similarity of Chinese texts. Background technique [0002] The dictionary is based on the dictionary word segmentation method. New words and ambiguous words are the focus and difficulty of word segmentation. It is necessary to introduce statistical knowledge to identify new words. [0003] The corpus stores the language materials that have actually appeared in the actual use of the language; the corpus is the basic resource of language knowledge carried by the computer; the real corpus needs to be processed (analyzed and processed) before it can become a useful resource. [0004] Chinese is the most widely used language in the world. It has a strong expressive ability, and its grammar is relatively casual and simple. Compared with English and other Latin language...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/334G06F40/216
Inventor 姜明鲁
Owner 山东爱城市网信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products