Construction method of Chinese word bank, Chinese word bank and application

A construction method and thesaurus technology, applied in the field of Chinese text structure processing, can solve problems such as failure to effectively use common features, and achieve the effects of improving accuracy and extraction efficiency, improving efficiency, and saving time and work.

Pending Publication Date: 2021-06-18
CENT SOUTH UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The construction methods of these lexicons fail to effectively utilize the common features of inclusion and inclusion among multiple words in the corpus

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Construction method of Chinese word bank, Chinese word bank and application
  • Construction method of Chinese word bank, Chinese word bank and application
  • Construction method of Chinese word bank, Chinese word bank and application

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0061] The present invention will be further described below in conjunction with the accompanying drawings and specific preferred embodiments, but the protection scope of the present invention is not limited thereby.

[0062] First, the theoretical basis involved in the present invention is explained: in some corpus, sentences are divided into sentences by separators such as conjunctions and punctuation marks, and words can be distinguished from sentences. However, there are many consecutive words in the corpus, and no separators such as conjunctions and punctuation marks are used to divide sentences. Through the construction of the multi-fork tree, the corresponding relationship between the inclusion and inclusion of multiple words can be extracted, and then the words on the root node of the multi-fork tree that are not included by other words in the corpus can be quickly identified, which are called reduced words. A reduced word can refer to some key features contained in ot...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a construction method of a Chinese word bank, the Chinese word bank and application, and the method comprises the following steps: S1.1, preprocessing a corpus set, dividing each sentence in the corpus set into segmented words, and generating a simplified word bank from the clause words based on a multi-way tree method; s1.2, calculating the completeness probability of words in the simplified word library, constructing a subdivision field simplified word library, wherein the completeness probability of the words in the subdivision field simplified word library is smaller than a preset threshold value; and S1.3, for the segmented words of each sentence, segmenting the segmented words by taking the words in the simplified word library in the subdivision field as boundaries, and generating a subdivision field mode matching word library by using the segmented words. The method has the advantages of convenient and efficient word bank construction, reliable feature extraction and the like.

Description

technical field [0001] The invention relates to the technical field of Chinese text structural processing, in particular to a construction method of a Chinese lexicon, a Chinese lexicon and an application thereof. Background technique [0002] The rapid development of information technology has driven the informatization construction of all walks of life across the country, and the support of national policies has laid a solid foundation for the establishment of information systems in all walks of life; this has brought a large amount of data in professional fields, of which the Chinese Corpus composed of text data has received extensive attention. Corpus is an important information resource generated in production activities, which can be a large amount of comment data in social networks, or customer service data in customer service centers of shopping websites. Digging out valuable information from the complex corpus will greatly promote the development of all walks of li...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/211G06F40/284G06F40/289
CPCG06F40/211G06F40/284G06F40/289
Inventor 何世文章桐
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products