Electric power professional lexicon construction method based on hybrid model and clustering algorithm

A technology of clustering algorithm and construction method, which is applied in the field of artificial intelligence, can solve problems such as being unable to cope with the endless emergence of new words, and achieve the effect of rich words and good effects

Pending Publication Date: 2021-11-05
JINCHENG POWER SUPPLY COMPANY OF STATE GRID SHANXI ELECTRIC POWER
View PDF13 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The currently commonly used word segmentation methods are based on the artificial thesaurus, which can manually collect some

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Electric power professional lexicon construction method based on hybrid model and clustering algorithm
  • Electric power professional lexicon construction method based on hybrid model and clustering algorithm
  • Electric power professional lexicon construction method based on hybrid model and clustering algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] A kind of electric power professional thesaurus construction method based on hybrid model and clustering algorithm of the present invention comprises the following steps:

[0027] Step 1. Preprocessing the power text and parallel corpus, including deleting spaces, punctuation marks, special characters and some words or words that have no physical meaning in the initial text data, to obtain qualified input text data;

[0028] Step 2. Segment the electric power text and the parallel corpus of non-electric power majors through the word segmentation model to obtain the electric power text lexicon and the parallel corpus lexicon. The electric power text lexicon is compared with the parallel corpus lexicon to obtain the characteristic corpus words in the electric power field;

[0029] Step 3. The feature corpus words still contain non-electric power professional vocabulary, and select the power professional vocabulary from the feature corpus words as the seed word; at the same...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of artificial intelligence, in particular to an electric power professional lexicon construction method based on a hybrid model and a clustering algorithm. The method includes preprocessing the power text and the parallel corpus; performing word segmentation through a word segmentation model, performing word combination on a Jieba word segmentation result through a mutual information and left and right entropy algorithm and a TextRank algorithm, wherein text keywords are extracted from the Jieba word segmentation result through a TF-IDF algorithm and a Word2Vec word clustering algorithm, and text word segmentation is directly carried out through an information entropy word segmentation algorithm; summarizing the results, and comparing to obtain feature corpus words; selecting power professional words from the feature corpus words as seed words; meanwhile, using the exported power text lexicon as candidate words to perform word segmentation on the power text, and then using a word2vec algorithm to convert the words into word vectors; obtaining similar words through clustering, and then obtaining a power professional word library through rule filtering. According to the invention, most professional words in the non-power field can be filtered by using one clustering model, and the professional words are relatively complete.

Description

technical field [0001] The invention relates to the field of artificial intelligence, in particular to a method for constructing a power professional thesaurus based on a hybrid model and a clustering algorithm. Background technique [0002] In the Chinese language, the ideographic ability of single characters is poor and the meaning is scattered, while the ideographic ability of words is stronger and can describe a thing more accurately. Therefore, in natural language processing, words (including single characters) are usually the most basic processing unit. For languages ​​of the Latin family such as English, since there are spaces between words as word margins, words can be extracted simply and accurately. In the Chinese language, except for punctuation marks, words are closely connected without obvious boundaries, so it is difficult to extract words. Chinese word segmentation methods are roughly divided into three types: dictionary-based segmentation, statistical model...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F40/242G06F40/284G06F16/35G06Q50/06
CPCG06F40/242G06F40/284G06F16/35G06Q50/06
Inventor 陈文刚宰洪涛刘建国张轲许泳涛何洪英罗滇生尹希浩奚瑞瑶符芳育方杰
Owner JINCHENG POWER SUPPLY COMPANY OF STATE GRID SHANXI ELECTRIC POWER
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products