A patent text modeling method based on word 2vec and semantic similarity

A technology of semantic similarity and modeling methods, which is applied in semantic analysis, natural language data processing, instruments, etc., can solve problems such as lack of semantic information, sparse feature dimensions of vector space models, and inability to dig deep into the internal laws of patent texts, etc., to achieve The clustering effect is stable, the effect is significant, and the effect of strong text representation ability

Active Publication Date: 2019-02-22
SUN YAT SEN UNIV
View PDF8 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] However, the problems of sparse feature dimensions and lack of semantic information in the traditional vector space model have not been well resolved. The existing technology for patent text analysis methods still lacks consideration of the full text of the patent text, and cannot deeply mine patents in the same field. The Problem of the Internal Law of the Text

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A patent text modeling method based on word 2vec and semantic similarity
  • A patent text modeling method based on word 2vec and semantic similarity
  • A patent text modeling method based on word 2vec and semantic similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0041] The accompanying drawings are for illustrative purposes only and cannot be construed as limiting the patent;

[0042] In order to better illustrate this embodiment, some parts in the drawings will be omitted, enlarged or reduced, and do not represent the size of the actual product;

[0043] For those skilled in the art, it is understandable that some well-known structures and descriptions thereof may be omitted in the drawings.

[0044] The technical solutions of the present invention will be further described below in conjunction with the accompanying drawings and embodiments.

[0045] like figure 1 As shown, it is a flow chart of the patent text modeling method based on word2vec and semantic similarity of the present invention, and this embodiment models Chinese patent texts in the communication field according to the flow chart.

[0046] Step 1: Crawl the patent text collection in the communication field, and preprocess the patent text collection. In this example,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to the field of text modeling, and provides a patent text modeling method based on word2vec and semantic similarity. The method includes: crawling the patent text set and performing preprocessing; calculating the TF-IDF value of each word in the patent text set sorting and selecting to obtain feature word set; The text set is imported into word2vec model and the word vector is obtained by training. Cosine similarity is calculated to get the similar word set wordC_1; Word2vec similarity was calculated to get the similar word set textC_1. The similarity of word set textC_1and textC_1 was calculated to get the similar word set textC_1. The text set is imported into the text processing system for training, and the semantic similarity is obtained. The similar word set wordC_2 is selected. The semantic similarity is calculated to get the similar word set textC_2. The semantic similarity is calculated to get the similar word set textC_2. The extended word set textC_f isobtained by calculating the mixed similarity. Weights are calculated to form new text identifiers to complete the modeling. The invention adds part of the information between words to the traditionalvector space model from the statistical angle of the word2vec and the semantic angle of the semantic similarity, reduces the sparsity of the text matrix to a certain extent, and the clustering effectis more remarkable and stable, and has stronger text identification ability.

Description

technical field [0001] The present invention relates to the field of text modeling, more specifically, relate to a kind of patent text modeling method of word2vec and semantic similarity. Background technique [0002] In terms of text modeling of patent texts, scholars have tried many different methods to improve traditional text modeling methods, such as expressing patent texts as text vectors with patent semantic weight information and word frequency weight information, and proposing a method based on conditional randomness. Airport (CRF) patented term extraction scheme, proposed Latent Semantic Index (LSI) model realized multilingual vector space, etc. In addition to improving the vector space model of the traditional text modeling method, many scholars have also constructed a text modeling method different from the vector space to improve the text expression ability of patent texts. [0003] However, the problems of sparse feature dimensions and lack of semantic informa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F16/332G06Q50/18
CPCG06Q50/184G06F40/242G06F40/247G06F40/289G06F40/30
Inventor 路永和刘小桦
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products