Semantic similarity analysis method based on text clustering

A technology of semantic similarity and text clustering, applied in semantic analysis, text database clustering/classification, unstructured text data retrieval, etc., can solve the problem of lack of semantic information, high dimensionality of feature word vectors, difficult fast and accurate Obtain similarity and other issues to achieve the effect of improving text clustering and semantic similarity analysis methods, simplifying classification networks, and improving accuracy and analysis speed

Inactive Publication Date: 2020-02-21
CHINA UNIV OF PETROLEUM (EAST CHINA)
View PDF0 Cites 18 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The problem of efficient similarity analysis and detection of text in specific fields has been basically solved, but it is difficult to quickly and accurately obtain similarity in a multi-field and large text database. Problems such as high dimensionality of word vectors, sparse data, ignorance of low-frequency words, and lack of semantic information, etc., and professional vocabulary in the text and synony

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Semantic similarity analysis method based on text clustering
  • Semantic similarity analysis method based on text clustering
  • Semantic similarity analysis method based on text clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0040] The present invention will be further described in detail below in conjunction with the drawings and specific implementations:

[0041] A semantic similarity analysis method based on text clustering, such as figure 1 Shown is a research flow chart of the method of semantic similarity analysis based on text clustering of the present invention. The method includes:

[0042] S1, data preprocessing, to perform data preprocessing on an input unprocessed text sequence such as text de-drying, stop words removal, encoding conversion, and Chinese word segmentation. The original data comes from the scientific research results declared by the marine oil production plant over the years. The scientific research results are divided into 4 categories. The actual scientific research results in the work are used as the sample database, and the newly declared scientific research results files are processed in real time.

[0043] S2, word vector training. Use Skip-gram and Softmax models to tra...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a semantic similarity analysis method based on text clustering. The method comprises the following steps: taking unprocessed text data as input; performing word frequency statistics on texts subjected to data preprocessing, adding word frequency statistics information serving as priori knowledge into text clustering, proposing a posteriori judgment criterion, and performingan unsupervised clustering method on the basis of taking the word frequency statistics as a classifier to improve the accuracy and timeliness of a text clustering result; carrying out synonym ambiguity elimination on the processed text, and carrying out semantic role labeling; and generating a semantic vector fused with the context features, processing the text sequence by adopting two LSTMs withcompletely same structures and parameters, adding the product and variance of the results, amplifying the same points and differences of the texts, and calculating to obtain a final result of similarity analysis. The method can be applied to actual scenes of text similarity analysis in various different fields, and text data of different types can be well processed.

Description

Technical field [0001] The invention belongs to the field of natural language processing, and relates to a semantic similarity analysis method based on text clustering. Background technique [0002] Text clustering and semantic similarity detection have always been an important research topic in the field of natural language processing. It can automatically and accurately determine the text category, semantic extraction and similarity comparison in text data. Handling and application are crucial. In recent years, the maturity of the research field has always been accompanied by a sharp increase in the number of reports and scientific research results. It is important to summarize and provide an overview. The similarity analysis method based on text clustering has received more and more attention. This method can It is divided into two stages of text clustering and text similarity analysis. The current methods basically pay more attention to word frequency information, while igno...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35G06F40/289G06F40/30G06F40/247G06K9/62G06N3/04
CPCG06F16/355G06N3/049G06N3/045G06F18/22
Inventor 唐昱润宫法明马玉辉司朋举李昕
Owner CHINA UNIV OF PETROLEUM (EAST CHINA)
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products