Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text similarity computing method and system based on improved LDA topic model

A text similarity and topic model technology, applied in computing, instrumentation, electrical and digital data processing, etc., can solve problems such as large dimensions, waste of space, failure to fully exploit and utilize the differences in words used in different types of texts, etc., to achieve a small dimension. , the effect of reducing wasted space

Pending Publication Date: 2018-11-16
CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
View PDF6 Cites 20 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Therefore, the technical problem to be solved by the present invention is to overcome the existence of large dimensions, serious waste of space, too much concentration on the word level, and failure to fully tap and utilize the inherent usefulness between different types of texts in the existing text similarity calculation method. word difference problem

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text similarity computing method and system based on improved LDA topic model
  • Text similarity computing method and system based on improved LDA topic model
  • Text similarity computing method and system based on improved LDA topic model

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0058] The technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Apparently, the described embodiments are part of the embodiments of the present invention, but not all of them. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

[0059] In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer" etc. The indicated orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the referred device or element must have a specific orientation, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a text similarity computing method and system based on an improved LDA topic model. The method comprises the following steps: acquiring a plurality of text sets in a WMF_LDA topic model; performing similarity computing on words in a preprocessed word set by virtue of a word2vec word vector model and generating a plurality of word similarity values; generating a field topicword set according to similarity between the words; obtaining a probability distribution of a document subjected to word semantic combination on different topics by virtue of the LDA topic model; anddetermining topic distribution similarity between any two texts, thus the text similarity is obtained. The method provided by the invention has the advantages that firstly screening is performed, sothat quantity of the words in the topic word set is reduced; and unified mapping is performed on synonyms and words in the same field, and then probability distributions of the texts are obtained through modeling, so that computation dimensionality is low in a process of computing the similarity of two texts, wasted space is reduced, and the problems that a word level is focused on excessively anddifferent types of texts can not be fully excavated and utilized are solved.

Description

technical field [0001] The invention relates to the technical field of language processing, in particular to a text similarity calculation method based on an improved LDA topic model. Background technique [0002] Text similarity research is an important research topic in natural language processing. The traditional VSM method uses TF-IDF as a feature to construct a vector, and uses cosine distance to calculate the similarity of text, but this method simply uses word frequency as a feature, without considering the semantic features of words and text. [0003] Text similarity is an important topic that has been widely studied in the fields of linguistics, psychology and information theory. At present, there are various methods for calculating text similarity, and many achievements have been made. The traditional calculation method of text similarity is characterized by word frequency, and the text is expressed as a vector. The dimension of the vector is the number of all wo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 芦天亮杜彦辉曹金璇蔡满春张建岭张璐
Owner CHINESE PEOPLE'S PUBLIC SECURITY UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products