Unlock instant, AI-driven research and patent intelligence for your innovation.

Document similarity recognition method and device based on potential semantic analysis

A document similarity and semantic analysis technology, applied in semantic analysis, text database query, natural language data processing, etc., can solve the problem of losing natural language attributes and achieve good recognition effect

Pending Publication Date: 2020-05-19
山东旗帜信息有限公司
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Now there are many theoretical methods for document similarity comparison, but most of them are based on statistical methods and have nothing to do with semantics in essence. This is actually natural language processing without natural language attributes. It has a certain effect, but it feels like looking for a fish on a tree

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document similarity recognition method and device based on potential semantic analysis
  • Document similarity recognition method and device based on potential semantic analysis
  • Document similarity recognition method and device based on potential semantic analysis

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0026] In order to clearly illustrate the technical characteristics of the present solution, the present application will be described in detail below through specific implementation modes and in conjunction with the accompanying drawings.

[0027] In the first embodiment, as in figure 1 shown, including the following steps:

[0028] S101. Build an original document library, where the original document library includes several original texts;

[0029] S102. The original text is preprocessed to obtain an original text bag-of-words vector corresponding to the original text;

[0030] The way of preprocessing is as follows: first obtain the word bag model;

[0031] Build a word-text matrix, and assign values ​​to each word in the matrix according to the TF-IDF method;

[0032] Determine the threshold, and use the SVD matrix singular value decomposition method for dimensionality reduction;

[0033] Get the final word-text matrix to get its bag-of-words vector;

[0034] S103. O...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A document similarity recognition method and device based on potential semantic analysis comprises the following steps that an original document library is constructed, the original document library comprises a plurality of original texts, and the original texts are preprocessed to obtain original text word bag vectors corresponding to the original texts in a one-to-one mode; obtaining an input text, and preprocessing the input text to obtain an input text word bag vector; and calculating the approximate degree of the bag-of-words vector of the input text and the bag-of-words vector of the original text to obtain the original text with the highest approximate degree with the input text. The method comprises the following steps: firstly, constructing a document library; the document libraryserves as a basic text, the input text serves as a main comparison text for comparison, documents similar to the input text are found from the basic text by means of the bag-of-words vector, and dueto the fact that semanteme is considered in the bag-of-words vector, a better document similarity recognition effect can be better obtained on the basis of potential semanteme.

Description

technical field [0001] The present application relates to a method and device for identifying document similarity based on latent semantic analysis. Background technique [0002] With the enhancement of computer processing power, how to digitize natural language has become an important object, because only digitized natural language is conducive to fast computer processing. [0003] With the massive accumulation of network information, the number of existing documents is very considerable. From a certain point of view, it is sufficient to classify new documents into existing documents, at least at the application level, such as product information in online shopping. Evaluation, etc., which involves the comparison of document similarity. Now there are a lot of theoretical methods for document similarity comparison, but most of them are based on statistical methods and have nothing to do with semantics in essence. This is actually natural language processing without natural ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/194G06F40/30G06F16/33
CPCG06F16/3347G06F16/3344
Inventor 于文才杜志诚杜明本钟琴隆王秀芹朱习文董林林叶玏
Owner 山东旗帜信息有限公司