Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Text similarity matching method based on subject terms

A text similarity and matching method technology, applied in the field of text similarity matching for fast retrieval of similar articles, can solve the problems of unsatisfactory accuracy and insufficient retrieval efficiency, improve the efficiency and accuracy of duplicate checking, and reduce manpower The effect of wasting resources

Pending Publication Date: 2020-05-05
TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING) +1
View PDF9 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the field of similarity retrieval, the existing similarity retrieval methods are either insufficient in retrieval efficiency or unsatisfactory in accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text similarity matching method based on subject terms
  • Text similarity matching method based on subject terms
  • Text similarity matching method based on subject terms

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0023] like figure 1 As shown, the flow of the text similarity matching method based on keywords includes the following steps:

[0024] In step 10, the text is fragmented, the texts in various formats are unified into the database, and the data is cleaned to form texts in a unified format;

[0025] Step 20 performs word segmentation and removal of stop words to the text, and stores the document id and word segmentation results in the database;

[0026] Step 30 uses the inverted index algorithm to perform statistical calculations on all word-segmented texts in the database to form a word-document list matrix, and store the results in the database;

[0027] Step 40 extracts the keywords of each text through the tf-idf algorithm and calculates the t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text similarity matching method based on subject terms, and the method comprises the steps of: performing document screening based on an inverted index, precisely filtering required related documents from a large batch of documents, and performing the similarity comparison, thereby greatly improves the retrieval efficiency; weighting the word vectors based on tf-idf, textRank comprehensive weight values of the keywords, performing document vector calculation by using the weighted word vectors, and performing cosine similarity comparison; and finally, calculating sentence vectors of the two similar documents respectively, calculating the similarity of every two sentences of the two documents, setting a threshold value to judge whether the sentences are similar or not, and if the similarity of the sentences is higher than the threshold value, judging that the sentences are similar and marked with red. The method is used for similarity checking work of corpus systems in various fields, the duplicate checking efficiency and accuracy of the system are improved, and waste of human resources is reduced.

Description

technical field [0001] The invention relates to the technical field of text data mining and computing information processing, in particular to a text similarity matching method for quickly retrieving similar articles from a large-scale corpus database based on keywords. Background technique [0002] With the popularity of various natural language processing applications such as computer text information mining, the demand for document retrieval systems based on text similarity is increasing in today's society, and people also put forward higher requirements for computer text processing. In the process of natural language processing, it often involves how to measure the similarity between two texts. We all know that text is a high-dimensional semantic space. How to abstract it and decompose it so that it can be quantified from a mathematical perspective Its similarity is the focus of this method. In the field of similarity retrieval, the existing similarity retrieval methods...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F40/216G06F40/289G06F40/30G06F16/31G06K9/62
CPCG06F16/319G06F18/22Y02D10/00
Inventor 杨雷段飞虎吕强印东敏冯自强张宏伟
Owner TONGFANG KNOWLEDGE NETWORK TECH CO LTD (BEIJING)
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products