Big data text deduplication technology based on improved Simhash algorithm

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A big data and text technology, applied in the field of big data text deduplication, can solve the problems of duplication, data redundancy, etc., achieve the effect of accurate calculation, reduce the number of comparisons, and improve the efficiency of the algorithm

Pending Publication Date: 2021-11-05

HARBIN UNIV OF SCI & TECH

View PDF0 Cites 1 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] In order to solve the problems of a large amount of data redundancy and duplication generated in the current big data era, the present invention discloses a big data text deduplication technology based on the improved Simhash algorithm

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0024] In order to clearly and completely describe the technical solutions in the embodiments of the present invention, the present invention will be further described in detail below in conjunction with the drawings in the embodiments.

[0025] The embodiment of the present invention is based on the flow of the big data text deduplication technology of the improved Simhash algorithm, such as figure 1 shown, including the following steps.

[0026] Step 1 The process of obtaining the repeated text dataset is as follows:

[0027] Sogou news data: https: / / www.sogou.com / labs / resource / ca.php, data 5000 Chinese news text data, divided into ten categories: 'auto', 'finance', 'technology', 'health' ','Sports','Education','Culture','Military','Entertainment','Fashion', each with 500 similar data and mixed with 2000 irrelevant data.

[0028] Step 2 The process of word segmentation and feature weight calculation of the text set is as follows:

[0029] Step 2-1 uses the NLPIR-ICTCLAS w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a big data text deduplication technology based on an improved Simhash algorithm, and relates to the field of natural language processing. The technology comprises the following steps: (1) carrying out word segmentation by adopting a word segmentation tool; (2) giving corresponding weights to the divided keywords; (3) calculating a document content signature and an article abstract signature through the keyword weight; and (4) calculating and finding out similar documents. The invention provides an improved Simhash algorithm for big data text deduplication on the basis of a classical Simhash algorithm. Firstly, a better word segmentation tool is selected, word segmentation is more accurate, part-of-speech and word length are considered in a weight calculation stage, and secondary hash is carried out by adopting a barrel sorting idea in a signature value matching stage. Finally, according to the feature vectors of the article content and the abstract content, a brand new calculation formula for calculating the Hamming distance comparison of the Simhash signature value is provided. The technology is very suitable for deduplication of big data texts, the accuracy rate and the recall rate are improved, and the deduplication speed is also increased.

Description

technical field [0001] The invention discloses a large data text deduplication technology based on an improved Simhash algorithm, and relates to the field of natural language processing. Background technique [0002] Since the 21st century, human activities have produced a large amount of data. The development of the network and big data has also allowed more and more researchers to study big data. When studying big data, a large amount of data should be preprocessed first. Data deduplication technology is the first step in data preprocessing. Through this technology, a large amount of duplicate data can be removed, which can greatly speed up data query, reduce storage space, and save storage expenses. Duplicate data deduplication technology can find and remove duplicate parts in the data, transmit and store the deduplication result data, and use pointers to point the stored data objects to duplicate data, so as to delete duplicate data or even have only one copy of the sam...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F16/31G06F16/34G06F40/216G06F40/289G06F40/194

CPCG06F16/345G06F16/325G06F40/289G06F40/216G06F40/194

Inventor梁超张宇

OwnerHARBIN UNIV OF SCI & TECH

Big data text deduplication technology based on improved Simhash algorithm

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology