Big data text deduplication technology based on improved Simhash algorithm

A big data and text technology, applied in the field of big data text deduplication, can solve the problems of duplication, data redundancy, etc., achieve the effect of accurate calculation, reduce the number of comparisons, and improve the efficiency of the algorithm

Pending Publication Date: 2021-11-05
HARBIN UNIV OF SCI & TECH
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In order to solve the problems of a large amount of data redundancy and duplication generated in the current big data era, the present invention discloses a big data text deduplication technology based on the improved Simhash algorithm

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Big data text deduplication technology based on improved Simhash algorithm
  • Big data text deduplication technology based on improved Simhash algorithm
  • Big data text deduplication technology based on improved Simhash algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0024] In order to clearly and completely describe the technical solutions in the embodiments of the present invention, the present invention will be further described in detail below in conjunction with the drawings in the embodiments.

[0025] The embodiment of the present invention is based on the flow of the big data text deduplication technology of the improved Simhash algorithm, such as figure 1 shown, including the following steps.

[0026] Step 1 The process of obtaining the repeated text dataset is as follows:

[0027] Sogou news data: https: / / www.sogou.com / labs / resource / ca.php, data 5000 Chinese news text data, divided into ten categories: 'auto', 'finance', 'technology', 'health' ','Sports','Education','Culture','Military','Entertainment','Fashion', each with 500 similar data and mixed with 2000 irrelevant data.

[0028] Step 2 The process of word segmentation and feature weight calculation of the text set is as follows:

[0029] Step 2-1 uses the NLPIR-ICTCLAS w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a big data text deduplication technology based on an improved Simhash algorithm, and relates to the field of natural language processing. The technology comprises the following steps: (1) carrying out word segmentation by adopting a word segmentation tool; (2) giving corresponding weights to the divided keywords; (3) calculating a document content signature and an article abstract signature through the keyword weight; and (4) calculating and finding out similar documents. The invention provides an improved Simhash algorithm for big data text deduplication on the basis of a classical Simhash algorithm. Firstly, a better word segmentation tool is selected, word segmentation is more accurate, part-of-speech and word length are considered in a weight calculation stage, and secondary hash is carried out by adopting a barrel sorting idea in a signature value matching stage. Finally, according to the feature vectors of the article content and the abstract content, a brand new calculation formula for calculating the Hamming distance comparison of the Simhash signature value is provided. The technology is very suitable for deduplication of big data texts, the accuracy rate and the recall rate are improved, and the deduplication speed is also increased.

Description

technical field [0001] The invention discloses a large data text deduplication technology based on an improved Simhash algorithm, and relates to the field of natural language processing. Background technique [0002] Since the 21st century, human activities have produced a large amount of data. The development of the network and big data has also allowed more and more researchers to study big data. When studying big data, a large amount of data should be preprocessed first. Data deduplication technology is the first step in data preprocessing. Through this technology, a large amount of duplicate data can be removed, which can greatly speed up data query, reduce storage space, and save storage expenses. Duplicate data deduplication technology can find and remove duplicate parts in the data, transmit and store the deduplication result data, and use pointers to point the stored data objects to duplicate data, so as to delete duplicate data or even have only one copy of the sam...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/31G06F16/34G06F40/216G06F40/289G06F40/194
CPCG06F16/345G06F16/325G06F40/289G06F40/216G06F40/194
Inventor 梁超张宇
Owner HARBIN UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products