Spark platform-based group news data preprocessing method

A news data and preprocessing technology, applied in the computer field, can solve the problems of overall time wasting, slowing down the speed of application scenarios, lack of availability, etc., and achieve the effects of low error rate, easy promotion, and strong practicability

Inactive Publication Date: 2017-11-03
SHANDONG INSPUR GENESOFT INFORMATION TECH CO LTD
View PDF4 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] Since the above method does not have the provided memory-based calculation and memory storage for intermediate results, the overall time is wasted too much in the process of work, which slows down the speed of the entire application scenario.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Spark platform-based group news data preprocessing method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0029] In order to make those skilled in the art better understand the solution of the present invention, the present invention will be further described in detail below with reference to specific embodiments. Obviously, the described embodiments are only some, but not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

[0030] as attached figure 1 As shown, a group news data preprocessing method based on the Spark platform can achieve more accurate and efficient denoising and deduplication functions.

[0031] The specific implementation process is as follows:

[0032] The collection operator, that is, the collection of group news data;

[0033] Denoising operator, the group news data that is about to be collected will be denoised, and the denoising operator will be completed...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Spark platform-based group news data preprocessing method. The method comprises the specific steps of setting a collection operator for collecting group news data; setting a denoising operator for performing denoising processing on the collected group news data, wherein the denoising operator is finished based on a Spark platform; setting a duplicate removal operator for performing duplicate removal processing on the denoised data; and finally setting a Hamming distance threshold, and determining texts with Hamming distances smaller than the set threshold in the duplicate removal processing as approximate texts. Compared with the prior art, the Spark platform-based group news data preprocessing method has the advantages that the processing speed is high; the denoising of one hundred million records can be finished in milliseconds; the duplicate removal of ten million records can be finished in minutes; the accuracy is high, the denoising processing accuracy can reach 96.4%, and the duplicate removal processing accuracy can reach 90.3%; and the method is high in practicality, wide in application range and easy to popularize.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a method for preprocessing group news data based on a Spark platform. Background technique [0002] The main application scenarios of the existing data denoising algorithms are on a single machine or on a single machine. The current algorithms are mainly aimed at data whose text length does not meet the threshold, advertisement content and automatic reply data. In the existing data deduplication algorithm, the module performs word frequency statistics on the text according to the result of the word segmentation to convert it into a dimensional vector, and operates on the dimensional vector to obtain the binary signature of the bit, which is executed by the deduplication operation module. The following operations: segment the binary signature of the bit according to the set parameters, and establish a reverse index according to the segmentation result, and segmentally retrieve t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/183G06F16/215G06F16/254G06F40/289
Inventor 李腾
Owner SHANDONG INSPUR GENESOFT INFORMATION TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products