An article deduplication method, device and equipment and a storage medium

A technology of articles and equipment, applied in the field of data processing, can solve problems such as poor results

Pending Publication Date: 2021-03-19
BEIJING GRIDSUM TECH CO LTD
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the existing article data deduplication methods, the deduplication of article data is usually based on the corresponding URL (Uniform Resource Locator, Uniform Resource ...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An article deduplication method, device and equipment and a storage medium
  • An article deduplication method, device and equipment and a storage medium
  • An article deduplication method, device and equipment and a storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0054] In the existing technical solutions for deduplication of article data, most of them use the URL of the article data (specifically, the character string of the URL) to deduplicate the article data, but the article data in the article data obtained by this deduplication method The repetition rate is high, that is, there are still many articles with consistent content in the deduplication article data, and the deduplication effect is poor.

[0055]The inventor found through research that there is not a one-to-one correspondence between the URL and the content of the article. Specifically, for the same article, it may exist in multiple locations on the network, for example, an article may be published on multiple network platforms, etc., which makes an article actually correspond to multiple different URL. Then, when deduplication is performed on the article data based on the URL, although the URLs are different, the content of the corresponding articles is still the same,...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an article deduplication method, device and equipment and a storage medium, and the method comprises the steps of obtaining a to-be-deduplicated target article, determining anarticle attribute corresponding to each article in the target article, and enabling the article attribute to be used for uniquely identifying the article; and performing duplicate removal on the target article according to the determined article attribute corresponding to each article. Due to the fact that the article attributes are generally in one-to-one correspondence with the articles, duplicate removal is conducted on the target article based on the determined article attributes, the articles obtained after duplicate removal can be different, the repetition rate between the articles obtained after duplicate removal is reduced, the uniqueness of the article data obtained after duplicate removal can be improved, and thus, the duplicate removal effect of the article data is improved.

Description

technical field [0001] The present application relates to the technical field of data processing, in particular to a method, device, equipment and storage medium for deduplication of articles. Background technique [0002] In the process of data processing of article data, data cleaning and denoising is a relatively important step. The quality of the article data denoising effect determines the quality of the final article data used, which in turn affects the accuracy of the analysis results obtained when the article data is analyzed and processed later. Among them, deduplication refers to removing articles with duplicate content in the article data, and is also an important aspect in the process of denoising the article data. [0003] In the existing article data deduplication methods, the deduplication of article data is usually based on the corresponding URL (Uniform Resource Locator, Uniform Resource Locator) of the article. However, the effect of this deduplication met...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/951G06F16/9535G06F16/955G06F16/33
CPCG06F16/33G06F16/951G06F16/9535G06F16/955
Inventor 任志伟
Owner BEIJING GRIDSUM TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products