Method and device for data cleaning

A data and numerical technology, applied in electrical digital data processing, special data processing applications, instruments, etc., can solve problems such as affecting statistical results, analyzing noise interference, etc.

Active Publication Date: 2016-10-05
北京秒针信息咨询有限公司
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When grouping URLs, http automatic request URLs will also be included in it, which will cause the analysis to be disturbed by no...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for data cleaning
  • Method and device for data cleaning
  • Method and device for data cleaning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0047] In order to make the technical problems, technical solutions and advantages to be solved by the present invention clearer, the following will describe in detail with reference to the drawings and specific embodiments.

[0048] Aiming at the problem that it is difficult to clean automatic http requests in the prior art, the invention proposes a data cleaning method. In the prior art, in order to clean HTTP automatic requests, there are two commonly used methods as follows: one is to add corresponding parameters to the URL when the website initiates a request, and to identify whether the URL is an automatic request through different parameters carried in the URL, but The addition of parameters depends on the media itself, not all media will add corresponding URL parameters to identify automatic requests; even if there are corresponding parameters to identify automatic requests, the formats used by different websites are also different, and it is very difficult to obtain th...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a method and a device for data cleaning. The method includes acquiring user internet behavior data, which are collected in advance, include user unique identifier, request_url field and referer field both requested by the current http and have URL content; counting a first value for expressing occurrence frequency of each URL in the request_url field of the user internet behavior data and a second value for expressing occurrence frequency of each URL in the referer field and then calculating ratio of the second value to the first value to obtain a first ratio; according to the user internet behavior data belonging to the same user, establishing a behavior tree which comprises multiple leaf nodes respectively corresponding to one URL of the request_url field; judging whether the first ratio of the corresponding URL of each leaf node is smaller than a preset threshold value or not, and deleting the user internet behavior data containing the request_url field of the corresponding URL if the first ratio is smaller than the preset threshold. By the method, useless data are cleaned effectively.

Description

technical field [0001] The invention relates to the technical field of data cleaning, in particular to a method and device for cleaning data. Background technique [0002] In the era of Internet big data, Internet users' access to websites is the mainstream Internet access mode in the current society, and the analysis of Internet users' access behavior is of vital significance to many companies. Data cleaning is one of the necessary procedures before data analysis. Screening out valuable data is beneficial to the marketing plan and development planning of the company. On the contrary, if a large amount of useless data is screened out, the company will not only need to spend manpower and material resources to analyze the data , and may also be misled in the direction of operation by wrong analysis results, resulting in huge losses. [0003] But in a real media environment, when a user visits a certain page, multiple http automatic requests may be generated. The http automat...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 陈家耀李长刚冯是聪吴明辉
Owner 北京秒针信息咨询有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products