Web click counting method based on web crawler behavior identification and buffering updating strategies

A technology of web crawler and counting method, which is applied in network data retrieval, network data indexing, calculation, etc., can solve problems such as large state space, difficult determination of similarity threshold, and difficult estimation of value range, so as to achieve timely update and avoid The effect of artificial height

Active Publication Date: 2014-03-26
深圳前海财信云科技有限公司
View PDF3 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] User identification is simply based on the IP address and the Agent string of the client. Since the crawler program cannot be forced to identify itself, this method cannot distinguish crawlers from network users.
The click sequence model uses a statistical model to describe the conditional probability of two clicks before and after, and can describe the jump relationship between different links in user browsing behavior [3], although it describes the selectivity of network users when browsing websites from the statistical characteristics. behavior, but when it is actually applied to distinguish crawlers and network users, it will encounter the following problems: (1) For a network forum like the stock bar website, due to the huge number of posts, when using a statistical model to describe the jump relationship, the required The state space processed is very large, resulting in a large amount of sparse information in the model
However, since the similarity value is a likelihood value, its value range is not easy to estimate, which makes it difficult to determine the similarity threshold, so there are still big problems in practicality

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web click counting method based on web crawler behavior identification and buffering updating strategies
  • Web click counting method based on web crawler behavior identification and buffering updating strategies
  • Web click counting method based on web crawler behavior identification and buffering updating strategies

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] The present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments.

[0037] A web click counting method based on web crawler behavior identification and buffer update strategy provided in the present invention divides the crawler behavior identification and buffer update calculation method into two parts: online behavior detection and buffer update main process and client identification thread. The creation, execution and termination of the client thread to complete the processing of the specific click sequence. The details are as follows.

[0038] 1. Build a logical structure diagram of the page

[0039] (1) Input the page files of the entire website, initialize the list LS, and process each page file F as follows:

[0040] (1.1) Extract the string S pointed to by "href=" in the file.

[0041] (1.2) If all "href=" in the file have been processed, go to step (2), otherwise go to 1.3 and continue.

[0042] (...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of web design, and particularly relates to a web click counting method based on web crawler behavior identification and buffering updating strategies. According to the method, firstly, a logic expression of the page organization structure of a web site is built, crawler behavior identification is conducted by comprehensively using client identity identification, web page logic structure matching and time attributes, and then on the basis of setting a buffering counting structure, click counting updating is conducted on the basis of identification results. By means of the method, crawler click behaviors can be identified correctly, the phenomenon that the counted number is artificially high is prevented, and meanwhile web real click counting updating can be conducted timely. The method is suitable for various applications needing linking or page click counting.

Description

technical field [0001] The invention belongs to the technical field of Web design and relates to a novel method for counting Web clicks, in particular to a novel calculation method designed on the basis of network crawler behavior pattern analysis combined with a buffer update strategy. Background technique [0002] On many websites, Web hit counts play an increasing role. For example, on the Oriental Fortune Stock Bar website, each post lists the number of clicks on the post on the left side of the post. The number of these clicks has a certain guiding value for investors to obtain information, and wrong counts are likely to cause misleading. [0003] Most of the Web page counters on the Internet currently use a simple counting method, that is, one click will increase the count value by one. This counting method is constantly being affected by web crawlers, and the resulting count makes the counting of surfing clicks of Internet users more and more unreliable. With the de...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F9/44
CPCG06F16/951G06F16/958
Inventor 曾剑平罗邦慧
Owner 深圳前海财信云科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products