System and method for identifying and automatically acquiring webpage information

A technology of automatic collection and web page information, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem of not fundamentally eliminating search engine collection, increasing the difficulty of web page collection and analysis, etc., to eliminate collection Effect

Active Publication Date: 2013-07-24
国科(上海)企业发展有限公司
View PDF4 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] 2. The information on some websites has privacy or copyright, and many webpages contain information such as background databases, user privacy, passwords, etc.
Parsing process such as figure 2 As...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and method for identifying and automatically acquiring webpage information
  • System and method for identifying and automatically acquiring webpage information
  • System and method for identifying and automatically acquiring webpage information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0036] Referring to the accompanying drawings, a system and method for anti-grabbing that can identify webpage information, including an anti-collection classifier building block, an automatic collection identification module, an anti-collection online processing module, and an anti-collection classifier building block, the module is mainly used for using a computer The program learns and distinguishes the automatic collection of historical web information and normal web page access behaviors. This module provides a training model for automatic collection and recognition. The automatic collection and recognition module described above automatically recognizes the search results by loading an automatic classifier. The automatic acquisition behavior of the engine program, and the IP segment where the identified acquisition program is located is added to the blacklist, which is used for subsequent online interception of the automatic acquisition behavior. The anti-collection online...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a system and a method for identifying and automatically acquiring webpage information. The system comprises an anti-acquisition classifier constructing module, an automatic acquisition identifying module and an anti-acquisition online processing module, wherein the anti-acquisition classifier constructing module is mainly used for automatically acquiring history web information by using a computer program and learning and distinguishing normal webpage access behaviors; the automatic acquisition identifying module is used for automatically identifying the automatic acquisition behavior of a search engine program by using an anti-acquisition classifier in the previous step, and adding an IP (Internet Protocol) segment where the identified acquisition program is positioned into a blacklist; and the anti-acquisition online processing module is mainly used for automatically judging and processing accessing users on line. Due to the adoption of the system and the method, the deficiencies in the prior art are overcome; and in the system, the history webpage access behaviors of a website are analyzed, the automatic acquisition classifier is established, automatic acquisition of a robot is identified, and webpage anti-grabbing is realized through automatic robot acquisition and identification.

Description

technical field [0001] The invention relates to the technical field of web page dynamic analysis, and in particular belongs to an automatic system and method capable of identifying web page information. Background technique [0002] With the development of the Internet, there have been more and more Internet sites in endless forms, such as news, blogs, forums, SNS, Weibo and so on. According to the latest statistics from CNNIC this year, China now has 485 million Internet users and more than 1.3 million domain names of various sites. In today's Internet information explosion, search engines have become the most important tool for people to find Internet information. [0003] Search engines mainly crawl website information automatically, preprocess it, and build indexes after word segmentation. After entering the search term, the search engine can automatically find the most relevant results for the user. After more than ten years of development, the search engine technolo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 张炜金军吴杨梓江岩
Owner 国科(上海)企业发展有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products