Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

System and method for real-time intelligent capturing of article

An article and intelligent technology, applied in the field of scraping technology in Internet technology, can solve the problems of inability to accurately extract articles, consuming network hardware resources, and low availability of crawling articles, so as to improve news coverage and real-time, fast The effect of approximate reloading, improved coverage and real-time performance

Active Publication Date: 2013-08-14
凤凰在线(北京)信息技术有限公司
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003]1) The crawling system that uses machine-automatically generated extraction wrapper technology can capture a large number of articles, but cannot achieve accurate extraction of articles, and the availability of crawled articles is low ;
[0004]2) The article extraction results of the crawling system using artificially generated extraction wrapper technology are accurate, but it is necessary to generate, update and maintain the extraction wrappers for thousands of websites on the Internet Ordinary vertical crawlers cannot take on this job very well, and can only rely on a large amount of human participation;
When high real-time crawling is required, it is necessary to frequently initiate link and download requests to the crawling website server, which will put a lot of pressure on the server of the other party, which in turn will cause the other party to adopt blocking strategies such as denying access to ensure the server works fine, this will cause the fetch to fail
At the same time, high real-time capture requirements consume a lot of network, server and other hardware resources, resulting in increased costs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System and method for real-time intelligent capturing of article
  • System and method for real-time intelligent capturing of article
  • System and method for real-time intelligent capturing of article

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0080] The grabbing system consists of 5 modules or subsystems, such as figure 1 shown. Including: real-time crawling module, web page extraction system, document approximation deduplication module, document automatic classification module, and article publishing module.

[0081] The overall data flow of the system is as follows: figure 2 As shown, the specific steps are as follows:

[0082] Step 1, submit a job or a bunch of jobs to the real-time capture module of the system; the real-time capture module can be mainly divided into two main steps: a jobs analysis scheduling module and a crawler download module (task download module);

[0083] Step 2, the jobs parsing and scheduling module of the real-time crawling module is responsible for explaining each job to several rules stipulated by the cost system. These rules specify the specific crawling logic of the crawler module in the next step; A job schedule is distributed to a suitable server to achieve faster job capture ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a system for real-time intelligent capturing of an article. The system comprises a real-time capturing module, a webpage extraction system, a similar document duplicate-removing module, a document automatic classification module and an article publishing module. The real-time capturing module further comprises seven modules running online: a task extraction module, a task analysis module, a task capturing time range test module, a task capturing time interval test module, a task scheduling module, a task downloading module and a task capturing frequency regulation module; and the real-time capturing module still comprises three modules running offline: a task capturing time range discovery module, a task capturing time internal discovery module and a nonprofit agent collection and authentication module.

Description

technical field [0001] The invention relates to the fields of crawling technology, web mining technology, information extraction technology, and natural language processing technology in Internet technology; it can be applied to Internet fields such as portal websites and search engine websites that require large-scale, accurate, and real-time crawling of articles. Background technique [0002] Internet portal websites have a large demand for reprinting articles every day, and have high requirements for the quality of articles. Many existing crawling systems can meet this requirement, but they all suffer from the following three problems: [0003] 1) The crawling system that uses the machine-automatically generated extraction wrapper technology can capture a large number of articles, but it cannot achieve accurate extraction of articles, and the usability of crawling articles is low; [0004] 2) The article extraction results of the crawling system using the artificially ge...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 吴华鹏曾明厉锟陈大伟
Owner 凤凰在线(北京)信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products