Intelligent extraction system and intelligent extraction method for article type web pages

A technology for extracting systems and web pages, applied in special data processing applications, instruments, electrical and digital data processing, etc., can solve the problems of inability to accurately extract articles, low availability of captured articles, and relying on large human participation.

Active Publication Date: 2014-06-11
凤凰在线(北京)信息技术有限公司
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003]1) The crawling system that uses machine-automatically generated extraction wrapper technology can capture a large number of articles, but cannot achieve accurate extraction of articles, and the availability of crawled articles is low ;
[0004]2) The article extraction results of the crawling system using artificially generated extraction wrapper technology are accurate, but it is necessary to generate, update and maintain the extraction wrappers for thousands of websites on the Internet Ordinary vertical crawlers cannot take on this job very well, and can only rely on a large amount of human participation;
[0017]1) The extraction system using the machine-automatically generated extraction wrapper technology can capture a large number of articles, but it cannot achieve accurate extraction of articles, and the availability of captured articles is low;
[0018]2) The extraction system adopts artificially generated extraction wrapper technology, and the article extraction results are accurate, but it is necessary to generate, update and maintain the extraction wrappers for thousands of websites on the Internet work, can only rely on a large number of human participation;

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Intelligent extraction system and intelligent extraction method for article type web pages
  • Intelligent extraction system and intelligent extraction method for article type web pages
  • Intelligent extraction system and intelligent extraction method for article type web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0215] The real-time intelligent grasping system consists of 5 modules or subsystems, such as figure 1 shown. Including: real-time crawling module, article-type webpage intelligent extraction system, document approximate deduplication module, document automatic classification module, and article publishing module.

[0216] Detailed technical scheme of the article type web page intelligent extraction system of the present invention

[0217] There are many technical solutions in the field of information extraction, the core of which is how to generate and maintain extraction wrappers. Technically, there are two main categories:

[0218] 1) The extraction system that uses the machine-automatically generated extraction wrapper technology can capture a large number of articles, but it cannot achieve accurate extraction of articles, and the availability of captured articles is low;

[0219] 2) The extraction system adopts artificially generated extraction wrapper techn...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

An intelligent extraction system for article type web pages comprises a web page loading module to be extracted, a wrapper query module, a web page extraction module, an unsuccessfully extracted web page collection module, a learning judgment module, a web page learning module and an extraction wrapper management module.

Description

technical field [0001] The invention relates to a system and method for intelligently capturing articles in real time on the Internet, which can be applied to Internet fields such as portal websites and search engine websites that require large-scale, accurate and real-time capturing of articles. Background technique [0002] Internet portal websites have a large demand for reprinting articles every day, and have high requirements for the quality of articles. Many existing crawling systems can meet this requirement, but they all suffer from the following three problems: [0003] 1) The crawling system that uses the machine-automatically generated extraction wrapper technology can capture a large number of articles, but it cannot achieve accurate extraction of articles, and the usability of crawling articles is low; [0004] 2) The article extraction results of the crawling system using the artificially generated extraction wrapper technology are accurate, but it is necessar...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30G06F17/27
Inventor 吴华鹏曾明厉锟
Owner 凤凰在线(北京)信息技术有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products