Web page information extraction system and method

A web page information, web page technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problems of dissimilar structure, fast website update speed, long running time, etc., to improve the accuracy rate and improve the recall rate. , the effect of fast extraction

Active Publication Date: 2009-06-24
INST OF COMPUTING TECH CHINESE ACAD OF SCI
View PDF0 Cites 64 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0011] First, the current web crawler judges whether these web pages are under the same ur1 path when crawling similar web pages in the website. However, there are a large number of dynamic ur1s on the current website, and even this situation exists. The structure of the ur1 path where the webpage is located may be very different even between the same webpages
This will cause the wrapper file generated by the training webpage to fail to extract the webpages generated by different webpage templates in the webpage collection
[0012] Second, even if these webpages are generated by the same webpage template, but there are many non-template nodes in the webpage, and there are various differences between the non-template nodes of different webpages, then the wrapper files that are only generated by some training webpages often cannot Covering all these differences has caused the wrapper file pair to be incapable of extracting this part of the webpage, and the traditional method is to submit these webpages that cannot be extracted correctly to the user, and let the user mark the data in these webpages area, and then provide these web pages as training web pages to the web page extraction program to regenerate the wrapper
[0013] Thirdly, the current web page extraction systems all have a contradiction between accuracy, automation and the required manual intervention. For example, extraction systems with higher accuracy and fewer training samples often take longer The running time cannot meet the needs of online real-time extraction, but a system with higher efficiency in the extraction stage often requires more training pages and manual intervention to generate wrapper files with better precision and recall
[0014] Fourth, the update speed of the current website is fast. After the correct wrapper file is generated, with the revision of the website, the task of extracting webpages from the revised website cannot be completed by using the wrapper file generated by the old version of the webpage.
[0015] Fifth, at this stage, many web page extraction technologies are aimed at certain types of websites, such as only news web pages can be extracted, or only certain products of a certain object can be extracted, such as only the price and name of the product and other attributes to extract

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page information extraction system and method
  • Web page information extraction system and method
  • Web page information extraction system and method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0072] The present invention will be further described in detail below in conjunction with the accompanying drawings.

[0073] The system structure of the present invention is as follows figure 1 shown, including:

[0074] The template generation module 101 is configured to select webpages to be automatically marked from the webpage collection, classify the webpages to be automatically marked according to the training webpages marked by the user, and generate webpage templates corresponding to the categories of the training webpages.

[0075] The webpage homogenization module 102 is used for shielding, according to the webpage template of the category, the difference between the webpage to be automatically marked belonging to the category and the webpage template of the category.

[0076] The automatic labeling module 103 is used to parse the training webpage of the category, generate a first wrapper (wrapper) file, and automatically label the webpage to be automatically labe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a system for extracting web page information and a method thereof. The system comprises a template generation module, a web page homogenization module, an automatic tagging module, a wrapper file generation module and an on-line extraction module, wherein, the template generation module is used for selecting web pages to be automatically tagged from a web page collection, and the web pages to be automatically tagged is classified according to training web pages tagged by a user, so as to generate a classified category web page template; the web page homogenization module is used for screening out the difference between the automatic tagging web pages and the web page template belonging to the same category with automatic tagging web; the automatic tagging module is used for analyzing training web pages corresponding to the category, so as to generate a first wrapper file; automatic tagging can be performed on the automatic tagging web pages according to the fisrt wrapper file, so as to generate new training web pages; the wrapper file generation module is used for analyzing all the training web pages and generating a second wrapper file; and the on-line extraction module is applied to the second wrapper document, and is used for extracting unselected web page information in the web page collection. The invention ensures that a plurality of templates corresponding to inhomogeneous web pages can be generated, and extracting can be performed on a plurality of records in a web page and a plurality of attributes of each record.

Description

technical field [0001] The invention belongs to the field of network information processing, and in particular relates to a system and method for extracting web page information. Background technique [0002] The current web page extraction technology can be divided into specific domain-specific web page extraction technology and general web page extraction technology according to the application field. [0003] In the web page extraction technology for a specific field, it is usually necessary to make some assumptions about the content to be extracted. For example, extracting the text of news web pages, extracting some specific attributes in web pages, such as extracting product prices. This type of method often extracts web pages according to the characteristics of the object to be extracted, through statistical methods or through the method of summarizing starting rules. However, due to the particularity of the extracted objects, the generality of this method and the ty...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 吴博王宇张刚丁国栋程学旗
Owner INST OF COMPUTING TECH CHINESE ACAD OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products