Webpage information extracting method and system

A web page information and web page technology, which is applied in the computer field, can solve the problems of missing web page information, inconvenient detection of web page information, and poor monitoring of site templates, and achieve the effect of improving the extraction accuracy.

Active Publication Date: 2012-08-29
SHENZHEN SHI JI GUANG SU INFORMATION TECH
View PDF4 Cites 29 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] For a large site, there may be multiple sets of templates, and the internal structure of the same template is also very complicated, which brings great difficulties to the configuration of the template, which will lead to possible omissions of web page information
Moreover, local changes of site templates are relatively frequent. The configured templates need to be changed frequently to ensure the accuracy of network element information extraction, and it is not easy to monitor the changes of site templates.
It can be seen that the existing web page information extraction methods are not convenient for detecting web page information, and the processing effect on web pages of certain types of sites is not very good
Especially for vertical search, it is necessary to accurately extract the information in the webpage, and it is necessary to accurately classify the extracted information types, and the existing automatic extraction methods through machine learning become difficult to apply, or the effect is poor
[0005] It can be seen that the existing web page information extraction technology needs to improve the accuracy rate of web page extraction for relatively large sites, and the recall rate of extracted information also needs to be improved

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage information extracting method and system
  • Webpage information extracting method and system
  • Webpage information extracting method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] The embodiment of the present invention provides a technical solution for accurately extracting different types of information in web pages by configuring templates and using DOM (Document Object Model) trees, so as to solve the problem of poor automatic web page extraction effect of websites, especially Solve the problem of precise extraction in vertical search.

[0023] In the embodiment of the present invention, corresponding template sets may be defined in advance for different websites, and the template set may include one or more templates. The defined template can adopt XML (Extensible Markup Language, Extensible Markup Language) file format or other file formats for web page extraction, and different templates correspond to different web page information organization structures (or called web page frames, that is, frames) , templates are used to extract blocks of content from web pages based on the corresponding web frame. The extracted content blocks may be on...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage information extracting method and system. The webpage information extracting method comprises the following steps of: analyzing a webpage to be extracted as a document object model (DOM) tree, and obtaining a template corresponding to the webpage to be extracted; traversing the DOM tree according to webpage division fineness defined by the template, and dividing the corresponding webpage into content blocks; and outputting content and type information of the content blocks according to an output rule defined by the template. With the adoption of the webpage information extracting method and system provided by the invention, the precision of extracting webpage information can be improved.

Description

technical field [0001] The invention relates to the field of computer technology, in particular to a web page information extraction method and system thereof. Background technique [0002] Web information extraction refers to extracting target information from web pages, which is a basic link in search engines. Web pages themselves are structured data, so there are many features that can be used to extract information from them. For example, a browser can render the source code of a web page into a beautiful web page through the structural information in the web page. For a search engine, it is not only necessary to use structured information to extract the information, but also to perform further processing on the extracted information, such as classifying and labeling the extracted content. [0003] In the face of massive and ever-changing data, the mainstream of current web page extraction methods is automatic extraction and classification based on machine learning. M...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 王传刚杨巍张立明
Owner SHENZHEN SHI JI GUANG SU INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products