System, method and program for extracting web page core content based on web page layout

A core content and extraction system technology, applied in the direction of instruments, computing, electrical digital data processing, etc., can solve the problem that the core content of the web page is not satisfactory

Inactive Publication Date: 2006-06-14
IBM CN
View PDF0 Cites 35 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0015] According to the above analysis, it can be seen that the existing methods for extracting the core content of web pag

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • System, method and program for extracting web page core content based on web page layout
  • System, method and program for extracting web page core content based on web page layout
  • System, method and program for extracting web page core content based on web page layout

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0079] In the following detailed description of specific embodiments according to the present invention, some terms are used. In order to facilitate understanding of the content disclosed in this application, these terms are collectively explained as follows here:

[0080] 1) Tags related to tables

[0081] "Table related tags" (HTML tags) include 、 、 、 、 and etc. among them

[0084]

[0085] for creating data tables, is used to represent the body of the table, is used to denote a footnote for a table, used to represent data rows of a table, is used to define headers, while Used to create data structures.

[0082] 2) Basic structure

[0083] "Basic structure" refers to the HTML tags included in the and , or the HTML tag pair and information items within. The information items mentioned here may be images, text / image links, plain texts, table structures, and the like. A basic structure can nest the next basic structure. 3) Table stru...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a system and method for extracting webpage kernel contents, and the system receives HTML documents (web pages) and extracts the kernel contents, and comprises: text block analyzer for using HTML label as delimiter to divide the text fragments in each available basic structure in the input HTML documents into one or plural independent file blocks and in order connecting all the file blocks together to output, where the available basic structure comprises webpage kernel contents; and text block checker for removing the file blocks without the kernel contents and outputting the rest as the webpage kernel contents. The invention determines if each file block contains advertisements and navigation information, thus able to accurately determine the webpage kernel contents and also raises the processing efficiency.

Description

technical field [0001] In general, the present invention relates to a system and method for extracting core content of a web page and a computer program product for realizing the method. Specifically, the present invention relates to a system and a method for extracting the core content of a webpage by using the layout of the webpage and a computer program product for realizing the method. Background technique [0002] With its rapid growth, the World Wide Web has become the largest source of information in many fields. How to effectively and automatically extract information from the Internet is one of the most active topics in the field of knowledge management. In order to facilitate users to read and browse information on the Internet, such information is generally presented to users in the form of hypertext markup language (HTML) files. HTML files not only include the information that users care about (called the core content of the web page), but also include standard...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/22
Inventor 马立苏中刘世霞潘越
Owner IBM CN
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products