Method and device for extracting contents of bodies of web pages

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A text and webpage technology, applied in the field of webpage text content extraction, can solve the problems of low efficiency of webpage text content extraction, and achieve the effect of improving accuracy, improving efficiency, and strong versatility

Active Publication Date: 2014-06-11

CHINA MOBILE COMM GRP CO LTD

View PDF5 Cites 21 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0014] The present invention is to overcome the defect of low extraction efficiency of webpage text content in the prior art, and according to one aspect of the present invention, a method for extracting webpage text content is proposed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0033] The specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, but it should be understood that the protection scope of the present invention is not limited by the specific embodiments.

[0034] The basic principles of the technical solution of the present invention:

[0035] (1) The web page text content extraction method and device provided by the present invention are based on the HTML DOM tree, DOM is the abbreviation of Document Object Model (Document Object Model), and the analyzer based on DOM converts web page documents into a set of object models (in the form of nodes Tree form representation, called DOM tree).

[0036] (2) According to the characteristics of the DOM tree, it can be seen that the text must be distributed on the leaf nodes of the DOM tree, but not all leaf nodes contain the text; the area containing all the text of the web page must be a subtree in the DOM tree, and this The re...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and a device for extracting contents of bodies of web pages. The method includes parsing to-be-extracted documents of the web pages to generate document object model tree structures and combining leaf nodes of document object model trees corresponding to the web pages with one another to form node sets; searching father nodes corresponding to the certain leaf nodes in the document object model trees; merging the certain leaf nodes with the father nodes, and merging the leaf nodes with identical father nodes with one another; determining that zones included by the leaf nodes are zones where the contents of the bodies of the web pages are located if the merged leaf nodes in the node sets meet preset conditions; removing tags of the web pages in the determined zones where the contents of the bodies of the web pages are located, and extracting the contents of the bodies of the web pages. The certain leaf nodes are positioned in the deepest layers in the node sets. The method and the device have the advantages that by the aid of the method and the device, the zones where the bodies in the HTML (hypertext markup language) pages are located can be quickly and effectively positioned and can be separated from noise contents, and the body content information acquisition efficiency can be improved.

Description

technical field [0001] The present invention relates to the technical field of the Internet in the field of communication, in particular to a method and a device for extracting content of a webpage text. Background technique [0002] With the rapid development of the Internet, the Internet has become an important way for people to obtain information, communicate with others, and share information. How to retrieve useful information on the Web more accurately, quickly, and more comprehensively has become a research hotspot. In addition to the subject content, the webpages we browse every day also contain a large amount of content irrelevant to the subject, such as navigation information, copyright information, advertisement information, and related links, which we call "noise" content. The existence of these noise contents affects and leads to the reduction of retrieval efficiency and accuracy. [0003] For the extraction of webpage text, there are three major types of mains...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/951

Inventor毛雅琴张远田冬吴淑燕

OwnerCHINA MOBILE COMM GRP CO LTD

Method and device for extracting contents of bodies of web pages

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology