Method and device for extracting contents of bodies of web pages

A text and webpage technology, applied in the field of webpage text content extraction, can solve the problems of low efficiency of webpage text content extraction, and achieve the effect of improving accuracy, improving efficiency, and strong versatility

Active Publication Date: 2014-06-11
CHINA MOBILE COMM GRP CO LTD
View PDF5 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0014] The present invention is to overcome the defect of low extraction efficiency of webpage text content in the prior art, and according to one aspect of the present invention, a method for extracting webpage text content is proposed

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for extracting contents of bodies of web pages
  • Method and device for extracting contents of bodies of web pages
  • Method and device for extracting contents of bodies of web pages

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] The specific embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings, but it should be understood that the protection scope of the present invention is not limited by the specific embodiments.

[0034] The basic principles of the technical solution of the present invention:

[0035] (1) The web page text content extraction method and device provided by the present invention are based on the HTML DOM tree, DOM is the abbreviation of Document Object Model (Document Object Model), and the analyzer based on DOM converts web page documents into a set of object models (in the form of nodes Tree form representation, called DOM tree).

[0036] (2) According to the characteristics of the DOM tree, it can be seen that the text must be distributed on the leaf nodes of the DOM tree, but not all leaf nodes contain the text; the area containing all the text of the web page must be a subtree in the DOM tree, and this The re...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and a device for extracting contents of bodies of web pages. The method includes parsing to-be-extracted documents of the web pages to generate document object model tree structures and combining leaf nodes of document object model trees corresponding to the web pages with one another to form node sets; searching father nodes corresponding to the certain leaf nodes in the document object model trees; merging the certain leaf nodes with the father nodes, and merging the leaf nodes with identical father nodes with one another; determining that zones included by the leaf nodes are zones where the contents of the bodies of the web pages are located if the merged leaf nodes in the node sets meet preset conditions; removing tags of the web pages in the determined zones where the contents of the bodies of the web pages are located, and extracting the contents of the bodies of the web pages. The certain leaf nodes are positioned in the deepest layers in the node sets. The method and the device have the advantages that by the aid of the method and the device, the zones where the bodies in the HTML (hypertext markup language) pages are located can be quickly and effectively positioned and can be separated from noise contents, and the body content information acquisition efficiency can be improved.

Description

technical field [0001] The present invention relates to the technical field of the Internet in the field of communication, in particular to a method and a device for extracting content of a webpage text. Background technique [0002] With the rapid development of the Internet, the Internet has become an important way for people to obtain information, communicate with others, and share information. How to retrieve useful information on the Web more accurately, quickly, and more comprehensively has become a research hotspot. In addition to the subject content, the webpages we browse every day also contain a large amount of content irrelevant to the subject, such as navigation information, copyright information, advertisement information, and related links, which we call "noise" content. The existence of these noise contents affects and leads to the reduction of retrieval efficiency and accuracy. [0003] For the extraction of webpage text, there are three major types of mains...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 毛雅琴张远田冬吴淑燕
Owner CHINA MOBILE COMM GRP CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products