Web page content extraction method and apparatus, and computing device

A technology for computing equipment and web content, applied in the field of the Internet, which can solve problems such as high time and cost

Active Publication Date: 2017-07-14
QILIN HESHENG NETWORK TECH INC
View PDF5 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in this way, a complete DOM tree needs to be created and traversed eve

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Web page content extraction method and apparatus, and computing device
  • Web page content extraction method and apparatus, and computing device
  • Web page content extraction method and apparatus, and computing device

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0035] Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

[0036] figure 1 A schematic diagram of a web content extraction system 100 according to an embodiment of the present invention is shown. Such as figure 1 As shown, the web content extraction system 100 includes a computing device 200, a server 310, and a server 320. s, figure 1 The web content extraction system 100 in is only exemplary. In specific practical situations, there may be a different number of com...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The present invention discloses a web page content extracting method and apparatus, and a computing device. The method is suitable to implement in the computing device, and the computing device comprises a data storage device. The method comprises: obtaining an HTML document of a to-be-processed web page; according to the domain name of the to-be-processed web page, obtaining a node matching rule corresponding to the data storage device from the data storage device, wherein the node matching rule is generated based on a DOM tree of a source web page associated with the to-be-processed web page; constructing a target DOM tree, wherein the target DOM tree is initialized to be empty; processing the HTML document by using the node matching rule, so that updating of the target DOM tree is facilitated; and obtaining each node in the updated target DOM tree so as to extract content in the to-be-processed web page.

Description

technical field [0001] The invention relates to the technical field of the Internet, in particular to a web page content extraction method, device and computing equipment. Background technique [0002] Each website on the Internet has its own web page, and the structure and layout of the web page are quite different. It is a tedious and time-consuming task to parse the web page and extract the content. At present, most of the methods for extracting web page content are based on DOM trees. By organizing web page content into a DOM tree and traversing the DOM tree, the information in the required nodes is obtained to form the web page to be extracted. content. [0003] The full name of DOM is Document Object Model, that is, Document Object Model. It can use the tag information of HTML documents, such as Table, List, etc., to logically parse the document into a tree structure, and the nodes of the tree are objects one by one. After the DOM tree is built, it traverses each nod...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/9577G06F16/986
Inventor 李涛
Owner QILIN HESHENG NETWORK TECH INC
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products