Method and apparatus for grabbing content of target page

A target page and content capture technology, applied in the field of web pages, can solve problems such as inconsistent absolute paths, inability to execute JS scripts, and no parsing function for capture scripts, so as to improve capture efficiency and simplify the configuration of capture rules

Inactive Publication Date: 2015-10-07
BEIJING QIHOO TECH CO LTD +1
View PDF8 Cites 28 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] One is to configure the string matching method, which is to directly match the start and end characters of the target element of the web document, but in reality, the web page changes greatly with the article, and the content area is complex, so it only relies on simple string matching It is difficult to accurately distinguish the information characteristics of target elements, and it is difficult to analyze the captured results
[0006] The second is to configure the matching method of the absolute path. This method starts matching fr...

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and apparatus for grabbing content of target page
  • Method and apparatus for grabbing content of target page
  • Method and apparatus for grabbing content of target page

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0061] refer to figure 1 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:

[0062] Step 110, obtaining the webpage document of the target page;

[0063] In the embodiment of the present invention, before executing the content grabbing process of the target page, that is, before enabling the grabbing script, the relevant information for the target page will be configured, such as the link address of the target page, and the relative path information for the target page. The relative path information is used to find the location of the content of the target page in the web document of the target page.

[0064] Of course, in the embodiment of the present invention, the present invention can provide a configuration interface, which includes a configuration column for the link address of the target page, a configuration column for the relative path, etc. After the user confirms, th...

Embodiment 2

[0104] refer to figure 2 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:

[0105] Step 210, obtaining the webpage document of the target page;

[0106] Step 220, according to the relative path information for the target page, search the document object model node under the relative path information in the web document; wherein, the relative path information is constructed based on the attribute related information of the document object model node ;

[0107] Step 230 , extracting the content of the target page from the document object model node according to the preset regular matching expressions and / or matching expressions before and after the content of the target page.

[0108] In the embodiment of the present invention, when extracting the content of the target page from the found document object model nodes, in order to more accurately extract the content required by ...

Embodiment 3

[0118] refer to image 3 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:

[0119] Step 310, according to the link address of the list page, obtain the webpage document of the list page;

[0120] In this step, the list page mentioned is the target page mentioned in the first embodiment is the list page.

[0121] Before step 310 in the embodiment of the present invention, the link address of the list page, relative path information for the list area in the list page, extraction rules for DOM nodes, etc. can be configured first. Such as Figure 3A , the user can enter the Figure 3A On the list configuration page, configure the website name "Site", the column name of the website "Zone", and the URL of the list page of this column http: / / ng.d.cn / wushuangjianji / news / list_walkthrough_1.html. The relative path information ul[class=znewsList] of the URL link with the resource page...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method and an apparatus for grabbing the content of a target page and relates to the technical field of web pages. The method comprises the steps of: obtaining a page document of the target page; according to relative path information aiming at the target page, finding a document object model node in the relative path information from the page document, wherein the relative path information is established based on attribute related information of the document object model node; and extracting the content of the target page from the found document object model node. The problems of very difficulty in accurately distinguishing target element information characteristics in a character string matching mode and analyzing grabbed results, and the problems of non-unified absolute paths and incapability of normally executing a grabbing process due to a factor of delay loading of a web page to a JS (javascript) in an absolute path matching mode are solved; and the beneficial effects of no small-range page change interference during grabbing of target page content, simple grabbing rule configuration and capability of improving the grabbing efficiency are achieved.

Description

technical field [0001] The invention relates to the technical field of web pages, in particular to a method and device for grabbing target page content. Background technique [0002] With the development of the Internet, more and more users obtain various information through the Internet. On the Internet, there are many websites, and the webpages inside the websites are even larger. If a user wants to know the content of a certain aspect, he may need to visit multiple webpages of multiple websites to browse the content he needs. [0003] In view of the above situation, in order to facilitate the user's access to the content of the target page, a certain aspect of the content of the target page of each website is captured through some collection, and then the user can directly browse the collected content. Users do not need to visit pages one by one to browse the content of the target page. [0004] However, in the prior art, commonly used acquisition tools such as octopus ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/9535
Inventor 黄钊
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products