Method and apparatus for grabbing content of target page

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A target page and content capture technology, applied in the field of web pages, can solve problems such as inconsistent absolute paths, inability to execute JS scripts, and no parsing function for capture scripts, so as to improve capture efficiency and simplify the configuration of capture rules

Inactive Publication Date: 2015-10-07

BEIJING QIHOO TECH CO LTD +1

View PDF8 Cites 28 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0005] One is to configure the string matching method, which is to directly match the start and end characters of the target element of the web document, but in reality, the web page changes greatly with the article, and the content area is complex, so it only relies on simple string matching It is difficult to accurately distinguish the information characteristics of target elements, and it is difficult to analyze the captured results

[0006] The second is to configure the matching method of the absolute path. This method starts matching from the body of the web document. This method can capture the content of the target page more accurately. However, due to the delay loading of JS scripts on the web page, Moreover, the crawling script does not have the parsing function of the browser kernel, and can only obtain the initial HTML code, but cannot execute the JS script in the HTML code. Therefore, the content of the target page obtained by the crawling script and the actual browser is different, resulting in inconsistent absolute paths and the fetching process does not work properly

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment 1

[0061] refer to figure 1 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:

[0062] Step 110, obtaining the webpage document of the target page;

[0063] In the embodiment of the present invention, before executing the content grabbing process of the target page, that is, before enabling the grabbing script, the relevant information for the target page will be configured, such as the link address of the target page, and the relative path information for the target page. The relative path information is used to find the location of the content of the target page in the web document of the target page.

[0064] Of course, in the embodiment of the present invention, the present invention can provide a configuration interface, which includes a configuration column for the link address of the target page, a configuration column for the relative path, etc. After the user confirms, th...

Embodiment 2

[0104] refer to figure 2 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:

[0105] Step 210, obtaining the webpage document of the target page;

[0106] Step 220, according to the relative path information for the target page, search the document object model node under the relative path information in the web document; wherein, the relative path information is constructed based on the attribute related information of the document object model node ;

[0107] Step 230 , extracting the content of the target page from the document object model node according to the preset regular matching expressions and / or matching expressions before and after the content of the target page.

[0108] In the embodiment of the present invention, when extracting the content of the target page from the found document object model nodes, in order to more accurately extract the content required by ...

Embodiment 3

[0118] refer to image 3 , which shows a schematic flow chart of a method for grabbing target page content according to the present invention, which may specifically include:

[0119] Step 310, according to the link address of the list page, obtain the webpage document of the list page;

[0120] In this step, the list page mentioned is the target page mentioned in the first embodiment is the list page.

[0121] Before step 310 in the embodiment of the present invention, the link address of the list page, relative path information for the list area in the list page, extraction rules for DOM nodes, etc. can be configured first. Such as Figure 3A , the user can enter the Figure 3A On the list configuration page, configure the website name "Site", the column name of the website "Zone", and the URL of the list page of this column http: / / ng.d.cn / wushuangjianji / news / list_walkthrough_1.html. The relative path information ul[class=znewsList] of the URL link with the resource page...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and an apparatus for grabbing the content of a target page and relates to the technical field of web pages. The method comprises the steps of: obtaining a page document of the target page; according to relative path information aiming at the target page, finding a document object model node in the relative path information from the page document, wherein the relative path information is established based on attribute related information of the document object model node; and extracting the content of the target page from the found document object model node. The problems of very difficulty in accurately distinguishing target element information characteristics in a character string matching mode and analyzing grabbed results, and the problems of non-unified absolute paths and incapability of normally executing a grabbing process due to a factor of delay loading of a web page to a JS (javascript) in an absolute path matching mode are solved; and the beneficial effects of no small-range page change interference during grabbing of target page content, simple grabbing rule configuration and capability of improving the grabbing efficiency are achieved.

Description

technical field [0001] The invention relates to the technical field of web pages, in particular to a method and device for grabbing target page content. Background technique [0002] With the development of the Internet, more and more users obtain various information through the Internet. On the Internet, there are many websites, and the webpages inside the websites are even larger. If a user wants to know the content of a certain aspect, he may need to visit multiple webpages of multiple websites to browse the content he needs. [0003] In view of the above situation, in order to facilitate the user's access to the content of the target page, a certain aspect of the content of the target page of each website is captured through some collection, and then the user can directly browse the collected content. Users do not need to visit pages one by one to browse the content of the target page. [0004] However, in the prior art, commonly used acquisition tools such as octopus ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/30

CPCG06F16/9535

Inventor黄钊

OwnerBEIJING QIHOO TECH CO LTD

Method and apparatus for grabbing content of target page

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment 1

Embodiment 2

Embodiment 3

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology