A big data web crawler page selection method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A web crawler and big data technology, applied in the field of big data web crawler paging selection, can solve the problems of unable to locate, unable to crawl webpage data circularly, unable to locate label information, etc., to prevent process interruption, improve processing efficiency, improve The effect of crawling efficiency

Active Publication Date: 2019-11-12

CHENGDU SEFON SOFTWARE CO LTD

View PDF3 Cites 0 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

When crawling data, due to changes in the HTML structure of the page buttons in some web pages, the corresponding label information cannot be located, resulting in the problem that web page data cannot be crawled cyclically. For example, the HTML structure of the "next page" in some web pages is in After clicking "Next Page" several times, it will change. After the data structure changes, the "Next Page" button cannot be located through the original HTML locator, and web page data cannot be crawled in a loop.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0112] Based on the configuration steps of the present invention and corresponding configuration modules, the crawler script of the paging configuration part is as follows:

[0113] 1 name: 'nextpage',

[0114] 2 css: '#ess_ctrl193591_ListC_AspNetPager>table>tbody>tr>td:nth-child(2)>a',

[0115] 3 type: 'list',

[0116] 4 regex: 'next page',

[0117] 5 rule: {

[0118] 6 name: 'Href',

[0119] 7 keys: [

[0120] {

[0121]8 name: 'Href',

[0122] 9 type: 'pagelink',

[0123] 10 css: 'a'

[0124]},

[0125] {

[0126] 11 name: 'title',

[0127] 12 type: 'text',

[0128] 13 css: 'a'

[0129]},

[0130] {

[0131] 14 name: 'txt',

[0132] 15 type: 'text',

[0133] 16 css: 'a'

[0134]}

]

[0135]}

[0136] The crawler script is as follows:

[0137] 1 name: 'liuyugaikuang',

[0138] 2 url: 'http: / / www.gdwater.gov.cn / yszx / ysgk / lygk',

[0139] 3 keys: [{

[0140] 4 name: 'news',

[0141] 5 css: 'body'>div.wrap>div>div.glcom.clearfix>div.gl-right>ul>li,

[0142...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method and system for crawler paging selection of a big data network. The method comprises the steps that a crawler script is analyzed; a matching character is obtained to match label information in crawler script contents; feature values of matched labels are stored into a URL queue; a URL connecting address in the URL queue is acquired, and verified; the verified URL connecting address is acquired for address matching; and a web page of the URL address with successful address matching is analyzed, and paging information is acquired. The system comprises a first analysis module, a first matching module, a storage module, an acquisition module, a second matching module, a second analysis module and a configuration module. According to the invention, the problem that web page data cannot be crawled cyclically when an HTML structure of a page button in the current web page is changed can be solved; paging labels of the data can be recognized accurately; interruption of the cyclic data crawling can be prevented effectively; and crawling efficiency of the web pager data is increased.

Description

technical field [0001] The invention relates to the technical field of big data analysis and processing, in particular to a big data web crawler paging selection method and system. Background technique [0002] With the rapid development of the network, all kinds of data are being generated on the World Wide Web every moment. At present, the total number of websites in China is about 4.54 million, and the number of web pages exceeds 200 billion. The surge of data contains amazing value. How to effectively extract and utilize this information has become a huge challenge. How to make these complicated and disorderly Internet data generate value, how to turn the World Wide Web into its own database, how to enable enterprises to easily control these massive data information to innovate and quickly gain insight into business opportunities, search engines (Search Engine), such as traditional general search Engines such as Google and Baidu, as a tool to assist people in retrievin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(China)

IPC IPC(8): G06F16/951G06F16/955

CPCG06F16/951G06F16/9566

Inventor 张志成王纯斌覃进学刘佳

Owner CHENGDU SEFON SOFTWARE CO LTD

A big data web crawler page selection method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology