Unlock instant, AI-driven research and patent intelligence for your innovation.

A big data web crawler page selection method and system

A web crawler and big data technology, applied in the field of big data web crawler paging selection, can solve the problems of unable to locate, unable to crawl webpage data circularly, unable to locate label information, etc., to prevent process interruption, improve processing efficiency, improve The effect of crawling efficiency

Active Publication Date: 2019-11-12
CHENGDU SEFON SOFTWARE CO LTD
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When crawling data, due to changes in the HTML structure of the page buttons in some web pages, the corresponding label information cannot be located, resulting in the problem that web page data cannot be crawled cyclically. For example, the HTML structure of the "next page" in some web pages is in After clicking "Next Page" several times, it will change. After the data structure changes, the "Next Page" button cannot be located through the original HTML locator, and web page data cannot be crawled in a loop.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A big data web crawler page selection method and system
  • A big data web crawler page selection method and system
  • A big data web crawler page selection method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0112] Based on the configuration steps of the present invention and corresponding configuration modules, the crawler script of the paging configuration part is as follows:

[0113] 1 name: 'nextpage',

[0114] 2 css: '#ess_ctrl193591_ListC_AspNetPager>table>tbody>tr>td:nth-child(2)>a',

[0115] 3 type: 'list',

[0116] 4 regex: 'next page',

[0117] 5 rule: {

[0118] 6 name: 'Href',

[0119] 7 keys: [

[0120] {

[0121]8 name: 'Href',

[0122] 9 type: 'pagelink',

[0123] 10 css: 'a'

[0124]},

[0125] {

[0126] 11 name: 'title',

[0127] 12 type: 'text',

[0128] 13 css: 'a'

[0129]},

[0130] {

[0131] 14 name: 'txt',

[0132] 15 type: 'text',

[0133] 16 css: 'a'

[0134]}

]

[0135]}

[0136] The crawler script is as follows:

[0137] 1 name: 'liuyugaikuang',

[0138] 2 url: 'http: / / www.gdwater.gov.cn / yszx / ysgk / lygk',

[0139] 3 keys: [{

[0140] 4 name: 'news',

[0141] 5 css: 'body'>div.wrap>div>div.glcom.clearfix>div.gl-right>ul>li,

[0142...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method and system for crawler paging selection of a big data network. The method comprises the steps that a crawler script is analyzed; a matching character is obtained to match label information in crawler script contents; feature values of matched labels are stored into a URL queue; a URL connecting address in the URL queue is acquired, and verified; the verified URL connecting address is acquired for address matching; and a web page of the URL address with successful address matching is analyzed, and paging information is acquired. The system comprises a first analysis module, a first matching module, a storage module, an acquisition module, a second matching module, a second analysis module and a configuration module. According to the invention, the problem that web page data cannot be crawled cyclically when an HTML structure of a page button in the current web page is changed can be solved; paging labels of the data can be recognized accurately; interruption of the cyclic data crawling can be prevented effectively; and crawling efficiency of the web pager data is increased.

Description

technical field [0001] The invention relates to the technical field of big data analysis and processing, in particular to a big data web crawler paging selection method and system. Background technique [0002] With the rapid development of the network, all kinds of data are being generated on the World Wide Web every moment. At present, the total number of websites in China is about 4.54 million, and the number of web pages exceeds 200 billion. The surge of data contains amazing value. How to effectively extract and utilize this information has become a huge challenge. How to make these complicated and disorderly Internet data generate value, how to turn the World Wide Web into its own database, how to enable enterprises to easily control these massive data information to innovate and quickly gain insight into business opportunities, search engines (Search Engine), such as traditional general search Engines such as Google and Baidu, as a tool to assist people in retrievin...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/9566
Inventor 张志成王纯斌覃进学刘佳
Owner CHENGDU SEFON SOFTWARE CO LTD