Unlock instant, AI-driven research and patent intelligence for your innovation.

Method of collecting Internet data

An Internet and data technology, applied in network data retrieval, network data indexing, electronic digital data processing, etc., can solve the problems of data duplication and poor matching degree of captured data, to meet user needs, avoid repeated capture, and apply Wide range of effects

Inactive Publication Date: 2017-05-31
SHANDONG INSPUR CLOUD SERVICE INFORMATION TECH CO LTD
View PDF3 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Internet webpage data collection is a process of obtaining Internet webpage content, which is generally crawled by web crawlers, but in the existing crawling process, repeated crawling of the same URL, duplication of captured data, and matching between captured data often occur. Based on this, we now provide a method for collecting Internet data, extracting the data content required by users from web pages through analysis, and converting and processing the extracted data content through content and format , stored to meet the needs of users

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] The present invention will be further described below in conjunction with specific examples.

[0022] A method for collecting Internet data according to the present invention first collects data according to rules configured in advance by users, including web page download rules, web page analysis rules, and content extraction rules.

[0023] In the present invention, the process of Internet web page big data collection and processing mainly includes 4 aspects:

[0024] 1) Web crawlers. Crawl the page content from the network and extract the required data content from it.

[0025] 2) Data processing. Process the content extracted by the web crawler.

[0026] 3) Crawl the url queue. Provide the URL address of the website that needs to extract data for the web crawler.

[0027] 4) Data. The data includes three aspects: ① the url information of the data website that needs to be captured, ② the data extracted from the web page by the web, and ③ the data processed by t...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method of collecting Internet data. The realization process of the method comprises the following steps: firstly, crawling a url queue, and providing a website url address needed to extract data for a web crawler, that is, storing the website url address needed to extract data in the url queue; acquiring url information of the website needed to extract data in the crawling url queue by using the web crawler; acquiring a corresponding page from the corresponding url page and extracting data information needed by a user by the web crawler; writing the extracted data into a database by the web crawler; and designing a data processing module and processing the data in the database through the data processing module. Compared with the prior art, the method of collecting Internet data processes the data by virtue of data processing modes of link filtration, data rearrangement and integration to eliminate repeated data and avoid repeated capture, and the integrating match degree among data is high, so that the user demand can be better met, and the method is high in practicality, wide in application range and easy to popularize.

Description

technical field [0001] The invention relates to the field of computer application technology, in particular to a highly practical method for collecting Internet data. Background technique [0002] Big data refers to a large amount of data that cannot be managed and analyzed by general software tools. The current era has entered the era of big data, which, like the invention of the Internet, has triggered a new wave in the field of information technology. Big data can help industry analysis, bring new business value and opportunities to enterprises, and also pose challenges to enterprise IT systems. To obtain data from the Internet, it is necessary to develop a data collection service method and provide corresponding technical support. [0003] Internet webpage data has the characteristics of big data such as wide distribution, various formats, and unstructured, so it is necessary to collect, process, and store Internet page data in a specific way. Internet webpage data co...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951G06F16/9566
Inventor 王利鑫王洪添
Owner SHANDONG INSPUR CLOUD SERVICE INFORMATION TECH CO LTD