Internet data acquisition method with high matching degree

A data collection, Internet technology, applied in network data indexing, network data retrieval, other database retrieval and other directions, can solve the problems of poor matching of captured data, data duplication, etc., to avoid repeated capture, meet user needs, Wide range of effects

Inactive Publication Date: 2019-04-19
河南大瑞物联网科技有限公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] Internet webpage data collection is a process of obtaining Internet webpage content, which is generally crawled by web crawlers, but in the existing crawling process, repeated crawling of the same URL, duplication of captured data, and matching between captured data often occur. Based on this, we now provide an Internet data collection method with a high matching degree, which extracts the data content required by the user from the web page through analysis, and converts and processes the extracted data content through content and format Processing, storage to meet user needs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0011] A method for collecting Internet data with a high degree of matching, the implementation process is as follows: first crawl the url list, provide the website url addresses that need to extract data for web crawlers, and store the website urls that need to extract data into the crawl url list; The crawler obtains the url information of the website that needs to extract data from the crawled url list; the web crawler obtains the corresponding page content from the corresponding url page and extracts the keyword information required by the user; the web crawler writes the extracted data into the database Middle; design the data analysis and comparison module, and process the data in the database through the data analysis and comparison module.

[0012] The web crawler performs data collection work according to the rules configured in advance by the user, and the configured rules include web page download rules, web page parsing rules, and content extraction rules.

[0013]...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an Internet data acquisition method with high matching degree, and the method comprises the steps: firstly crawling a url list, providing a website url address which needs to extract data for a web crawler, and storing the website url address which needs to extract the data into the crawling url list; The web crawler obtains url information of the website needing data extraction from the crawling url list; The web crawler obtains the corresponding page content from the corresponding url page and extracts keyword information required by the user; The web crawler writes the extracted data into a database; And designing a data processing module, and processing the data in the database through the data analysis and comparison module. Compared with the prior art, the internet data acquisition method with the high matching degree processes the data through a data processing mode of link filtering, data rearrangement and integration, eliminates repeated data, avoids repeated capture, and is high in integration and matching degree of the data, so that the internet data acquisition method better meets the requirements of users, and is high in practicability, wide inapplication range and easy to popularize.

Description

technical field [0001] The invention relates to the field of computer application technology, in particular to a highly practical method for collecting Internet data. Background technique [0002] Internet webpage data collection is a process of obtaining Internet webpage content, which is generally crawled by web crawlers, but in the existing crawling process, repeated crawling of the same URL, duplication of captured data, and matching between captured data often occur. Based on this, we now provide an Internet data collection method with a high matching degree, which extracts the data content required by the user from the web page through analysis, and converts and processes the extracted data content through content and format Processing, storage to meet the needs of users. Contents of the invention [0003] The technical task of the present invention is to provide an Internet data collection method with strong practicability and high matching degree aiming at the abo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951G06F16/955G06F16/903
Inventor 韩金花
Owner 河南大瑞物联网科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products