Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Webpage data crawling method, device and system and computer readable storage medium

A web page data and database technology, applied in network data indexing, network data retrieval, other database retrieval and other directions, can solve the problems of data analysis influence, mixed redundant and repeated data, etc.

Pending Publication Date: 2019-06-21
PING AN TECH (SHENZHEN) CO LTD
View PDF0 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, with the wanton reprinting of Internet information and the delivery of multiple websites, most of the data extracted by web crawlers is mixed with a lot of redundant and repeated data, which has a certain impact on subsequent data analysis.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage data crawling method, device and system and computer readable storage medium
  • Webpage data crawling method, device and system and computer readable storage medium
  • Webpage data crawling method, device and system and computer readable storage medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0050] It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

[0051] Please refer to figure 1 , figure 1 It is a schematic diagram of the terminal structure of the hardware operating environment involved in the solution of the embodiment of the present invention.

[0052]The terminal in this embodiment of the present invention is a control server, and the control server may be a terminal device such as a PC (personal computer, personal computer), a notebook computer, or a server.

[0053] like figure 1 As shown, the terminal may include: a processor 1001 , such as a CPU (Central Processing Unit, central processing unit), a communication bus 1002 , a user interface 1003 , a network interface 1004 , and a memory 1005 . Wherein, the communication bus 1002 is used to realize the connection and communication between these components; the user interface 1003 can include a display s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a webpage data crawling method, and relates to the field of data crawling. The method is applied to a webpage data crawling system, the system comprises a control server and a plurality of crawler servers connected with the control server. The method comprises: when the control server receives first webpage data sent by the crawler servers, carrying out feature extraction onthe first webpage data, and obtaining a first text feature vector; inputting the first text feature vector into a pre-trained fingerprint generation model to obtain a first webpage fingerprint; calculating similarity values between the first webpage fingerprint and a stored webpage fingerprint in a preset storage database, and judging whether a similarity value greater than a preset threshold exists in the similarity values or not; and if not, storing the first webpage data into the preset storage database. The invention further provides a webpage data crawling device and system and a computer readable storage medium. According to the method, device and system and the computer readable storage medium, the repetition rate of the webpage data obtained on the basis of web crawler crawling can be reduced.

Description

technical field [0001] The present invention relates to the technical field of data crawling, in particular to a method, device, system and computer-readable storage medium for webpage data crawling. Background technique [0002] With the vigorous development of network technology, obtaining data through the Internet has become an important way for people to obtain information resources, and web crawlers have become the mainstream means of obtaining web page data. However, with the wanton reprinting of Internet information and the delivery of multiple websites, most of the data extracted by web crawlers is mixed with a lot of redundant and repeated data, which has a certain impact on subsequent data analysis. Therefore, there is an urgent need for a method for removing duplicate data in the webpage data crawled by the web crawler, so as to reduce the repetition rate of the crawled webpage data. Contents of the invention [0003] The main purpose of the present invention i...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/951G06F16/955
CPCG06F16/951G06F16/955
Inventor 吴启王雪春
Owner PING AN TECH (SHENZHEN) CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products