Unstructured data real-time crawling system based on Python and using method of unstructured data real-time crawling system

A technology of unstructured data and unstructured data, which is applied in the field of network big data, can solve problems such as user information overload and information explosion, and achieve the effects of reducing interference, improving efficiency, and ensuring work efficiency

Pending Publication Date: 2020-10-30
广西美立方工程咨询有限公司
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the explosion of information has also brought the problem of information overload to users. How to quickly select what they need from the massive amount of information is an increasingly urgent problem.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unstructured data real-time crawling system based on Python and using method of unstructured data real-time crawling system
  • Unstructured data real-time crawling system based on Python and using method of unstructured data real-time crawling system
  • Unstructured data real-time crawling system based on Python and using method of unstructured data real-time crawling system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0030] refer to figure 1 , a Python-based real-time crawling system for unstructured data proposed by the present invention, including: a crawler cluster, a temporary storage database, a data migration module and a target database.

[0031] The crawler cluster includes multiple web crawlers set up for different crawling objects, and each web crawler is used to crawl unstructured data from corresponding crawling objects in real time. In this embodiment, the crawling object of the web crawler is set as unstructured data, which avoids restrictions on the crawled data and is beneficial to ensure the breadth of data crawling, thereby ensuring the richness and comprehensiveness of the crawled data.

[0032] Specifically, in this embodiment, the crawler cluster includes at least some open-source web crawlers, so as to facilitate adjustment of the web crawlers according to data requirements, thereby improving the applicability and flexibility of the system.

[0033] The temporary sto...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unstructured data real-time crawling system based on Python. The unstructured data real-time crawling system comprises a crawler cluster, a temporary storage database, a data migration module and a target database, wherein the data migration module is used for carrying out partitioning arrangement on the data stored in the temporary storage database and migrating the arranged data to a target database for storage. According to the method and the system, the data migration module is used for arranging the non-structural data in the temporary storage database and migrating the arranged data to the target database to be stored. In this way, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through thestorage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the logic integrity of the data in the target database is ensured, sothe efficiency of retrieving the information through the target database is further ensured.

Description

technical field [0001] The invention relates to the technical field of network big data, in particular to a Python-based real-time crawling system for unstructured data and a method for using the same. Background technique [0002] With the rapid development of the Internet, it has penetrated into all aspects of people's lives, from spiritual information acquisition to material needs can be realized through the Internet. [0003] With the explosive development of information, hundreds of millions of websites continue to emerge, and the number of web pages indexed by search engines is also increasing rapidly. [0004] The abundant information on the Internet brings great convenience to people. Through the Internet, people can obtain all kinds of information efficiently and quickly. However, the explosion of information has also brought the problem of information overload to users. How to quickly select what they need from the massive amount of information is an increasingly ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/951
CPCG06F16/951
Inventor 官鲁卫陈霞
Owner 广西美立方工程咨询有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products