The invention discloses a quick
collection system for distributed internet data. The
system comprises five
layers including a seed website setting node, a
hyperlink collection layer, a real-time
queue, a webpage downloading and
parsing layer and a webpage data storage layer, wherein the seed website setting node is used for setting each parameter and each extracting rule for storing a
data source; the
hyperlink collection layer is used for requesting the
hyperlink list webpage of the
data source and extracting the hyperlink of a target webpage; the real-time
queue is used for accessing a URL (
Uniform Resource Locator) hyperlink extracted by the hyperlink collection layer, the extraction rule corresponding to the URL hyperlink and the accessed URL hyperlink; the webpage downloading and
parsing layer is used for requesting and
parsing the URL hyperlink which is not accessed in the real-time
queue and carrying out formatting extraction on specific data; and the webpage data storage layer is used for storing target data obtained by the formatting extraction carried out by the webpage downloading and parsing layer. By use of the
system, data collection is carried out by a distributed layered cooperation way, and the
system application requirements including high data collection quantity, more data sources and high instantaneity requirements can be coped with.