Structured processing method of distributed network information

A distributed network and information structure technology, applied in the field of distributed network information structured collection and processing, can solve problems such as large storage space and information use barriers, and achieve the effect of effective classification and convenient analysis and processing

Active Publication Date: 2015-05-06
ZHEJIANG UNIV
View PDF7 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Such a storage method has the following disadvantages. First, storing in the form of the original web page requires a large storage space; second, there are a large amount of irrelevant information in the stored information, such as advertisements; It is a semi-structured method. Compared with the structured storage method, the semi-structured storage method will cause certain obstacles to the use of further information.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Structured processing method of distributed network information
  • Structured processing method of distributed network information

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0057] For an electric vehicle website, the user needs to obtain the news webpage and the car model parameter webpage. For these two types of webpages, each type obtains a typical webpage as the target webpage, thus forming a set of target webpages{C 1 , C 2}.

[0058] In step 2), distributed crawling is performed on the electric vehicle website to obtain its webpage data. The time-consuming process is related to the scale of the website to be crawled and the scale of the cluster that executes the crawling task. For a cluster of ten nodes, a fully loaded The crawling rate can reach 100,000 pieces / hour.

[0059] Step 3) The webpages obtained from the electric vehicle website are divided into two types of webpages: news and car model parameters through clustering, and the accuracy rate of clustering in this process can be more than 95%.

[0060] Step 4) carry out structural extraction for these two types of web pages, for news web pages, extract title, text, release date, sour...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a structured processing method of distributed network information. The method comprises the following steps: configuring a network information acqusition task, and saving interesting webpages of a user in category to serve as target webpages; acquiring the network information, cooperatively acquiring the webpages through multiple map / reduce processes, performing structured processing and saving in an HDFS (Hadoop Distributed File System) file system; performing structured clustering on the webpages after the structured processing by using a tree edit distance mode; performing structured extraction on the clustered webpage information, and saving in a database. A distributed architecture is adopted, a huge data volume of network data can be processed by using the calculation and storage capacity of a cheap computer cluster; the webpages are effectively classified; the network information is extracted and saved by using the structured mode, and further analytical processing of the network information is facilitated.

Description

technical field [0001] The invention relates to a network information processing method in the field of network information collection, in particular to a structured collection and processing method of distributed network information. Background technique [0002] A distributed system is a system that efficiently organizes cheap computing clusters to perform large-scale data operations and storage. [0003] Distributed systems are different from stand-alone systems. The use of computer clusters for data calculation and storage needs to balance the cost of single-node computing power and inter-node communication. At the same time, the system availability and data recovery caused by node failures in the cluster must also be considered. issues such as sex. Hadoop distributed processing and HDFS distributed file system are open source distributed computing and storage systems designed and developed based on the Map / Reduce computing model proposed by Google. Because of its effe...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/13G06F16/16G06F16/182
Inventor 常鹏飞伍赛陈珂寿黎但陈刚
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products