Hadoop cluster-based large-scale Web information extraction method and system

A hadoop cluster and information extraction technology, applied in the field of large-scale Web information extraction, can solve problems such as low access efficiency and one-sided data extraction

Active Publication Date: 2014-03-12
华夏文广传媒集团股份有限公司
View PDF2 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In view of the above-mentioned problems in the prior art, the problem solved by the present invention is: to solve the problems of existing large-scale Web information extraction, one-sided data extraction, and low access efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Hadoop cluster-based large-scale Web information extraction method and system
  • Hadoop cluster-based large-scale Web information extraction method and system
  • Hadoop cluster-based large-scale Web information extraction method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0032] The architecture of the Hadoop cluster-based large-scale Web information extraction system in an embodiment of the present invention is a Hadoop cluster-based Web information extraction architecture. In this overall architecture, the database and service are separated into two independent clusters. In order to ensure the stable operation of the system, regular services and internal services are specifically established in the service cluster. The internal service is mainly used to keep the data extraction service running normally 24 hours a day, 7 days a week, without stopping due to resources occupied by regular services. In the service cluster, it is necessary to design a protection process to monitor the running status of the service. When a service fails, the server must be switched or restarted in time to make the system more stable. In order to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a Hadoop cluster-based large-scale Web information extraction method and a Hadoop cluster-based large-scale Web information extraction system, aiming at the problem that a single node cannot be competent to the requirement for large-scale Web information extraction. The method comprises the following steps that an aggregation processing node extracts a website seed to be inquired according to predetermined conditions, performs load balancing segmentation according to the processing capacity of each query node, and transmits the seed to be inquired to each query node; each query node performs Web extraction locally according to the seed to be inquired and reports to the aggregation processing node; the aggregation processing node aggregates the reported information to obtain large-scale Web information. According to the method and the system, massive data extraction is performed in a Hadoop cluster mode, and data are processed by a high-efficiency Hbase-type memory database; the extraction efficiency is greatly improved compared with a single machine and a traditional relational database, and the reliability and the expansibility are high.

Description

technical field [0001] The invention relates to the field of information retrieval of computer networks, in particular to a large-scale Web information extraction method and system based on Hadoop clusters. Background technique [0002] With the rapid development of network information technology, big data on the Web is growing exponentially, making the Web the largest data collection in the world. Therefore, information extraction based on large-scale Web has always been a research hotspot for scholars at home and abroad. For the research of large-scale Web information extraction technology, the early Google crawler was designed by Stanford University and consisted of five modules, which were sent to the machine where the crawler runs by reading the URL linked list. By adopting the method of asynchronous I / O, hundreds of links are maintained in the whole system, and the captured pages are compressed and stored in the storage server. However, such architecture lacks stabili...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 施佺马松玉邵叶秦施振佺丁卫平徐露李冬冬
Owner 华夏文广传媒集团股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products