Hadoop cluster-based large-scale Web information extraction method and system

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A hadoop cluster and information extraction technology, applied in the field of large-scale Web information extraction, can solve problems such as low access efficiency and one-sided data extraction

Active Publication Date: 2014-03-12

华夏文广传媒集团股份有限公司

View PDF2 Cites 17 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0003] In view of the above-mentioned problems in the prior art, the problem solved by the present invention is: to solve the problems of existing large-scale Web information extraction, one-sided data extraction, and low access efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0031] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0032] The architecture of the Hadoop cluster-based large-scale Web information extraction system in an embodiment of the present invention is a Hadoop cluster-based Web information extraction architecture. In this overall architecture, the database and service are separated into two independent clusters. In order to ensure the stable operation of the system, regular services and internal services are specifically established in the service cluster. The internal service is mainly used to keep the data extraction service running normally 24 hours a day, 7 days a week, without stopping due to resources occupied by regular services. In the service cluster, it is necessary to design a protection process to monitor the running status of the service. When a service fails, the server must be switched or restarted in time to make the system more stable. In order to...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a Hadoop cluster-based large-scale Web information extraction method and a Hadoop cluster-based large-scale Web information extraction system, aiming at the problem that a single node cannot be competent to the requirement for large-scale Web information extraction. The method comprises the following steps that an aggregation processing node extracts a website seed to be inquired according to predetermined conditions, performs load balancing segmentation according to the processing capacity of each query node, and transmits the seed to be inquired to each query node; each query node performs Web extraction locally according to the seed to be inquired and reports to the aggregation processing node; the aggregation processing node aggregates the reported information to obtain large-scale Web information. According to the method and the system, massive data extraction is performed in a Hadoop cluster mode, and data are processed by a high-efficiency Hbase-type memory database; the extraction efficiency is greatly improved compared with a single machine and a traditional relational database, and the reliability and the expansibility are high.

Description

technical field [0001] The invention relates to the field of information retrieval of computer networks, in particular to a large-scale Web information extraction method and system based on Hadoop clusters. Background technique [0002] With the rapid development of network information technology, big data on the Web is growing exponentially, making the Web the largest data collection in the world. Therefore, information extraction based on large-scale Web has always been a research hotspot for scholars at home and abroad. For the research of large-scale Web information extraction technology, the early Google crawler was designed by Stanford University and consisted of five modules, which were sent to the machine where the crawler runs by reading the URL linked list. By adopting the method of asynchronous I / O, hundreds of links are maintained in the whole system, and the captured pages are compressed and stored in the storage server. However, such architecture lacks stabili...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/30

CPCG06F16/951

Inventor 施佺马松玉邵叶秦施振佺丁卫平徐露李冬冬

Owner 华夏文广传媒集团股份有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Hadoop cluster-based large-scale Web information extraction method and system

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology