Method and system for extracting Web information based on Nutch

An information extraction and information extraction module technology, applied in the field of computer Web information retrieval, can solve problems such as low access efficiency and one-sided data extraction, and achieve the effects of saving user time, improving efficiency, and improving work efficiency.

Inactive Publication Date: 2015-04-15
NANTONG UNIVERSITY
View PDF2 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004]Aiming at the existing large-scale web information extraction

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for extracting Web information based on Nutch
  • Method and system for extracting Web information based on Nutch
  • Method and system for extracting Web information based on Nutch

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0047] This patent is further described in conjunction with the accompanying drawings and specific embodiments:

[0048] figure 1 It is a system overall schematic diagram of the Nutch-based Web information extraction information system of the present invention, as figure 1 As shown, this system is based on the parallel computing and distributed storage capabilities of the Hadoop cluster, uses the Nutch framework to obtain massive web information in the network, and distributes the complex extraction process that consumes a large amount of computing resources to multiple nodes through the Hadoop cluster. The captured data is distributed and stored through the HDFS file system or stored in Mysql, and indexed by Solr for easy retrieval. As can be seen from the test results, the system of the present invention combines the Nutch framework with the Mysql database, realizes the crawling, indexing and retrieval of web pages, improves the efficiency of information retrieval, a...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a system for extracting Web information based on Nutch. The system comprises an information extraction module, a storage module, an index module and a retrieval module, wherein the information extraction module is used for capturing webpage data from the Internet through a Nutch frame and analyzing the data; the storage module is used for storing webpage extraction files in which the webpage data is filtered; the index module is used for transmitting the webpage information collected by the Nutch to Solr to establish an index; the retrieval module is used for using the Solr to respond to a user query request and displaying the query result to a user in an XML page form. The response and running sped, stability and expandability of information extraction are improved, the excessive storage space occupied by the program is reduced, and guarantees are provided for the fact that the user can obtain effective information in time.

Description

technical field [0001] The invention relates to the field of computer Web information retrieval, in particular to a Nutch-based Web information extraction method and system. Background technique [0002] With the rapid development of network information technology, the service functions provided by computer software programs are becoming more and more perfect, which makes the information data attached to each computer software expand rapidly. Web pages have become the most important information resources on the Internet. However, the information on the webpage contains a large amount of content irrelevant to the subject information of the webpage, so that the main information of the page is often hidden in irrelevant content and structure, which limits the availability of web information, and the big data on the web is increasing exponentially The rapid growth of high-level forms has made the Web the largest data collection in the world. Therefore, information extraction ba...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/951
Inventor 施佺徐露丁卫泽程显毅丁卫平李冬冬孙鸿艳
Owner NANTONG UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products