Method and system for extracting Web information based on Nutch
An information extraction and information extraction module technology, applied in the field of computer Web information retrieval, can solve problems such as low access efficiency and one-sided data extraction, and achieve the effects of saving user time, improving efficiency, and improving work efficiency.
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment
[0047] This patent is further described in conjunction with the accompanying drawings and specific embodiments:
[0048] figure 1 It is a system overall schematic diagram of the Nutch-based Web information extraction information system of the present invention, as figure 1 As shown, this system is based on the parallel computing and distributed storage capabilities of the Hadoop cluster, uses the Nutch framework to obtain massive web information in the network, and distributes the complex extraction process that consumes a large amount of computing resources to multiple nodes through the Hadoop cluster. The captured data is distributed and stored through the HDFS file system or stored in Mysql, and indexed by Solr for easy retrieval. As can be seen from the test results, the system of the present invention combines the Nutch framework with the Mysql database, realizes the crawling, indexing and retrieval of web pages, improves the efficiency of information retrieval, a...
PUM
Abstract
Description
Claims
Application Information
- R&D Engineer
- R&D Manager
- IP Professional
- Industry Leading Data Capabilities
- Powerful AI technology
- Patent DNA Extraction
Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.
© 2024 PatSnap. All rights reserved.Legal|Privacy policy|Modern Slavery Act Transparency Statement|Sitemap|About US| Contact US: help@patsnap.com