Webpage spider theme type search system based on improved cloud platform

A search system and cloud platform technology, applied in the field of web spider theme search system, can solve problems such as inability to adapt to performance and scalability

Pending Publication Date: 2021-04-02
荆门汇易佳信息科技有限公司
View PDF0 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In view of this, thematic search engines and web spider technologies for specific fields or topics have been widely used. However, with the exponential growth of network information, traditional web spiders that only re

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage spider theme type search system based on improved cloud platform
  • Webpage spider theme type search system based on improved cloud platform
  • Webpage spider theme type search system based on improved cloud platform

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0111] The following is a further description of the technical solution of the improved cloud platform-based webpage spider theme search system provided by the present invention in conjunction with the accompanying drawings, so that those skilled in the art can better understand the present invention and implement it.

[0112] The main tasks of the present invention include: one is to improve the web page analysis algorithm based on the link structure HITS algorithm and the topic similarity calculation based on the VSM vector space model, and propose an improved web page spider model algorithm. The comprehensive value of the information is evaluated; the second is to propose an improved task allocation algorithm in the process of realizing the cloud platform webpage spider, which takes into account the uniform distribution and the load of each crawling sub-node, and improves the system performance. Optimize resource allocation, improve the crawling rate and accuracy of the clou...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

According to the webpage spider theme type search system based on the improved cloud platform, improvement is carried out on a webpage analysis algorithm HITS algorithm based on a link structure and awebpage analysis algorithm based on theme similarity calculation of a VSM vector space model, and an improved webpage spider model algorithm is provided; according to the method, an overall frameworkmodel based on a Hadoop cloud platform webpage spider is provided, a storage structure of the cloud platform webpage spider is designed and realized on a file system HDFS, and MapReduce algorithm realization is carried out on each functional module based on module division; an improved task allocation algorithm is provided, uniform allocation and the load condition of each crawling child node canbe considered, and the crawling efficiency and accuracy of the cloud platform webpage spider system are improved; results show that the cloud platform webpage spider system based on Hadoop provided and realized by the invention is feasible and effective, the accuracy and efficiency of theme type search can be greatly improved, and theme associated information can be comprehensively, quickly and accurately retrieved.

Description

technical field [0001] The invention relates to a web spider theme search system, in particular to a web spider theme search system based on an improved cloud platform, belonging to the theme search system technical field. Background technique [0002] With the increasing popularity and rapid development of Internet technology, in the face of such a large and diverse information data, general-purpose search engines, as the main means of obtaining information, are far from meeting people's retrieval needs for specific fields or topic-related information. . In view of this, thematic search engines and web spider technologies for specific fields or topics have been widely used. However, with the exponential growth of network information, traditional web spiders that only rely on a single computer for crawling cannot adapt to big data. Thematic search in the environment requires performance and scalability, and the computing framework of the Hadoop cloud platform can solve this...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/951G06F16/955G06F16/182
CPCG06F16/182G06F16/951G06F16/955
Inventor 扆亮海
Owner 荆门汇易佳信息科技有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products