A Dynamic Crawling Method Based on Viterbi Algorithm for Web Page Classification and Sorting

A Viterbi algorithm and web page classification technology, applied in the field of network data mining, can solve the problems of low accuracy and low crawler efficiency, and achieve the effect of accurate acquisition, increased efficiency and accuracy

Active Publication Date: 2022-02-08
KUNMING UNIV OF SCI & TECH
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The present invention provides a dynamic crawler method for classifying and sorting webpages based on the Viterbi algorithm, which is used to filter out as many irrelevant webpages as possible, to screen out the theme websites required by users, and to solve the problem of low precision of existing crawler methods, The problem of low crawler efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Dynamic Crawling Method Based on Viterbi Algorithm for Web Page Classification and Sorting
  • A Dynamic Crawling Method Based on Viterbi Algorithm for Web Page Classification and Sorting
  • A Dynamic Crawling Method Based on Viterbi Algorithm for Web Page Classification and Sorting

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] Embodiment 1: as Figure 1-9 Shown, a kind of dynamic crawler method of classifying and sorting web pages based on Viterbi algorithm, the specific steps of said method are:

[0049] Step1. Obtain the link relationship network; first obtain any webpage related to the topic as the seed URL, and obtain the chain child links by crawling the hyperlinks of the seed webpages, and obtain the relationship diagram between the parent link and the child link. The link structure flow diagram is as follows figure 2 shown;

[0050] Step2. Calculate the value LV of webpage links;

[0051] Step2.1. Calculate the value LV of the web page link. The formula for calculating LV is:

[0052] Among them, LN is the current number of incoming links of the webpage; the number of incoming links is a dynamic value. Through the continuous deepening of crawlers, the number of incoming links of some webpages will increase and gradually approach the number of incoming links of webpages in the real...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a dynamic crawler method for classifying and sorting web pages based on a Viterbi algorithm, and belongs to the technical field of network data mining. The present invention first gives the seed URL, crawls down the seed URL as the parent link, and obtains the chain link; calculates the number of incoming links of the sub-link based on the link structure; then obtains the content of the sub-link web page and calculates the similarity between the content of the web page and the theme performance; by calculating the comprehensive evaluation value of the webpage, the webpage with a low evaluation value is eliminated and the remaining webpage is used as the parent link to crawl down the new link. Repeat the above process until no new web pages are added during the crawling process, then stop the crawling. The method of the invention enables the user to efficiently and accurately obtain important websites under a specific theme under the condition of a given theme through a dynamic web crawler based on the Viterbi algorithm.

Description

technical field [0001] The invention relates to a dynamic crawler method for classifying and sorting web pages based on a Viterbi algorithm, and belongs to the technical field of network data mining. Background technique [0002] With the rapid development of the Internet, network information resources have expanded rapidly. According to the statistics of CNNIC (China Internet Network Information Center), as of December 2016, the number of Chinese websites was 4.82 million, and the number of web pages was 236 billion. At this time, it is an important problem faced by network users to efficiently and quickly search for required information. Due to the large number of network information resources and the noise of crawlers, traditional general crawlers cannot meet the needs of users at all. Therefore, topic-oriented search engine has become a new round of research direction. The theme crawler sets a specific theme and crawls in a targeted manner, which greatly reduces the n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951G06F16/9535G06F16/955
CPCG06F16/9535
Inventor 邵玉斌张鸿飞龙华杜庆治
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products