Viterbi algorithm based web page sorting dynamic crawling method

A technology of Viterbi algorithm and web page classification, which is applied in computing, special data processing applications, instruments, etc. It can solve the problems of low efficiency and low precision of crawlers, and achieve the effect of increasing efficiency and accuracy, and accurate acquisition

Active Publication Date: 2018-05-08
KUNMING UNIV OF SCI & TECH
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The present invention provides a dynamic crawler method for classifying and sorting webpages based on the Viterbi algorithm, which is used to filter out as many irrelevant webpages as possible, to screen out the theme websites required by users, and to solve the problem of low precision of existing crawler methods, The problem of low crawler efficiency

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Viterbi algorithm based web page sorting dynamic crawling method
  • Viterbi algorithm based web page sorting dynamic crawling method
  • Viterbi algorithm based web page sorting dynamic crawling method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0048] Embodiment 1: as Figure 1-9 Shown, a kind of dynamic crawler method of classifying and sorting web pages based on Viterbi algorithm, the specific steps of said method are:

[0049] Step1. Obtain the link relationship network; first obtain any webpage related to the topic as the seed URL, and obtain the chain child links by crawling the hyperlinks of the seed webpages, and obtain the relationship diagram between the parent link and the child link. The link structure flow diagram is as follows figure 2 shown;

[0050] Step2. Calculate the value LV of webpage links;

[0051] Step2.1. Calculate the value LV of the web page link. The formula for calculating LV is:

[0052] Among them, LN is the current number of incoming links of the webpage; the number of incoming links is a dynamic value. Through the continuous deepening of crawlers, the number of incoming links of some webpages will increase and gradually approach the number of incoming links of webpages in the real...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of network data mining and relates to a viterbi algorithm based web page sorting dynamic crawling method. The method includes: providing a seed URL (uniform resource locator), taking the seed URL as a parent link to crawl downwards to acquire outbound sublinks; calculating inbound link quantity of the sublinks on the basis of a link structure; acquiringsublink web page content, and calculating similarity of the web page content to a theme; calculating web page comprehensive assessment values, eliminating web pages low in assessment value, and taking the rest of web pages as a parent link to crawl downwards to obtain new links; repeating the process until no new web page joins in during crawling, and stopping crawling. The method has the advantages that under the condition of a given theme, a user can efficiently and accurately acquire important websites under the specific theme through viterbi algorithm based dynamic web crawling.

Description

technical field [0001] The invention relates to a dynamic crawler method for classifying and sorting web pages based on a Viterbi algorithm, and belongs to the technical field of network data mining. Background technique [0002] With the rapid development of the Internet, network information resources have expanded rapidly. According to the statistics of CNNIC (China Internet Network Information Center), as of December 2016, the number of Chinese websites was 4.82 million, and the number of web pages was 236 billion. At this time, it is an important problem faced by network users to efficiently and quickly search for required information. Due to the large number of network information resources and the noise of crawlers, traditional general crawlers cannot meet the needs of users at all. Therefore, topic-oriented search engine has become a new round of research direction. The theme crawler sets a specific theme and crawls in a targeted manner, which greatly reduces the n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9535
Inventor 邵玉斌张鸿飞龙华杜庆治
Owner KUNMING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products