Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

A Topic Crawler Method Based on Incremental Bayesian Algorithm

A Bayesian algorithm and theme crawler technology, applied in computing, computer parts, instruments, etc., can solve problems such as failure to meet user intelligence needs, calculation result errors, and inapplicable scenarios where new data is coming in a steady stream. Improve predictive power, increase accuracy, increase impact weight effects

Active Publication Date: 2020-04-14
NANJING UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] In the context of the current exponential growth of network information scale, the use of traditional web crawlers is limited by information collection speed, value density, and professionalism. Returning web pages is usually accompanied by a lot of worthless information, which cannot meet the intelligent needs of users.
[0004] The current topic crawler technology, when calculating the priority of links, is often obtained by weighting the correlation between anchor text and web page text. In addition, when the classification algorithm calculates the correlation between text and topics, it ignores , the distribution of the original webpage sample space will change. If the correlation degree is calculated using the same classification model, the calculation result will have a large error, and it is not suitable for the actual scene where new data comes continuously.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Topic Crawler Method Based on Incremental Bayesian Algorithm
  • A Topic Crawler Method Based on Incremental Bayesian Algorithm
  • A Topic Crawler Method Based on Incremental Bayesian Algorithm

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0039] In order to better understand the technical content of the present invention, specific embodiments are given together with the attached drawings for description as follows.

[0040] Such as figure 1 The flow chart of the incremental Bayesian algorithm training is given as shown in the figure. The Bayesian classifier introduced with incremental learning makes full use of the information of new samples, and uses the original classifier to predict and classify new samples. Correct or not to improve the model.

[0041] Such as figure 2 As shown, the topic crawler structure diagram based on the incremental Bayesian algorithm of the embodiment of the present invention. Its structure mainly includes several modules such as web page repository, web page downloader, web page parser, classifier, link priority queue and invalid link filtering.

[0042] Such as image 3 The figure shows the workflow flow chart of topic crawler based on incremental Bayesian algorithm. The inve...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A topic crawling method based on incremental Bayesian algorithm, the steps of using incremental learning idea to train Bayesian classifier, the steps of online topic crawling based on incremental Bayesian algorithm: input initial training set and incremental training set; perform word segmentation and other preprocessing on the initial training set and incremental training set; train the initial classifier according to the initial training set and the Naive Bayesian principle, and use the initial classifier to classify the samples for the data in the incremental training set. Update the classification model according to the classification results; initialize the priority queue, visited link collection, incremental Bayesian classifier, and add the initial webpage links to the priority queue; according to whether the webpage links contain subject keywords, if they contain , the incremental Bayesian classification model is updated. Each time, the webpage with the highest priority in the priority queue is selected for webpage download, and the above steps are repeated until the conditions are met.

Description

technical field [0001] The present invention relates to a theme crawler technology based on incremental Bayesian algorithm, especially suitable for the application scenario of real-time incremental crawling of web pages Background technique [0002] The hugeness and complexity of the network lead to many challenges in obtaining web page information. Traditional web crawlers are programs or scripts that automatically grab information on the World Wide Web according to certain rules, and gradually spread from the initial web page links to the entire Internet. The main purpose is to Get a lot of Internet data within a certain period of time. [0003] In the context of the current exponential growth of network information, the use of traditional web crawlers is limited by information collection speed, value density, and professionalism. Returning web pages is usually accompanied by a lot of worthless information, which cannot meet the intelligent needs of users. [0004] The cu...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/35G06F16/9535G06F16/951G06K9/62
CPCG06F18/24155
Inventor 张雷王姗姗许磊吴和生陆恒杨
Owner NANJING UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products