Unlock instant, AI-driven research and patent intelligence for your innovation.

A Method of Topic Crawling Based on Improved Shark Search

A topic crawler and topic technology, which is applied in the direction of network data indexing, network data retrieval, and other database retrieval, etc., can solve problems such as unsatisfactory retrieval results and unretrievable data, so as to reduce error rate and improve crawling Coverage, the effect of solving myopia problems

Active Publication Date: 2021-05-04
NANJING UNIV
View PDF5 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, search engines have certain limitations: traditional search engines only cover about 40% of network resources, and most of the data cannot be retrieved; users with different backgrounds often have different retrieval needs. When retrieving specific aspects of content, the results returned by search engines will contain a large number of web pages that users are not interested in. When facing a specific field, the retrieval results are often unsatisfactory

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A Method of Topic Crawling Based on Improved Shark Search
  • A Method of Topic Crawling Based on Improved Shark Search
  • A Method of Topic Crawling Based on Improved Shark Search

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0082] Below in conjunction with accompanying drawing and specific embodiment, further illustrate the present invention, should be understood that these examples are only for illustrating the present invention and are not intended to limit the scope of the present invention, after having read the present invention, those skilled in the art will understand various aspects of the present invention All modifications of the valence form fall within the scope defined by the appended claims of the present application.

[0083] A topic crawling method based on improved shark search, which constructs topic word vectors by introducing word vectors and topic models, and expands the semantics of words. Combining the semi-structured features of the webpage to improve the TF-IDF algorithm and extract the keywords of the webpage, the correlation between the webpage and the topic is transformed into the correlation between the webpage keywords and the subject words. On this basis, the webpag...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a topic crawling method based on improved shark search, comprising the following steps: 1) seed url configuration and keyword configuration stage; 2) webpage download stage; 3) topic discrimination stage; 4) crawler search stage: a calculation link content score and url clustering score; b judge the hub page of the parent web page; c calculate the search depth of the link; d add the link to the url priority queue, and adjust the order in the queue according to the link score and search depth. The invention solves the problems of inaccurate topic discrimination and insufficient crawling coverage area in the theme crawler by using the subject word vector, url clustering algorithm and hub type page discrimination.

Description

technical field [0001] The invention relates to a topic crawling method based on an improved shark search, which solves the problems of inaccurate topic discrimination and low crawling coverage in a topic crawling system. Background technique [0002] With the rapid development of network and mobile network technology, the Internet penetration rate continues to increase. As of December 2018, the number of Internet users in my country has reached 820 million, and the Internet penetration rate is 59.6%. According to the 43rd "Statistical Report on Internet Development in China" released by China Internet Network Information Center in 2019, the total number of domain names in my country is 37.928 million, of which the total number of ".CN" domain names is 21.243 million, an increase of 31% compared with 2015. %. The explosive growth of Internet pages not only allows information to spread more quickly, but also satisfies various information needs of users. On the other hand, th...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/951
CPCG06F16/951
Inventor 吴骏谈志文张哲成王崇骏
Owner NANJING UNIV