Supercharge Your Innovation With Domain-Expert AI Agents!

Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler

A technology focusing on crawlers and web page classification, applied in the field of web search engines, can solve the problems of low web page recognition rate, dimensionality disaster, weak function of obtaining structured information, etc., to achieve high application value, improve efficiency, and reduce the number of effects.

Inactive Publication Date: 2017-05-10
HUAIHAI INST OF TECH
View PDF3 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0016] This method has certain difficulties for the following reasons: it is difficult for the focused crawler to select the crawling queue that is closely related to the topic information from the URL queue to be crawled; in the process of URL extraction, the web crawler uses search strategies such as depth and breadth, which is easy to generate The "curse of dimensionality" problem; many existing open source crawler systems are weak in obtaining structured information from crawled webpages; existing focused crawler strategies are difficult to adapt to the dynamic changes in the content and structure of webpages
To sum up, the recognition rate of different categories of webpages in the traditional focused crawler technology is low, and another method must be found

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler
  • Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler
  • Webpage classification recognition method based on comprehensive subject term vertical search and focused crawler

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0075] The invention proposes a technical framework capable of effectively identifying various URLs in dynamic webpages, and provides a detailed algorithm. The system is divided into three layers, from top to bottom: acquisition layer, analysis layer and presentation layer.

[0076] 1. Web page data collection layer

[0077] Function: The main function of this layer is to realize the collection of dynamic webpage data and hand it over to the upper layer for content analysis.

[0078] Interface: This layer is an interface focusing on crawlers and the network, and is responsible for providing web page source code string input data to the upper layer

[0079] 2. Web page content analysis layer

[0080] Function: This layer is the core layer of the entire design. It mainly analyzes the content of the pages collected by the web page data collection layer, obtains effective hyperlinks according to the weight of the subject words, and builds the URL queue sequence list to be crawle...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a webpage classification recognition method based on comprehensive subject term vertical search and focused crawler, and belongs to the technical field of webpage search engines. According to the method, research is performed aiming at a webpage classification recognition method in a subject term vertical search engine which is dynamically changed in a webpage, and the judgment of a fact that whether a dynamically changed webpage is related to a subject term is mainly searched; by computing the subject term correlation degree in the webpage, a URL highly related to a comprehensive subject term is screened out and enters a queue for crawl; classified information of the webpage is obtained through vertical search and focused crawler technologies; a webpage classification recognition model and algorithm are designed; different classifications of URLs are obtained through the recognition of the dynamically changed webpage; accurate webpage search is provided for users, and the webpage classification of an unknown URL can be further provided. The method has very wide significance and a high application value for the classification recognition of the dynamic webpage.

Description

technical field [0001] The invention relates to the technical field of webpage search engines, in particular to a method for classifying and identifying webpages based on vertical search of comprehensive subject words and focused crawlers. Background technique [0002] With the increasing popularity of vertical search engines, as a key technology of vertical search engines—focused crawlers are becoming more and more important. Focused crawler is a program that automatically downloads web pages. It selectively visits web pages and related links on the World Wide Web according to the established crawling target to obtain the required information. The main processing object of the crawler is URL. Get the required file content, and then do further processing on it. [0003] With the rapid growth of the Internet, the amount of information on the Internet is also presenting explosively. People pay special attention to how to obtain effective information from the massive amount of...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/9566G06F16/951
Inventor 掌明卢艳宏杨瑞樊纪山王经卓宋永献孙巧榆张金学洪露
Owner HUAIHAI INST OF TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More