Supercharge Your Innovation With Domain-Expert AI Agents!

A method and system for quickly identifying web page types through links

A web page type and type of technology, applied in the field of network communication, can solve problems such as low versatility and system resource occupation, and achieve the effect of improving work efficiency

Active Publication Date: 2017-11-24
SHANGHAI ZHANGMEN TECH
View PDF1 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, since this method needs to download and analyze the webpage first, this will cause a lot of system resources to be occupied, and this method requires that the link must contain a specified characteristic string to be identified, so the generality of this method is not good. high

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method and system for quickly identifying web page types through links
  • A method and system for quickly identifying web page types through links

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0031] In order to have a more specific understanding of the technical content, characteristics and effects of the present invention, now in conjunction with the illustrated embodiment, the details are as follows:

[0032] The present invention firstly needs to construct a link normalization dictionary for recording the link (url) normalization methods required by each web page type. The specific method is as follows:

[0033] First, for each website to be crawled, analyze the url naming rules of the types of webpages to be crawled. For example, the urls of all book display pages (contentpage) of Boku.com (www.bookuu.com) are in the form:

[0034] http: / / www.bookuu.com / kgsm / ts / 2010 / 07 / 13 / 1786270.shtml

[0035] http: / / www.bookuu.com / kgsm / ts / 2010 / 09 / 21 / 1827795.shtml

[0036] http: / / www.bookuu.com / kgsm / ts / 2009 / 12 / 08 / 1644478.shtml

[0037] That is, in the url, the prefix is ​​the same, but some parts (the last number string in the above example) are changed.

[0038] Then, ac...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for quickly identifying web page types through links, comprising the steps of: 1) constructing a link normalization dictionary; 2) extracting links; 3) reading the link normalization dictionary to generate a prefix character string; 4) quickly predicting Link prefix type; 5) Discriminate the link type; 6) Pass the effective link to the web spider. The invention also discloses a system for realizing the above method, including: a link normalization dictionary, a link extraction module, a prefix extraction module, a type prediction module and a type discrimination module. The system and method use the naming rules of web page link addresses to extract the prefix character string and normalized character string from the link address, and use the comparison between the character strings to quickly determine the type of web page, thereby improving the accuracy of web page type identification Speed ​​and efficiency of web spiders.

Description

technical field [0001] The invention relates to the field of network communication, in particular to a method for quickly identifying web page types through links. The invention also relates to a system for implementing the above method. Background technique [0002] Web spider (Spider) is a program for search engines to automatically crawl web pages. It starts from a certain page (usually the home page) of the website, reads the content of the web page, finds other link addresses in the web page, and searches through these link addresses. The next webpage, and so on, until all the webpages of this website are crawled. [0003] Using the above principles, web spiders can crawl all web pages on the Internet. However, due to the huge number of webpages on the Internet, the number of webpages that a web spider can crawl in a given time is limited, and for a specific application of the web spider, it is only necessary to crawl its Therefore, how to effectively schedule the we...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F17/30
Inventor 陈运文
Owner SHANGHAI ZHANGMEN TECH
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More