Enriched URL (uniform resource locator) recognition method and apparatus

An identification method and identification device technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as increasing the burden on search engines and wasting bandwidth, improving coverage and timeliness, and saving bandwidth waste. , the effect of reducing the amount of crawl

Inactive Publication Date: 2015-10-07
BEIJING QIHOO TECH CO LTD +1
View PDF4 Cites 6 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] However, there is a phenomenon of enrichment in this solution. Each URL has its own characteristics. The quality of web pages with similar URLs is v

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Enriched URL (uniform resource locator) recognition method and apparatus
  • Enriched URL (uniform resource locator) recognition method and apparatus
  • Enriched URL (uniform resource locator) recognition method and apparatus

Examples

Experimental program
Comparison scheme
Effect test

Example Embodiment

[0075] Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a more thorough understanding of the present disclosure and to fully convey the scope of the present disclosure to those skilled in the art.

[0076] Reference figure 1 , Shows a step flow chart of an embodiment of a method for identifying enriched URLs according to an embodiment of the present invention, which may specifically include the following steps:

[0077] Step 101: Extract one or more URLs;

[0078] In practical applications, various types of websites may design numerous web pages every day, and each web page has a URL.

[0079] Applying the e...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

An embodiment of the invention provides an Enriched URL (uniform resource locator) recognition method and apparatus. The method comprises the steps of: extracting one or more URLs; selecting candidate URLs from the one or more URLs; correlating each candidate URL with each anchor text; calculating the similarity of the anchor texts; and identifying an enriched URL from the candidate URLs according to the similarity. According to the embodiment of the invention, a search engine can be prevented from grabbing spam and repeated web pages during web page grabbing, so that the bandwidth waste during grabbing is greatly reduced; as the grabbing amount is reduced, the burden of the search engine is reduced; and meanwhile, the search engine can additionally grab other good-quality web pages, so that the coverage rate of the search engine during web page including is increased and the timeliness of the search engine during web page including is improved.

Description

technical field [0001] The invention relates to the technical field of computer processing, in particular to a method for identifying enriched URLs and an identification device for enriched URLs. Background technique [0002] With the rapid development of the network, the network has become a carrier of a large amount of information. In order to effectively extract and utilize this information, a search engine (Search Engine) usually downloads web pages from the network through a web crawler. [0003] The web crawler starts from the URL (Uniform Resource Locator, Uniform Resource Locator) of one or several initial webpages, and obtains the URL on the initial webpage. until a certain stopping condition of the system is met. [0004] Web crawlers can discover a large number of newly generated URLs in the network every day. However, the data of URLs in the network is massive, and the amount of URLs that search engines can actually crawl every day is limited. Sort the discover...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/00G06F16/9566
Inventor 王智广
Owner BEIJING QIHOO TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products