Webpage uniform resource locator (URL) classification and identification method and device

A technology of addresses and webpages, which is applied in the field of webpage URL address classification and identification methods and devices, can solve the problems of uncertainty, low matching depth of domain names, and heavy workload of directory maintenance, etc., so as to reduce data volume, improve processing efficiency, and ensure Effects of Accuracy and Depth

Active Publication Date: 2015-07-01
XINYANG BRANCH HENAN CO LTD OF CHINA MOBILE COMM CORP
View PDF4 Cites 19 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the matching depth of domain name matching is low. Generally, it can intelligently identify which website the URL URL belongs to. It is impossible to determine which channel or category the URL belongs to.
[0008] The fourth is to match the URL address of the obtained web page with the preset directory. If the match is successful, attribute the URL addre

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage uniform resource locator (URL) classification and identification method and device
  • Webpage uniform resource locator (URL) classification and identification method and device

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] In order to reduce the amount of data required for URL address classification as much as possible, improve the accuracy and classification depth of URL classification, and improve processing efficiency, the embodiments of the present invention provide a method and device for classifying web page URL addresses.

[0024] Wherein, the general components of the URL address include: the adopted transmission protocol (for example, http, ftp, etc.), host domain name (host) and path. The path is a string separated by zero or more " / " symbols, and generally represents the address of a directory or file on the server.

[0025] For example, in the URL address: http: / / www.ceocio.com.cn / net / , www.ceocio.com.cn is the domain name of the host, that is, the domain name of the server of the web page, and net is a directory of the server of the web page.

[0026] Preferred embodiments of the present invention will be described in detail below in conjunction with the accompanying drawings...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage uniform resource locator (URL) classification and identification method and device used for improving URL classification accuracy and depth and improving efficiency under the premise that data amount required by URL classification is reduced as much as possible. The method includes analyzing a plurality of specific URLs with the identical domain name, determining a main path and sub paths contained by the URLs, building a directory tree according to the determined main path and sub paths as the directory of each stage, dividing each URL into the corresponding directories in the directory tree according to all paths contained in the URLs, determining keywords in webpages that the URLs correspond to aiming at the directory of any stage in the directory tree, and determining the class of a first keyword to be the class of the URLs under the directory when determining that the ratio of the number of the webpages with identical first keyword and the total number of the webpages in the directory is higher than the set threshold.

Description

technical field [0001] The present invention relates to the field of network technology, in particular to a method and device for classifying and identifying web page URL addresses. Background technique [0002] With the rapid development of the Internet, the amount of network data has increased sharply. Facing the huge amount of webpage information resources, it is necessary to classify and organize the huge amount of webpage information. [0003] At present, classifying webpages according to their Uniform Resource Locators (URLs) is one of the more common webpage classification methods. [0004] Traditional URL classification identification mainly has the following methods: [0005] The first method is to use web crawler technology to crawl the content of the webpage according to the URL address after obtaining the URL address of the webpage, and determine the set number of keywords (Keywords ), and determine the category to which the webpage belongs according to the det...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 崔洪涛李明李远邵杰黄伟张杰
Owner XINYANG BRANCH HENAN CO LTD OF CHINA MOBILE COMM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products