A method of normalizing URL

A normalization and vector technology, applied in the field of URL normalization, can solve problems such as poor versatility, inability to handle correctly, and lack of willingness of website developers to cooperate, and achieve good stability
CN110298005AInactive Publication Date: 2019-10-01SHANGHAI GUAN AN INFORMATION TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHANGHAI GUAN AN INFORMATION TECH
Publication Date
2019-10-01
Estimated Expiration
Not applicable · inactive patent

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a method for normalizing a URL (Uniform Resource Locator), and aims to solve the problems existing in an existing URL normalization method. The method comprises the following specific steps: step 1, encoding an original URL into a numerical vector through a deep learning method, so that the distances of URLs with the same path and different parameters in a vector space after encoding are very close; and step 2, combining URLs with similar numerical vectors so as to achieve the purpose of normalization. According to the method, a complex regular expression does not needto be compiled, the parameter part can be accurately identified no matter how long the parameter part is, and the URL can be accurately normalized; according to the method, an Autoencoder method is adopted, the Autoencoder method is an unsupervised learning algorithm, and manual annotation is not needed; according to the method, a URL mapping table or a directory structure does not need to be maintained, and better stability is achieved when new URLs appear when small-scale revision is conducted on the website.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the field of URL normalization, in particular to a method for normalizing URLs. Background technique

[0002] When analyzing web logs, we often need to perform some statistical calculations on web pages, such as calculating the number of visits per hour of a page, the number of visits to IPs, the distribution of response status codes, etc., by establishing a time series model for these statistics, or using They are used as features to build a more complex anomaly discovery model, which is used to discover abnormal pages accessed within a certain period of time. But in actual analysis, we can't see the real page visited by the user, only the URL (the address of the standard resource on the Internet) visited by the user can be seen from the access log, so strictly speaking, the object of our analysis is not the "page , instead of "URL".

[0003] Regardless of whether the server uses apache, nginx or IIS, the log format they rec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More