A method of normalizing URL

A normalization and vector technology, applied in the field of URL normalization, can solve problems such as poor versatility, inability to handle correctly, and lack of willingness of website developers to cooperate, and achieve good stability

Inactive Publication Date: 2019-10-01
SHANGHAI GUAN AN INFORMATION TECH
View PDF8 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The current normalization methods are as follows: First, let the website developers cooperate, provide the soft routing logic of the website, and make a mapping table, and use this mapping table to restore the URL in the log during analysis. In actual operation of this method, website developers and security operation and maintenance personnel (that is, people who need to analyze logs) often belong to different departments, and website developers are often not willing to cooperate.
In addition, at present, the business systems of many companies change very quickly, and the update speed of the company website is often at the weekly or even daily level. It is very difficult to maintain a mapping table
Second, experts set some filtering rules based on experience, integrate these expert experiences into regular expressions, and filter out the parameters of the URL path; in this way, regular expressions specified by experts often have large false positives and False negative, the length of parameters in the URL, the selection of characters, there is no certain standard, and the versatility is not strong
Third, read the URL for a long period of time, construct the directory tree of the website, and count the number of child nodes for each node, which can only solve the situation where the last segment of the URL is a parameter. For such a URL, / avatar / user01 / 12313123123, the last two paragraphs are the user id and a random number, which cannot be correctly processed into / avatar

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A method of normalizing URL
  • A method of normalizing URL
  • A method of normalizing URL

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0026] A method for normalizing URLs, the specific steps are as follows:

[0027] Step 1: Encode the original URL into a numerical vector through the self-encoder method, and use the long-term short-term memory network (LSTM) as the basic network. LSTM is a type of recurrent neural network (RNN), which is used to process the input data as a sequence Each unit is a single-layer or multi-layer neural network, each unit has exactly the same structure, the input of each unit is the output of the previous unit and the character input of this step, the data is accurate, and the self- The encoder has two parts, one part is the encoder, and the other is the decoder (conversion encoder). Each character in the URL is passed into the encoder in turn to get a vector, and then the vector is passed into the decoder. The input of the decoder For the output of the previous step and the character input of the previous step, we hope to restore the original URL. Through this step, the LSTM autoe...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for normalizing a URL (Uniform Resource Locator), and aims to solve the problems existing in an existing URL normalization method. The method comprises the following specific steps: step 1, encoding an original URL into a numerical vector through a deep learning method, so that the distances of URLs with the same path and different parameters in a vector space after encoding are very close; and step 2, combining URLs with similar numerical vectors so as to achieve the purpose of normalization. According to the method, a complex regular expression does not needto be compiled, the parameter part can be accurately identified no matter how long the parameter part is, and the URL can be accurately normalized; according to the method, an Autoencoder method is adopted, the Autoencoder method is an unsupervised learning algorithm, and manual annotation is not needed; according to the method, a URL mapping table or a directory structure does not need to be maintained, and better stability is achieved when new URLs appear when small-scale revision is conducted on the website.

Description

technical field [0001] The invention relates to the field of URL normalization, in particular to a method for normalizing URLs. Background technique [0002] When analyzing web logs, we often need to perform some statistical calculations on web pages, such as calculating the number of visits per hour of a page, the number of visits to IPs, the distribution of response status codes, etc., by establishing a time series model for these statistics, or using They are used as features to build a more complex anomaly discovery model, which is used to discover abnormal pages accessed within a certain period of time. But in actual analysis, we can't see the real page visited by the user, only the URL (the address of the standard resource on the Internet) visited by the user can be seen from the access log, so strictly speaking, the object of our analysis is not the "page , instead of "URL". [0003] Regardless of whether the server uses apache, nginx or IIS, the log format they rec...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/958G06N3/04G06N3/08
CPCG06F16/958G06N3/08G06N3/044G06N3/045
Inventor 陈曦魏国富辜乘风汲丽钟丹阳
Owner SHANGHAI GUAN AN INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products