A malicious URL detection system and method based on automatic feature extraction

A feature extraction and detection system technology, applied in transmission systems, special data processing applications, instruments, etc., can solve problems such as the lack of popular URL detection software, improve the scope of application and accuracy, avoid manual errors, and improve adaptability Effect

Active Publication Date: 2018-12-14
SHANGHAI JIAO TONG UNIV
5 Cites 15 Cited by

AI-Extracted Technical Summary

Problems solved by technology

Although the new technology of deep learning has been extensively resea...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

In order to make full use of the information provided by the URL, reduce manual errors, and make the selected three kinds of features (URL structure features, webpage text features, webpage image features) more closely related, after the training model of single layer Plus a layer of Softmax model that fully connects the three models. In this way, the information correlation between the three is maximized, and the utilization rate of various information...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention discloses a malicious URL detection system and a malicious URL detection method based on automatic feature extraction, which relates to the malicious URL detection field. The malicious URL detection system comprises a preprocessing module, a parallel learning module and a detection classification module. The preprocessing module takes the URL of the web page as an input, and convertsthe URL structural features, the web page text content and structural features, and the image features extracted by the preprocessing into three digital matrices containing feature vectors respectively. The parallel learning module uses three independent depth learning networks of different algorithms to process the three digital matrices to obtain three probability matrices. The detection and classification module inputs the three probability matrices to a fully connected network for further processing to give a final classification result. The invention combines the depth learning model ofthe text and the image with the malicious URL detection, comprehensively extracts various information of the web page, and improves the application scope and accuracy of the detection method.

Application Domain

Technology Topic

Parallel learningImaging Feature +8

Image

  • A malicious URL detection system and method based on automatic feature extraction
  • A malicious URL detection system and method based on automatic feature extraction
  • A malicious URL detection system and method based on automatic feature extraction

Examples

  • Experimental program(1)

Example Embodiment

[0041] Hereinafter, a number of preferred embodiments of the present invention will be introduced with reference to the accompanying drawings in the specification to make the technical content clearer and easier to understand. The present invention can be embodied by many different forms of embodiments, and the protection scope of the present invention is not limited to the embodiments mentioned in the text.
[0042] figure 1 It shows a schematic structural diagram of a malicious URL detection system based on automatic feature extraction in an embodiment of the present invention. This embodiment provides a malicious URL detection system based on automatic feature extraction. The system is composed of a preprocessing module, a parallel learning module, and a detection classification module. For the input URL, the system will determine whether it is a malicious URL and give Its category. In the present invention, the preprocessing module converts different types of data sources such as character strings, webpage text, and webpage images into three digital matrices that carry URL structural features, text features, and image features. In view of the different characteristics of these three digital matrices, in the parallel learning module of the present invention, three different deep learning networks of n-gram convolutional network, TextCNN, and image convolutional network are used to learn the features respectively. In the detection classification module of the present invention, the learning results of the three parts in the parallel learning module are comprehensively used to obtain the final detection result, which is returned to the client. This method of completely automatically extracting features by a computer and integrating the three features to draw conclusions is one of the core innovations of the present invention. The processing and learning process of URL structural features, webpage text features and webpage image features in the present invention are as follows:
[0043] URL structural features: The traditional URL structural feature extraction relies on human experience, and the present invention is inspired by the processing method of word2vec converting text into word vectors and calculating its association, and discards the way of manually extracting URL structural features. figure 2 The conversion process of the character string of the malicious URL detection system based on automatic feature extraction to the multi-dimensional vector in the embodiment of the present invention is shown: a character in the URL string corresponds to a multi-dimensional vector, so that a URL string is converted into Digital matrix. Similar characters are closer in the multidimensional space, and vice versa. In the embodiment of the present invention, the experimental results show that symbols are considered by the system as similar characters, lowercase letters are considered as similar characters, and uppercase letters are also considered as similar characters. The next step after the character string is transformed into a multidimensional vector is to use neural networks to learn features. image 3 Shows the convolution process of a fan-shaped window on a multi-dimensional vector. In the embodiment of the present invention, convolution windows of sizes 3, 4, and 5 are used to convolve character vectors respectively. The convolutional network first automatically summarizes the pattern features from a large number of URL character matrix inputs that have been marked. Then when there is a new URL input, the neural network can perform pattern matching on it through convolution. The pattern matching here can be understood by the following example. If the neural network finds a capital letter followed by a number or a control character, it will automatically compare it with the pattern feature set to see if it matches an existing pattern. The result of pattern matching is the learning result of URL structural features.
[0044] Webpage text features: In the traditional sense, convolutional neural networks are used for image processing and also show good performance. Intuitively, the top-down scanning characteristics of the convolutional neural network from left to right are indeed very similar to the way we process images. However, this does not mean that it cannot be used for text processing. The basic algorithm of the text convolutional neural network (TextCNN) is the same as the above-mentioned convolutional neural network. The difference is that in natural language processing, we need to select a feature extraction window with the same width as the input matrix, and the height of the window is optional , Its typical value is 2-5. In actual operation, we have selected three windows with widths of 3, 4, and 5, and the number of each window is set to 128, so that more comprehensive features can be extracted, which helps to improve the accuracy of the final result. In general, our extraction of text features can be divided into two parts: word2vec word vector conversion part and TextCNN word vector processing part. After entering a web page body segment, word2vec converts each word in the text into a word vector, so that for the entire text, we get a digitized matrix. Taking this digitized matrix as the input of TextCNN, we can get a probability matrix about the text, which contains the classification features of the text. The implementation framework of the entire text extraction process can be determined by Figure 4 Said.
[0045] Webpage image features: The image feature data source of this project is the webpage image information of the webpage corresponding to the malicious URL. After preprocessing such as clipping and filtering, the webpage is adapted to the input requirements of the deep image convolutional neural network. Then use deep image convolutional neural network to learn image features.
[0046] In order to make full use of the information provided by the URL, reduce artificial errors, and make the selected three features (URL structure feature, web page text feature, web page image feature) more closely related, add after the single-layer training model The first layer is a Softmax model that fully connects the three models. In this way, the information association between the three is maximized, and the utilization of various information is maximized. At the same time, due to less manual intervention, the error of feature extraction can be further reduced. The learning result of the fully connected layer is the final system's judgment result of the URL. We divide URLs into 7 categories, normal URLs into one category, and malicious URLs into 6 categories of systems. Finally, the system will give a classification report for the entered URL, the specific classification such as Figure 5 Shown.
[0047] The preferred embodiments of the present invention are described in detail above. It should be understood that ordinary technologies in the field can make many modifications and changes according to the concept of the present invention without creative work. Therefore, all technical solutions that can be obtained by those skilled in the art through logical analysis, reasoning or limited experiments based on the concept of the present invention on the basis of the prior art should fall within the protection scope determined by the claims.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

Zoning oxidation ozone distributor, and arrangement mode and applications thereof

ActiveCN109173662ARapid responseImprove flue gas denitrification efficiencyGas treatmentDispersed particle separationChemistryFlue gas
Owner:INST OF PROCESS ENG CHINESE ACAD OF SCI

Classification and recommendation of technical efficacy words

  • Improve accuracy and coverage
  • Rapid response

Text recommendation method

InactiveCN104239512AImprove accuracy and coverageGood practical valueSemantic analysisSpecial data processing applicationsData miningDegree of similarity
Owner:UNIV OF ELECTRONIC SCI & TECH OF CHINA

Electronic commerce transaction monitoring method based on internet transaction data

InactiveCN104915842AAdaptive reductionImprove accuracy and coverageCommerceParameter analysisSelf adaptive
Owner:ZHEJIANG LISHI TECH

Data quality detection method and device and storage medium

PendingCN109656812AImprove accuracy and coverageReduce labor and time costsDigital data information retrievalSoftware testing/debuggingMetadataAutomation
Owner:PING AN TECH (SHENZHEN) CO LTD

Link anomaly detection method and device

PendingCN111314121AGuaranteed timelinessImprove accuracy and coverageData switching networksMonitoring dataEngineering
Owner:ALIPAY (HANGZHOU) INFORMATION TECH CO LTD

Analytical aid

ActiveUS20060247555A1Rapid responseQuick eliminationCatheterDiagnostic recording/measuringEngineeringApplication site
Owner:ROCHE DIABETES CARE INC

Image display

InactiveUS20050259068A1Rapid responseSimple and inexpensive constructionStatic indicating devicesPhysicsHigh potential
Owner:BRIDGESTONE CORP

User interface for accessing messages

InactiveUS20130018945A1Rapid responseImprove playback qualityError detection/correctionMultiple digital computer combinationsEmbedded applicationsFacsimile
Owner:CALLWAVE COMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products