Word splitting method and device aiming at URL

A word segmentation method and word segmentation technology, applied in natural language data processing, special data processing applications, using information identifiers to retrieve web data, etc., to achieve the effect of improving task accuracy and efficient segmentation

Active Publication Date: 2018-06-29
INST OF INFORMATION ENG CAS
View PDF7 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At the same time, there is currently no word segmentation method specifically for URLs

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Word splitting method and device aiming at URL
  • Word splitting method and device aiming at URL
  • Word splitting method and device aiming at URL

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0032] In order to make the above objects, features and advantages of the present invention more obvious and comprehensible, the present invention will be further described in detail below with specific drawings.

[0033] A word segmentation method for URL provided by the present invention, the flow of the method is as follows figure 1 As shown, the main steps include:

[0034] (1) Hierarchical segmentation. Firstly, the URL of the semi-structured data is segmented according to its internal hierarchical structure to obtain five hierarchical parts;

[0035] (2) Symbol segmentation and regular expression filtering, carry out sequentially on each level, segment it according to special symbols, and perform regular expression filtering on fields with specific formats, such as IP addresses, dates, numbers, etc., and further Sanitize non-alphabetic characters in URLs;

[0036] (3) Segmentation of character strings, using the two-way maximum matching algorithm and probability model ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention relates to a word splitting method and device aiming at a URL. The method comprises the steps of 1, dividing a URL address according to an internal layer structure of the URL address toobtain several layer parts; 2, conducting symbol dividing and regular expression filtering on the layer parts in sequence; 3, dividing a character string obtained after processing conducted in step 2to obtain a URL word splitting sequence. The URL address is divided into five layer parts in step 1, namely, a protocol type, a free domain name, a secondary domain name, a primary domain name and a path; a bidirectional maximum matching algorithm and a probability model are utilized to divide the character string in step 3. The layer structure of the URL itself is made full use of, the URL can beefficiently divided, useful information in the URL address is maximally reserved, the obtained URL word splitting sequence can be used for feature analysis in tasks like webpage classification and angling URL detection, and the task accuracy can be effectively improved.

Description

technical field [0001] The present invention relates to the technical field of word segmentation of network security data, in particular to a method and device for word segmentation of URLs. On the basis of retaining the unique hierarchical structure of URLs, the word segmentation sequences of URL strings are obtained, and the results can be used for web page classification and phishing URLs. Feature analysis in tasks such as detection. Background technique [0002] URL is a uniform resource locator, which is the address of a standard resource on the Internet, through which access to and acquisition of information resources can be realized. URL uses part of the ASCII code to represent the address, the syntax is extensible, and its standard structure is as follows: [0003] Protocol type: [ / / server address[:port number]][ / path][? query][#fragment] [0004] Most URLs include three main parts: protocol type (scheme), server address (domain) and path (path). The protocol typ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/955G06F40/284G06F40/289
Inventor 亚静柳厅文张盼盼李全刚时金桥郭莉
Owner INST OF INFORMATION ENG CAS
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products