Multi-key-word matching method for rapidly analyzing content

A technology of keyword matching and content analysis, which is applied in the field of content analysis, can solve problems such as the inability to realize continuous multi-byte jumps in the text matching window, and achieve the effect of speeding up text scanning

Inactive Publication Date: 2009-02-04
BEIJING VENUS INFORMATION TECH
View PDF0 Cites 42 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Compared with other keyword matching methods, this keyword matching method takes more into account the unique characteristics of the virus detection field, and shows a better scanni

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multi-key-word matching method for rapidly analyzing content
  • Multi-key-word matching method for rapidly analyzing content
  • Multi-key-word matching method for rapidly analyzing content

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0042] Suppose there are K original keywords to be searched, expressed as P={P 1 , P 2 ,...,P k}. In practical applications, the lengths of the original keywords to be searched are not equal. In order to facilitate the parallel matching of multiple keywords, the present invention needs to cut all keywords to equal length, that is, to select a keyword substring length value W, and for each original keyword P in the set P i , cut it into a keyword substring M of W byte length i . This cropped keyword substring M of W byte length i It is called the original keyword feature string. By extracting each keyword feature string M i The composed set is the set M of keyword feature strings. Note that when selecting the length of the keyword feature string, the value of W cannot be greater than the length of the shortest keyword in the original keyword set. The simplest clipping method is to take the W byte prefix or suffix of each keyword as the keyword feature string of the ori...

Embodiment approach

[0061] When implementing the present invention, the step A1 of the preprocessing stage A described in the present invention can adopt the following preferred implementation mode: for keyword set P={P 1 , P 2 ,...,P k} in each keyword P i , the extracted keyword feature string M i is the keyword substring with the least number of occurrences in the entire keyword set.

[0062] The following method can be used to make the extracted keyword feature string M i is the keyword substring with the least number of occurrences in the entire keyword set:

[0063] a) Establish a hash table for storing all possible keyword substrings with a length of W;

[0064] b) for any length n i The original keyword P i , can be divided into (n i-W) keyword substrings with a length of W, for each segmented keyword substring, first determine whether it is in the keyword substring hash table: if not in the hash table, create a new key Word substring entry, and the counter value is set to 1; if ...

Embodiment 2

[0071] The entire technical solution of the present invention will be further described below through an embodiment.

[0072] Suppose the keyword set is P={abcdefg, abcopq, wyzopq}, and the text to be matched is bcgilmnom.

[0073] According to the pretreatment process of the inventive method as follows:

[0074] First, the length of the keyword feature string is determined and the keyword feature string corresponding to each keyword is cut out. Here, the keyword length is selected to be 3 bytes, and the characteristic strings of each keyword are selected according to the principle of least occurrence of keyword substrings, and finally the set of keyword characteristic strings is M={bcd, cop, wyz}.

[0075] Then, set the jumping step of the text matching window and determine the set of keyword feature slices. Here, the skipping step of the text matching window is selected as 2 bytes, so the corresponding keyword feature fragment set K={bc, cd, co, op, wy, yz} can be obtained...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a quick content analysis multi-keyword matching method which includes the preprocessing stage and the pattern matching stage; wherein, the preprocessing stage includes the keyword characteristic string clipping, the establishment of the keyword character segmentation set, the Bloom Filter (bloom filter) construction based on the keyword character segmentation set and the original keyword set linear table construction; the pattern matching stage includes the quick judgment that the text string in the current window is not matched with any keyword characteristic segmentation is realized according to the Bloom Filter; the accurate matching of the text string and the candidate keyword is realized through the character string comparison only under the situation that the quick judgment is failed; the text matching window skips in high speed with continuous multibytes. The quick content analysis multi-keyword matching method utilizes the very low success matching rate of the text for being matched and the keyword to realize the high-speed matching under the large quantity of keywords scene so that the multi-keyword matching method is very applicable to the online virus scanning application such as the virus detection.

Description

technical field [0001] The invention relates to the technical field of content analysis, in particular to a multi-keyword matching method for fast content analysis. Background technique [0002] Multiple Pattern String Matching (Multiple Pattern String Matching) is one of the basic problems in the field of computer science. The problem it solves is to quickly judge whether a certain data block contains one or some keywords in the keyword set. Multi-keyword matching technology is widely used in text processing, network content analysis, intrusion detection, information retrieval and virus detection and other fields. [0003] At present, a large number of multi-keyword matching algorithms have emerged, including Aho-Corasick[1], Wu-Manber[2], and E2XB[3], etc. (all cited references are located at the end of the background technology). These multi-keyword matching algorithms all have an ideal application condition. For example, the best application condition of the Aho-Corasi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
Inventor 叶润国华东明李博胡振宇
Owner BEIJING VENUS INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products