Large scale key word matching method

A keyword matching and keyword technology, applied in special data processing applications, instruments, electrical digital data processing and other directions, can solve the problems of no randomness, high false alarm rate, affecting the efficiency of HASH-AV keyword matching, etc. To achieve the effect of reducing the high false positive rate of judgment, improving matching efficiency and improving retrieval efficiency

Inactive Publication Date: 2009-04-01
BEIJING VENUS INFORMATION TECH
View PDF0 Cites 32 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

There is no false negative when using the Bloom filter to determine whether an element belongs to the specified element set, but there may be false positives, especially when the set of elements represented by the Bloom filter is larger.
In theory, false positives can be reduced by increasing the bit string size of the Bloom filter, but it is difficult to achieve the effect in practice, because the hash function of the Bloom filter constructed in the actual situation does not have good randomness
The HASH-AV method uses a Bloom filter to represent the set of keywords to be searched. We found in the experiment that when the set of keywords searched in HASH-AV is greater than 100,000, the current text based on a single Bloom filter does not The false positive rate of any keyword matching judgment is high, which directly affects the keyword matching efficiency of HASH-AV; at the same time, after each text matching window moves, the HASH-AV method needs to re-execute each Bloom based on the current text. filter hash function, without taking into account the fact that the current text string is mostly the same as the text string in the previous window

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large scale key word matching method
  • Large scale key word matching method
  • Large scale key word matching method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0040] Suppose there are K original keywords to be searched, expressed as P={P 1 , P 2 ,...,P k}. In practical applications, the lengths of the original keywords to be searched are not equal. In order to facilitate the parallel matching of multiple keywords, the present invention needs to cut all keywords to equal length, that is, to select a keyword substring length value W, and for each original keyword P in the set P i , cut it into a keyword substring M of W byte length i . This cropped keyword substring M of W byte length i Keyword feature strings called raw keywords. By extracting each keyword feature string M i The composed set is the set M of keyword feature strings. Note that when selecting the length of the keyword feature string, the value of W cannot be greater than the length of the shortest keyword in the original keyword set. The simplest clipping method is to take the W byte prefix or suffix of each keyword as the keyword feature string of the original...

Embodiment approach

[0054] When implementing the present invention, the step A1 of the preprocessing stage A described in the present invention can adopt the following preferred implementation mode: for keyword set P={P 1 , P 2 ,...,P k} for each keyword P i , the extracted keyword feature string M i is the keyword substring with the least number of occurrences in the entire keyword set.

[0055] The following method can be used to make the extracted keyword feature string Mi be the keyword substring with the least number of occurrences in the entire keyword set:

[0056] a) Establish a hash table for storing all possible keyword substrings with a length of W;

[0057] b) for any length n i The original keyword P i , can be divided into (n i -W) keyword substrings with a length of W, for each segmented keyword substring, first judge whether it has appeared in the keyword substring hash table: if not in the hash table, create a new The keyword substring table entry, and the counter value i...

Embodiment 2

[0069] The entire technical solution of the present invention will be further described below through an embodiment.

[0070] Suppose the keyword set is P={abcdefghijk, abcopqrst, wyzopqhijk}, and the text to be matched is bcgilmnommlmloptrstuvabc.

[0071] According to the pretreatment process of the inventive method as follows:

[0072] First, the length of the keyword feature string is determined and the keyword feature string corresponding to each keyword is cut out. Here, the keyword length is selected to be 6 bytes, and the characteristic strings of each keyword are selected according to the principle of the least occurrence of keyword substrings. The finally obtained keyword characteristic string set is M={bcdefg, copqrs, pqhijk} (note that the minimum There may be multiple keyword substrings for the occurrence principle, and one of them can be randomly selected in practical applications).

[0073] Then, start to construct three simple Bloom filters based on the set M...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a matching method used for large-scale key words, comprising a pre-processing stage and a mode matching stage; the pre-processing stage comprises a key word characteristic string cutter, the structure of a plurality of simple bloom filter based on key word characteristic string sets, and a Hash table structure based on the key word characteristic string sets; the mode matching stage comprises the steps as follows: quick judgment that the text string in the current window is not matched with any key word characteristic string is achieved by the simple bloom filter series of previous structure; the precise match with candidate key words is executed under a failed judgment condition; during the text scanning process, current hash values of the current text corresponding to all simple bloom filters are quickly calculated by a recursive algorithm. The matching method sufficiently uses the characteristics that the match success rate of the text to be matched and the key words is extremely low and the recursive hash arithmetic has high efficiency, can realize the high-speed match under the condition of large-scale key words, and is extremely suitable for online virus scanning application such as virus detection and the like.

Description

technical field [0001] The invention relates to the technical field of computer content analysis, in particular to a multi-keyword matching method for rapid content analysis. Background technique [0002] The problem solved by Multiple Pattern String Matching is to quickly judge whether a certain data block contains one or some keywords in the keyword set. Multi-keyword matching technology is widely used in text processing, network content analysis, intrusion detection, information retrieval and virus detection and other fields. [0003] Traditional multi-keyword matching methods include literature [A.V.Aho, M.J.Corasick.EfficientString Matching: An Aid to Bibliographic Search, (Chinese name: an efficient string matching method for catalog search) Communications of the ACM, 1975, 18 (6): 333-340], literature [S.Wu, U.Manber.A Fast Algorithm For Multi-Pattern Searching (Chinese name: a kind of fast multi-pattern matching algorithm). TechnicalReport TR-94-17, University of A...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 叶润国周涛华东明孙海波骆拥政焦玉峰
Owner BEIJING VENUS INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products