Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof

a pattern search and pattern technology, applied in the field of pattern search methods, pattern search apparatus and computer programs therefor, and the storage medium thereof, can solve the problems of large database, time required for dichotomous calculations, and difficulty in employing suffix trees for large text databases

Inactive Publication Date: 2002-09-05
IBM CORP
View PDF0 Cites 75 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

First, when a suffix tree is employed as a data structure, a large database is required.
Therefore, it is difficult to employ a suffix tree for a large text database.
Therefore, if the size of an alphabet is a constant, compared with the linear time that a suffix tree search requires, the time required for the calculations performed for a dichotomous, suffix array search is enormous.
However, when a large number of character types (s) are employed, the data array to be maintained as a table is extremely large, and this is not practical.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof
  • Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof
  • Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof

Examples

Experimental program
Comparison scheme
Effect test

example 2

[0121] Case where a character is represented by two bytes, and there are 65536 (=2.sup.16) character types. Japanese text corresponds to this case. When k=65536 is set, the total size of the tables F, L, G and C is 8 n bytes, and the total data size, even including the text T and the suffix array SA, is only 14 n bytes. It should be noted that in this case a small value of k, such as k=256, is not preferable because the data size will be increased.

example 3

[0122] Case of a DNA array (the number of character types is four). If the use of 2-bit characters and 4-bit characters is permitted, with k=4 the total data size for the tables F, L, G and C, the text T and the suffix array SA will be approximately 8.75 n bytes. Further, when k=16, the total data size is about 5.375 n bytes. The data size, especially in the second case, is substantially no different from the size of the suffix array SA.

[0123] An example for measuring the search speed for an actual DNA array will be explained. In this example, the calculation times are compared when the search method of this embodiment and of the conventional method for a binary search of the suffix array SA are employed, and the same query is repeated 10000000 times for all the arrays of a colon bacillus. It should be noted that an RS6000 (a workstation by IBM), which was equipped with a 333 MHz Power PC as the CPU, was employed for the calculations.

[0124] Search pattern P="CACATAA"

[0125] Search ti...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A fast search is performed of a large text database, while suppressing an increase in the data size of the data structure used for the process. A pattern search method for searching a target character string for a desired pattern includes: a range search step and a character string extraction step. At the range search step, intermediate patterns are obtained by adding characters in order, one by one, from the last character of the pattern to the first, and a range is determined for a suffix array, which corresponds to the target character string, wherein the first character of each of the intermediate patterns is present. Then, at the character string extraction step, elements of the character string are designated that correspond to elements included in the range of the suffix array, and character string segments are extracted consisting of the same number of elements as the elements of the pattern and having the elements of the character string as their first characters.

Description

[0001] 1. Field of the Invention[0002] The present invention relates to a data structure used to search an array for a frequently appearing segment, such as a character string, or to search for an array segment that is common to two or more arrays, and to a pattern search method using this data structure.[0003] 2. Related Art[0004] A suffix tree is a well known data structure that can be effectively employed to perform a quick search of character strings for a frequently appearing segment or for a character string segment used in common in two or more character stings. A suffix tree is one in which all the suffixes in a character string are represented by adding, to the end of a process target character string, the character $, which is not present in the character strings that are processed. The leaf nodes (nodes, at the ends of edges, to which no edges are connected) of a suffix tree correspond to individual suffixes.[0005] When a specific character is designated in a predetermine...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(United States)
IPC IPC(8): G06F17/30
CPCG06F17/30985G06F17/30988Y10S707/99936Y10S707/99933G06F16/90344G06F16/90348
Inventor SHIBUYA, TETSUO
Owner IBM CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products