Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof

a pattern search and pattern technology, applied in the field of pattern search methods, pattern search apparatus and computer programs therefor, can solve the problems of large database, difficult to employ suffix trees for large text databases, etc., and achieve the effect of reducing the data size of the data structur

Inactive Publication Date: 2006-03-21
INT BUSINESS MASCH CORP
View PDF4 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0021]It is, therefore, one object of the present invention to perform a fast search of a large text database, while suppressing an increase in the data size of a data structure used for the process.
[0023]This pattern search method can be used for a search of a character string consisting of various characters, such as alphabetical or Japanese text characters. In particular, when a desired pattern is to be extracted from a character string, such as binary data or a genetic array, consisting of a small number of character types, the size of the data structure used for the search can be reduced.
[0029]The table can be a table generated to reflect the locations of elements in accordance with every predetermined count in the suffix array. That is, the data size of the data structure can be reduced by thinning out the data to be stored in the table.

Problems solved by technology

First, when a suffix tree is employed as a data structure, a large database is required.
Therefore, it is difficult to employ a suffix tree for a large text database.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof
  • Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof
  • Pattern search method, pattern search apparatus and computer program therefor, and storage medium thereof

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0122]Case wherein a character is represented by one byte and there are 256 types of characters (255 types when the end character $ is to be represented at the same time). In general, English text corresponds to this case.

[0123]When the character count of the text T is defined as n, the size of the text T is n bytes, and the size of the suffix array SA is 4n bytes. For example, when k=65536 (=216) is employed, the numbers equal to or smaller than k can be represented by two bytes. Thus, the total size of the tables F, L, G and C is a little more than 2n bytes. Therefore, the data size, even including the text T, the suffix array SA and the text T, is only a little over 7n bytes, which is only about one third of the size of the suffix tree (20n to 40n bytes) that corresponds to the text T. Since the search speed is proportional to log k, the speed can be increased by reducing the value of k. When, for example, k=256 (=28) is set, the twice the search speed can be expected than when k...

example 2

[0124]Case where a character is represented by two bytes, and there are 65536 (=216) character types. Japanese text corresponds to this case. When k=65536 is set, the total size of the tables F, L, G and C is 8n bytes, and the total data size, even including the text T and the suffix array SA, is only 14n bytes. It should be noted that in this case a small value of k, such as k=256, is not preferable because the data size will be increased.

example 3

[0125]Case of a DNA array (the number of character types is four). If the use of 2-bit characters and 4-bit characters is permitted, with k=4 the total data size for the tables F, L, G and C, the text T and the suffix array SA will be approximately 8.75n bytes. Further, when k=16, the total data size is about 5.375n bytes. The data size, especially in the second case, is substantially no different from the size of the suffix array SA.

[0126]An example for measuring the search speed for an actual DNA array will be explained. In this example, the calculation times are compared when the search method of this embodiment and of the conventional method for a binary search of the suffix array SA are employed, and the same query is repeated 10000000 times for all the arrays of a colon bacillus. It should be noted that an RS6000 (a workstation by IBM), which was equipped with a 333 MHz Power PC as the CPU, was employed for the calculations.[0127]Search pattern P=“CACATAA”[0128]Search time req...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A fast search is performed of a large text database, while suppressing an increase in the data size of the data structure used for the process. A pattern search method for searching a target character string for a desired pattern includes: a range search step and a character string extraction step. At the range search step, intermediate patterns are obtained by adding characters in order, one by one, from the last character of the pattern to the first, and a range is determined for a suffix array, which corresponds to the target character string, wherein the first character of each of the intermediate patterns is present. Then, at the character string extraction step, elements of the character string are designated that correspond to elements included in the range of the suffix array, and character string segments are extracted consisting of the same number of elements as the elements of the pattern and having the elements of the character string as their first characters.

Description

BACKGROUND OF THE INVENTION[0001]1. Field of the Invention[0002]The present invention relates to a data structure used to search an array for a frequently appearing segment, such as a character string, or to search for an array segment that is common to two or more arrays, and to a pattern search method using this data structure.[0003]2. Related Art[0004]A suffix tree is a well known data structure that can be effectively employed to perform a quick search of character strings for a frequently appearing segment or for a character string segment used in common in two or more character stings. A suffix tree is one in which all the suffixes in a character string are represented by adding, to the end of a process target character string, the character $, which is not present in the character strings that are processed. The leaf nodes (nodes, at the ends of edges, to which no edges are connected) of a suffix tree correspond to individual suffixes.[0005]When a specific character is design...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(United States)
IPC IPC(8): G06F17/30
CPCG06F17/30985G06F17/30988Y10S707/99933Y10S707/99936G06F16/90344G06F16/90348
Inventor SHIBUYA, TETSUO
Owner INT BUSINESS MASCH CORP
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products