Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for identifying sub-sequences of interest in a sequence

a technology of sub-sequences and sequences, applied in the field of identification of sequences of interest, can solve the problems of difficult to identify meaningful or interesting sequences within a genome, the purpose of different parts of the genome is currently unknown, and the difficulty of traditional methods of identifying meaningful or interesting sequences

Inactive Publication Date: 2005-12-08
GENERAL ELECTRIC CO
View PDF5 Cites 22 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Benefits of technology

[0007] In accordance with a further embodiment of the present technique, a method is provided for identifying a biological sequence of interest. The method comprises analyzing a biological polymer sequence based on a grammar comprising at least an initial grammar. A minimum description length heuristic for each sub-sequence of the analyzed biological polymer sequence may be calculated. A selected minimum description length heuristic may be compared with one or more reference conditions. The grammar and the biological polymer sequence may be updated with a symbol representing a sub-sequence corresponding to the selected minimum description length heuristic based upon a non-termination result of the comparison. Alternatively, the sub-sequence may be identified as a biological sequence of interest based upon a termination result of the comparison. Code stored on tangible, machine-readable media may afford functionality of the type defined by these methods and is provided for by the present technique.

Problems solved by technology

However, for the genomes, which are known or are being sequenced, the purposes of different parts of the genomes are currently unknown.
Hence, the identification of meaningful or interesting sequences within a genome may pose a challenge.
Furthermore, it is increasingly difficult to identify meaningful sequences of interest employing traditional techniques.
In particular, the vast amount of data, such as genome data is difficult to analyze using traditional techniques in a computationally efficient manner.
In addition, existing computational techniques to determine meaningful information may be inadequate for the identification of sequences of interest.
For example, existing techniques may fail to identify DNA sequences in a genome that are known to be of interest, such as sequences experimentally demonstrated to be of interest.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for identifying sub-sequences of interest in a sequence
  • Method for identifying sub-sequences of interest in a sequence
  • Method for identifying sub-sequences of interest in a sequence

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0014] In many fields, such as genomic sequencing and analysis, it may be desirable to identify repetitive sequences, either to assist in compression and manipulation or to facilitate analysis. In particular, it may be desirable to identify such sequences in a computationally efficient manner. The techniques discussed herein address some or all of these issues.

[0015] Turning now to the drawings, and referring to FIG. 1, a flow chart 10 depicts steps for identifying a sequence of interest, according to one aspect of the present technique. As suggested by the flow chart 10, a given data series 12 may be provided, within which may be one or more sequences of interest to be identified. The data series 12 may be constructed from a grammar 14. As will be appreciated by those of ordinary skill in the art, a grammar 14 may comprise terminals, i.e., uncombined symbols, and variables, i.e., combinations of terminals or terminal and other variables. For example, for a numeric data series 12, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

PropertyMeasurementUnit
symbol compression ratioaaaaaaaaaa
compression ratioaaaaaaaaaa
description lengthaaaaaaaaaa
Login to View More

Abstract

The present technique provides for the analysis of a data series to identify sequences of interest within the series. The analysis may be used to iteratively update a grammar used to analyze the data series or updated versions of the data series. Furthermore, the technique provides for the calculation of a minimum description length heuristic, such as a symbol compression ratio, for each sub-sequence of the analyzed data sequence. The technique may then compare a selected heuristic value against one or more reference conditions to determine if additional iteration is to be performed. The grammar and the data sequence may be updated between iterations to include a symbol representing a string corresponding to the selected heuristic value based upon a non-termination result of the comparison. Alternatively, the string corresponding to the selected heuristic value may be identified as a sequence of interest based upon a termination result of the comparison.

Description

BACKGROUND [0001] The invention relates generally to algorithmic information theory, and more specifically, to the identification of sequences of interest in a given data series. [0002] In various applications, such as information theory, data compression, and intrusion detection, it may be desirable to identify sequences of interest within a data series. It may be advantageous to identify such sequences of interest in order to extract meaningful information from the identified sequences or to allow easier manipulation or analysis of the data series. For example, identification of repetitive sequences in a data series may allow easier or more effective compression of the data. [0003] Similarly, in the field of genetics, biologically interesting phrases or sequences in a genome, such as the human genome, may have higher redundancy than non-meaningful phrases, as nature tends to repeat or emphasize important sequences more frequently than unimportant sequences. However, for the genome...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(United States)
IPC IPC(8): G16B30/00G01N33/48G01N33/50G06F17/30
CPCG16B30/00
Inventor EVANS, SCOTTBUSH, STEPHENTORRES, ANDREW
Owner GENERAL ELECTRIC CO
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products