Method and system for efficient indexed storage for unstructured content

a technology of unstructured content and indexing techniques, applied in the field of storage, can solve the problems of unstructured content, multimedia, lack of robust, scalable indexing techniques, and inability to insert and query through efficient indexes, and achieve the effect of distinguishing unstructured content from structured content, and avoiding the loss of indexing efficiency

Active Publication Date: 2010-01-19
NAHAVA
View PDF3 Cites 36 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Unstructured content, for example, multimedia, does not fit well in conventional databases.
Conventional databases perform admirably on structured content, but for unstructured content they lack the ability to insert and query via efficient indexes.
This presents a problem.
The absence of robust, scalable indexing techniques distinguishes unstructured content from the structured content.
This presents a problem.
Adding “features” takes effort.
This presents a problem.
First, when the number of items is large it is often impractical to manually apply features, commonly referred to as, “hand-tagging.” Second, content might be manually tagged once, but it can be impractical to revisit them to tag them for another reason.
However, when a new inquiry arises, it may be impractical to rescan the entire collection of images to annotate for a particular mole near the nose or for a scar on the forehead.
This presents a problem.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and system for efficient indexed storage for unstructured content
  • Method and system for efficient indexed storage for unstructured content
  • Method and system for efficient indexed storage for unstructured content

Examples

Experimental program
Comparison scheme
Effect test

example # 1

Token Sequence Example #1

Text

[0025]Given a block of text, tokenize it by mapping words, word breaks, punctuation, and other formatting into a sequence of tokens. The tokenizer performs the desired mapping from text to tokens. From the above discussion, it follows that a block of text corresponds to a probability transition matrix. In other words, associate a block of text with its corresponding vector in an inner product space. A distance function expressed between two vectors as,

Distance(A,B)=sqrt(A−B,A−B>)

[0026]This distance is also known as the “norm,”

Distance(A,B)=∥A−B∥=sqrt(A−B,A−B>)

[0027]According to this metric, we say that two text blocks are similar if the distance between their corresponding probability transition matrices is small. In practice, we have found that this numerical measure corroborates our own ordinary concept of “similar” text. For example, when the distance between two text blocks is small, the text is satisfyingly similar to a human reader. Empirically, we...

example # 2

Token Sequence Example #2

Genomic Data

[0029]Another kind of unstructured content is genomic data. Using an embodiment of the present invention, DNA sequences of nucleotides can be parsed and transformed into sparse probabilistic tuple-to-token transition matrices. For example, the VIIth and XIVth chromosomes of Baker's yeast (Saccharomyces cerevisiae) consist of 563,000 and 784,000 bases, respectively. In one embodiment of the invention, a tuple width of 8, 15, and 30 have been used to perform the “splitting” operation. The operations to compute the average and deviation each take 40 seconds to 3 minutes, depending on the tuple width, for each chromosome on a vintage 2002 desktop PC. This confirms that long token sequences associated with raw nucleotide data can be mapped to an inner product vector space.

[0030]Image Blocks

[0031]The techniques of the present invention may be used on image blocks, as similarity between two images can be represented by computing an inner product over “s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A method and apparatus for efficient indexed storage for unstructured content have been disclosed.

Description

RELATED APPLICATION[0001]This patent application claims priority of U.S. Provisional Application Ser. No. 60 / 656,521 filed Feb. 24, 2005 titled “Method and Apparatus for Efficient Indexed Storage for Unstructured Content”, which is hereby incorporated herein by reference.FIELD OF THE INVENTION[0002]The present invention pertains to storage. More particularly, the present invention relates to a method and apparatus for efficient indexed storage for unstructured content.BACKGROUND OF THE INVENTION[0003]Unstructured content, for example, multimedia, does not fit well in conventional databases. Conventional databases perform admirably on structured content, but for unstructured content they lack the ability to insert and query via efficient indexes. This presents a problem.[0004]Unstructured content includes, among other things, text, multimedia and cutting-edge data types such as genomic sequences. Text covers documents, emails, blogs, etc. Multimedia encompasses images, music, voice, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(United States)
IPC IPC(8): G06N5/02
CPCG06F17/3002G06F17/3069G06F17/30625G06F17/30333G06F16/3347G06F16/322G06F16/41G06F16/2264
Inventor NAKANO, RUSSELL TOSHIO
Owner NAHAVA
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products