Method and system for efficient indexed storage for unstructured content

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
a technology of unstructured content and indexing techniques, applied in the field of storage, can solve the problems of unstructured content, multimedia, lack of robust, scalable indexing techniques, and inability to insert and query through efficient indexes, and achieve the effect of distinguishing unstructured content from structured content, and avoiding the loss of indexing efficiency

Active Publication Date: 2010-01-19

NAHAVA

View PDF3 Cites 36 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

Unstructured content, for example, multimedia, does not fit well in conventional databases.

Conventional databases perform admirably on structured content, but for unstructured content they lack the ability to insert and query via efficient indexes.

This presents a problem.

The absence of robust, scalable indexing techniques distinguishes unstructured content from the structured content.

This presents a problem.

Adding “features” takes effort.

This presents a problem.

First, when the number of items is large it is often impractical to manually apply features, commonly referred to as, “hand-tagging.” Second, content might be manually tagged once, but it can be impractical to revisit them to tag them for another reason.

However, when a new inquiry arises, it may be impractical to rescan the entire collection of images to annotate for a particular mole near the nose or for a scar on the forehead.

This presents a problem.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

example # 1

Token Sequence Example #1

Text

[0025]Given a block of text, tokenize it by mapping words, word breaks, punctuation, and other formatting into a sequence of tokens. The tokenizer performs the desired mapping from text to tokens. From the above discussion, it follows that a block of text corresponds to a probability transition matrix. In other words, associate a block of text with its corresponding vector in an inner product space. A distance function expressed between two vectors as,

Distance(A,B)=sqrt(A−B,A−B>)

[0026]This distance is also known as the “norm,”

Distance(A,B)=∥A−B∥=sqrt(A−B,A−B>)

[0027]According to this metric, we say that two text blocks are similar if the distance between their corresponding probability transition matrices is small. In practice, we have found that this numerical measure corroborates our own ordinary concept of “similar” text. For example, when the distance between two text blocks is small, the text is satisfyingly similar to a human reader. Empirically, we...

example # 2

Token Sequence Example #2

Genomic Data

[0029]Another kind of unstructured content is genomic data. Using an embodiment of the present invention, DNA sequences of nucleotides can be parsed and transformed into sparse probabilistic tuple-to-token transition matrices. For example, the VIIth and XIVth chromosomes of Baker's yeast (Saccharomyces cerevisiae) consist of 563,000 and 784,000 bases, respectively. In one embodiment of the invention, a tuple width of 8, 15, and 30 have been used to perform the “splitting” operation. The operations to compute the average and deviation each take 40 seconds to 3 minutes, depending on the tuple width, for each chromosome on a vintage 2002 desktop PC. This confirms that long token sequences associated with raw nucleotide data can be mapped to an inner product vector space.

[0030]Image Blocks

[0031]The techniques of the present invention may be used on image blocks, as similarity between two images can be represented by computing an inner product over “s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

A method and apparatus for efficient indexed storage for unstructured content have been disclosed.

Description

RELATED APPLICATION[0001]This patent application claims priority of U.S. Provisional Application Ser. No. 60 / 656,521 filed Feb. 24, 2005 titled “Method and Apparatus for Efficient Indexed Storage for Unstructured Content”, which is hereby incorporated herein by reference.FIELD OF THE INVENTION[0002]The present invention pertains to storage. More particularly, the present invention relates to a method and apparatus for efficient indexed storage for unstructured content.BACKGROUND OF THE INVENTION[0003]Unstructured content, for example, multimedia, does not fit well in conventional databases. Conventional databases perform admirably on structured content, but for unstructured content they lack the ability to insert and query via efficient indexes. This presents a problem.[0004]Unstructured content includes, among other things, text, multimedia and cutting-edge data types such as genomic sequences. Text covers documents, emails, blogs, etc. Multimedia encompasses images, music, voice, ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Patents(United States)

IPC IPC(8): G06N5/02

CPCG06F17/3002G06F17/3069G06F17/30625G06F17/30333G06F16/3347G06F16/322G06F16/41G06F16/2264

Inventor NAKANO, RUSSELL TOSHIO

Owner NAHAVA

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Method and system for efficient indexed storage for unstructured content

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

example # 1

example # 2

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology