Method for quickly looking for feature character strings in text sequential data

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A technology of characteristic character strings and text sequences, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., and can solve problems such as slow speed and poor adaptability of data sequences

Inactive Publication Date: 2016-06-08

CHANGSHU RES INSTITUE OF NANJING UNIV OF SCI & TECH

View PDF5 Cites 5 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] In order to solve the problems in the prior art that the similarity search speed of text sequence data is too slow, and the data sequence needs to be completely matched, resulting in poor adaptability, the present invention proposes a method for quickly searching for characteristic strings

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0022] Define the suffix array: Given a set of text sequences, S=e 1 ...e n and a set of mutually independent hash functions H={h 1 … h n}, let h i (S) is expressed as a sequence of hash results, h i (S)=h i (e 1 )… h i (e n ), where the suffix matrix of S is M s,m = , is h i (S) suffix array. There are many ways to form a suffix prime group, so the generated suffix matrix can be many.

[0023] Search in the suffix array: it is divided into two steps, first find potential similar segments from the suffix matrix, and then directly filter through the similarity.

[0024] Given a set of mutually independent hash functions and query sequences, generate a suffix matrix. Then decompose through binary search, and search for each row according to the number of rows in the suffix matrix. If a field appears a specified number of times in the binary search result set, the field is considered a candidate field.

[0025] The following program sequence shows the process of...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a method for quickly looking for feature character strings in a text sequential data. The method comprises the following steps of (1) acquiring a text sequence from information, namely a character string, (2) generating a suffix array, (3) searching in the suffix array and resolving according to binary search. In the third step, according to lines of the suffix matrix, search is conducted to each line; and if a field occurs for designated times in a concentrative way in binary search results, similarity of two fields is calculated and the field close to the similarity most is the candidate field. Advantages of original data in the sequence is effectively utilized, so problems of data analysis complication and slow speed due to limitation of LSH algorithm to the unordered data can be overcome; besides, after fuzzy check, delete and selection can be directly conducted; a candidate part can be directly filtered via similarity calculation; and a problem that a sub-sequence has to be fully matched for similarity search algorithm can be overcome.

Description

technical field [0001] The invention relates to a method for quickly searching for character strings, especially for searching for continuous or discontinuous similar texts in a large amount of data. Background technique [0002] Sequence data is now quite common in real-life applications, including bioinformatics, system security, and network connectivity. At the same time, similarity search is also a basic technique in sequence data management. There are many effective methods for symbolic sequence and time series data, such as DNA sequences, stocks, network data packets and video streams. For text search, it is mainly divided into two categories at this stage. One is the location-sensitive hash algorithm using the minimum hash (Locality-SensitiveHashingwithMin-Hash will be abbreviated as LSH later), and the other is based on hash index, suffix tree and Similar segment search for suffix tree columns, but there are limitations in text sequence data. The LSH algorithm is l...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & AuthorityApplications(China)

IPC IPC(8): G06F17/30

Inventor李涛张晟骁李千目侯君徐建

OwnerCHANGSHU RES INSTITUE OF NANJING UNIV OF SCI & TECH

Method for quickly looking for feature character strings in text sequential data

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology