Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method for quickly looking for feature character strings in text sequential data

A technology of characteristic character strings and text sequences, which is applied in the direction of electrical digital data processing, special data processing applications, instruments, etc., and can solve problems such as slow speed and poor adaptability of data sequences

Inactive Publication Date: 2016-06-08
CHANGSHU RES INSTITUE OF NANJING UNIV OF SCI & TECH
View PDF5 Cites 5 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In order to solve the problems in the prior art that the similarity search speed of text sequence data is too slow, and the data sequence needs to be completely matched, resulting in poor adaptability, the present invention proposes a method for quickly searching for characteristic strings

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for quickly looking for feature character strings in text sequential data
  • Method for quickly looking for feature character strings in text sequential data
  • Method for quickly looking for feature character strings in text sequential data

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0022] Define the suffix array: Given a set of text sequences, S=e 1 ...e n and a set of mutually independent hash functions H={h 1 … h n}, let h i (S) is expressed as a sequence of hash results, h i (S)=h i (e 1 )… h i (e n ), where the suffix matrix of S is M s,m = , is h i (S) suffix array. There are many ways to form a suffix prime group, so the generated suffix matrix can be many.

[0023] Search in the suffix array: it is divided into two steps, first find potential similar segments from the suffix matrix, and then directly filter through the similarity.

[0024] Given a set of mutually independent hash functions and query sequences, generate a suffix matrix. Then decompose through binary search, and search for each row according to the number of rows in the suffix matrix. If a field appears a specified number of times in the binary search result set, the field is considered a candidate field.

[0025] The following program sequence shows the process of...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a method for quickly looking for feature character strings in a text sequential data. The method comprises the following steps of (1) acquiring a text sequence from information, namely a character string, (2) generating a suffix array, (3) searching in the suffix array and resolving according to binary search. In the third step, according to lines of the suffix matrix, search is conducted to each line; and if a field occurs for designated times in a concentrative way in binary search results, similarity of two fields is calculated and the field close to the similarity most is the candidate field. Advantages of original data in the sequence is effectively utilized, so problems of data analysis complication and slow speed due to limitation of LSH algorithm to the unordered data can be overcome; besides, after fuzzy check, delete and selection can be directly conducted; a candidate part can be directly filtered via similarity calculation; and a problem that a sub-sequence has to be fully matched for similarity search algorithm can be overcome.

Description

technical field [0001] The invention relates to a method for quickly searching for character strings, especially for searching for continuous or discontinuous similar texts in a large amount of data. Background technique [0002] Sequence data is now quite common in real-life applications, including bioinformatics, system security, and network connectivity. At the same time, similarity search is also a basic technique in sequence data management. There are many effective methods for symbolic sequence and time series data, such as DNA sequences, stocks, network data packets and video streams. For text search, it is mainly divided into two categories at this stage. One is the location-sensitive hash algorithm using the minimum hash (Locality-SensitiveHashingwithMin-Hash will be abbreviated as LSH later), and the other is based on hash index, suffix tree and Similar segment search for suffix tree columns, but there are limitations in text sequence data. The LSH algorithm is l...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 李涛张晟骁李千目侯君徐建
Owner CHANGSHU RES INSTITUE OF NANJING UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products