Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A suffix array indexing method and apparatus for real-time data stream

A suffix array and data indexing technology, applied in the field of data indexing, can solve problems such as speeding up the response time, and the accuracy of the inverted index being easily affected by the effect of word segmentation

Active Publication Date: 2019-02-01
SUN YAT SEN UNIV
View PDF6 Cites 7 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the problem that the accuracy of the inverted index is easily affected by word segmentation and the indexing of heterogeneous data, the present invention provides a real-time data stream suffix array indexing method and a device using the indexing method, which can be used without Real-time indexing of heterogeneous data in the case of word segmentation, and asynchronous index generation to speed up response time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A suffix array indexing method and apparatus for real-time data stream
  • A suffix array indexing method and apparatus for real-time data stream
  • A suffix array indexing method and apparatus for real-time data stream

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0066] In the method for suffix array indexing of real-time data streams described in this embodiment, the process of creating a suffix array index can be divided into two parts: data processing and storage, and generation of a suffix array index.

[0067] A. Data processing and storage, such as figure 1 , figure 2 As shown, a piece of source data corresponds to a document, a document contains multiple fields, and the field is the data storage unit; a field contains multiple segments, and the segments are divided into temporary segments, dynamic segments, and persistent segments. Segments improve indexing efficiency; segments are independent suffix array indexes, and each segment independently maintains source data and index information. The data processing and storage process includes the following steps:

[0068] A101. The client submits an index request to the server through an HTTP request, records the name of the index library and other information through the request l...

Embodiment 2

[0093] A suffix array indexing method for real-time data streams, using temporary segments to improve indexing efficiency.

[0094] Assuming that the real-time data stream arrives at the server in three batches, source data A, source data B, and source data C are respectively extracted from the real-time data stream. The data size of the source data is 100MB. The index of the real-time data stream There are two implementations:

[0095] Implementation 1: Without the use of temporary segments

[0096] Because the suffix array can only be constructed for a complete segment at a time, if the new data is spliced ​​at the end of the old data, and then the index operation is performed, it will cause the problem of repeatedly creating the suffix array index for the old data, such as Figure 4 shown.

[0097] Time T1: Create a suffix array index for source data A (100MB);

[0098] Time T2: Source data B is spliced ​​at the end of source data A, and a suffix array index is created f...

Embodiment 3

[0108] A suffix array index method for real-time data streams, the suffix array index is composed of segment source data, segment suffix array, and segment information, such as Figure 6 As shown, the suffix array index retrieval process includes the following steps:

[0109] C101. The client initiates a search request, specifying the name of the target index library, the domain to be retrieved, and the search content; if the target index library and the domain to be retrieved are not specified, it defaults to all index libraries and all domains;

[0110] C102. The server receives and parses the retrieval request, determines the target index library, and obtains the corresponding domain object according to the domain to be retrieved;

[0111] C103. Each domain object starts an independent thread to complete data retrieval, reads all segments of the domain to be retrieved (including temporary segments, dynamic segments and persistent segments), and retrieves each segment independ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a suffix array indexing method for a real-time data stream. The method comprises the following steps: a server receives the real-time data stream, extracts source data, and pretreats the source data into documents; parsing the document, distributing the document according to the domain, receiving the source data in each domain, and starting an independent thread to index and store the data; a domain consists of a plurality of segments. After receiving the source data, the domain object writes the source data directly into the segments and sets the segment source data update signal to return the response. If all domains of the document return a response, the response information is returned to the client; the suffix array construction tool listens for the segment source data update signal in the background, automatically constructs the suffix array for the segment source data, and generates the segment suffix array; a segment source data, a segment suffix array,and a segment information are linked into a full suffix array index, and the source data is indexed successfully. The invention can index heterogeneous data in real time without word segmentation, andadopts asynchronous mode to generate index to accelerate response time. The invention is suitable for data indexing field.

Description

technical field [0001] The present invention relates to the field of data indexing, and more specifically, to a real-time data stream suffix array indexing method and device. Background technique [0002] With the development of informatization and the advent of the era of big data, the amount of data is growing explosively. In order to support the rapid retrieval of data in a massive data environment, the design of data index has become a crucial link. [0003] In the field of data indexing, inverted indexing has been widely used, but its indexing accuracy for non-natural language data is easily affected by word segmentation, and it is difficult to guarantee a 100% recall rate. Unlike the inverted index, the suffix array index does not need to segment the data, and can create an index for heterogeneous data without distinction. It is not only suitable for natural language data such as text, but also for non-natural language such as binary data, biological information, and n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/2457G06F16/2455
Inventor 陈浩宇农革徐文涛
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products