Suffix array indexing method and device for real-time data stream

A suffix array and data indexing technology, applied in the field of data indexing, can solve the problems that the accuracy of the inverted index is easily affected by word segmentation and speed up the response time

Active Publication Date: 2021-11-30
SUN YAT SEN UNIV
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] In order to solve the problem that the accuracy of the inverted index is easily affected by word segmentation and the indexing of heterogeneous data, the present invention provides a real-time data stream suffix array indexing method and a device using the indexing method, which can be used without Real-time indexing of heterogeneous data in the case of word segmentation, and asynchronous index generation to speed up response time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Suffix array indexing method and device for real-time data stream
  • Suffix array indexing method and device for real-time data stream
  • Suffix array indexing method and device for real-time data stream

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0066] In the method for suffix array indexing of real-time data streams described in this embodiment, the process of creating a suffix array index can be divided into two parts: data processing and storage, and generation of a suffix array index.

[0067] A. Data processing and storage, such as figure 1 , figure 2 As shown, a piece of source data corresponds to a document, a document contains multiple fields, and the field is the data storage unit; a field contains multiple segments, and the segments are divided into temporary segments, dynamic segments, and persistent segments. Segments improve indexing efficiency; segments are independent suffix array indexes, and each segment independently maintains source data and index information. The data processing and storage process includes the following steps:

[0068] A101. The client submits an index request to the server through an HTTP request, records the name of the index library and other information through the request l...

Embodiment 2

[0093] A suffix array indexing method for real-time data streams, using temporary segments to improve indexing efficiency.

[0094] Assuming that the real-time data stream arrives at the server in three batches, source data A, source data B, and source data C are respectively extracted from the real-time data stream. The data size of the source data is 100MB. The index of the real-time data stream There are two implementations:

[0095] Implementation 1: Without the use of temporary segments

[0096] Because the suffix array can only be constructed for a complete segment at a time, if the new data is spliced ​​at the end of the old data, and then the index operation is performed, it will cause the problem of repeatedly creating the suffix array index for the old data, such as Figure 4 shown.

[0097] Time T1: Create a suffix array index for source data A (100MB);

[0098] Time T2: Source data B is spliced ​​at the end of source data A, and a suffix array index is created f...

Embodiment 3

[0108] A suffix array index method for real-time data streams, the suffix array index is composed of segment source data, segment suffix array, and segment information, such as Figure 6 As shown, the suffix array index retrieval process includes the following steps:

[0109] C101. The client initiates a search request, specifying the name of the target index library, the domain to be retrieved, and the search content; if the target index library and the domain to be retrieved are not specified, it defaults to all index libraries and all domains;

[0110] C102. The server receives and parses the retrieval request, determines the target index library, and obtains the corresponding domain object according to the domain to be retrieved;

[0111] C103. Each domain object starts an independent thread to complete data retrieval, reads all segments of the domain to be retrieved (including temporary segments, dynamic segments and persistent segments), and retrieves each segment independ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a real-time data flow suffix array indexing method, the method steps: the server receives the real-time data flow, extracts the source data, and preprocesses it into a document; parses the document, distributes the document by domain, and each domain receives the source Data, and start an independent thread for data indexing and storage; a domain is composed of multiple segments, the domain object directly writes it into the segment after receiving the source data, and sets the segment source data update signal, and then returns a response; if the document has all If all domains return a response, the response information will be returned to the client; the suffix array construction tool listens to the segment source data update signal in the background, automatically constructs a suffix array for the segment source data, and generates a segment suffix array; segment source data, segment suffix array, and segment information Concatenated into a complete suffix array index, the source data index is successful. The invention can index heterogeneous data in real time without needing word segmentation, and generates the index in an asynchronous manner to speed up the response time. The invention is applicable to the field of data indexing.

Description

technical field [0001] The present invention relates to the field of data indexing, and more specifically, to a real-time data stream suffix array indexing method and device. Background technique [0002] With the development of informatization and the advent of the era of big data, the amount of data is growing explosively. In order to support the rapid retrieval of data in a massive data environment, the design of data index has become a crucial link. [0003] In the field of data indexing, inverted indexing has been widely used, but its indexing accuracy for non-natural language data is easily affected by word segmentation, and it is difficult to guarantee a 100% recall rate. Unlike the inverted index, the suffix array index does not need to segment the data, and can create an index for heterogeneous data without distinction. It is not only suitable for natural language data such as text, but also for non-natural language such as binary data, biological information, and n...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/2457G06F16/2455
Inventor 陈浩宇农革徐文涛
Owner SUN YAT SEN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products