Structured data deduplication method and device based on MapDB, equipment and medium

A structured data and serialization technology, applied in the field of data processing, can solve problems such as data integration failure, data integration job exception, system downtime, etc., and achieve the effect of data integration

Pending Publication Date: 2020-09-11
ENJOYOR COMPANY LIMITED
View PDF5 Cites 1 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

In the existing technology, Kettle technology is generally used to deduplicate data based on memory, but when the amount of data exceeds the memory capacity, it will cause abnormal data integration operations, resulting in data integration failure; another way to deduplicate data is to use data Deduplication at the source end puts a lot of pressure on the source database, especially when processing some relatively large source data, there is a risk of system downtime

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Structured data deduplication method and device based on MapDB, equipment and medium
  • Structured data deduplication method and device based on MapDB, equipment and medium
  • Structured data deduplication method and device based on MapDB, equipment and medium

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] This embodiment provides a method for deduplication of structured data based on MapDB. With the help of the MapDB database, the data to be processed is cached on the disk, and the memory mapping technology is used to directly find the corresponding address on the disk from the memory mapping address. At the same time of reading speed, get rid of the limitation of memory capacity, and avoid the problem of data integration failure when the amount of data exceeds the memory. At the same time, when the data to be processed is cached, the processing mechanism of secondary storage is adopted. Based on the principle of memory-mapped files, the linear space in the disk is mapped through the two-level index, and the temporary file of the operating system is used as the physical storage medium of the two-level index. It can not only increase the efficiency of data calling and processing, but also trigger the recycling mechanism to process temporary files when the program is abnorm...

Embodiment 2

[0075] This embodiment corresponds to the method for deduplication of structured data based on MapDB in embodiment 1, and discloses a device for deduplication of structured data based on MapDB, which is the virtual device structure of the above embodiment 1, such as Figure 4 shown, including:

[0076] A data acquisition module 410, configured to acquire messages and deduplication conditions of the messages;

[0077] The data deduplication module 420 is configured to generate a first Key value according to the deduplication condition to traverse the primary index, and if the first Key value is found in the Key-Value record stored in the primary index, the The message corresponding to the first key value is deduplicated with the message; otherwise, the global pointer is self-incremented, and the self-incremented global pointer is used as the first Value associated with the first Key value, and the first key The value and the associated first Value are stored in the primary ind...

Embodiment 3

[0080] Figure 5 A schematic structural diagram of an electronic device provided by Embodiment 3 of the present invention, such as Figure 5 As shown, the electronic device includes a processor 510, a memory 520, an input device 530, and an output device 540; the number of processors 510 in the electronic device may be one or more, Figure 5 Take a processor 510 as an example; the processor 510, memory 520, input device 530 and output device 540 in the electronic device can be connected by bus or other methods, Figure 5 Take connection via bus as an example.

[0081] The memory 520, as a computer-readable storage medium, can be used to store software programs, computer-executable programs and modules, such as program instructions / modules corresponding to the MapDB-based structured data deduplication method in the embodiment of the present invention (for example, based on The data acquisition module 410 and the data deduplication module 420 in the structured data deduplicati...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a structured data deduplication method based on MapDB, relates to the technical field of data processing, and improves the data deduplication efficiency. The method comprises the following steps: acquiring a message and a deduplication condition of the message; generating a Key value according to a deduplication condition, traversing the first-level index, and if a record stored in the first-level index is traversed, performing deduplication processing according to a message corresponding to the Key value and the message; otherwise, carrying out self-increasing on the global pointer, taking the self-increased global pointer as a Value associated with the Key value, storing the key value and the associated Value in a first-level index, and storing the Value and the message in a second-level storage in a Key-Value form. The invention further discloses a structural data deduplication device based on the MapDB, electronic equipment and a computer storage medium.

Description

technical field [0001] The present invention relates to the technical field of data processing, in particular to a method, device, equipment and medium for deduplication of structured data based on MapDB. Background technique [0002] With the advent of the digital economy era, digital business has gradually become the focus, and many industries have achieved digital business. However, business digitization has led to a large number of data islands, which has become a common pain point for the continued development of digital business. All industries urgently need data integration, open up and avoid data islands, integrate data resources, and effectively develop the associated value between data. [0003] Integrating heterogeneous data sources is a problem that data integration often faces. In the process of integrating data resources, data deduplication is a common processing step. In the existing technology, Kettle technology is generally used to deduplicate data based on...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/215G06F16/2455G06F16/22
CPCG06F16/215G06F16/24562G06F16/2272
Inventor 王超群李建元刘飞黄于德军王丰
Owner ENJOYOR COMPANY LIMITED
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products