Cloud storage similar data detection method and system based on meta-semantic embedding

A data detection and cloud storage technology, applied in file metadata retrieval, digital data information retrieval, file system and other directions, can solve the problems of unstable feature value extraction, large amount of calculation, low detection efficiency, etc. User experience, improved accuracy, and reduced computational overhead

Pending Publication Date: 2022-06-14
NANHUA UNIV
View PDF0 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] One of the main problems solved by the present invention is that the existing cloud storage deduplicat

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cloud storage similar data detection method and system based on meta-semantic embedding
  • Cloud storage similar data detection method and system based on meta-semantic embedding

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0041] Embodiment 1, the similar data detection and deduplication method of this method includes feature extraction and similarity search, and also performs differential compression after the similarity search is completed.

[0042] In order to simplify the processing and improve the effect of semantic embedding, this method combines the neural network, and makes full use of the contextual semantic information of the data block for the two stages of feature extraction and similarity search to improve the effect of similar data detection.

[0043] The specific workflow is as follows:

[0044] Generation of meta-semantic model:

[0045] (1) All the data in the data domain stored in the server is subjected to CDC partitioning (CDC partitioning is a partitioning technology that partitions data according to data content).

[0046] (2) Generate a feature vector for the CDC block obtained in step (1):

[0047] Step 1: divide the data block into K blocks of fixed size;

[0048] Ste...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a cloud storage similar data detection method and system based on meta-semantic embedding. The method comprises the following steps: carrying out CDC partitioning on all data in a cloud storage data domain; extracting feature vectors of all the CDC blocks by adopting a MinHash algorithm; processing the context feature vector of any CDC block based on a Mask algorithm, and inputting all the processed context feature vectors into a neural network model for training to obtain a meta-semantic model of a cloud storage data field; extracting semantic feature vectors of the new data uploaded to the cloud storage data domain; and inputting the semantic feature vector of the new data into the new neural network model initialized by the meta-semantic model for similarity detection. According to the method, full-text semantics are embedded based on a meta-semantic embedding method, the reliability of data feature extraction is enhanced, repeated training of the neural network is avoided, and therefore the calculation overhead is reduced.

Description

technical field [0001] The invention relates to the technical field of artificial intelligence, in particular to a cloud storage similar data detection method and system based on meta-semantic embedding. Background technique [0002] With the increasing popularity of cloud storage, the amount of data in the data center is also increasing. Data deduplication across users is critical to reducing storage costs for cloud providers. Among them, the similarity detection of data plays a crucial role in data deduplication. [0003] Currently, data similarity detection technologies widely used in data deduplication include fixed-sized partition (FSP) and content-defined chunking (CDC). These technologies make Dependencies are generated between files sharing data blocks. The loss or error of several key data blocks may lead to the loss and error of multiple files, thus reducing the reliability of the storage system. For this reason, some researchers have introduced redundant replic...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F3/06G06F16/14G06F16/174G06F16/182G06N3/04G06N3/08
CPCG06F3/0641G06F3/067G06F16/152G06F16/174G06F16/182G06N3/04G06N3/08
Inventor 田纹龙李柏松李宇圣万亚平欧阳纯萍刘永彬李跃
Owner NANHUA UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products