Methods and systems for distributed data storage with enhanced security, resilience, and control.

JP7880916B2Active Publication Date: 2026-06-26MYOTA INC

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
MYOTA INC
Filing Date
2024-06-05
Publication Date
2026-06-26

Smart Images

  • Figure 0007880916000010
    Figure 0007880916000010
  • Figure 0007880916000011
    Figure 0007880916000011
  • Figure 0007880916000012
    Figure 0007880916000012
Patent Text Reader

Abstract

To disclose a method and system for encrypting and reconstructing data files, including related metadata.SOLUTION: This method involves separately encrypting data and metadata as chaining processes and integrating a plurality of encryption / encoding techniques together with strategic storage distribution techniques and parsing techniques which results in integrated benefits of collection of techniques. As disclosed, content data is separated from its metadata, encryption keys may be embedded in the metadata, and in a content data encryption chaining process, the method can perform chunking, encryption, sharding, and store content data and shards separately and store metadata in a flexible, distributed, and efficient manner, at least in part to assure improved resiliency. The processes are preferably implemented locally, included at the site of the content data or a proxy server.SELECTED DRAWING: Figure 1
Need to check novelty before this filing date? Find Prior Art

Description

Background Art

[0001] (Cross-reference to Related Applications) This application claims priority to U.S. Provisional Patent Application No. 62 / 851,146, filed on May 22, 2019, which is currently pending, and the entire disclosure of which is incorporated herein by reference.

[0002] Data protection is a well-known issue in the field of storage technology from the perspectives of security and resilience. Regarding conventional solutions, in order to improve error correction capabilities, there are well-known solutions such as Erase Code widely used in CDs, DVDs, QR Codes (registered trademarks), etc., and Shamir's Secret Sharing Scheme (SSSS) that protects secrets using polynomial interpolation techniques. Their (t,n) threshold property requires at least t data pieces called shares (or shards) from n to reconstruct the original data. This property is similar to n replicas but introduces an additional constraint t, allowing for n - t storage node failures without service interruption, thus improving data resilience when reconstructing the original. From the perspective of data protection, the (t,n) threshold property also reveals the original data only when at least t shares are accessible and valid.

[0003] Erase Code aims to correct bit errors in data by maximizing transmission or storage efficiency. Thus, most applications are based solely on Erase Codes such as Reed - Solomon (RS) Code. In computer storage, Erase Code has been used to implement particularly levels 5 and 6 of Redundant Array of Independent Disks (RAID) for reliable storage components under different levels of failures.

[0004] Large-scale data storage systems present new technical challenges, namely the management and protection of metadata. To achieve flexibility and scalability, data is stored in distributed storage along with its metadata. Metadata contains information about the location of necessary data pieces. Therefore, a separate data protection layer is typically required to ensure the secure and reliable storage of metadata.

[0005] For example, even if SSSS and RS codes have a (t,n) threshold property and require at least t data shares from n to reconstruct the original data, Shamir Secret Sharing (SSSS) and RS have been used to protect data security and error correction, respectively. They are intended for encryption and error correction, respectively.

[0006] SSSS is designed as an encryption technique that stores secrets in multiple shares n without using an encryption key. Because SSSS utilizes polynomial interpolation to guarantee theoretical encryption, no method is known to break SSSS with fewer than t shares.

[0007] JPEG0007880916000001.jpg61168

[0008] Distributed data storage emerged due to its scalability and cost-effectiveness. One of the most prominent distributed data storage systems is the Hadoop File System (HDFS), designed for very large data center storage systems to perform parallel data workloads such as MapReduce. HDFS proposes three duplicate copies of data: two stored on two different nodes in the same rack, and the other stored on a different node in a different rack (location). This strategy easily improves data accessibility by leveraging locality of failure. More recently, object storage solutions have been used to simplify I / O queries using key-value pairs.

[0009] Distributed storage systems have direct challenges. The first challenge concerns metadata management. Because data content is distributed across multiple storage nodes, the addresses of the distributed content must be maintained in a secure and reliable location. This becomes a single point of failure and a performance bottleneck. Metadata storage has a significant impact on system performance because it is primarily related to directory services and metadata lookup operations where performance bottlenecks exist. For example, List and Stat are called more frequently than Read and Write. Ceph1 proposed a way to build a metadata server farm and metadata location to more efficiently distribute metadata request queries across servers (Ceph1:[1]SA Weil, SA Brandt, EL Miller, DDE Long, and C. Maltzahn, “Ceph: A Scalable, High-performance Distributed File System”, 7th Symposium on Operating systems design and implementation (OSDI). Nov, 2006.). Ceph's hash function is designed to minimize the shard paths of metadata queries in the server farm.

[0010] Beyond performance issues, security and resilience are also challenging topics as a result of isolation. Protecting metadata using data encryption techniques incurs additional computational costs and performance degradation.

[0011] Importantly, in conventional solutions, storage and retrieval are performed under a synchronous protocol, whereas in the present invention, storage and retrieval are performed asynchronously, as detailed below.

[0012] Another challenge is the architectural limitations on end-to-end solutions. Most distributed storage systems are designed for clients in the same data center with network latency of less than approximately 1ms, which negatively impacts multi-data center solutions. For example, in a client-centric architecture where client devices are mobile, the client devices may connect to storage nodes and metadata server nodes over the internet. Because the clients are mobile or outside the data center, system performance cannot match that of storage services within the data center. [Overview of the Initiative]

[0013] This application relates to a method and system for separately encrypting data and metadata as a chaining process using network-equipped devices and network-equipped storage nodes for secure storage. Here, the process and system are reliable and resilient beyond the levels currently available. The method and system of the present invention integrate various encryption / encoding techniques with strategic storage and analysis techniques, resulting in the full benefits of the collection of techniques. The present invention separates content data from its metadata, and in the content data encryption chaining process, the present invention chunks, encrypts, shards, and stores the content data, and separately shards and stores the metadata. Here, the metadata is enhanced with information related to the content data encryption chaining process. The method of the present invention uses both computational cryptography and theoretical cryptography. Furthermore, the process is preferably carried out locally, including the site of the content data or proxy server.

[0014] In a preferred embodiment, content data is chunked, and each chunk is then encrypted with randomly generated key-based AES-256 (or equivalent), then RS-encoded (or equivalent), and divided into shards or "sharded," where a shard is a portion of a file formed by parsing the file following encryption and encoding. Metadata is modified by introducing SSSS-encrypted chunk IDs, then sharded in combination with key shards, and then SSSS-encrypted key shards are introduced during the metadata encryption process. It is important to note that at least two encryption methods are used: (1) AES+RS for creating data shards, and (2) SSSS for the chunk IDs and AES keys stored in the metadata shards.

[0015] In short, the present invention comprises multiple forms of encryption plus encoding, and distributed storage of encrypted (and encoded for some data) files.

[0016] This method allows for improved security and resilience compared to conventional solutions, enables faster recovery, and is controllable based on user preferences for storage management and configuration for data access control. [Brief explanation of the drawing]

[0017] [Figure 1] Figure 1 shows the inference and event detection of the present invention, particularly illustrating the inference and event detection processes.

[0018] [Figure 2] Figure 2 shows the event log collection and training of the present invention.

[0019] [Figure 3] Figure 3 shows the file and metadata / key encryption chain of the present invention.

[0020] [Figure 4] Figure 4 shows system components, interactions, and process steps.

[0021] [Figure 5] Figure 5 shows how the data path control path is separated in the present invention.

[0022] [Figure 6] Figure 6 shows the step - by - step procedure of the file store of the present invention.

[0023] [Figure 7] Figure 7 shows a black - list lost client and sets a new client.

[0024] [Figure 8] Figure 8 shows the procedure of the file store due to a data storage failure.

[0025] [Figure 9] Figure 9 shows the procedure of the file store due to a metadata / key storage failure.

[0026] [Figure 10] Figure 10 shows the encoding of the metadata of the "replicated" data and "encrypted" data of the present invention.

[0027] [Figure 11] Figure 11 shows examples of file encryption and metadata encryption.

Best Mode for Carrying Out the Invention

[0028] The present invention solves the aforementioned encryption / storage challenges by using a combination of a data encryption chaining module and different metadata / key encryption chaining modules that follow the data encryption chaining module by theoretical encryption technology, at least in part based on computational encryption technology for data encryption / encoding. Others store and configure metadata separately from content data, but they merely focus on storing content and metadata separately and do not include many of the important and beneficial attributes and approaches described herein. Reliability and security are also important priorities for metadata / keystore / storage. The present invention advances the technology by implementing a practical implementation of content data and metadata / key encryption separately (but still interrelated) in a chaining process using at least computational encryption technology for content and at least theoretical encryption technology for metadata / keys, thereby implementing an architecture that has not been previously used for secure storage. This solution, among other advantages, provides significant improvements in speed, resilience, and recovery, as well as security, for individual users, "group" users (such as businesses with shared storage), and multi-datacenter users (such as users with multiple data centers in the backend of their services), thereby enabling simultaneous service to multiple user types. Compared to computational encryption techniques, some encryption algorithms have been proven to be theoretical encryption algorithms that cannot be mathematically broken by an attacker. The computational encryption algorithm used can, in practice, be determined based on the amount of time required to reverse the original data, which is long enough for the approach in use. Theoretical encryption algorithms, on the other hand, provide a solution where breaking encrypted data is mathematically impossible without meeting the necessary conditions.

[0029] By definition, the term “encoding” is used in situations where an RS code (or equivalent) is used to generate data shards from each encrypted chunk, and “encryption” is used for SSSS (or equivalent) and / or chunk encryption. The term “chunk” is used in contexts where a file is divided into multiple pieces without manipulation. The term “shard” is used when the output has a (t,n) threshold property. When encrypted chunks are encrypted into data shards, the cipher is preserved in the output. A “metadata shard” is defined as data containing encrypted chunk IDs and file attributes. A “key shard” is defined as encrypted data for each chunk encryption key using SSSS or equivalent. This application defines a client as a user-facing device such as a desktop, laptop, or mobile device. An extended definition of a client includes a server machine in the user domain.

[0030] In this invention, data is first chunked, encrypted, and then sharded. Metadata and content data are encrypted and encoded separately and stored separately. The encryption / encoding scheme of this invention dramatically improves storage efficiency, data resilience, and security, based in part on the encryption scheme, encoding scheme, and storage scheme, and on how metadata remains accessible to authorized users. The method of this invention maintains or improves system reliability, meaning protection from physical or cyberattacks, and resilience, meaning the ability to recover files after file corruption.

[0031] The present invention's approach includes a novel system architecture for encryption and storage, and novel encryption and encoding approaches for both content and metadata. Improvements provided by these approaches include, but are not limited to, overcoming remote storage latency, advancing storage distribution technologies using AI (artificial intelligence), and, in particular, benefits in monitoring, controlling, and managing unstructured data.

[0032] This solution creates two separate encryption chains: the file content and its metadata / key. Each is encrypted separately, and a different approach is used for the file content, including the encoding. The file content encoding algorithm includes a performance-intensive information-distributed algorithm (known as a computational cryptographic algorithm), such as RS encoding, which is performed only after the content has already been encrypted using a well-known algorithm such as AES-256. In the process of file content encoding, one or more randomly generated encryption keys and a potentially included nonce (including an initialization vector typically randomly selected in this invention) are used to encrypt the file. To securely store the encryption keys, they are stored in a metadata file or separately with the data shard, rather than in a common way. Separating the encryption keys from the content data protects the content data from attackers, even if an attacker has already obtained data storage permissions and / or somehow obtained the keys. Since the metadata is modified to include additional important information, this invention applies a security-intensive information-distributed algorithm (known as a theoretical cryptographic algorithm) to the metadata. This invention encrypts each criterion of a chunk (chunk ID) and uses an encryption key using only SSSS or equivalents. The theory of SSSS ensures that metadata reconstruction is available only when a sufficient number of shards (in this example, >=2) are available for reconstruction.

[0033] Compared to conventional approaches, this solution addresses existing challenges, improves system performance, and reduces the need for traditional storage. The most important goals that a storage system must achieve are data validity, resilience, and reliability. The most common solution to improve data validity and resilience is data backup. Data backup typically requires at least two data storage areas to store redundant data. This invention requires less, typically n / t times, storage space, where t is the number of shards required and n is the total number of stored shards. By simply requiring t from n instead of n, resilience is dramatically improved. In conventional solutions, RAID (Redundant Array of Independent Disks) improves storage efficiency while ensuring nearly the same resilience to data backup by adding error correction codes to the data. These earlier state-of-the-art storage solutions (prior to this invention) use distributed storage nodes to achieve both validity and resilience. They utilize error correction codes, such as erase codes, to store data pieces on distributed storage nodes and allow for a certain number of storage node failures based on encoding parameters. However, these solutions only address the need for separate storage of metadata and content data.

[0034] This invention separates metadata / key storage from content data storage while maintaining the relationship between the two. Furthermore, this solution provides users with an optimal and improved time configuration of various storage backends, connecting different types of backend storage, such as cloud storage services from different vendors and the user's own storage nodes. In other words, this invention enables the simultaneous use of multiple storage types, regardless of storage type (and location), even in single computer storage / user storage. This approach offers additional advantages because conventional solutions were not designed for or unsuccessful in organizing diverse backend storage. Due to the diversity of backend storage, configuring to improve system performance and efficiency is typically a highly complex task that must be performed by experts, thereby adding an unstable layer and risk to conventional solutions. The organization layer in this solution optionally uses an AI-assisted optimization module to provide an abstraction of the configuration task. Optimal configuration includes, but is not limited to, cost optimization, storage efficiency, security improvements, simplified policy configuration, advanced monitoring metrics, and alert management.

[0035] Furthermore, this solution reduces the latency of metadata operations on high-latency networks, at least in part, depending on how metadata is configured for storage and how storage is selected. In conventional solutions, recovery latency is a major common problem due to the storage approach used. File directories and file statistics need to be frequently retrieved to maintain file system integrity and keep file data usability up to date. That is, remote backups are performed not just once, but regularly and automatically, updating the stored content and metadata. Therefore, the performance of metadata operations directly relates to overall system performance and user experience. Because distributed storage solutions require operations to retrieve metadata, metadata servers located across high-latency networks become a bottleneck related to system performance. Compared to conventional inventions, this solution is intended to use backend storage on high-latency networks such as the internet without degrading performance. This solution reduces the latency of metadata operations by separating operational attributes and content references stored in distributed metadata storage, including potential storage on the local machine.

[0036] This solution further improves the user experience related to file content operations such as reading and writing in high-latency networks by implementing an asynchronous approach. High-latency networks typically negatively impact the performance of file content operations. Performance depends on the bottleneck link, which often exists in the network between the user device and the backend storage. In the context of this invention, the user device can be a server, mobile device, or standalone computer equipped with a network interface to the storage of the user's data. Here, the user can be an individual or a group (such as a company). The typical synchronous approach directly negatively impacts the user experience because the user has to wait for a response from the backend storage in a high-latency network. This solution uses an asynchronous approach to absorb the delay between the user device and the backend storage in a high-latency network. Instead of waiting for a response from the backend storage, this approach returns an early response scheduled to synchronously as a batch process so that the results are updated asynchronously later, especially when the request is performed locally as an intermediate state.

[0037] The following describes additional, unique, and novel solutions of the present invention to solve the aforementioned problems and overcome the limitations of how others store data.

[0038] Artificial intelligence (AI) supported optimal configuration This invention uses AI to optimize backend configurations and provides an abstracted level of control over diverse backend storage and configurations. The solution provides a human-friendly interface (including a graphical user interface, GUI, and / or file system interface), and a language that functions as an inference module and interpreter for deriving detailed storage configurations. See Figure 1. The application of AI includes, but is not limited to, enhancing security by (i) optimizing storage costs through optimal data allocation, (ii) optimizing data access latency based on user data access patterns and user and storage location information, and (iii) dynamically changing the number of shards for data reconfiguration. When using the AI ​​algorithm, the solution collects anonymous logs from user file operation events. See Figure 2. The file operation event logs may be stored for analysis by a pattern analyzer to train the AI ​​algorithm. Once the algorithm is trained, the model and parameters are deployed to an AI-assisted module where the algorithm is actually implemented. The AI-assisted module receives events from user devices and performs optimal configuration, anomaly detection, etc. Data locations are stored and It is stored within encrypted metadata and updated based on AI-driven adjustments.

[0039] Examples of additional AI applications in the context of the present invention include the following:

[0040] (1) Optimal storage selection - performance

[0041] The system collects upload and download events to storage (and potentially more common access) to measure the experienced speed of each backend storage. If shards are stored and more storage is assumed than the number of shards to be stored, the system stores more shards on faster backend storage to minimize data storage latency. Since latency is determined by the bottleneck of the slowest storage, the minimum-maximum algorithm reduces overall data upload latency by minimizing the maximum shard storage latency for each storage. When fetching shards to rebuild files, the minimum-maximum algorithm also selects the fastest t storage out of all n where the shards are stored to minimize the maximum latency for each storage.

[0042] (2) Optimal storage selection - cost

[0043] The system collects file access frequency and moves the least accessed files to cold storage. For the sake of discussion, we assume there are two tiers of storage, namely hot storage and cold storage. Hot storage is fast but expensive, while cold storage is slow but cost-effective. Existing services provide a simple policy to storage-locally to determine which data to move to cold storage, based on the time to be stored or the last access. The present invention stores n shards, requires t of n, and reconstructs the original data, so the decision is not a binary choice between cold or hot, but based on how many shards, which parts of shards, or which shards are directed to cold and hot storage. The system of the present invention periodically collects shard access events and frequencies to calculate the estimated storage cost, including the cost of moving to cold storage. The system of the present invention reduces configuration complexity when considering multiple parameters from different types of storage. Considering performance metrics together, the algorithm can then move data from cold storage to hot storage based on shard access turns.

[0044] (3) Detection of abnormal file access

[0045] Autocorrelation is one feature in the workload of a networked system that can be applied in the present invention. For example, the study of network traffic shows regular (daily, for example) patterns based on the temporal and spatial similarity of traffic sources and destinations. In one embodiment of the present invention, the system uses autocorrelation based on file access patterns that show little similarity between daily and weekly. This feature enables the development of predictive algorithms using deep learning and regression methods. Thus, the system can determine irregularities or deviations from patterns, such as statistically significant irregularities or deviations, and can therefore warn system administrators of unusual file access by malicious users or malware.

[0046] Reducing latency for metadata operations on high-latency networks Current solutions store fragmented files across multiple backend storages, with metadata enhanced by the information needed to reconstruct the original file. Metadata typically also stores file attributes such as file size, modification time, and access time. Observations of file operation call frequency revealed that metadata operations are called more frequently than file content operations. Consequently, file systems are designed assuming low latency for metadata operations. Therefore, traditional solutions required metadata storage (or servers) in the local area network, introducing unnecessary risks of failure-related loss. However, this solution designs metadata applicable to high-latency networks while maintaining the characteristics of distributed data encryption. The metadata in this solution consists of "replicated" data and "encrypted" data (see Figure 10). Replicated data includes information unrelated to the file content. File name, size, modification time, and other file attributes are stored in the replicated data. This allows the system to collect multiple metadata shards and retrieve data without decrypting it. Meanwhile, information related to file content, chunk IDs, and chunk encryption keys is stored as encrypted data. To maintain the characteristics of distributed encryption for metadata and encryption keys, this solution uses SSSS or equivalents to achieve a stronger level of metadata security than required in the present invention for file content. Since SSSS does not require an encryption key, decryption only requires the collection of multiple data shards. Therefore, the present invention leverages distributed storage with a variety of authentication and security solutions provided by the storage solution as a basis for trust.

[0047] Distributing metadata shares using the separation of replicated and encrypted data improves the performance of metadata operations by storing one of the encrypted metadata shards on a local device, such as a user device on a local area network or a metadata server. Furthermore, this solution allows metadata shares to be stored in different locations as redundant copies and encrypted data shares.

[0048] OpenDir and Stat - Metadata Operation Examples When a user opens a directory, the file system should return information about its children, i.e., a list of files and directories. To list the information for each child of the target directory, metadata storage needs to provide metadata selection functionality based on object name or prefix. The present invention can implement a directory system using a native directory structure, key-value storage, and / or a database system. Since a set of metadata shares is stored locally, identification of child directories and files is performed without remote metadata files. In other words, the user can "unlock" the metadata stored on the local device and identify only the files they wish to recover.

[0049] The following operation returns the file attributes that the stat operation stores as duplicate data in a set of metadata shares. Therefore, the stat operation is performed in a lightweight manner, namely by looking up the corresponding metadata shares stored on the local device.

[0050] Reading - File Content Manipulation Examples In this invention, since the chunk ID is encrypted using a technique such as SSSS, two or more metadata shares are required to decrypt the chunk ID. The encryption technique is not limited to SSSS. This means that at least one metadata share must be obtained from a remote metadata server. This takes longer than a simple metadata lookup operation. However, the time it takes to download the metadata share from remote storage is significantly shorter than the time it takes to download the file content. Also, unlike metadata operations, file content operations are not requested as frequently. Therefore, the extra delay of downloading metadata from the remote server is not a significant factor in the file read operation in terms of download completion time.

[0051] Asynchronous content transfer to remote storage This solution improves the user experience by staging encrypted content on the user's device before sending it to remote storage. Therefore, the solution absorbs latency by returning results to the user early and sending data asynchronously to backend storage. For example, when the current file system interface receives a request for a file content operation, such as writing a file, this operation stores the encoded content in a local buffer as a staged status before returning the result to the user interface. The staged status is batch-scheduled to complete asynchronously in the background. This design significantly improves the user experience when writing files by decoupling latency between the user device and remote storage from user interaction.

[0052] Pre-fetching and caching of file content from remote storage Due to the significant latency gap between remote and local storage, pre-fetching and caching improve the completion time of read operations and the user experience. Unlike file write operations, file read operations are typically on-demand operations that require data content to be delivered immediately after the user reads the file. To reduce the latency of downloading necessary data pieces from remote storage, this solution pre-fetches the necessary data pieces based on the user's file access patterns, which are calculated by the AI ​​module of the present invention. This module uses temporal autocorrelation, user ID, application type, cache storage capacity, etc., to determine the lifetime and replacement of cached data, pre-fetching of data, etc.

[0053] Conventional solutions have focused on distributed storage systems that utilize multiple backend storage and provide an integration layer by using erasure code or SSSS (or variations thereof). While some have deployed diverse backend storage across WANs and LANs, improving manageability and efficiency remains an unresolved issue. This solution addresses these and other problems in distributed storage systems deployed across WANs and LANs, improving their configurability and performance (i.e., latency). To overcome the complexity of diverse backend storage configurations in terms of cost, performance, and security, the present invention uses an AI module including an event log collector (interface), a data analyzer (algorithm generation), an algorithm trainer (parameter tuner), a model deployer (batch processing), and an implementer (processor). (See Figures 1 and 2).

[0054] Furthermore, this solution addresses a new challenge in distributed storage solutions: the high latency of metadata and file content operations when backend storage and metadata servers are deployed over high-latency networks (e.g., the internet). This solution improves the user experience by reducing the latency of metadata operations, which are called more frequently than content operations. This is achieved by allowing some of the metadata replicated to local storage to be stored / retrieved at once. Meanwhile, the solution encrypts content-related metadata (e.g., chunk IDs) using SSSS (or Equivalents), keeping the metadata secure in a distributed manner. Asynchronously transferring file content to remote storage when writing files decouples the data storage procedure from the user's procedure, resulting in faster response times to the user interface before completing the content upload task to remote storage. AI-assisted pre-fetching and caching when reading files provides better predictions for placing the necessary data content on the local device based on user file access patterns, application type, etc.

[0055] Furthermore, this solution encrypts the content in addition to directly encoding it using RS coding (or equivalent), with RS coding used to encode the encrypted chunks. Such encoding is useful, at least, to formulate efficient storage. Therefore, instead of using another algorithm like SSSS, which provides stronger encryption but also introduces more overhead, this solution encrypts the chunked content using AES-256 (or equivalent) and stores its encryption key separately in the metadata.

[0056] RS coding is efficient in terms of storage and computational overhead compared to other approaches such as SSSS. This solution already overcomes the security weaknesses of RS coding by encrypting the content before coding, so other similar algorithms focused on efficiency and performance can be used.

[0057] SSSS (or Equivalent) is used for metadata encryption. Metadata is the root key for content encryption. While other algorithms can be used if they provide the same (t,n) or similar threshold properties, this invention requires and uses a strong encryption algorithm to protect metadata that is encrypted and stored separately from content data. SSSS theoretically guarantees its security, making a forceful attack impossible if an attacker does not have enough fragments. Since the overall size of metadata is much smaller than file content, the encryption overhead is negligible.

[0058] In the case of content encryption, SSSS exhibits n hours of storage overhead, while RS exhibits only n / t hours of storage overhead. However, RS has limited randomness in its algorithm because it was not designed for encryption (it is static and relatively easy to reverse). By also using AES-256 (or other encryption) on the content chunks on the RS code, the solution improves randomness while achieving n / t times the storage overhead. To protect the encryption key for the AES-256 (or equivalent) encryption, the second chain uses SSSS to encrypt the key and stores the key shard in a metadata shard.

[0059] File content is chunked for several reasons. Firstly, chunking provides the ability to identify duplicated content, and as a result, can improve storage efficiency by storing only one copy of the content along with its criteria. Secondly, chunking improves security. Attackers need to know the criteria of the chunk required to retrieve the file content. Thirdly, it improves the flexibility of data storage and its location.

[0060] The system of the present invention is further designed to implement end-to-end security regardless of the storage architecture and environment. File encryption / decryption operations are integrated with metadata and data read / write operations, thereby minimizing vulnerabilities to man-in-the-middle attacks and performance degradation. The system architecture of the present invention also enhances end-to-end security by separating the control path from the data path.

[0061] See Figure 3. The file encoding algorithm of the present invention, called an encryption chain, aims to integrate data / metadata encryption with a data storage strategy. This is a combination of information-theory cryptography, which cannot be broken even if the adversary has unlimited computing power, and computational cryptography, which cannot be broken with current computing techniques within a sufficiently short timeframe to be practical.

[0062] Unlike conventional solutions, the architecture of the present invention does not have a single point of aggregation between the client (for example, the term "client" is used when referring to a proxy server, and the term "user device," which can be a standalone server, computer, or other computing device, is expected to include various types of clients here) and the data / metadata / key storage, thereby eliminating vulnerability to "man-in-the-middle" attacks. The encryption chain is initiated on the user device without a proxy server. The encryption chain is seamlessly integrated with metadata and file I / O operations to minimize modifications to existing systems and reduce changes to the user experience. The encryption chain does not require any modification to metadata and file operations, except when collecting data from storage nodes.

[0063] The encryption chain of the present invention consists of two parts: a file encryption chain and a metadata / key encryption chain. The file encryption chain contains chunks of content files. The method of the present invention encrypts each chunk and shards the encrypted chunks. Each chunk is typically a slice of content file that can be used to identify duplicate pieces (KyoungSoo Park, Sunghwan Ihm, Mic Bowman, and Vivek S. Pai, “Supporting practical content-addressable caching with CZIP compression” Proceedings of the 2007 USENIX Annual Technical Conference, Santa Clara, CA, USA, June 17-22, 2007). In this method, only one copy of duplicate pieces is stored to save storage space that would otherwise be used to label locations in metadata (this technique is called data deduplication). Each chunk is encrypted as multiple shards using RS code. Since RS code is not commonly used for encryption, the chunks are encrypted with at least one encryption key. A chunk is a randomly generated key for one-time use before the chunk is encoded as a shard. The encryption key is securely stored within the metadata / key encryption chain. The key and chunk identifier (chunk ID) are encrypted by SSSS. Each set of chunk ID shards and each set of encryption key shards are distributed to the metadata storage node and storage node, respectively, in the formation of metadata and key shard files. This process does not require a centralized component to compute the allocation of metadata, keys, and data shards across multiple storage nodes. The following sections describe the files and metadata / key encryption chain in detail, referring to Figure 4. Figure 10 provides a further example.

[0064] JPEG0007880916000002.jpg27168

[0065] JPEG0007880916000003.jpg82168

[0066] JPEG0007880916000004.jpg53168

[0067] JPEG0007880916000005.jpg23168

[0068] decryption of the chain Data decryption is the reverse procedure of the encryption chain. The key shards necessary to decrypt the encryption key and chunk ID are collected by gathering encrypted metadata / keys from different metadata / storage locations, a procedure required for normal file operations. Next, the data shards necessary to regenerate the encrypted chunks are collected. Finally, the original file is reconstructed after the chunks have been decrypted and concatenated in sequence.

[0069] Figure 4 shows an example of a system architecture including system components, interactions, and processing steps. The user device encrypts (and decrypts) content and metadata and synchronizes data between the user device and storage. The logical I / O module is the interface that communicates with the user. When the logical I / O module receives a file I / O request such as open, write, or read, an event handler intercepts the native file operation event, processes the request, and implements an add-on encryption chain for file processing. To ensure end-to-end security, the present invention preferably implements flush, Fsync, and close handlers as locations to perform file encryption before storing data content in storage.

[0070] There are several usable approaches, including, but not limited to, algorithms such as round-robin, random, and minimum-maximum algorithms. In one embodiment of the present invention, the minimum-maximum algorithm uses an empirical data transfer rate to minimize the maximum transfer time at each storage and transfer the data to the storage. When uploading, as implemented here, the minimum-maximum algorithm stores more shards in faster storage if more storage is available than the coding parameter n, which is the number of shards to store for each chunk. When downloading, the current minimum-maximum algorithm is more useful by selecting faster storage, which is the number of shards needed to reconstruct the storage from n storage where the corresponding shards are stored. That is, to reconstruct a file distributed across n shards, t chunks are needed.

[0071] As a linked chain of file encryption chains, the metadata / key encryption chain generates multiple metadata and key shards. These shards contain encryption information such as one or more encrypted chunk IDs and SSSS encryption keys in the metadata shard. One of the encrypted metadata files is stored in the metadata storage of the user device. Specifically, a copy of the metadata shard file is stored on the local device to reduce latency for metadata operations. The other metadata and key shard files are stored in metadata / key storage, which can be configured on a single storage node or on logically / physically independent storage nodes, depending on the user's preference.

[0072] The synchronization processing unit 411 calculates the timing of data transmission between the user device and storage based on its knowledge base. The synchronization processing unit also uses its knowledge base to select / identify the locations of shards and encrypted metadata files. This task aims at cost optimization, performance optimization, security optimization, and other objectives.

[0073] Data transmission requests are pushed to a request queue, and the connector fetches the corresponding requests and performs actual data storage over the network. Response messages from storage are pushed to a response queue that serializes asynchronous responses to data storage requests. Responses are fetched, and the shard and encrypted metadata storage status are updated. If an I / O request requires data transmission synchronization, the I / O event handler waits until the corresponding response is collected.

[0074] This system provides end-to-end security for stored data by integrating encryption chains and applying information distribution theory. End-to-end data security is challenging because (1) latency between end-user devices and storage backend locations is much greater than latency between machines within the data center, (2) system performance is limited by the most controlled components, and (3) resource control and environment configuration are controlled. High network latency between clients and metadata servers impacts the performance of metadata operations.

[0075] Because metadata contains critical information necessary to assemble file content, storing the entire file content on the end-user's device can be extremely risky. When metadata is stored on a representative server (or server), metadata lookup operations, which are called more frequently than data lookup operations, become a system performance bottleneck. One example is Ceph's methodology, which involves storing metadata on logically separate distributed servers from content storage, while balancing overhead across metadata servers. A challenge in end-to-end solutions is that latency between the client and server is not predictable enough to design a system that guarantees optimal or approximate performance. Decomposing functions such as encoding / decoding, identifying duplicated content (primarily known as data deduplication), and designing data / control channels requires careful design of system functionality and performance, taking into account hardware computing capacity, expected network latency, and the frequency of operations.

[0076] With respect to the "t out of n" approach in the present invention, such an approach is important multiple times in multiple ways. First, in the present invention, an item is parsed many times into n units. However, in each case, n may be a different value. Similarly, each different t may be a different value (however, two or more such t and / or two or more such n may be the same value). The "t out of n" approach in the present invention is preferably directed towards the number of parsed data content pieces, divided into the number of parsed metadata pieces, and separately directed towards the number of data shard pieces in each encrypted chunk of content data.

[0077] Regarding reconstruction, the "t from n" approach becomes important multiple times.

[0078] The formulation of the file encryption chain and the metadata / key encryption chain is typically a computational task performed by the data processing unit. As mentioned earlier, the file encryption chain encodes / decodes the data shards, followed by the metadata / key encryption chain, which provides similar functionality. The data shards are temporarily stored in a shard buffer until they are scheduled to be synchronized to data storage.

[0079] Again, Figure 4 shows an overview of the system. This client-centric architecture is an example of a possible deployment scheme, showing components deployed on end-user devices that create file and metadata / key encryption chains and distribute encrypted data to a storage backend. The client is not limited to end-user devices such as PCs, laptops, or mobile devices, but may also be an enterprise server, for example.

[0080] Figure 4 shows the relationship between the user device 401 and the storage pool 402 in the context of the present invention. The user device 401 may be a processor or group of processors programmed to perform at least the functions shown. As shown, the user device 401 also performs encryption, chaining, and decryption roles. The user device 401 includes input / output means, which include an I / O request engine 423 (process step 1), an input / output event handler 403, at least one data processing engine / unit 405, a storage engine 408 and relational storage, a synchronization processing unit or engine 411, a request queuing engine 415, a network interface engine 416, and a response queuing engine 422. The input / output event handler 403 has (2) at least one input / output logical module 404 that sends file content. The data processing engine / unit 405 includes (3) a file encryption module 406 that performs functions including file fragmentation, encoding, decryption, encryption, and decryption of chunks and data shards, and (4) a metadata / key encryption module that has functions including metadata files and key fragmentation, encoding, decryption, encryption, and decryption of shards. The storage engine 408 and relational storage include (5) metadata / key storage 410 for identifying metadata / key shards and data shards for upload and download, and a shard buffer 409. The synchronization processing unit or engine 411 includes (6) a scheduler 412 for collectively requesting queuing and a storage selector 413. The request queuing engine 415 (7) allocates requests. The network interface engine 416 has connectors for data 418 and metadata / key storage 417, and (8) sends data requests to the network. The response queuing engine 422 (9) sends data results and (10) updates the shard and encryption metadata / key status.User devices 401 may be distributed and communicate with various remote external storage pools 402, including data 419 and metadata / key 420 storage, as well as backup storage 421.

[0081] The reconfiguration process is the reverse of the chunking, encryption, sharding, and distribution processes, and can be implemented as a user application with a graphical user interface (GUI) and / or using a common file system interface (e.g., POSIX). In this invention, it is preferable that the GUI or file system interface lists files by file name and modification time. Other more common file system interfaces are also supported (it is preferable that file modification time is stored as a file attribute in metadata). These interfaces require essential file attributes such as file name, modification time, access time, and size. Thus, all files displayed in the interface have attributes that allow the user interface and the system interface, respectively, to identify the file.

[0082] File reconstruction requires the first reconstruction of the metadata. The 't' portion of the metadata needs to be identified for the metadata to be reconstructed. Since the metadata contains chunk data of the content data, each chunk needs to identify its own 't' shard in order to reconstruct each chunk (in this case, each 't' and each 'n' may be different for each chunk, but in the case of metadata, they do not need to be different from 't out of n'). Each chunk is reconstructed using at least the relevant key, stored in the metadata beforehand, and encrypted. After each chunk is reconstructed, the chunks are arranged in the same order as they were initially arranged to reconstruct the entire file and make it usable again.

[0083] As mentioned above, there are numerous storage options available in the present invention, and in a preferred embodiment, the items required for reconstruction are stored in a more accessible (potentially more expensive) area. Furthermore, such items can be moved from one location to another based on cost considerations, as in the example above. Consequently, there may be implemented algorithms for the ongoing storage relocation of the parsed data content and metadata elements. Nevertheless, a reconstruction process involving multiple t of the n approaches remains a preferred embodiment.

[0084] JPEG0007880916000006.jpg38168

[0085] JPEG0007880916000007.jpg58168

[0086] The file name and modification time pair is the initial combination required for file reconstruction. Referring to Figure 4, to ensure end-to-end security, reconstruction is integrated with a file open operation that specifies the file name and modification time. The metadata / key encryption module 407 requests the synchronization processing unit 411 to collect metadata and key shards. The storage selector module 413 selects target metadata / key storage based on optimization parameters, including but not limited to latency and cost. If no preferred parameters are set, storage is selected randomly. The metadata / key encryption module 407 decrypts the chunk ID and encryption key into the corresponding chunks. The file encryption module 406 requests the collection of data shards specified by the chunk ID. Storage selection for data shards is the same as for encryption. The file encryption module 406 reconstructs the encrypted chunks using the data shards. The encrypted chunks are then decrypted into plain chunks of the file using the encryption key.

[0087] The control server monitors clients to control and monitor their status. The control server also configures data storage, metadata / key storage, and metadata / key backup storage, but it does not act as a proxy for the storage backend. The control server also provides an administrator portal for controlling the overall configuration, including data storage settings, metadata storage settings, key storage settings, status monitoring, policy configuration, and access control. The control server is also responsible for initial authentication between users and devices. The control server can also integrate authentication procedures with existing components such as LDAP.

[0088] Data storage is the location where user data is actually stored. There is no code enforcement on data storage. Therefore, data storage is not limited to cloud services and can include any legacy storage node with a network. Metadata storage and key storage are locations where file metadata and encryption keys are stored. Data storage, metadata storage, and key storage can be configured with separate (t,n) parameters. Since metadata and key storage have similar requirements to data storage, data storage nodes can also be used for metadata and key storage. Storage can be configured according to performance and reliability requirements, as well as data management policies. Metadata / key backup storage stores copies of the same metadata / key shards as those on client devices. Since metadata and key shards are encrypted by SSSS, replicating the same set of shards does not increase the risk of data breaches. These data storage, metadata / key storage, and metadata / key backup storage can be deployed over a LAN, the Internet, or a hybrid, but there are guidelines for optimal deployment. The control server may reside in the cloud or LAN; in metadata / key backup storage on the LAN; or in data storage and metadata / key storage in the cloud or a hybrid between the cloud and LAN.

[0089] Separation of data path and control path Figure 5 provides an overview of how the data path and control path are separated in this system. In addition to metadata separation, the control path between the control server and the client (long dashed line) is logically or physically separated from the data path between the client and the data storage (solid line).

[0090] Separating the data path from the control path prevents even the most privileged administrator of the control server from accessing user data. Each data path between the client and each data storage is independently protected by utilizing the various security mechanisms provided by each data storage node. The independence of the control path from the data path makes the deployment of the control server a flexible process without impacting security and performance configurations.

[0091] Data shard memory To ensure end-to-end security, the I / O event handler intercepts Flush and Fsync (Flush and Fsync are filesystem calls that synchronize data in main memory to a physical storage device; Fsync is a lower-level system of Flush; www.man7.org / linux / man-pages / man2 / fdatasync.2.html) filesystem call events and performs file encryption before saving the data content to the storage node. The encrypted chunks of data shards are buffered in a shard buffer until they are scheduled to be propagated to data storage. Thus, the present invention guarantees intermediate data encryption after a Flush call. The scheduler determines the location and timing of data shard propagation based on configurations such as cost optimization, performance optimization, and security optimization. For example, a consistent hash algorithm minimizes shard relocation costs when attaching / detaching data storage. More advanced algorithms can be developed and deployed.

[0092] Storing encrypted metadata / keys The metadata / key encryption chain is triggered after the file encryption chain is complete. The encryption key is sharded in local metadata / storage until it is scheduled to be propagated to key storage. Unlike storing data and key shards on data and key storage nodes, metadata storage is a synchronous process with Flush, Fsync, or Close calls. Therefore, if storing encrypted metadata fails, Flush, Fsync, or Close will return a failure code.

[0093] Staged data Staging data on end-user devices before uploading shards to the storage backend not only absorbs upload delays but also improves the user experience by increasing the flexibility of scheduling the data store. In the storage of this invention, there are six states for staged data. Note that Stage 4 needs to be synchronized with Stage 3 because metadata needs to be remembered in order to continue the process.

[0094] Stage 0: Ready to start

[0095] Stage 1: Encrypt chunk content with a randomly generated encryption key; encode data shards (Block 1 of the process)

[0096] Stage 2: Encryption of Chunk ID (Block 2 complete)

[0097] Stage 3: Memory of Metadata and Keyshards (Block 3 complete)

[0098] Stage 4: Save data shard (Revise block 1, complete)

[0099] Stage 5: Completed

[0100] Metadata manipulation Because encrypted metadata shards are stored in multiple locations, metadata lookup operations can directly read file attributes from local encrypted metadata shards. Directory and file attribute operations do not cause performance degradation regardless of metadata / key storage and data storage latency. Since metadata file writing is primarily related to data operations, metadata write delays can be ignored compared to other data storage and recovery operations.

[0101] Metadata storage selection Unlike distributing data shards, metadata shards are stored in pre-configured metadata storage. The guideline for metadata encoding is (t,n)=(2,3). Metadata definition:

[0102] JPEG0007880916000008.jpg8158

[0103] JPEG0007880916000009.jpg47168

[0104] Synchronization of metadata and data content The metadata and data content of all files are synchronized periodically. This process calculates a baseline counter for chunks according to the chunk's state, i.e., local only, remote only, or an intermediate state between local and remote. The baseline counter is used to schedule chunk encoding and shard distribution. This process also identifies unreferenced chunks that can be completely deleted.

[0105] Deleting data Metadata updates are stored to track history, so deletion does not remove metadata or content data. When a file is updated, the system remembers the updated metadata without deleting previous versions until the number of previous versions exceeds a predefined number. If a metadata file needs to be deleted, the system finally breaks the criterion link between the metadata and the chunk. When the system identifies a chunk with a criterion count of zero, the system finally deletes the chunk from backend storage.

[0106] Normal mode operation Figure 6 shows the step-by-step procedure of the file storage process of the present invention. As shown in the figure, there are eight related steps:

[0107] 601: Fsync - Performed on client

[0108] 602: Encoding - Performed on the client side

[0109] 603: Staging - Implemented by the client

[0110] 604: Metadata shard storage - performed on the client.

[0111] 605: Metadata / Keyshard Storage - From the Internet to Metadata / Key Storage

[0112] 606: Metadata / Key Shard Backup Storage - From LAN to Metadata / Key Backup Storage

[0113] 607: Scheduled push - Implemented by the client

[0114] 608: File Shard Storage - From the Internet to Data Storage

[0115] Whenever a specific system call event, such as Fsync or Flush, is received, the client begins encoding the corresponding file into an encryption chain. Once the file encryption chaining process is complete, the data shards are staged (ready to be pushed to data storage). Next, the metadata shards are stored in the client, metadata storage, and metadata backup storage. Key shards are also stored in key storage and key backup storage. Finally, the stored data is scheduled to be stored in data storage when the scheduler triggers an execution.

[0116] File fetching is the reverse procedure of file storage. Even in the event of a certain level of storage failure (where t is a parameter in the RS or SSSS code, at least n storage can be used for t storage), the file fetch operation is performed in normal mode (failures are logged). If the number of errors exceeds a configurable threshold (available storage is less than t), file fetch returns a fetch error to the user.

[0117] While not limited to this, it is sometimes important to blacklist clients, such as lost clients. Figure 7 shows the procedure for adding an old device to the blacklist and registering a new client. If a user loses a client, the user and / or administrator report the client to the control server. The steps in the illustrated procedure include:

[0118] 701. Blacklist lost clients - implemented by the control server.

[0119] 702. Authentication session expired - enforced by the control server

[0120] 703-5. Access denied

[0121] 706. Client Registration - Performed by the Control Server

[0122] 707. Metadata Recovery Command - Performed by the Control Server

[0123] 708. Fetch metadata from backup storage - performed by the new client.

[0124] 709. Authentication session reconfiguration - performed by the control server

[0125] 710-12. Access Permission

[0126] The control server blacklists the client information and terminates all sessions used for authentication in data storage, metadata / key storage, and metadata / key backup storage. If a user recovers or replaces their client device, the new client must be authorized by the control server in order to recover the files. The control server then sends a command message to recover the metadata using the metadata / key backup storage. Finally, the control server provides new client access information to data storage, metadata / key storage, and metadata / key backup storage.

[0127] Failure Mode Operation Failure mode operation allows users to continue using the system if the number of storage failures does not exceed a threshold. Unlike file fetching, which does not require failure mode, file stores require a mechanism to handle backend-side upload failure errors in order to keep the system in a controlled and operational state.

[0128] Figure 8 illustrates the file storage procedure in the event of a data storage failure. The steps in this process include:

[0129] 801. Fsync

[0130] 802. Encoding

[0131] 803. Stage adaptation

[0132] 804. Metadata Shard Memory

[0133] 805. Metadata / Keyshard Memory

[0134] 806. Backing up metadata / keyshards

[0135] 807. Schedule Push

[0136] 808. File Shard Storage

[0137] 809. Error Detection

[0138] 810. Memory failure, shard local retention.

[0139] 811. Push retry according to the following schedule.

[0140] This procedure is the same as a normal mode file storage operation until the data shard is pushed to data storage. If the client detects an upload error that would cause it to store the shard, the client will keep the shard locally. The shard is managed in the same way as a staged shard. The scheduler (within the client) then reschedules the shard using other new staged shards in the next push cycle.

[0141] Figure 9 illustrates the file storage procedure in the event of a metadata / key storage failure. This is far more critical than a data storage failure. The steps in this process include:

[0142] 901.Fsync

[0143] 902. Encoding

[0144] 903. Stage adaptation

[0145] 904. Remember metadata shards

[0146] 905. Remember metadata / keyshard backups.

[0147] 906. Remember metadata / keyshard

[0148] 907. Error detection

[0149] 908. Rollback

[0150] Unlike data storage failures, metadata / key storage failures prevent the system from continuing file storage operations. Instead, ongoing data storage operations are rolled back. All previously stored files remain accessible in read-only mode until metadata / key storage is recovered.

[0151] Figure 3 shows an embodiment of the encryption chain creation method of the present invention. Here, the file encoding approach, called encryption chain, aims to integrate data / metadata / key encryption with a data / metadata / key storage strategy. The steps of the method include:

[0152] 1. Creating a file encryption chain

[0153] Each data file is parsed into chunks, forming an encrypted chain.

[0154] ○The encryption chain is preferably initiated on the user device rather than on a centralized device.

[0155] Two separate encryption chains are created: a data file encryption chain and a metadata file encryption chain, which is usually created later. This metadata file contains and distributes information relating to how the data file encryption chain is encrypted, and other information relating to the encryption and / or distribution of the metadata file encryption. The information is not limited to these.

[0156] ○In this embodiment, the data file is first chunked and then encrypted.

[0157] ○When chunked, each chunk is assigned an ID, and this assigned ID is included in the metadata.

[0158] ○Then, each encrypted chunk is divided into shards.

[0159] ○ Shards are ultimately sent to storage, and each shard may move to a different storage medium in a different location.

[0160] ○It is preferable that there is no encrypted metadata within the data shard (however, chunk identifiers are embedded in the metadata shard).

[0161] 2. Data file encryption will use a subsequent Reed-Solomon (RS) or Equivalent code for sharding, in addition to conventional file encryption.

[0162] ○Each data file is analyzed into an encryption chain, and the files are distributed into chunks. Each chunk is distributed across shards.

[0163] Each chunk has a specific ID. This specific ID can be determined by calculation.

[0164] Metadata increases with each assigned ID, and therefore includes various file attributes (but not limited to, name, size, modification time, access time, etc.) and IDs. Each ID relates to a specific data file chunk.

[0165] The chunk ID is inserted into the associated metadata.

[0166] ○Data file chunks are encrypted and encoded using RS or Equivalent.

[0167] ○Then, the encrypted chunks are sharded.

[0168] Since RS code is not designed for encryption, chunks may be encrypted with an encryption key, determined by the processor of the present invention, and randomly generated for one-time use before being encoded as shards.

[0169] ○A single key is used for the chunking, encryption, and storage processes of the entire data file, and a different key may be used for each chunk, or a key somewhere in between may be used. The determination of the key size is performed by the processor of the present invention, and the result is stored in metadata for chunking and other purposes.

[0170] 3. Encrypt the chunk identifier (chunk ID) using SSSS or an equivalent for metadata that stores the criteria for the required content.

[0171] The metadata shard file stores the chunk ID shard.

[0172] ○Each encryption key itself is sharded.

[0173] ○In addition, each encryption key (for chunk encryption) is encrypted using SSSS or an equivalent.

[0174] ○An encryption method other than SSSS can be used instead.

[0175] ○ Users can determine the minimum number of shards required to rebuild the file.

[0176] 4. A set of chunk ID shards is stored in a metadata shard file along with replicated file attributes such as size and modification time. The encryption key shard is associated with the corresponding chunk ID.

[0177] 5. Chunked data, metadata, and encryption key shards are stored on physically or logically distributed storage / mediums.

[0178] 6. This process does not require a centralized component to calculate the allocation of data, metadata, and key shards across multiple storage units.

[0179] 7. Various algorithms can be applied to select storage for storing / fetching shards, in order to improve storage efficiency and performance.

[0180] Figure 11 shows examples of file encryption and metadata / key encryption. The configurable parameters t and n are set to 2 and 3, respectively, for file and metadata / key encryption. In this example, a file with the content "abcdefgh" is stored from three storage sources, while tolerating one storage failure. The file is chunked into two pieces, "abcd" and "efgh". To create a chunking criterion (called a chunk ID), the SHA256 hash of the chunk content is calculated. In this example, the chunk IDs are 8c3f=Sha-256("abcd") and a3dc=Sha-256("efgh"). These chunk IDs are stored in metadata (in JSON format). The chunk content "abcd" and "efgh" are encrypted using randomly generated keys "y2gt" and "5xkn", respectively. Thus, the chunk content is encrypted as "X?2#" and "&$cK". Next, the encrypted chunk content is encoded using Reed-Solomon (RS) code. The encrypted chunk content "X?2#" is encoded into three shards: "kg", "dh", and "%f". Any two of the three shards are needed to reconstruct "X?2#". The encrypted chunk "&$cK" is encoded in the same way. Finally, the data shards are stored in the data storage node.

[0181] The key used to encrypt the chunk content is associated with the corresponding chunk. Chunk reference information (chunk ID) is encrypted using SSSS to protect it. This requires two of three shards to decrypt. Chunk ID "8c3f" is encrypted into "ct1d", "jfy2", and "7g72". Other chunks are encoded in the same way. Chunk ID shards are stored separately in metadata shard files. The encryption key "y2gt" is also encrypted using SSSS into "3cd2", "ziaj", and "pzc8". The other encryption key "5xkn" is encoded in the same way. Finally, metadata and keys are protected by storing the three different metadata shard files and key shard files in different locations.

[0182] The chunk IDs "8c3f" and "a3dc" can only be obtained if two of the three metadata files are accessible. These chunk IDs can be used to find the data / key shard and reconstruct the encrypted chunk content "X?2#" and "&$cK". Finally, the encrypted chunk content is decrypted using the encryption key, and the decrypted chunks are concatenated to obtain the original content "abcdefgh".

[0183] Data Integrity Verification To store multiple shards while allowing a certain level of failure, a process for calculating data retention status is required. This is typically an I / O-intensive task. To improve the efficiency of data integrity verification, the system of the present invention simply calculates data retention status using typical list object (or file) operations available in typical storage and operation systems. (1) Fetch a list of metadata shard objects from metadata storage, including file path, modification time, and file status. (2) Fetch a list of data shard objects from data storage, including chunk IDs. (3) Fetch a list of key shard objects from key storage, including associated chunk IDs. (4) Count the metadata files that appear in the list based on the same metadata file set with the same file path and modification time. If the number of metadata files in the set is n, the metadata files guarantee full resilience from storage failure. If the number of metadata files is less than n and greater than or equal to t, the corresponding metadata is decryptable, and the metadata set is recoverable to have full resilience from storage failure. If the number of metadata files is less than t, the metadata files are corrupted. (5) Count the data / key shards that appear in the list based on chunk IDs. If the number of shards in a set is n, then nt storage failures are allowed for each chunk, metadata, and encryption key. This is the maximum tolerance specified by parameters t and n. If the number of shards is less than n and greater than or equal to t, the chunks are decryptable, and the set is recoverable to have full resilience from storage failures. If the number of shards is less than t, the chunks are corrupted. This process does not read the contents of metadata files to find a map between files and chunks, so it cannot identify which files are corrupted, but the overall integrity and data storage status are calculated with a small number of list object operations on storage.This process can also be performed from each client device and other centralized entities such as control servers.

[0184] In summary, the present invention encompasses numerous areas of novelty and originality, including:

[0185] • File and metadata / key encryption chain; RS code and SSSS are applied to encrypted chunks of files and chunk identifiers / key content to provide integration of file encoding and metadata / key encoding.

[0186] • End-to-end security; integrate files and metadata / key encryption chains into file system operations to prevent security vulnerabilities between the file system interface and the storage backend.

[0187] • System implementation; design and implement system components while considering long-latency networks (e.g., the Internet and WANs) and user experience.

[0188] - A client-centric architecture ensures the design and implementation of end-to-end data protection solutions.

[0189] - Encryption chain; content encryption and metadata encryption using (t,n) threshold retention properties.

[0190] Content encryption is preferable due to its storage efficiency and minimal error correction code size.

[0191] Metadata encryption requires randomness and theoretical cryptography.

[0192] - AI-assisted configuration and anomaly monitoring and detection

[0193] Client-centric architecture Based on the client definition, the solution architecture is designed to achieve client-centric implementation, ensuring direct communication between the client and data / metadata / storage. From the client side to the storage side, the client uses protocols and channels provided by different types of storage. The diversity of protocols and channels is implemented at the client level with minimal or zero code implementation at the backend.

[0194] Implementing a client-centric architecture in a distributed storage solution is more difficult than doing so on the server side, because the client is not a shared component like a server. Therefore, the client must perform an efficient synchronization process to overcome the control of missing shard components. This solution directly accesses metadata in distributed storage that is not designed for shared and centralized resources, thereby overcoming performance controls, including metadata access latency, by partially encoding the metadata and storing versions on the client.

[0195] To implement a client, a network-enabled client device requires a user data I / O interface, a data processing unit, hardware storage, a synchronization processing unit, and a network interface. In this example, the data I / O interface receives data I / O requests such as read, write, and list. This solution implements a POSIX file interface as the data I / O interface, but is not limited to this. The data I / O interface can implement key-value storage, CRUD (Create-Read-Update-Delete) interfaces, etc. The data processing unit encrypts and encodes data into shards by implementing files and metadata / key encryption chains. Hardware storage stores intermediate status and data being processed before sending it to storage. Hardware storage requires access control to prevent unauthorized entities from accessing intermediate status and data. The synchronization processing unit is responsible for sending and receiving shards. The synchronization processing unit schedules send / receive tasks based on a knowledge base that stores the empirical performance and configuration of the client and storage. The synchronization processing unit also determines the location of shards among available storage nodes, which is also determined based on the knowledge base. The synchronous processing unit implements an AI engine to optimize parameters based on user configurations. This asynchronous send / receive task in the synchronous processing unit absorbs delays by responding to the user before sending data to storage, providing flexibility to expand the scheduling algorithm in the future.

[0196] This solution defines three types of data storage: data storage, metadata / key storage, and metadata / key backup storage. Storage provides clients with authentication and data I / O interfaces. Data storage requires a cost-effective and scalable solution, while metadata and key storage require high-speed access. The requirements for metadata / key backup storage are the same as metadata / key storage, but it resides in the user domain.

[0197] The control server is a portal for configuring backend storage, managing users / devices / policies, and sending commands to clients. The control server is completely isolated from data transmission channels to prevent interception of user data along the way. The control server deploys configurations to clients, allowing them to obtain necessary parameters and request redirects to initiate processes.

[0198] Artificial intelligence for configuration Due to the complexity of backend interfaces and the diversity of services, configuring an optimal setup within budget while maximizing user satisfaction is challenging. This invention provides an abstraction of the configuration layer to reduce the time and effort required for backend configuration. The invention aims to optimize operational costs, optimize performance, monitor and detect backend storage costs and performance, and empirical data based on anomalies related to user behavior profiles. The client collects event data and performs preprocessing such as anonymization and reformatting. After collecting the event data, the client sends the event data to the data collection server.

[0199] Optimizing the configuration to reduce operational costs overcomes the complexity of backend storage configurations and reduces operational costs by distributing shards to the optimal backend storage based on data storage / access costs, empirical storage performance, peer group usage profiles, and predefined static models.

[0200] This solution also improves responsiveness by leveraging the advantages of the implemented architecture. It overcomes the complexity of backend storage while reducing data access and storage latency. Unlike optimizing operational costs, distributing more shards to high-speed storage should have a higher priority than storage cost. In addition to these two cases, the system can be configured to achieve a balanced setting between cost-optimal and performance-optimal values, for example, by using a simple weighted sum equation.

[0201] In this invention, the AI ​​algorithm for behavioral analysis does not examine user data to detect anomalies in the system. While the algorithm is widely used to detect unknown attacks, it is necessary to correctly define normal conditions to reduce false positive errors. Anomalies are discovered using behavioral analysis algorithms. Strictly fitted models exhibit low accuracy, while roughly fitted models exhibit low recall rates. Based on data collected from clients, the system adaptively updates classification metrics between normal and abnormal conditions. This invention leverages the characteristics of data access patterns from individual users and user groups.

[0202] The following are the parameters optimized by this invention.

[0203] - Optimization 1: Indicators for shard storage that minimize data storage costs and data access costs

[0204] - Optimization 2: Indicators for shard memory that minimize data upload / download completion time

[0205] - Optimization 3: Minimizing the cost of shard redistribution when deploying Optimization 1 or Optimization 2.

[0206] - Optimization 4: Classification metrics for determining normal and abnormal data access

[0207] - Optimization 5: Classification metrics for determining normal and abnormal storage access

[0208] - Optimization 6: Classification metrics for determining normal and abnormal errors from clients

[0209] To achieve these optimizations, the present invention incorporates the following:

[0210] - Backend storage costs and (quantitative) Service Level Agreements (SLAs)

[0211] - Empirical throughput of backend storage for each client

[0212] - Timestamps of file content operations

[0213] -Operation name

[0214] - Number of shard accesses

[0215] - Anonymous file identifier

[0216] - Anonymous client identifier

[0217] While some common applications of the present invention have been described above, it should be clearly understood that the present invention can be integrated with any network application to increase security, error margin, anonymity, or any suitable combination of the aforementioned or other relevant attributes. Furthermore, other combinations, additions, substitutions, and modifications will be apparent to those skilled in the art in consideration of the disclosure herein. Thus, the present invention is not intended to be limited by the reactions of preferred embodiments.

[0218] While the above invention is described in some detail for clarity, it will be apparent that certain changes and modifications can be made without departing from the principles of the present invention. It should be noted that there are many alternative ways of carrying out both the process and apparatus of the present invention. Therefore, this embodiment should be considered illustrative and not limiting, and the present invention should not be limited to the specific details given herein. The embodiments described herein can be embodied as systems, methods, or computer-readable media.

[0219] In some embodiments, the described embodiments can be implemented in hardware, software (including firmware, etc.), or a combination thereof. Some embodiments can be implemented in a computer-readable medium containing computer-readable instructions for implementation by a processor. Any combination of one or more computer-readable media can be used. The computer-readable medium may include computer-readable signaling media and / or computer-readable storage media. The computer-readable storage medium may include any tangible medium capable of storing computer programs used by a programmable processor to perform the functions described herein by operating on input data and producing outputs. A computer program is a set of instructions that can be used directly or indirectly in a computer system to perform a particular function or determine a particular result.

[0220] Some embodiments can be provided to end users via a cloud computing infrastructure. Cloud computing generally involves providing scalable computing resources as a service over a network (e.g., the Internet). While several methods and systems are described herein, it is intended that a single system or method may include two or more of the above-described subjects. Therefore, multiple of the above-described systems and methods can be used together in a single system or method.

[0221] The embodiments disclosed in this application should be considered in all respects to be illustrative and not limiting. The scope of the invention is indicated not by the foregoing description but by the appended claims, and all modifications that fall within the meaning and scope of equivalence of the claims are intended to be encompassed therein.

[0222] The flowcharts and / or block diagrams in the figures illustrate the implementable architecture, functions, and operations of systems, methods, and computer program products according to various exemplary embodiments of the concepts of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or part of an instruction comprising one or more implementable instructions for performing a specified logical function. In alternative implementations, the functions described in a block may be performed in a different order than shown in the figure. For example, two blocks shown consecutively may actually be executed substantially simultaneously, or sometimes in reverse order depending on the function the block contains. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in a block diagram and / or flowchart, may be performed by a special-purpose hardware-based system that performs a specified function or operation, or a combination of special-purpose hardware and computer instructions.

Claims

1. A method for a processor-type server to securely and reliably reconstruct an encrypted computer file from storage, wherein the computer file comprises data and associated file metadata, and the encryption is n 1 It is analyzed into chunks, and each of the n 1 Each chunk has its own chunk ID, each chunk is encrypted with its own key, and the n chunks are stored separately in each of the multiple storage devices. 1 n of chunks 2 Each of the n content shards 2 The content shard comprises the data content of at least one encrypted file analyzed, the chunk ID, and the encryption key, and then n 3 The metadata portion is analyzed, and the n 3 Each metadata portion has a computational and theoretical cryptographic method with metadata of the at least one encrypted computer file, each metadata portion of which is stored separately, and the method is executed on a processor. The above method is executed by the processor. The steps include sending a list of available encrypted files for selection for reconstruction, Upon receiving the selection of files for reconstruction, n related to the selected files 3 t of the metadata part 3 is required for the construction of the metadata, and t 3 is n 3 At least t less than 3 identifying the metadata part; The identified t 3 Steps to recover the metadata portion, The recovered t for the file in storage 3 The steps include: reconstructing the metadata using the metadata portion, and n 2 Content shard t 2 This is necessary for rebuilding each chunk, and for each chunk, t 2 This can vary from chunk to chunk, and each t 2 n is the associated n for the chunk corresponding to the chunk ID. 2 at least t per smaller chunk 2 The steps include identifying the respective chunk IDs of the reconstructed metadata for each content shard, The t of the identified content shards of each chunk 2 Steps to recover, Using the recovered content shards, the n 1 The steps to rebuild each chunk, The steps include: decrypting each reconstructed chunk using the respective keys and reconstructing the decrypted chunks into a content data file; A method comprising the step of sending the reconstructed data file to a user.

2. The method according to claim 1, wherein the number of content shards is user-configurable at the time of encryption and is not limited to each t, ​​and n is t=2 and n=3.

3. The method according to claim 1, wherein the processor-type server controls input data using a user interface, and the list of encrypted files is sent to the user interface by the processor for selection.

4. The aforementioned file has associated file attributes, and each metadata portion further comprises the aforementioned file attributes and at least t related to the selected file. 3 The method according to claim 1, wherein the step of identifying a metadata portion comprises the step of identifying a metadata portion having the file attributes for the selected file.

5. The method according to claim 4, wherein the file attributes include a file name, file size, and modification time.

6. The method according to claim 4, wherein the step of sending a list of encrypted files available for selection for reconstruction is performed by the processor, which accesses a metadata portion stored in the server's storage and presents at least the file name of the file attribute of the accessed metadata portion.

7. The identified t 3 The method according to claim 6, wherein at least one metadata portion is recovered by the processor from storage located away from the server.

8. The aforementioned processor processes each chunk ID for each chunk. 2 Identify the reconstructed metadata that is larger than each chunk shard, and the t 2 Based on the recovery latency and recovery cost from the reconstructed metadata larger than each chunk shard, t for recovery 2 The method according to claim 4, wherein each content shard is selected.

9. The method according to claim 1, wherein the metadata of the stored encrypted file is parsed into metadata portions using Shamir's Secret Sharing Scheme (SSSS), the chunks of the stored encrypted file are encoded using AES-256, and the chunks of the stored encrypted file are parsed into content shards using the Reed-Solomon Code.

10. A method for a processor for securely and reliably reconstructing an encrypted file located in storage, wherein the file in a pre-encrypted form has content data and associated metadata, and as a result, at least one location for storage for the encrypted and encoded content data is different from the at least one location for storage for the encrypted and encoded metadata, and the encryption is The content data portion of the aforementioned file is parsed into a chain of n content chunks, each of which is assigned a chunk ID. Encrypt each of the content chunks using the aforementioned at least one encryption key per chunk, Each of the aforementioned content chunks is encoded and analyzed into multiple content shards, The aforementioned chunk ID is encrypted, The metadata is increased by the encrypted chunk ID, and the increased metadata is generated. The increased metadata is parsed into multiple metadata shards for encryption, thereby modifying the metadata. The method comprises the step of reversing the steps of an algorithm performed based on a combination of computational and theoretical cryptography, A method comprising: parsing the content data portion of the aforementioned file into a chain of n content chunks, each of which is assigned a chunk ID; encrypting each of the content chunks using at least one encryption key per chunk; encrypting and parsing each of the aforementioned content chunks into multiple content shards; and encrypting the chunk IDs.

11. The algorithm for doing the reverse is: The steps include increasing the metadata with the encrypted chunk ID, The steps include: analyzing the aforementioned at least one key into multiple key shards; The steps include encrypting the aforementioned multiple key shards, The method according to claim 10, further comprising the step of adding the encrypted plurality of key shards to the increased metadata.

12. The method according to claim 11, wherein the encrypted key shard and the chunk ID are stored separately.

13. The method according to claim 10, wherein the encryption step includes the use of Shamir's Secret Sharing Scheme (SSSS).

14. The method according to claim 10, wherein the encryption comprises computational cryptography and includes the use of the Reed-Solomon Code.

15. The method according to claim 10, wherein the step of encrypting each of the content chunks includes the use of AES-256.

16. The method according to claim 10, wherein the number of metadata storage, key storage, and data storage is configurable and not limited to three.

17. The method according to claim 10, wherein the parameters t and n of the metadata shard, key shard, and data shard can be set separately and are not limited to t=2 and n=3, where t is the number of content shards required for reconstruction and n is the number of content shards to store.