Separation of logical and physical storage in distributed database systems

By dividing the logical database file into slices and further subdividing them into stripes and stride units, the problem of low query processing efficiency and scalability in existing distributed database systems on large datasets is solved, achieving more efficient storage and query performance.

CN117795498BActive Publication Date: 2026-06-19MICROSOFT TECHNOLOGY LICENSING LLC

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
MICROSOFT TECHNOLOGY LICENSING LLC
Filing Date
2022-06-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing distributed database systems suffer from low query processing efficiency, scalability, and availability issues when handling scan-intensive pure analytical workloads with large datasets, especially when logical and physical file storage are tightly coupled.

Method used

The logical database file is divided into slices and distributed to multiple page servers through endpoint mapping. The slices are further subdivided into stripes and straddle units to achieve parallel storage and improve performance.

Benefits of technology

It improves the query processing efficiency of the database system, reduces the time for replica re-creation, enhances the throughput of log recording, parallelizes I/O operations, optimizes the storage configuration of hot and cold data, and improves the scalability and performance of the system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117795498B_ABST
    Figure CN117795498B_ABST
Patent Text Reader

Abstract

A distributed database system comprising compute nodes and page servers is described herein, which implements the separation of logical and physical storage of database files within the distributed database system. The distributed database system includes page servers and compute nodes, and is configured to store logical database files, which contain data and are associated with file identifiers. Each page server can be configured to store slices (i.e., sub-parts) of the logical database files. Compute nodes are coupled to multiple page servers and are configured to store logical database files in response to received commands. In one aspect, such storage may include slicing the data comprising the logical database files into a set of slices, where each slice is associated with a corresponding page server, maintaining an endpoint mapping for each slice in a first set of slices, and sending each slice to its associated page server for storage.
Need to check novelty before this filing date? Find Prior Art

Description

Background Technology

[0001] A typical distributed database system divides storage and compute workloads among multiple distributed components. Such a system may include, for example, one or more compute nodes / servers, page servers, and storage components. This system partitions system functions between compute and storage. Compute nodes handle all incoming user queries and query processing activities, while page servers are coupled to storage components to provide a horizontally scalable storage engine, with each page server responsible for a subset of the database's pages. In this configuration, page servers are limited to serving pages to compute nodes and updating the corresponding pages based on ongoing transaction activity.

[0002] This architecture enables scaling to databases of 100+ TB, rapid database recovery, near-instantaneous backups, and the ability to quickly expand and shrink. This configuration provides flexibility, scalability, and performance for online transaction processing and / or hybrid analytical processing workloads that require high transaction throughput while also supporting real-time analytics.

[0003] However, such a system may not be optimal for scan-intensive, purely analytical workloads on very large datasets, as query processing is performed on (multiple) compute nodes, requiring the migration of large amounts of data from the page server to the compute nodes for processing. Furthermore, scalability and availability issues arise when logical and physical file storage are tightly or completely coupled. Summary of the Invention

[0004] This summary is provided to introduce a selection of concepts in a simplified form, which will be further described in the detailed description below. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.

[0005] A distributed database system is provided herein, configured to decouple logical database files from their physical storage. In one aspect, the logical database file comprises data and is associated with a file identifier, and the distributed database system includes multiple page servers, each configured to store slices (i.e., sub-parts) of the logical database file. The distributed database system also includes compute nodes coupled to the multiple page servers and configured to store the logical database file in response to received commands (e.g., import commands or as a result of query commands). In one aspect, such storage can slice the data comprising the logical database file into set pieces, wherein each piece is associated with a corresponding page server, maintaining an endpoint mapping for each piece in the set of pieces, and sending each piece to its associated page server for storage.

[0006] On the other hand, endpoint mapping includes a database file identifier, a slice identifier, and the endpoint address of the page server associated with the corresponding slice. Furthermore, different logical database files can be stored using different storage configurations.

[0007] In another aspect, the distributed database system is configured to change the storage of the logical database file from one configuration to another by moving slices of the logical database file to new page servers with different configurations and updating the endpoint mappings accordingly, or by changing the hardware configuration of the page server where the slice is currently located.

[0008] Further features and advantages, as well as the structure and operation of various examples, are described in detail below with reference to the accompanying drawings. Note that the ideas and techniques are not limited to the specific examples described herein. Such examples presented herein are for illustrative purposes only. Additional examples will be apparent to those skilled in the art based on the teachings contained herein. Attached Figure Description

[0009] The accompanying drawings, which are incorporated herein and form part of the specification, illustrate embodiments of the present application and, together with the specification, further serve to explain the principles of the embodiments and enable those skilled in the art to make and use these embodiments.

[0010] Figure 1 A block diagram of a distributed database system configured to separate the logical storage from the physical storage of database files stored in the system, according to an example embodiment, is depicted.

[0011] Figure 2 The example embodiment is described as being configured to... Figure 1 A block diagram of an example compute node in a distributed database system where the logical storage and physical storage of database files are separated.

[0012] Figure 3 The example embodiment illustrates a method for storing... Figure 1 A block diagram illustrating example endpoint mappings for database files in a distributed database system.

[0013] Figure 4 A unit of a database file according to an example embodiment is shown. Figure 1 A block diagram of an example distributed system of multiple page servers in a distributed database system.

[0014] Figure 5A block diagram depicts an example data organization in a logical database file, from consecutive data chunks to cells, striped cells, and striped / strided cells, according to an example embodiment.

[0015] Figure 6 A flowchart is depicted according to an example embodiment of a method for separating logical storage from physical storage in a database file stored in a distributed database system.

[0016] Figure 7 The example describes a configuration providing multiple physical storage configurations according to an example embodiment. Figure 6 The flowchart is a refinement of the method.

[0017] Figure 8 The example embodiment describes a method for changing the physical storage configuration of data stored in a distributed database. Figure 7 The flowchart is a refinement of the method.

[0018] Figure 9 This is a block diagram of an example computer system in which embodiments can be implemented.

[0019] The features and advantages of the embodiments will become more apparent from the detailed description of the following embodiments when taken in conjunction with the accompanying drawings, in which similar reference numerals consistently identify corresponding elements. In the drawings, similar reference numerals generally indicate the same, functionally similar, and / or structurally similar elements. The first appearance of an element in the drawing is indicated by the leftmost (or more) numerals of the corresponding reference numeral. Detailed Implementation

[0020] I. Introduction

[0021] This specification and accompanying drawings disclose one or more embodiments incorporating the features of the invention. The scope of the invention is not limited to the disclosed embodiments. The disclosed embodiments are merely illustrative, and modifications of the disclosed embodiments are also covered by the invention. The embodiments of the invention are defined by the appended claims.

[0022] References to "one embodiment," "an embodiment," "exemplary embodiment," etc., in the specification indicate that the described embodiment may include a particular feature, structure, or characteristic, but not every embodiment necessarily includes that particular feature, structure, or characteristic. Furthermore, such phrases do not necessarily refer to the same embodiment. Additionally, when a particular feature, structure, or characteristic is described in connection with an embodiment, it is assumed that implementing such a feature, structure, or characteristic in conjunction with other embodiments is within the knowledge of those skilled in the art, whether explicitly described or not.

[0023] In this discussion, unless otherwise stated, adjectives (such as “substantially” and “approximately”) that modify one or more features of embodiments of this disclosure are understood to mean that the condition or feature is defined within an acceptable tolerance for operation of the embodiment for the intended application.

[0024] Several exemplary embodiments are described below. It should be noted that any section / subsection headings provided herein are not intended to be limiting. Embodiments are described throughout this document, and any type of embodiment may be included under any section / subsection. Furthermore, embodiments disclosed in any section / subsection may be combined in any manner with any other embodiments described in the same section / subsection and / or different sections / subsections.

[0025] Section II below describes an example embodiment for separating the logical and physical storage of database files in a distributed database system. Section III below describes an example computing device embodiment that can be used to implement the features of the embodiments described herein. Section IV below describes other examples and advantages, and Section V provides some concluding comments.

[0026] II. Example Implementation

[0027] As described above, there exist distributed database systems where computational and storage resources are isolated, such that one or more compute nodes (i.e., servers dedicated to performing "computation" functions) are coupled to, for example, n page servers, where each page server manages access to and / or modification of one-nth of the pages containing data stored in the database. In such a system, the compute nodes handle all incoming user queries and query processing activity, while the page servers are coupled to storage components to provide an extended storage engine, with each page server responsible for a subset of the pages in its database. Some such systems employ a storage subsystem that may have limited or no separation between logical file storage and physical file storage. That is, logical files may correspond one-to-one with physical files residing on a single storage device (e.g., a single drive / spindle). In such cases, scalability issues may arise, which can impair storage performance and / or the costs associated with delivering services at the desired or required performance level.

[0028] The embodiments described herein decouple logical file storage from the underlying physical storage of that file, while improving performance through the following:

[0029] use Endpoint mapping Redirecting the logical database file to multiple page servers separates the logical database file into... slice

[0030] Subdivide the slices into smaller pieces. Strip and step unit To improve overall performance

[0031] These aspects of the embodiments can be better understood in the context of an example distributed database system in which the embodiments can be implemented. For example, consider Figure 1 This document depicts a block diagram of a distributed database system 100 configured, according to an example embodiment, to separate the logical storage from the physical storage of database files stored in the system. Distributed database system 100 is an example system in which embodiments may be implemented, but is not intended to be limiting. Embodiments may be implemented in other types of distributed database systems, as will become apparent to those skilled in the art from the teachings herein.

[0032] Distributed database system 100 manages one or more databases where data is stored across different physical locations. The devices of system 100 can be located in the same physical location (e.g., a data center) or can be distributed across a network of interconnected computers. System 100 can manage the databases according to any suitable database model (e.g., relational or XML) and can implement any suitable query language(s) to access the databases, including SQL (Structured Query Language) or XQuery. Figure 1 The distributed database system 100 shown includes one or more compute nodes 102, a log server 110, one or more page servers 108, and storage devices 136. The compute nodes 102 include a master compute node 104 and a set of auxiliary compute nodes 106-1 to 106-N. Similarly, the page servers 108 include a set of page servers 108-1 to 108-N. The log server 110 includes a log cache 112. These features of system 100 will be described in further detail below.

[0033] Any number of user devices 101 can access data managed by the distributed database system 100. Multiple user devices 101 are coupled to one or more compute nodes and provide workloads to the distributed database system 100 in the form of transactions and other queries. The primary and secondary compute nodes of compute nodes 102 are coupled to a log server 110 and one or more page servers 108. Each of the multiple user devices 101 can be any type of fixed or mobile computing device, including mobile computers or mobile computing devices (e.g., Devices, Personal Digital Assistants (PDAs), Laptops, Notebooks, Tablets (such as Apple iPads) TM (e.g., netbooks, mobile phones, wearable computing devices or other types of mobile devices, or fixed computing devices such as desktop computers or PCs (personal computers) or servers.)

[0034] Note that although embodiments may sometimes be described herein in the context of user devices (such as user devices(101) providing a query or a workload of queries and receiving returned query results), embodiments are not limited to operation using or through user devices (such as user devices(101)). In fact, the embodiments described herein can execute queries against or on behalf of any source of such queries and provide query results to the same or some other source or entity. For example, queries can be performed by a computing component ( Figure 1 (Not shown in the text) is generated and provided to the embodiment for execution. Thereafter, the embodiment may execute the query in the manner described herein and provide the results directly back to the query source, or appropriately to some other location, entity, or component.

[0035] The compute node 102, page server 108, and log server 100 may include any number of computing devices (e.g., servers) including hardware (e.g., processors, memory, storage devices, networking components) and software (e.g., database management system (DBMS) software), configured to interact with user device 101 and manage access to stored data in one or more databases (including reading, writing, modifying, etc.).

[0036] In one embodiment, each page server in page server 108 is configured as a separate shard storing a single database file. Although storage device 136 and the files 122 stored therein are described as a monolithic storage device shared among page servers 108, the embodiments are not limited thereto. In an alternative embodiment (not shown), each page server in page server 108 may be coupled to a dedicated storage device that includes only the files 122 managed and stored by that particular page server. Similarly, embodiments can be configured anywhere in between, where some files corresponding to a given page server are stored at one location on a single storage unit, while other pages are stored at other locations.

[0037] As mentioned above, the distributed database system 100 can be configured to perform transaction processing. Implementations of the distributed database system 100 are ACID compliant. As is known in the art, ACID is an acronym for a set of properties that ensure data persistently stored in the database is valid, even if errors occur due to factors such as hardware failure or power failure. ACID stands for Atomicity, Consistency, Isolation, and Durability. Transactions executed by the distributed database system 100 are ACID compliant because the logically corresponding operations collectively satisfy the ACID properties.

[0038] The atomicity property of a transaction requires that it either succeeds completely or fails completely. Complete failure of a transaction means that the database remains unchanged. For example, suppose a transaction involves transferring money from account A to account B. The entire transaction consists of multiple steps, such as: the funds being deducted from account A, the funds being transferred to account B, and the funds being credited to account B. In this case, atomicity guarantees that no funds were deducted from account A if, for any reason, the funds were not credited to account B.

[0039] Consistency properties ensure that transactions comply with all applicable rules governing the storage of data, enabling transactions to transform the database from one valid state to another.

[0040] Isolation properties ensure that different transactions executed concurrently keep the database in the same state that would be obtained if the transactions were executed serially.

[0041] Finally, the durability property guarantees that once a transaction is committed (i.e., completed and persisted to the database in an ACID-compliant manner), the transaction will remain committed, and no hardware, system, or power failure will cause the transaction to be lost or cause the database to enter an otherwise inconsistent state. Further references Figure 1 The ACID properties of transactions executed by the distributed database system 100 are partially ensured by the following use of the log server 110.

[0042] In one embodiment, the primary compute node 104 is configured to perform both read and write operations, while the secondary compute nodes 106-1 to 106-N are configured to perform read-only operations. Therefore, only the primary compute node 104 can execute transactions that change the database state. To maintain the ACID properties of transactions, the primary compute node 104 can be configured to generate a transaction log record upon transaction commit and store this record locally in the transaction log before any database modifications caused by the transaction are written to disk.

[0043] The log entries for committed transactions include all the information needed to redo the transaction in the event that a problem (e.g., a power failure) occurs before the data modified by the transaction can be stored (e.g., in file 122 on storage device 136). The log entries may include information including, but not limited to, transaction identifiers, log sequence numbers, timestamps, information indicating what(s) data objects(s) were modified and how they were modified.

[0044] Regarding log sequence numbers, the transaction log operates logically as if it were a sequence of log records, each identified by a Log Sequence Number (LSN). Each new log record is written to the logical end of the transaction log with an LSN higher than that of its preceding record. Log records are stored in serial order as they are created, such that if LSN2 is greater than LSN1, the change described by the log record referenced by LSN2 occurs after the change described by log record LSN1. Each log record also includes a transaction identifier for the transaction to which it belongs. That is, the transaction identifier uniquely identifies the transaction corresponding to the log record (e.g., a Universally Unique Identifier (UUID) or a Globally Unique Identifier (GUID)).

[0045] In one embodiment, the log record corresponding to the transaction is then forwarded to log server 110, which is configured to provide a logging service. The logging service on log server 110 accepts log records from the primary compute node 104, stores them in log cache 112, and subsequently forwards the log records to the remaining compute replicas (i.e., secondary compute nodes 106-1 to 106-N) so that they can update their local log caches. Log server 110 also forwards log records to the relevant page servers (i.e., page servers that manage data modified by the transaction) so that the data can be updated there.

[0046] In this way, all data changes from the primary compute node 104 are propagated to all secondary compute nodes and page servers via the log service. Finally, the log entries are pushed to long-term storage, such as, for example, storage device 136. In addition to transaction commits, other types of operations can also be logged at the primary compute node 104 and subsequently forwarded, including but not limited to transaction initiation, range and page allocation or deallocation, table or index creation or discarding, and each data or schema modification.

[0047] As mentioned above, several problems arise when logical file storage is substantially or entirely consistent with the physical storage range of the file. These problems include and / or are related to: a) copy re-creation, b) remote storage I / O limitations, and c) data archiving (i.e., storage tier modification). Each will now be described in turn.

[0048] The main problem with recreating a copy is file size. For example, suppose a single 1TB file is stored in file 122 and managed by page server 108-1 in page server 108. If page server 108-1 fails (or, for example, the SQL instance hosted on it or any failure that actually prevents page server 108-1 from fulfilling its role), another page server and SQL instance (e.g., page server 108-N) must be started to replace the failed page server. Subsequently, the 1TB data file must be read from file 122 into local cache 110-N. During the process of recreating a cached copy of the 1TB data file, page server 110-N must continue to respond to read requests and maintain file modifications from log records received from log server 110. During cache recovery, query performance (i.e., read workload) at page server 110-N may be significantly impacted because page server 110-N cannot simultaneously keep up with the number of changes coming from log server 110 (i.e., write workload). As a further consequence, page server 108-N may cause throttling of the log pipeline, which in turn reduces the throughput of log server 110.

[0049] This implementation addresses the replication problem by dividing a 1TB logical file into multiple file slices, each independently maintained by a different instance of page server 108. Therefore, the amount of the 1TB data file "owned" by any one-page page server is configurable, allowing for much faster replication because the amount of data owned by any single page server is significantly smaller. Furthermore, since each page server is only responsible for a portion of a single file, the throughput of logging applications is enhanced, and log entries corresponding to modifications not owned by that page server can be ignored.

[0050] Remote storage I / O limitations are rigorously tested when large logical files are maintained as large physical files. This is especially true for online transaction processing (“OLTP”) workloads, which typically involve modifications to many small lines. In such cases, the number of changes to a given page server can overwhelm the I / O capacity of storage device 136. For a standard storage configuration, typical bandwidth can be between 500 and 1000 IO / s. To improve transaction throughput, one could have the files reside on more powerful hardware. However, even the best hardware available today has limitations, and such hardware is extremely expensive regardless.

[0051] The embodiment addresses storage I / O limitations by further dividing the slices described above into stripes and straddle units, wherein different units reside in the page server cache, on storage device 136, or on different physical storage devices (i.e., different drives / spindles). As will be further described below, such a configuration allows I / O operations to be performed in parallel, thereby improving performance.

[0052] The parallelization of I / O described above also addresses issues related to data archiving and / or storage layer modifications. In data warehouse scenarios, it's common to have "hot" and "cold" data. Hot data is the data within the database that is currently the target of a large number of queries (whether read or write) during the current workload. On the other hand, cold data corresponds to data that is not currently needed and may be the subject of very few queries.

[0053] For example, a database application might be configured to monitor financial data over time. In this case, query activity might focus more on the most recent financial data (i.e., “hot” data). However, as time goes on and the data ages, there might be fewer or no queries targeting such data, which is then referred to as “cold” data. Because such data is rarely accessed, it’s best to use a different storage configuration to store it. For example, a higher-performance and more expensive cache memory or solid-state drive (“SSD”) cache dedicated to such cold data could be reassigned to hot data. Alternatively, cold data could be moved to a cheaper and slower storage subsystem (e.g., from an SSD drive to a spindle-based drive, or from a spindle-based drive to tape).

[0054] However, migrating a single large file residing on a single storage device from one storage configuration or tier to another is quite slow because the file is so large. On the other hand, the embodiment breaks down the large logical file into a large number of small physical files, as described above. This allows replication to be parallelized to become a constant-time operation. After the data replication to the new configuration is complete, a new page server instance extending to that data can be created, and compute node 104 can, for example, create new endpoint mappings for the new page server and remove old endpoint mappings. This redirects future read traffic to the new page server, all without needing to disconnect from the ongoing workload. The concept of "endpoint mapping" will now be referred to... Figure 2 and Figure 3 Described.

[0055] Figure 2 A block diagram of a system 200 according to an example embodiment is depicted. The system 200 includes a storage device 136, a plurality of page servers 108, and a compute node 104, configured to be stored in Figure 1In the distributed database system 100, the logical storage and physical storage of database files are separated. System 200 is described as follows.

[0056] System 200 includes a compute node 104 coupled to a page server 108, which in turn is coupled to a storage device 136, as well as... Figure 1 As shown. Compute node 104 may include any type of server or computing system, as mentioned elsewhere in this document or otherwise known, including but not limited to cloud-based systems, on-premises servers, distributed network architectures, etc.

[0057] like Figure 2 As shown, compute node 104 includes processor 204, memory / storage device 206, network interface 228, operation processor 222, and storage manager 238. Storage device 136 includes file 122. Storage manager 238 includes endpoint manager 220 and optionally includes data slicer 210 (such option is indicated by the dashed lines of data slicer 210 as depicted). These components of system 200 are described below.

[0058] This document envisions that, in various embodiments, any component of compute node 104 can be grouped, combined, separated, etc., with any other component, and Figure 2 The illustrated example of compute node 104 is not limited in its configuration and / or the number of components and its exemplary arrangement. Furthermore, it should be understood that components (such as, for example, processor(s) 204, memory(s) 206, and / or network interface(s) 228) may include multiple instances of such components, whether physical or virtual.

[0059] The multiple processors 204 and multiple memories / storage devices 206 may respectively be any type of processor circuitry / system and memory as described herein and / or as would be understood by those skilled in the art who benefit from this disclosure. The multiple processors 204 and multiple memories / storage devices 206 may each respectively include one or more processors or memories, different types of processors or memories (e.g., one or more types / numbers of caches for query processing, allocation of data storage, etc.), remote processors or memories, and / or distributed processors or memories. Processor 204 may be a multi-core processor configured to concurrently execute more than one processing thread. The multiple processors 204 may include circuitry configured to execute and / or process computer program instructions, such as, but not limited to, embodiments of memory manager 238 and / or data slicer 210, including one or more of their components as described herein, which may be implemented as computer program instructions as described herein. For example, in Figure 6 , Figure 7 and Figure 8 In the execution / operation of any flowchart in flowcharts 600, 700 and / or 800, as described in detail below, processors 204 may execute the described program instructions.

[0060] In embodiments, the operation processor 222 may be part of a query processor or database server / system, configured to perform database operations, such as performing queries against a database. In embodiments, the operation processor 222 may include program instructions executed by processor(s) 204(s), or may be a hardware-based processing device as described herein.

[0061] Multiple memories / storage units 206 include volatile storage portions (such as random access memory (RAM)) and / or persistent storage portions (such as hard disk drives, non-volatile RAM, etc.) for storing or being configured to store computer program instructions / code for separating the logical storage of database files in a distributed database system from physical storage, as described herein, and in various embodiments, for storing other information and data described in this disclosure, including but not limited to embodiments of storage manager 238 and / or data slicer 210, including one or more of the components described herein.

[0062] Storage device 136 may be internal and / or external storage or any type such as those disclosed herein. In embodiments, storage device 136 stores one or more files 122, which include database objects or database files and can be accessed only by or through the page server of page server 108. In embodiments, storage device 136 may also store files 122 and / or portions of files provided from one or more page servers in response to requests from compute node 104.

[0063] Network interface 228 can be any type or number of wired and / or wireless network adapters, modems, etc., configured to enable compute node 104 to communicate within its component system and to communicate with other devices and / or systems over a network, such as compute node 104 and Figure 1 Communication between other devices, systems, and hosts in System 100.

[0064] According to an embodiment, computing node 104 also includes additional components (not shown for brevity and clarity), including but not limited to components and sub-components of other devices and / or systems herein, as well as those described below. Figure 9 Those described, such as operating systems, etc.

[0065] like Figure 2The endpoint manager 220 of the storage manager 238 of the compute node 104 depicted in system 200 will now be combined with Figure 3 Described as an example embodiment for storage Figure 1 A block diagram 300 of example endpoint mapping 216 in logical database files 318-328 in database 302 of distributed database system 100.

[0066] Block diagram 300 also includes page server A 314 and page server C 316, which are instances of page server 108. Block diagram 300 also includes, Figure 1 and Figure 2 The storage device 136 shown.

[0067] Each logical database file in logical database files 318-322 is part of filegroup 304. Similarly, logical database files 324-328 are... Figure 3 This is depicted as part of file group 306. Each logical database file in logical database files 318-328 is also associated with a file ID, as shown in the description of each database file. More specifically, logical database files 318-328 are associated with file IDs having M, 11, 10, N, 2, and 8 respectively.

[0068] Continue to refer to Figure 2 The endpoint manager 220 of storage manager 238 is configured to create and maintain endpoints such as... Figure 3 The endpoint mapping 216 is shown. More specifically, when a logical database file is created in database 302 (e.g., through actions such as querying, commanding, importing, migrating, copying, or other means of placing data into a new database file), an embodiment of the storage manager 238 of compute node 104 can be configured to operate in conjunction with, for example, operation processor 222 to slice the logical database file into slices, each of which is managed by one of the page servers 108. Thus, a slice comprises a contiguous sub-part of the logical database file. A slice is the basic unit of storage allocation for a logical file. In different embodiments, slices can have different sizes. Similarly, some embodiments support slices of different sizes (e.g., both 16GB and 128GB slices). Of course, a logical database file can be smaller than, for example, 16GB, and in such a case, such a file is suitable within a single slice and subsequently managed by only a single page server. However, and as will be discussed further below, embodiments further decompose the data stored in the slices into smaller physical storage units to achieve the previously described benefits of parallelism.

[0069] Continue to refer to Figure 3Each slice is given an entry in endpoint mapping 216. Each such entry in endpoint mapping 216 includes an identifier that uniquely identifies the logical database file to which the slice belongs. These identifiers are shown, for example, as entries in column file ID 308 of endpoint mapping 216.

[0070] The entries in endpoint mapping 216 also include a range identifier, which corresponds to the slice in question and indicates, for example, the range of pages covered by that slice and thus the location of the data for that slice within the logical database file. For example, the column representing range 310 in endpoint mapping 216 includes the range identifier. Note that the page range does not need to correspond 1:1 to the slice size.

[0071] Finally, each entry in endpoint mapping 216 includes the endpoint address of the page server assigned to manage that slice. For example, as... Figure 3 The entries in column endpoint address 312 shown in endpoint mapping 216 include endpoint address entries. The entries in endpoint address 312 are formatted to specify the protocol (i.e., "TCP") and the page server name (e.g., "Page Server A"). However, it should be understood that endpoint addresses do not need to follow such a format, and any type of identifier that uniquely identifies the page server in a given entry will suffice.

[0072] Furthermore, the slice size need not be constant. For example, such as Figure 3 The logical database files 322 and 328 of the database 302 shown are each divided into ranges of 1000, 8000, or 24000 storage units, respectively. In different embodiments, the storage units of the ranges can be different. For example, a storage unit may include a fixed number of gigabyte blocks. Alternatively, the storage unit of the range may be a data page (i.e., a typical 8-kilobyte data page, as in Microsoft SQL Server).

[0073] Continue to refer to Figure 2 Endpoint manager 220 is responsible for assigning each slice to a specific page server. For example, and refer to... Figure 3 In the depicted endpoint mapping 216, the first 8000 storage units of the logical database file 328 (i.e., with file D=8) are assigned to endpoint TCP: / / page server C, as shown in the third line of endpoint mapping 216. In one embodiment, storage manager 238 is configured to receive or possess information about the performance capabilities of each existing or potentially created page server, and may be further configured to assign larger ranges to more capable page server instances (e.g., those with more processing power, RAM, cache, network bandwidth, etc.).

[0074] In different embodiments, endpoint mapping 216 is used for various purposes. For example, compute node 104 can be invoked to satisfy queries requiring the reading of data from logical file 328. In such cases, as... Figure 2 The operation processor 222 of the compute node 104 shown can determine that data from one or more pages in the range [0, 7999] needs to be retrieved. In such a case, the endpoint map 216 is consulted to determine which page server holds the relevant data, and the request (e.g., a GetPage call) can be sent to the correct page server. In the example above, such a request would be sent to page server C 316.

[0075] In another embodiment, and referring to Figure 1 Log server 110 can be configured to receive and use a copy of endpoint mapping 216 to send log updates for a given database update only to the relevant page server (e.g., in contrast to all page servers receiving all log updates and each page server selectively ignoring those log updates that are not within its own scope).

[0076] As described above, the embodiments theoretically support slices of any size. For example, in one embodiment, the slice size could be 16GB or 128GB, and could (as generally described above) be further processed to break the slice into smaller units, to rearrange the order of data in physical storage, or both. For example, a slice could include a cell. A cell is a consistent unit that can be maintained independently of other cells. That is, a cell includes not only the data stored within the cell, but also metadata that allows the page server to perform I / O operations on the data contained in the cell. Therefore, cells from the same logical file do not need to be managed by the same page server. For example, consider... Figure 4 .

[0077] Figure 4 Units 404-1 to 404-6 of database file 402 according to an example embodiment are shown in Figure 1 An example distribution 400 is provided among multiple page servers 108-1 to 108-4 in a distributed database system 100. As described above, slices of logical database files are distributed across different page servers for management. Figure 4 In the depicted distribution, each unit is also a slice. This is because a slice must include at least one unit (but can have more), and... Figure 4In the example, each slice comprises one unit. Therefore, in the depicted embodiment, a single unit can hold 16GB, as this is the capacity of a single slice. On the other hand, when a slice is configured to contain 128GB, such a slice will comprise 8 units (i.e., since 128GB / 16GB = 8). In this instance, all units of the slice will be managed by the same page server, because the slice is mapped 1:1 to a page server instance (note: Figure 4 (This configuration is not shown in the image). Of course, in some other embodiments, the slice may be larger or smaller than 128GB and may include an appropriate number of cells depending on the cell size chosen for this embodiment.

[0078] In summary, the embodiments are configured to store logical database files across different page servers by breaking them down into 16GB or 128GB slices, where such slices themselves comprise one or more 16GB units. In one embodiment, a unit can be further broken down into four 4GB stripes (i.e., blobs or files), where such stripes can also be strung to provide further performance enhancements. A strung is a sequential block of data read (or written) from one stripe before (or simultaneously with) continuing a sequential read operation using the next stripe. Such a storage configuration in Figure 5 It can be better understood in the context of [the context].

[0079] Figure 5 An example data organization is depicted in consecutive blocks 502 to cells 504, stripe cells 506, and stripe / step cells 508 of the logical database file 402 according to an example embodiment. It is assumed that each block in the consecutive blocks 502 numbered 1 to 16 of the logical database file 402 comprises 1 GB of data. Thus, each consecutive block 502 numbered 1 to 16 can be stored together in a single cell (since the cell contains 16 GB in our running example), as depicted by cell 504.

[0080] However, if cell 504 is stored as a single physical file, reading data from, for example, block 6 would require reading the entire cell 504. To address this issue, embodiments physically store each cell in a collection of different physical files called stripes, where each stripe is stored on a different physical storage device (e.g., SSD or HDD). In one embodiment, the stripes forming a cell may all have the same size, although in alternative embodiments, the stripes forming a cell may have different sizes. Such an arrangement is depicted as stripe cell 506, where the blocks of cell 504 are divided among stripes 508 to 514. Arranging data as shown in stripe cell 506 has the advantage that reading a random block (e.g., block 6) requires reading only the stripe containing that block, which in this example is only 1 / 4 the size of the entire cell. Furthermore, since each stripe is stored on a different device, reading the entire cell can be up to 4 times faster due to parallel reading. However, further optimizations are possible.

[0081] The problem with stripe cell 506 is that parallel reads are not guaranteed when only consecutive portions of the cell are read. For example, reading blocks 1-4 of stripe 508 of stripe cell 506 will not be parallel because all those blocks are stored on a single storage device corresponding to stripe 508. Similarly, reads of blocks 3-6 can only occur in parallel across two storage devices corresponding to each of stripes 508 and 510.

[0082] However, embodiments can be, for example... Figure 5 The stripe / span unit 524 spans these blocks across stripes in the manner shown. A span is a sequential block / segment of data read from one disk (strip) before the sequential read operation continues to the next disk (strip). The stripe / span unit 524 organizes blocks 1-16 of consecutive blocks 502 into units comprising 4 stripes, but also spans data across those stripes in a polling manner to produce spanned stripes 516 to 522. Figure 5 As shown, the data stored in the straddle stripes 516 to 522 is distributed such that consecutive blocks are *always* stored in different stripes (and therefore on different storage devices). Therefore, sequential reads will always involve multiple stripes, thus requiring multiple physical storage devices.

[0083] For example, consider reading blocks 1-4 as described above. In that case, each stripe of the stride / strip unit 524 can be read in parallel, which should be 4× faster than performing the read using stripe unit 506, otherwise identical. Similarly, reading blocks 3-6 would also involve all four stripes of the stripe / strip unit 524, instead of just two stripes of the stripe unit 506 as described above, to achieve a 2× speedup. Figure 5In this context, each block is a stride. Therefore, block 1 is read from stride stripe 516 (e.g., on its own disk), and then the sequential read operation must continue by reading block 2 from stride stripe 518.

[0084] Note that the step size of each strip / step unit 524 is... Figure 5 The size is described as the same as each block size: 1GB. However, embodiments can use smaller granularity (e.g., 1MB) straddle stripes, which would significantly increase the probability of extended or random reads using multiple stripes (and therefore multiple physical devices). For example, a 1MB portion or block of each contiguous block in contiguous block 502 could be straddled across each straddle strip in a polling manner.

[0085] In an embodiment, Figure 1 Distributed database systems 100 (including such Figure 2 The page server 108 and compute node 104 described can be used in various ways to separate the logical storage and physical storage of database files. For example, Figure 6 A method for transferring according to an example embodiment is described. Figure 1 The flowchart 600 illustrates a method for separating logical and physical storage in database files stored in a distributed database system 100. (See flowchart 600 for further details.) Figure 1 and Figure 2 As described. However, based on the following discussion of flowchart 600, other structural and operational embodiments will be apparent to those skilled in the art(s).

[0086] Flowchart 600 begins at step 602. In step 602, data including the first logical database file is sliced ​​into a first set of slices, each slice being associated with a corresponding page server among a plurality of page servers. For example, and continuing to refer to... Figure 1 The distributed database system 100 and page server 108 and Figure 2 Computation node 104, such as Figure 2 The data slicer 210 of the storage manager 238 of the compute node 104 shown can be configured to slice the database file into slices in the manner described above herein.

[0087] More specifically, data slicer 210 can slice the database file into slices of a predetermined size (e.g., 16GB or 128GB in some embodiments). Also as described above herein, each such slice is subsequently assigned to a page server of page server 108 for storage and management. Flowchart 600 continues to step 604.

[0088] At step 604, the endpoint mapping for each slice in the first set of slices is maintained. For example, and continuing to refer to... Figure 1 The distributed database system 100 and page server 108 and Figure 2 The endpoint manager 220 of the storage manager 238 of the compute node 104 can be configured to create and manage endpoint mappings 216.

[0089] As described above, endpoint mapping 216 may include a lookup table or other data structure, which includes a file identifier for each file, a slice range, and an endpoint address. The file identifier for each file identifies the slice from the corresponding database file from which it was created. The slice range specifies the location of the slice's data within the logical database file corresponding to the file identifier. The endpoint address corresponds to the endpoint address of the page server to which the slice is assigned. Flowchart 600 ends at step 606.

[0090] At step 606, the data corresponding to each slice is sent to the appropriate page server associated with that slice for storage. For example, and continuing to refer to... Figure 1 Distributed database system 100 and page server 108 Figure 2 One or more of the compute node 104, storage manager 238, and / or operation processor 222 can be configured to cause data corresponding to each slice to reach its corresponding page server as indicated by endpoint mapping 216.

[0091] In one embodiment, such data may come directly from compute node 104, while in other embodiments, compute node 104 may indirectly deliver such data to the target page server. For example, compute node 104 may be configured to cause one or more auxiliary compute nodes 106-1 to 106-N or one or more page servers 108-1 to 108-N to directly pass such data to the target page server through a query operation, a push-down query, a query fragment, or some other operation.

[0092] Figure 6 Flowchart 600 illustrates a method for separating the logical and physical storage of a database file according to an embodiment. As described above, in one embodiment, such as Figure 7 As shown in flowchart 700, different database files can be stored using different storage configurations. Figure 7 Flowchart 700 depicts the... Figure 6 The method of flowchart 600 is refined to provide multiple physical storage configurations. In one embodiment, Figure 1The distributed database system 100 is configured to operate according to flowchart 700. Further structural and operational examples will be apparent to those skilled in the art based on the following description.

[0093] Flowchart 700 begins at step 702. In step 702, the first logical database file is stored using a first storage configuration. For example, and continuing to refer to... Figure 1 The distributed database system 100 and page server 108 and Figure 2 The compute node 104, in embodiments, can be configured to use different hardware and / or its configuration in various ways as described above herein.

[0094] More specifically, the page server in page server 108 can be configured to cache data for its slices both locally and remotely, for example... Figure 3 As shown, Figure 3 Page server A 314 is shown to be configured to locally cache data corresponding to slices of its database file 322, in addition to remotely storing such data in storage device 136. In other embodiments, other storage configurations are of course possible. For example, the storage configuration may utilize inexpensive commodity storage hardware, or it may instead utilize state-of-the-art solid-state devices and supporting infrastructure to achieve extremely high transaction rates. Flowchart 700 ends at step 704.

[0095] At step 704, the second logical database file is stored using the second storage configuration. For example, and continuing to refer to... Figure 1 The distributed database system 100 and page server 108 and Figure 2 The compute node 104, in embodiments, can be configured in various ways to use different hardware and / or configurations for different database files or filegroups, as described above herein. For example, as Figure 3 As shown, the storage of file 322 is managed by page server C 316. Page server C316 does not include a local cache. The absence of a local cache will generally result in slower access times, but will also result in lower sales costs.

[0096] In an embodiment, the database file storage configuration (such as...) Figure 7 The flowcharts shown in 700 can be changed, and such changes can be implemented in various ways. For example, Figure 8 The example embodiment is depicted. Figure 7 The improved flowchart 800 provides a method for changing the physical storage configuration of data stored in a distributed database. In one embodiment, Figure 1The distributed database system 100 can be configured to operate according to flowchart 700. Therefore, flowchart 800 continues to refer to... Figure 1 and Figure 2 As described. However, based on the following discussion of flowchart 800, other structural and operational embodiments will be apparent to those skilled in the art(s).

[0097] Flowchart 800 includes step 802. In step 802, the storage of the first logical database file or the second logical database file is changed to use the third storage configuration by moving the slices including the corresponding logical database file to one or more new endpoint addresses and updating the endpoint mapping of each slice to one of the one or more new endpoint addresses, or by b) changing the hardware configuration of the first storage configuration or the second storage configuration to the third storage configuration accordingly.

[0098] For example, and continue to refer to Figure 1 The distributed database system 100 and page server 108 and Figure 2 The storage configuration of compute node 104 for different files or filegroups can be changed in a variety of ways described above herein. For example, slices can be moved to one or more different page servers with different configurations (e.g., with or without local caching). For example, see reference... Figure 3 The database file 328 currently managed by page server C 316 can be copied to page server A 314, which includes a local cache, and once such copying is complete, endpoint mapping 216 is subsequently updated to reflect the new endpoints.

[0099] Alternatively, the slice can continue to be managed at the same page server as page server 108, but the storage configuration used for that page server is changed relative to the relevant slice. For example, and refer to Figure 3 The storage configuration for database file 322, which is managed by page server A 314, can be changed by disabling the local cache, but the stored slices will remain in place.

[0100] III. Example Computer System Implementation

[0101] As described herein, the embodiments include, but are not limited to, those described herein. Figure 1 The distributed database system 100, the master computing node 104, the auxiliary computing nodes 106-1 to 106-N, the log server 110, the page servers 108-1 to 108-N in the page servers 108, or the storage device 136, or Figure 2The memory manager 238, data slicer 210, or endpoint manager 220 of compute node 104, together with any of its components and / or subcomponents, and any operations and portions of the flowcharts / flowcharts described herein and / or other examples described herein, may be implemented in hardware or in combination with software and / or firmware, including computer program code / instructions implemented as being configured to be executed in one or more processors and stored in a computer-readable storage medium, or implemented as a hardware logic / electronic circuit system, such as in a system-on-a-chip (SoC), field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), trusted platform module (TPM), etc. An SoC may include an integrated circuit chip that includes a processor (e.g., a microcontroller, microprocessor, digital signal processor (DSP), etc.), memory, one or more communication interfaces, and / or one or more additional circuitry and / or embedded firmware to perform its functions.

[0102] The embodiments described herein can be implemented in one or more computing devices similar to mobile systems and / or computing devices in fixed or mobile computer embodiments, including one or more features of the mobile systems and / or computing devices described herein, as well as alternative features. The description of the computing devices provided herein is provided for illustrative purposes and is not intended to be limiting. Embodiments can be implemented in other types of computer systems, such as those known to those skilled in the art.

[0103] The embodiments described herein can be implemented in one or more computing devices similar to mobile systems and / or computing devices in fixed or mobile computer embodiments, including one or more features of the mobile systems and / or computing devices described herein, as well as alternative features. The description of the computing devices provided herein is provided for illustrative purposes and is not intended to be limiting. Embodiments can be implemented in other types of computer systems, such as those known to those skilled in the art.

[0104] Figure 9 Exemplary implementations of a computing device 900 in which embodiments may be implemented are depicted. For example, the embodiments described herein may be implemented in one or more computing devices similar to the computing device 900 in fixed or mobile computer embodiments, including one or more features and / or alternative features of the computing device 900. The description of the computing device 900 provided herein is provided for illustrative purposes and is not intended to be limiting. Embodiments may be implemented in other types of computer systems, as will be known to those skilled in the art.

[0105] like Figure 9As shown, computing device 900 includes one or more processors (referred to as processor circuitry 902), system memory 904, and a bus 906 coupling various system components, including system memory 904, to processor circuitry 902. Processor circuitry 902 is electrical and / or optical circuitry implemented as a central processing unit (CPU), microcontroller, microprocessor, and / or other physical hardware processor circuitry in one or more physical hardware electronic circuitry device elements and / or integrated circuit devices (semiconductor material chips or dies). Processor circuitry 902 can execute program code stored in a computer-readable medium, such as operating system 930, application program 932, other program 934, etc. Bus 906 represents any one or more of several types of bus structures, including memory bus or memory controller, peripheral bus, accelerated graphics port, and processor or local bus using any of various bus architectures. System memory 904 includes read-only memory (ROM) 908 and random access memory (RAM) 910. Basic input / output system 912 (BIOS) is stored in ROM 908.

[0106] The computing device 900 also includes one or more of the following drives: a hard disk drive 914 for reading from and writing to a hard disk, a disk drive 916 for reading from or writing to a removable disk 918, and an optical disc drive 920 for reading from or writing to a removable optical disc 922 (such as a CD-ROM, DVD-ROM, or other optical media). The hard disk drive 914, disk drive 916, and optical disc drive 920 are connected to the bus 906 via a hard disk drive interface 924, a disk drive interface 926, and an optical drive interface 928, respectively. The drives and their associated computer-readable media provide the computer with non-volatile storage of computer-readable instructions, data structures, program modules, and other data. Although hard disks, removable disks, and removable optical discs are described, other types of hardware-based computer-readable storage media may also be used to store data, such as flash memory cards, digital video discs, RAM, ROM, and other hardware storage media.

[0107] Multiple program modules may be stored on a hard disk, magnetic disk, optical disk, ROM, or RAM. These programs include an operating system 930, one or more application programs 932, other programs 934, and program data 936. Application programs 932 or other programs 934 may include, for example, computer program logic (e.g., computer program code or instructions) for implementing the embodiments described herein, such as, but not limited to, [other components]. Figure 1The distributed database system 100, the master computing node 104, the auxiliary computing nodes 106-1 to 106-N, the log server 110, the page servers 108-1 to 108-N or the storage device 136, or Figure 2 The storage manager 238, data slicer 210, or endpoint manager 220 of compute node 104, together with any of its components and / or subcomponents, and the flowchart tables / flowcharts described herein, including portions thereof, and / or other examples described herein.

[0108] Users can input commands and information into computing device 900 using input devices such as keyboard 938 and pointing device 940. Other input devices (not shown) may include microphones, joysticks, game controllers, satellite antennas, scanners, touchscreens and / or touchpads, voice recognition systems for receiving voice input, gesture recognition systems for receiving gesture input, etc. These and other input devices are typically connected to processor circuitry 902 via serial port interface 942 coupled to bus 906, but may also be connected via other interfaces such as parallel ports, game ports, or Universal Serial Bus (USB).

[0109] Display screen 944 is also connected to bus 906 via an interface (such as video adapter 946). Display screen 944 can be external to computing device 900 or incorporated into computing device 900. Display screen 944 can display information and can serve as a user interface for receiving user commands and / or other information (e.g., via touch, finger gestures, virtual keyboard, etc.). In addition to display screen 944, computing device 900 may also include other peripheral output devices (not shown), such as speakers and printers.

[0110] Computing device 900 is connected to network 948 (e.g., the Internet) via an adapter or network interface 950, modem 952, or other components for establishing communication over the network. Modem 952 (which may be internal or external) can be connected to bus 906 via serial port interface 942, such as... Figure 9 As shown, another interface type (including parallel interface) can be used to connect to bus 906.

[0111] As used herein, the terms “computer program medium,” “computer-readable medium,” and “computer-readable storage medium” are used to refer to physical hardware media, such as a hard disk associated with hard disk drive 914, removable disk 918, removable optical disk 922, other physical hardware media such as RAM, ROM, flash memory cards, digital video disks, zip disks, MEM, nanotechnology-based storage devices, and other types of physical / tangible hardware storage media. Such computer-readable storage media are distinct from and do not overlap with communication media and propagation signals (excluding communication media or propagation signals). Communication media embody computer-readable instructions, data structures, program modules, or other data in modulated data signals (such as carrier waves). The term “modulated data signal” refers to a signal in which one or more characteristics of its properties are set or altered in a manner that encodes information in the signal. By way of example and not limitation, communication media include wireless media, such as acoustic, RF, infrared, and other wireless media, as well as wired media. Embodiments also pertain to such communication media, which are separate from and do not overlap with embodiments pertaining to computer-readable storage media.

[0112] As described above, computer programs and modules (including application program 932 and other programs 934) can be stored on a hard disk, magnetic disk, optical disk, ROM, RAM, or other hardware storage medium. Such computer programs can also be received via a network interface 950, a serial port interface 942, or any other interface type. When executed or loaded by an application, such computer programs enable the computing device 900 to implement the features of the embodiments described herein. Therefore, such computer programs represent the controller of the computing device 900.

[0113] The embodiments also relate to computer program products that include computer code or instructions stored on any computer-readable medium. Such computer program products include hard disk drives, optical disk drives, memory device packages, portable memory sticks, memory cards, and other types of physical storage hardware.

[0114] IV. Additional Example Implementations

[0115] This document provides a distributed database system comprising a distributed database configured to store a first logical database file, the first logical database file including data and associated with a file identifier. The distributed database system includes: a plurality of page servers, each of the plurality of page servers being configurable to store at least one slice including a portion of the first logical database file; and a compute node coupled to the plurality of page servers and configured to store the first logical database file in response to a received command. The storage includes: slicing the data including the first logical database file into a first set of slices, each slice being associated with a corresponding page server among the plurality of page servers; maintaining an endpoint mapping for each slice in the first set of slices; and sending data corresponding to each slice to the corresponding page server associated with each slice for storage therein.

[0116] In one embodiment of the aforementioned distributed database system, the endpoint mapping includes: a database file identifier; a slice identifier, which specifies the location of the data corresponding to the corresponding slice within the logical database file corresponding to the database file identifier; and an endpoint address corresponding to the page server associated with the corresponding slice.

[0117] In one embodiment of the aforementioned distributed database system, the computing node is further configured to: store a first logical database file using a first storage configuration; and store a second logical database file in response to a received command, wherein the second logical database file is stored using a second storage configuration.

[0118] In one embodiment of the aforementioned distributed database system, each endpoint address mapped by each endpoint corresponds to one of the storage configurations in the first storage configuration or the second storage configuration.

[0119] In the aforementioned embodiments of the distributed database system, the second storage configuration has a slower access time than the first storage configuration.

[0120] In the aforementioned embodiments of the distributed database system, the computing node is further configured to: change the storage of the first logical database file or the second logical database file to use the third storage configuration by moving a slice including the corresponding logical database file to one or more new endpoint addresses and updating the endpoint mapping of each slice to one of the one or more new endpoint addresses, or by changing the hardware configuration of the first storage configuration or the second storage configuration to the third storage configuration accordingly.

[0121] In the aforementioned embodiment of the distributed database system, the data for each slice is a continuous portion of the first logical database file.

[0122] In the aforementioned embodiments of the distributed database system, each slice includes a set of units, and each unit includes a logically consistent storage unit that can be maintained independently of other units in that set.

[0123] In the aforementioned embodiment of the distributed database system, each unit in the set of units includes a set of stripes, and each stripe includes a single physical file stored on a physical device different from the physical device on which each of the other stripes is stored.

[0124] In the aforementioned distributed database system embodiment, each stripe includes a set of blocks, each block corresponding to a portion of a corresponding slice, which is discontinuous with any other block in the set within the corresponding slice.

[0125] This paper provides a method for storing a first logical database file across multiple page servers in a distributed database system. The method includes: slicing data including the first logical database file into a first set of slices, each slice being associated with a corresponding page server among multiple page servers; maintaining an endpoint mapping for each slice in the first set of slices; and sending data corresponding to each slice to the corresponding page server associated with that slice for storage therein.

[0126] In another embodiment of the aforementioned method, the endpoint mapping includes: a database file identifier; a slice identifier, which specifies the location of the data corresponding to the corresponding slice within the logical database file corresponding to the database file identifier; and an endpoint address corresponding to the page server associated with the corresponding slice.

[0127] Another embodiment of the aforementioned method further includes: using a first storage configuration to store a first logical database file; and using a second storage configuration to store a second logical database file.

[0128] In another embodiment of the aforementioned method, each endpoint address mapped by each endpoint corresponds to one of the storage configurations in the first storage configuration or the second storage configuration.

[0129] In another embodiment of the aforementioned method, the second storage configuration has a slower access time than the first storage configuration.

[0130] Another embodiment of the aforementioned method further includes: changing the storage of the first logical database file or the second logical database file to use the third storage configuration by moving a slice including the corresponding logical database file to one or more new endpoint addresses and updating the endpoint mapping of each slice to one of the one or more new endpoint addresses, or by changing the hardware configuration of the first storage configuration or the second storage configuration to the third storage configuration accordingly.

[0131] In another embodiment of the aforementioned method, the data including each slice is a continuous portion of a first logical database file.

[0132] In another embodiment of the aforementioned method, each slice includes a set of units, each unit including a logically consistent storage unit that can be maintained independently of other units in that set of units.

[0133] In another embodiment of the aforementioned method, each unit in the set of units includes a set of strips, each strip including a single physical file stored on a physical device different from the physical device on which each of the other strips is stored.

[0134] In another embodiment of the aforementioned method, each strip comprises a set of blocks, each block corresponding to a portion of a corresponding slice, which is discontinuous with any other block in the set within the corresponding slice.

[0135] V. Conclusion

[0136] While various embodiments of the disclosed subject matter have been described above, it should be understood that they are presented by way of example only and not limitation. Those skilled in the art will understand that various changes in form and detail may be made therein without departing from the spirit and scope of the embodiments defined in the appended claims. Therefore, the breadth and scope of the disclosed subject matter should not be limited by any of the exemplary embodiments described above, but should be defined solely by the appended claims and their equivalents.

Claims

1. A distributed database system, comprising a distributed database configured to store a first logical database file, the first logical database file including data and associated with a file identifier, the distributed database system comprising: Multiple page servers, each of which can be configured to store at least one slice including a portion of the first logical database file; A compute node, coupled to the plurality of page servers and configured to store the first logical database file in response to a received command, the storage comprising: The data, including the first logical database file, is sliced ​​into a first group of units, and each unit is associated with a corresponding page server among the plurality of page servers; Maintain multiple endpoint mappings for a set of consecutive portions of the first logical database file, each endpoint mapping being maintained for a corresponding consecutive portion within the set of consecutive portions; and The data corresponding to each unit is sent to the corresponding page server associated with each unit for storage therein. Each endpoint mapping includes: The database file identifier corresponding to the first logical database file; A range identifier, specifying the location of the corresponding contiguous portion within the first logical database file; and The endpoint address of the page server associated with the corresponding range identifier.

2. The distributed database system according to claim 1, wherein the computing node is further configured as: The first logical database file is stored using the first storage configuration; and Use the second storage configuration to store the second logical database file.

3. The distributed database system according to claim 2, wherein each endpoint address mapped by each endpoint corresponds to one of the storage configurations in the first storage configuration or the second storage configuration.

4. The distributed database system according to claim 2, wherein the second storage configuration has a slower access time than the first storage configuration.

5. The distributed database system according to claim 4, wherein the computing node is further configured as: The storage of the first logical database file is changed to use a third storage configuration by moving the first group of units to one or more new endpoint addresses and updating the multiple endpoint mappings corresponding to the first group of units to correspond to the one or more new endpoint addresses, or by changing the hardware configuration of the first storage configuration to a third storage configuration.

6. The distributed database system according to claim 1, wherein each consecutive portion of the first logical database file comprises a subset of the first group of units.

7. The distributed database system of claim 6, wherein each unit in the first group of units comprises a logically consistent storage unit that can be maintained independently of other units in the first group of units.

8. The distributed database system of claim 7, wherein each unit in the first group of units comprises a set of stripes, each stripe being stored on a different physical device than the physical device on which each of the other stripes is stored.

9. The distributed database system of claim 8, wherein each stripe comprises a set of blocks, each block corresponding to a portion of the first logical database file, the portion being discontinuous within the first logical database file from any other block in the set of blocks.

10. The distributed database system according to claim 1, wherein the computing node is further configured as: Receive queries requesting blocks from the first logical database file; Based on the maintained multiple endpoint mappings, determine the endpoint address of the page server storing the block; and Send a request for the page containing the block to the determined endpoint address.

11. A method for storing a first logical database file across multiple page servers in a distributed database system, the first logical database file comprising data and associated with a first file identifier, the method comprising: The data, including the first logical database file, is sliced ​​into a first group of units, and each unit is associated with a corresponding page server among the plurality of page servers; Maintain multiple endpoint mappings for a set of consecutive portions of the first logical database file, each endpoint mapping being maintained for a corresponding consecutive portion in the set of consecutive portions; as well as The data corresponding to each unit is sent to the corresponding page server associated with each unit for storage therein. Each endpoint mapping includes: The database file identifier corresponding to the first logical database file; A range identifier, specifying the location of the corresponding contiguous portion within the first logical database file; and The endpoint address of the page server associated with the corresponding range identifier.

12. The method of claim 11, further comprising: The first logical database file is stored using the first storage configuration; as well as Use the second storage configuration to store the second logical database file.

13. The method of claim 12, wherein each endpoint address mapped by each endpoint corresponds to one of the storage configurations, the first storage configuration or the second storage configuration.

14. The method of claim 12, wherein the second storage configuration has a slower access time than the first storage configuration.

15. The method of claim 14, further comprising: The storage of the first logical database file is changed to use a third storage configuration by moving the first group of units to one or more new endpoint addresses and updating the multiple endpoint mappings corresponding to the first group of units to correspond to the one or more new endpoint addresses, or by changing the hardware configuration of the first storage configuration to a third storage configuration.

16. The method of claim 11, wherein each consecutive portion of the first logical database file comprises a subset of the first set of units.

17. The method of claim 16, wherein each of the first group of units comprises a logically consistent storage unit that can be maintained independently of other units in the first group of units.

18. The method of claim 17, wherein each unit in the first group of units comprises a set of strips, each strip being stored on a physical device different from the physical device on which each of the other strips is stored.

19. The method of claim 18, wherein each stripe comprises a set of blocks, each block corresponding to a portion of the first logical database file, the portion being discontinuous within the first logical database file from any other block in the set of blocks.

20. The method of claim 18, further comprising: Receive queries requesting blocks from the first logical database file; Based on the maintained multiple endpoint mappings, determine the endpoint address of the page server storing the block; as well as Send a request for the page containing the block to the determined endpoint address.