Relational database organization for sharding
By selecting partitioned key-value sets and sharded databases in the database system, the problems of high computational resource requirements for big data storage and insufficient features of NoSQL databases are solved, realizing high-performance and scalable distributed database management, which is suitable for OLTP systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ORACLE INT CORP
- Filing Date
- 2016-09-30
- Publication Date
- 2026-06-16
AI Technical Summary
Existing database systems require excessive computing resources for a single database instance when storing large amounts of data. NoSQL databases cannot provide the full features of relational databases, leading to performance bottlenecks and management complexity.
By selecting one or more columns as partition keys for a set of tables, multiple key-value sets are generated and sharded databases are established. The partition key-value sets are associated with shards, and the shard coordinator is used to store the mapped data and execute queries. It supports SQL queries and automatic propagation of schema changes, thereby achieving distributed storage and management of data.
It improves the scalability and performance of database systems, reduces points of failure, supports SQL queries, simplifies the management of database schema changes, and is suitable for OLTP systems such as e-commerce, mobile, and SaaS.
Smart Images

Figure CN114780651B_ABST
Abstract
Description
[0001] This application is a divisional application of the invention patent application filed on September 30, 2016, entitled "Organization of a Fragmented Relational Database" and with application number 201680067124.9. Technical Field
[0002] This disclosure relates to database systems. More specifically, this disclosure relates to relational database organization for sharding. Background Technology
[0003] Database systems storing ever-increasing volumes of data are becoming increasingly prevalent. For example, online transaction processing (OLTP) systems, such as e-commerce, mobile, social, and Software-as-a-Service (SaaS) systems, typically require large database storage. Example applications of OLTP systems include, but are not limited to, large billing systems, ticketing systems, online financial services, media companies, online information services, and social media companies. Given the massive amounts of data stored in these database systems, storing all the data in a single database instance is impractical, as this volume would require significant computational resources such as processors, memory, and storage devices.
[0004] Furthermore, non-relational databases (sometimes called NoSQL databases) do not offer the full benefits of relational databases. For example, NoSQL databases do not provide the same schema control, atomicity, consistency, isolation, durability, or full features of relational databases in their Structured Query Language (SQL).
[0005] The methods described in this section are methods that can be followed, but are not necessarily methods that have been previously conceived or followed. Therefore, unless otherwise stated, it should not be assumed that any method described in this section is prior art simply because it is included in this section. Summary of the Invention
[0006] According to one aspect of this disclosure, a method is provided, comprising: selecting one or more columns from a set of tables as partition keys; establishing a plurality of key-value sets, wherein each key-value set in the plurality of key-value sets includes the value of the partition key that does not belong to any key-value set in other key-value sets; generating partitions for each key-value set in the plurality of key-value sets for each table in the set of tables; creating a sharded database comprising a plurality of shards; wherein each shard in the plurality of shards is associated with a different key-value set in the plurality of key-value sets; distributing the partitions of the table in the plurality of shards in the sharded database for each table in the set of tables; wherein, for each table, the partition associated with each key-value set is distributed to the shard associated with that key-value set in the plurality of shards; and storing mapping data that maps each key-value set in the plurality of key-value sets to the shard associated with that key-value set in the plurality of shards.
[0007] According to another aspect of this disclosure, a method is provided, comprising: storing a plurality of mapping entries by a sharding coordinator; wherein each of the plurality of mapping entries maps a different set of shard key values to a shard in a plurality of shards of a sharded database; wherein each of the plurality of shards includes one or more database servers; receiving a first query from a client application at the sharding coordinator, wherein the first query includes a first shard key value; in response to receiving the first query, performing: identifying a specific mapping entry among the plurality of mapping entries corresponding to a set of shard key values including the first shard key value; using the specific mapping entry to identify a target shard from the plurality of shards; generating a first result for the first query by using the target shard corresponding to the set of shard key values including the first shard key value; and sending a response to the first query based on the first result to the client application.
[0008] According to another aspect of this disclosure, a method is provided comprising: performing a first operation relating to data residing in both a first shard and a second shard of a sharded database by: causing a client application to send a first request to the first shard to cause the first shard to perform a first portion of the first operation relating to data in the first shard; and causing the client application to send a second request to a second shard to cause the second shard to perform a second portion of the first operation relating to data in the second shard; performing a second operation relating to data residing in both the first shard and the second shard of the sharded database by: causing the client application to send a third request to a shard coordinator; and in response to the third request, the shard coordinator causing: the first shard to perform the first portion of the second operation relating to data in the first shard; and the second shard to perform the second portion of the second operation relating to data in the second shard.
[0009] According to another aspect of this disclosure, a method is provided, comprising: maintaining pattern data by a sharding directory indicating a pattern used by each of a plurality of shards in a sharded database; receiving a pattern modification instruction for modifying the pattern; automatically performing, in response to receiving the pattern modification instruction, updating the pattern data at the sharding directory based on the pattern modification instruction; and causing all shards in the plurality of shards to modify the pattern by performing the following operations for each specific shard in the plurality of shards: creating a database connection to the specific shard; sending the pattern modification instruction to the specific shard; and executing the pattern modification instruction on the specific shard.
[0010] According to another aspect of this disclosure, a method is provided, comprising: maintaining a sharded database comprising a plurality of shards; wherein the plurality of shards are grouped into a plurality of shard spaces; wherein each of the plurality of shard spaces comprises at least one shard from the plurality of shards; performing partitioning at one or more levels on a table using one or more levels of partitioning criteria to produce a first plurality of partitions; receiving user-specified code from a user; selecting a shard space from the plurality of shard spaces for each of the first plurality of partitions based at least in part on the user-specified code; and distributing each of the first plurality of partitions to the shard space selected for that partition based on the user-specified code.
[0011] According to another aspect of this disclosure, a non-transitory computer-readable medium is provided that stores one or more instructions, which, when executed by one or more hardware processors, cause the methods of this disclosure to be performed.
[0012] According to another aspect of this disclosure, an apparatus is provided that includes one or more devices configured to perform the methods of this disclosure. Attached Figure Description
[0013] One or more exemplary embodiments of the present invention are shown by way of example and not limitation in the accompanying drawings, in which similar reference numerals refer to similar elements, and wherein:
[0014] Figure 1 Examples of unsharded and sharded databases according to one embodiment are shown.
[0015] Figure 2 This is a block diagram of a system for a sharded database according to one embodiment.
[0016] Figure 3A and Figure 3B An example of a table family according to one embodiment is shown.
[0017] Figure 4A and Figure 4B An example of a repeating table according to one embodiment is shown.
[0018] Figure 5 An example of sharding as a form of distributed partitioning, according to one embodiment, is shown.
[0019] Figure 6A and Figure 6B An example of block configuration in a sharded database according to one embodiment is shown.
[0020] Figure 7 An example is shown of using a consistent hash to assign a range of values to blocks, according to one embodiment.
[0021] Figure 8 An example of system-managed sharding according to one embodiment is shown.
[0022] Figure 9 An example of a mapping from partitions to tablespaces and shards according to one embodiment is shown.
[0023] Figure 10 An example of a sharded database using composite sharding is shown according to one embodiment.
[0024] Figure 11 This is an example of a sharded database with sub-partitions according to one embodiment.
[0025] Figure 12A , Figure 12B and Figure 12C An example of re-fragmentation according to one embodiment is shown.
[0026] Figure 13 This is a flowchart of a process for re-fragmentation according to one embodiment.
[0027] Figure 14 This is a flowchart of a proxy-based query routing process according to one embodiment.
[0028] Figure 15A and Figure 15B This is a flowchart of a proxy-based query routing process according to one embodiment.
[0029] Figure 16 This is a flowchart of a query routing process according to one embodiment.
[0030] Figure 17 It is a block diagram of a computing device in which one or more exemplary embodiments of the present invention can be embodied.
[0031] Figure 18 It is a block diagram of the basic software system used to control the operation of computing devices. Detailed Implementation
[0032] In the following description, numerous specific details are set forth for purposes of explanation in order to provide a thorough understanding of one or more exemplary embodiments of the invention. However, it will be appreciated that one or more exemplary embodiments may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form to avoid unnecessarily obscuring one or more exemplary embodiments.
[0033] General Overview
[0034] Horizontal partitioning is a technique that breaks down a single, large table into smaller, more manageable subsets of information (called "partitions"). Sharding is a data layer architecture in which data is horizontally partitioned across independent database instances, each of which is called a "shard." The collection of shards together forms a single logical database called a "sharded database" ("SDB"). Logically, a sharded database can be accessible to client applications just like a traditional unsharded database. However, tables in a sharded database are horizontally partitioned across shards.
[0035] Figure 1 Examples of unsharded database 100 and sharded database 110 are shown. Unsharded database 100 is a relational database and includes table 102. All contents of table 102 are stored in the same unsharded database 100 and therefore use the same computing resources, such as processors, memory, and disk space.
[0036] However, sharded database 110 depicts an alternative configuration using sharding technology. Sharded database 110 includes three shards 112, 114, and 116. Each of shards 112, 114, and 116 is its own database instance and includes its own tables 113, 115, and 117, respectively. However, in sharded database 110, table 102 has been horizontally partitioned across shards 112, 114, and 116 into tables 113, 115, and 117. Horizontal partitioning in a sharded database involves splitting a database table (such as table 102) across shards such that each shard contains a subset of the rows of table 102. In this example, tables 113, 115, and 117 each contain a subset of the rows of table 102. In general, tables 113, 115, and 117 can be referred to as “sharded tables”. Data stored in tables 113, 115, and 117 is equivalent to data stored in table 102. The sharded database 110 is logically treated as a single database, and therefore can be accessed by client applications just like the unsharded database 100.
[0037] In one embodiment, sharding is a database architecture where nothing is shared, because shards 112, 114, and 116 do not need to share physical resources such as processors, memory, and / or disk storage devices. Shards 112, 114, and 116 are loosely coupled in software and do not require running clusterware. From the database administrator's perspective, sharded database 110 consists of multiple database instances that can be managed jointly or individually. However, from the perspective of client applications, sharded database 110 logically appears as a single database. Therefore, the number of shards included in sharded database 110 and the distribution of data across these shards are completely transparent to client applications.
[0038] The configuration of sharded database 110 offers various benefits. For example, in one embodiment, sharded database 110 improves scalability by eliminating performance bottlenecks and making it possible to increase system performance and capacity through the addition of additional shards and the distribution of load across shards. Sharded database 110 can be implemented with a no-shared architecture, so each shard in the sharded database is its own database instance, and shards do not need to share hardware such as processors, memory, and / or disk storage devices.
[0039] In this embodiment, the sharded database 110 provides fault containment because it eliminates individual points of failure such as shared disks, shared storage area networks, clustering components, shared hardware, etc. Instead, sharding provides strong fault isolation because the failure of a single shard does not affect the availability of other shards.
[0040] In this embodiment, sharded database 110 provides enhancements to global data distribution. Sharding makes it possible to store specific data in locations physically close to its customers. When data is legally required to be located within a specific jurisdiction, it may be necessary to physically locate the data close to the customer's storage by placing the shards used for that specific data within that jurisdiction to meet regulatory requirements. Storing data physically close to the customer can also provide performance benefits by improving latency between the customer and the underlying data stored in the shards.
[0041] In this embodiment, the sharded database 110 allows for rolling upgrades of the system. In a sharded data architecture, changes made to one shard do not affect the content of other shards in the sharded database, thereby allowing database administrators to first attempt changes to a small subset of data stored in a single shard and then roll those changes over to the remaining shards in the sharded database.
[0042] In this embodiment, sharded database 110 provides simplicity in cloud deployment. Given that shard sizes can be made arbitrarily small, database administrators can easily deploy sharded databases in a cloud comprised of low-end commodity servers with local storage.
[0043] In this embodiment, sharded database 110 is most effective in applications that are well-partitioned, primarily access data within a single shard, and do not have stringent performance and consistency requirements for cross-shard operations. Therefore, sharding is particularly suitable for OLTP systems such as e-commerce, mobile, social, and SaaS.
[0044] In this embodiment, sharded database 110 provides an improvement in the automatic propagation of database schema changes across shards. Instead of requiring database administrators to manually apply database schema changes to each individual shard, sharded database 110 can automatically propagate such schema changes to shards from a single entry point.
[0045] In this embodiment, the sharded database 110 supports the traditional Structured Query Language (SQL), thus leveraging all available SQL syntax and keywords. Furthermore, given that the sharded database 110 supports SQL, it can be easily integrated with existing client applications configured to access relational databases via SQL.
[0046] In this embodiment, the sharded database 110 provides the full advantages of relational databases, including schema control, atomicity, consistency, isolation, and durability.
[0047] In some embodiments, the sharded database 110 provides direct routing of queries to shards without requiring intermediate components to route the queries. This direct routing improves system latency by reducing the number of network hops required to process queries.
[0048] General system architecture
[0049] Figure 2 This is a block diagram of a database system according to one embodiment. Client application 210 is any type of client application that needs to access data stored in the database. In one embodiment, client application 210 may be a client in an OLTP setup (such as e-commerce, mobile, social, or SaaS). Client application 210 may be communicatively coupled to sharded database 250.
[0050] Sharded database 250 is a logical database in which data is horizontally partitioned across independent database instances. Specifically, data stored in sharded database 250 is horizontally partitioned and stored in shards 260A, 260B, and 260C. A sharded database can include any number of shards, and the number of shards in a sharded database can vary over time. According to one embodiment, each of shards 260A, 260B, and 260C is its own database instance, which needs to share physical resources, such as processors, memory, and / or storage devices, with the other shards in sharded database 250.
[0051] Partial Directory
[0052] Sharded database 250 includes shard directory 230. Sharded directory 230 is a special database used to store configuration data for sharded database 250. In one embodiment, the shard directory can be replicated to provide improved availability and scalability. Configuration data stored in shard directory 230 may include: a routing table that maps which shards store data blocks corresponding to a given value, range of values, or set of values for a shard key; shard topology data that describes the overall configuration of sharded database 250, information about the configuration of shards 260A, 260B, and 260C; information about the configuration of shard director 220, information about client application 210, information about the schema of data horizontally partitioned across shards 260A, 260B, and 260C, a historical log of pending and completed schema modification instructions for shards 260A, 260B, and 260C, and all other information related to the configuration of sharded database 250.
[0053] In one embodiment, the sharded directory maintains a routing table storing mapping data comprising multiple mapping entries. Each of the multiple mapping entries maps a distinct set of key-value pairs for one or more shard keys to shards within a sharded database. In another embodiment, each of the multiple mapping entries maps a distinct set of key-value pairs for one or more shard keys to data blocks on shards within a shard in a sharded database. In yet another embodiment, each of the multiple mapping entries maps a distinct set of key-value pairs for one or more shard keys to a shard space comprising one or more shards in a sharded database. In one embodiment, the key-value pairs may be a series of partition key values. In another embodiment, the key-value pairs may be a list of partition key values. In yet another embodiment, the key-value pairs may be a collection of hash values.
[0054] Therefore, for queries that require access to data with a specific shard key value, a routing table can be used to find which shard in the sharded database contains the data blocks needed to process the query.
[0055] Segmented Guide
[0056] Sharded database 260 includes shard director 220. Sharded director 220 coordinates various functions across sharded database 250, including but not limited to: routing database requests to shards, propagating database schema changes to shards, monitoring shard status, receiving status updates from shards, receiving notifications from client applications, sending notifications to shards, sending notifications to client applications, and / or coordinating various operations affecting the configuration of sharded database 260, such as resharding operations. Sharded director 220 is communicatively coupled to shard directory 230, client application 210, and shards 260A, 260B, and 260C.
[0057] Although depicted as a single shard director 220, in one embodiment, the shard database 260 may include multiple shard directors 220. For example, in one embodiment, the shard database 260 may include three shard directors 220. Having multiple shard directors 220 can allow load balancing of coordination tasks performed by the shard directors 220, thereby improving performance. In the case of multiple shard directors 220, in one embodiment, one of the shard directors 220 may be selected as the manager of the shard directors 220, which is responsible for managing the remaining shard directors 220, including load balancing.
[0058] Table Family
[0059] To reap the benefits of sharding, the schema of a sharded database should be designed to maximize the number of data requests that can be executed on a single shard without requiring cross-shard joins. By executing queries on a single shard, cross-shard joins are eliminated, thus improving efficiency.
[0060] Database tables often have parent-child relationships. For example, a reference constraint can exist between a parent and child table. Specifically, with foreign key constraints, a column in the child table can be constrained to store only values that exist in the parent table's primary key. Multiple tables linked through this relationship typically form a tree-like hierarchical structure where each child has a single parent. A "table family" is a representation of this hierarchical structure of related tables. The table in a table family that has no parent is called the "root table." There is only one root table in a table family.
[0061] For example, Figure 3AAn example of table family 310 is shown. Table family 310 includes tables 312, 314, and 316. Tables 312, 314, and 316 illustrate the classic Customers / Orders / LineItems schema. Table 312 is a table that includes customer information. The primary key of customer table 312 is the "Customer No." field. Customer table 312 is the root table of table family 310.
[0062] The Orders table (314) contains order information. The primary key of the Orders table (314) is the "Order No." The foreign key of the Orders table (314) is the "Customer No." field, which references the primary key of the Customer table (312). This means that the "Customer No." field in the Orders table (314) is only allowed to store values that exist in the "Customer No." field of the Customer table (312). Therefore, the Orders table (314) is a child of the Customer table (312).
[0063] Row entry table 316 is a table that contains row entry information. The primary key of row entry table 316 is the "Row Number" field. The foreign key of row entry table 316 is the "Order Number" field, which references the primary key of order table 314. This indicates that the "Order Number" field of table 316 is only allowed to store values that exist in the "Order Number" field of order table 314. Therefore, row entry table 316 is a child of order table 314.
[0064] According to one embodiment, when sharding tables belonging to a table family, all tables in the table family are partitioned by inheriting the partition key from the root table of the table family. Therefore, the partition key of the root table is used as the partition key for all tables in the table family. In one embodiment, the partition key is equivalent to the shard key in a sharded database. In the example of table family 310, the "Customer ID" field can be used as both the partition key and the shard key for all tables in table family 310. Corresponding partitions from all tables in the table family are always stored in the same shard. Since join operations in a sharded database are typically performed on related data from tables within the table family, storing corresponding partitions from the table family in the same shard improves the execution of such joins, as join operations can be performed on a single shard and cross-shard queries are not required.
[0065] To illustrate, Figure 3BThis example illustrates how data from table family 310 can be sharded into three shards 320, 330, and 340. Shard 320 includes tables 322, 324, and 326, which have already been partitioned based on the "Customer ID" sharding key. Specifically, tables 322, 324, and 326 each contain corresponding data for the Customer table, Order table, and Line Entry table, where Customer ID = 123. Shard 330 includes tables 332, 334, and 336. Shard 340 includes tables 342, 344, and 346, which have also been partitioned based on the "Customer ID" sharding key. Specifically, tables 332, 334, and 336 each contain corresponding data from the Customer table, Order table, and Line Entry table, where Customer ID = 456. Specifically, tables 342, 344, and 346 each contain corresponding data from the Customer table, Order table, and Line Entry table, where Customer ID = 999.
[0066] Overall, tables 322, 332, and 342 are sharded table representations of customer table 312; however, the content is now distributed across three shards. Similarly, tables 324, 334, and 344 collectively contain the same data as order table 314; however, the content is now distributed across three shards. Furthermore, tables 326, 336, and 346 collectively contain the same data as row entry table 316; however, the content is now distributed across three shards. By sharding the content of table family 310 using the same sharding key across all tables 312, 314, and 316, each shard will be able to independently perform joins on related data for a given table family, thereby improving system performance and reducing the frequency of cross-shard queries.
[0067] Table 1 provides example code for creating sharded tables for the customer table, order table, and line entry table:
[0068] Table 1
[0069] CREATE SHARDED TABLE Customers
[0070] (CustNo NUMBER NOT NULL
[0071] Name VARCHAR2(50)
[0072] Address VARCHAR2(250)
[0073] ,CONSTRAINT RootPK PRIMARY KEY(CustNo) )
[0075] PARTITION BY CONSISTENT HASH(CustNo)
[0076] PARTITIONS AUTO
[0077] TABLESPACE SET ts1 ;
[0079] CREATE SHARDED TABLE Orders
[0080] (OrderNo NUMBER NOT NULL
[0081] ,CustNo NUMBER NOT NULL
[0082] ,OrderDate DATE
[0083] ,CONSTRAINT OrderPK PRIMARY KEY(OrderNo)
[0084] ,CONSTRAINT CustFK FOREIGN KEY(CustNo)REFERENCES Customers(CustNo) )
[0086] PARTITION BY REFERENCE(CustFK) ;
[0088] CREATE SHARDED TABLE LineItems
[0089] (CustNo NUMBER NOT NULL
[0090] ,LineNo NUMBER(2)NOT NULL
[0091] ,OrderNo NUMBER(5)NOT NULL
[0092] ,StockNo NUMBER(4)
[0093] ,Quantity NUMBER(2)
[0094] ,CONSTRAINT LinePK PRIMARY KEY(CustNo,OrderNo,LineNo)
[0095] ,CONSTRAINT LineFK FOREIGN KEY(CustNo,OrderNo)REFERENCES Orders(CustNo,OrderNo) )
[0097] PARTITION BY REFERENCE(LineFK) ;
[0099] In the syntax example above, the Customers table is partitioned across shards using the CustNo (customer ID) field as the partition key. The Customers table is the root table in the table family. The Orders and LineItems tables are child and grandchild tables of the Customers table, respectively. The Orders and LineItems tables are partitioned according to foreign key references to the partition key that indirectly points back to the Customers table. Therefore, the Orders and LineItems tables are also partitioned based on the same partition key as the Customers table. Thus, the Customers, Orders, and LineItems tables are a table family that are similarly partitioned across shards for the same partition key.
[0100] In another embodiment, even if the table schema does not explicitly articulate constraints (such as foreign keys) defining parent-child relationships between the tables in the table set, the table set is partitioned across shards as if they were a table family. In this case, the database administrator can recognize that these tables are related to each other, even if they are not an explicit "table family." The database administrator can identify one or more partition keys to partition the table set across shards such that partitions of the table set include relevant data that might be used to satisfy a single query, and such partitions across the table set will be included on the same shard. Therefore, even without an explicit table family formed based on constraints in the schema, the database administrator can substantially simulate the logical grouping of the table set. For example, in one embodiment, the table set includes a customer table and an order table. The customer table and the order table may each include a "customer number" column; however, there is no reference constraint in the schema linking the customer table to the order table. In this example, even without forming a formal table family, the database administrator can use the customer number field as at least one partition key to partition the table set. Therefore, data blocks can be created using logical grouping of sets of tables, as will be described later.
[0101] Duplicate tables
[0102] Some reference tables are relatively small, do not change frequently, and / or are often accessed along with sharded tables. An example of a reference table could be a U.S. postal code table. This postal code information is relatively small and static, but may be frequently accessed along with other sharded tables involving addresses. Reference tables are a good candidate to be implemented as "duplicated tables" that are copied across all shards in a sharded database. By storing copies of the duplicated table across multiple shards, the system can ensure that more queries can be processed via a single shard without requiring cross-shard queries, thereby improving system efficiency.
[0103] Figure 4A An example of an unsharded database 410 is shown. Unsharded database 410 includes a customer table 412, an order table 414, and a line entry table 416, which again form a table family. Unsharded database 410 also includes a products table 418, which is an example of a reference table.
[0104] Figure 4B An example of a sharded database including shards 420, 430, and 440 is shown. The sharding in Tables 412, 414, and 416 is compared with... Figure 3B The process is similar. Specifically, shard 420 includes tables 422, 424, and 426. Shard 430 includes tables 432, 434, and 436. Shard 440 includes tables 442, 444, and 446. However, each of shards 420, 430, and 440 also includes a duplicate of table 418. In this way, each of shards 420, 430, and 440 can access the contents of table 418, thereby minimizing the need to perform cross-shard queries to access the contents of the reference table.
[0105] Table 2 provides example code for creating duplicate tables:
[0106] Table 2
[0107] CREATE DUPLICATED TABLE Products(
[0108] SKU NUMBER PRIMARY KEY,
[0109] Product VARCHAR2(20), ) ;
[0112] In one embodiment, materialized view replication is used to synchronize duplicate tables on each shard. In one embodiment, the duplicate tables on each shard are read-only materialized views. The parent table of the materialized views resides in the shard directory. In one embodiment, materialized views on all shards are automatically refreshed at a configurable frequency. In one embodiment, the CREATEDUPLICATED TABLE command automatically creates the parent table, materialized views, and other objects required for materialized view replication.
[0113] As a shard in a distributed partition
[0114] Although from a database administrator's perspective, a sharded database appears as a single database to client applications and developers, it is actually a collection of shards, where each shard is a discrete database instance. Sharded tables are partitioned across all shards of the sharded database. Therefore, table partitioning on each shard of a sharded database is no different from regular partitioning used in a non-sharded database.
[0115] Figure 5 An example of sharding as a distributed partition is shown. Unsharded database 500 consists of a single table broken down into partitions 1-20. Sharded database 520 has these same partitions 1-20 distributed as partitions across partitions 521, 522, 523, 524, and 525. From the perspective of the client application, the structure and format of the partitions are the same; however, in the case of sharded database 520, the partitions are distributed across five different partitions, thus distributing the computational load across the computational resources of each partition.
[0116] Tablespaces and blocks
[0117] A tablespace is a collection of storage containers (e.g., data files) used to store data for database objects within a database. Database objects are objects managed by the database system. Examples of database objects include tables, indexes, and code modules that can be executed by the database system. A database object can be referred to as residing in a specific tablespace when it holds the data of that object. Each partition of a sharded table is stored in a separate tablespace. Therefore, a tablespace is the physical unit of data distribution in a sharded database. A "set of tablespaces" is a collection of tablespaces.
[0118] It is possible to create different tables from a set of tables or a family of tables in different tablespace sets. For example, a customer table might be created in tablespace set TS1, and an order table might be created in tablespace set TS2. However, to minimize the number of required cross-shard queries, it is essential to ensure that the tablespace storing the customer table (partition_1) and the corresponding tablespace storing the order table (partition_1) always reside in the same shard. To achieve this, a "chunk" can be defined as a dataset that stores the corresponding partitions of a table. For example, in one embodiment, a chunk can be defined as a set of tablespaces that stores the corresponding partitions of all tables in a table family. A chunk contains a single partition from each table in the relevant table set or family.
[0119] Figure 6A An example of block 601 according to one embodiment is shown. Block 601 stores corresponding partitions of the customer table, order table, and line item table. Specifically, block 601 stores a partition Customers_P1 covering customers 1 to 1,000,000 (1M) in the customer table based on a range of partition key values. Block 601 also stores a partition Orders_P1 containing the corresponding rows in the order table for customers in Customers_P1, using the same partition key. Block 601 also stores a partition LineItems_P1 containing the corresponding rows in the line item table for orders in Orders_P1, using the same partition key. Thus, partitions Customers_P1, Orders_P1, and LineItems_P1 are all partitioned based on the same partition key. Therefore, join operations requiring data from the customer table, order table, and line item table for customers falling within the set of partition key values can be served by a single shard containing block 601 without cross-shard queries.
[0120] Figure 6BAn example of shard 620 is shown. Shard 620 includes blocks 601, 606, and 611, which contain partitions from a family of tables including a customer table, an order table, and a line entry table. Each of blocks 601, 606, and 611 contains a single partition from each table in the table family, based on different ranges of partition key values. In this example, the partition key is the customer ID column of the customer table. For example, block 601 includes data from the customer table, order table, and line entry table for customer IDs in the range of 1 to 1,000,000. Block 606 covers data from the customer table, order table, and line entry table for customer IDs in the range of 5,000,001 to 6,000,000. Block 611 includes data from the customer table, order table, and line entry table for customer IDs in the range of 1,000,0001 to 1,100,000. Shard 620 also includes table 630, which is a duplicate copy of the product table as a reference table. Therefore, queries that require data from corresponding rows of Customers, Orders, or LineItems tables that share the same partition key value can be served by a single shard containing blocks for the partitions of these three tables. Such queries do not require cross-shard joins.
[0121] System-managed sharding
[0122] In one embodiment, system-managed sharding is a sharding technique in which the system automatically determines the mapping of data to shards in a sharded database. As will be described herein, data is automatically distributed across shards using partitioning via a hashing strategy. Therefore, system-managed sharding distributes data evenly and randomly across shards. Because data is distributed randomly and evenly across shards, system-managed sharding reduces the likelihood of creating hotspots of data requests that require processing a disproportionate amount of data requests compared to other blocks in the sharded database. Furthermore, because data is distributed randomly and evenly across shards, system-managed sharding provides consistent performance across shards. In one embodiment, a sharded database using system-managed sharding can maintain a balanced distribution of data when shards are added to or removed from the sharded database.
[0123] In one embodiment, in system-managed sharding, data is allocated to shards based on a hash function applied to the shard key. For example, in one embodiment, conventional hashing is a partitioning strategy that can be used in system-managed sharding. In conventional hashing, a hash function is applied to the shard key as follows to determine the block to which the data should be assigned: hash(key) % N. Here, hash() is the hash function, and N is a constant representing the number of blocks. Therefore, by applying conventional hashing to the shard key, the system can determine which block the data associated with the shard key should be assigned to.
[0124] In one embodiment, consistent hashing is a partitioning strategy that can be used in system-managed sharding. In consistent hashing, a hash function is applied to the shard key to obtain a hash value. The possible range of values for the hash function is (e.g., from 0 to 2). 32 The data is divided into a set of N adjacent intervals, each representing a different block. Therefore, by applying a consistent hash to the shard key, the system can determine which block the data associated with the shard key should be assigned to.
[0125] For example, Figure 7 An example of a range of values assigned to a block is shown. In this example, the sharded database contains 1024 blocks, and each block is assigned 2. 22 The range of hash values. Therefore, block 710A is assigned from 0 to 4194304 (e.g., 2). 22 The range of hash values for block 710B is assigned from 419304 (e.g., 2). 22 ) to 8388608 (e.g., 2*2) 22 The range, etc., until block 710N is assigned from 4290772992 (e.g., 1023*2). 22 ) to 4294967296 (e.g., 1024*2) 22 Therefore, partitioning by consistent hashing is essentially partitioning by the range of hash values.
[0126] Assuming all shards have the same computing power, the same number of blocks are assigned to each shard in the sharded database. For example, if 1024 blocks are created in a sharded database containing 16 shards, then each shard will contain 64 blocks. In one embodiment, blocks are randomly assigned to shards to provide a random and uniform distribution of data across shards.
[0127] In the case of resharding, when a shard is added to or removed from the shard database, a certain number of blocks move between shards to maintain a uniform distribution of blocks across shards. During this resharding process, the content of the blocks does not change; therefore, rehashing is not required.
[0128] The number of blocks in a sharded database with system-managed sharding can be specified in the `CREATE SHARDCATALOG` command. Once the sharded database is deployed, the number of blocks can only be changed by splitting them. When a block is split, its hash value range is divided into two ranges, but nothing needs to be done for the remaining blocks. Any block can be split independently at any time.
[0129] A block can contain only one tablespace from a given set of tablespaces. However, the same set of tablespaces can be used for multiple tables belonging to the same table family. In this case, each tablespace in the set will store multiple partitions, one partition per table. Alternatively, each table in a table family can be stored in a separate set of tablespaces. In this case, the block contains multiple tablespaces: one tablespace per table, and each tablespace stores a single partition.
[0130] Figure 8 An example of system-managed sharding is shown, illustrating the relationship between partitions, tablespaces, and shards in a simplified case with a single sharded table. In this case, each block contains a single tablespace, and each tablespace stores a single partition. The sharded database includes shards 820, 830, 840, and 850. Partitions P_1 to P_480 reside in tablespaces tbs1-1 to tbs1-480. Tablespace set 810 includes tablespaces tbs1-1 to tbs1-480.
[0131] Table 3 provides example code for creating sharded tables using system-managed sharding:
[0132] Table 3
[0133] CREATE SHARDED TABLE customers
[0134] (cust_id NUMBER NOT NULL
[0135] name VARCHAR2(50)
[0136] ,address VARCHAR2(250)
[0137] ,location_id VARCHAR2(20)
[0138] ,class VARCHAR2(3)
[0139] ,signup DATE
[0140] CONSTRAINT cust_pk PRIMARY KEY(cust_id) )
[0142] PARTITION BY CONSISTENT HASH(cust_id)
[0143] PARTITIONS AUTO
[0144] TABLESPACE SET ts1 ;
[0146] In this example, a sharded table "customers" is created, which uses a consistent hash across the shard key cust_id. The statement includes "PARTITIONS AUTO", which instructs that the number of partitions will be automatically set to the number of tablespaces in the tablespace set ts1. Therefore, the number of blocks will also be equal to the number of tablespaces in the tablespace set ts1.
[0147] User-defined sharding
[0148] In one embodiment, user-defined sharding is a sharding technique that allows users, such as database administrators, to explicitly specify a user-defined mapping of data to shards. User-defined sharding can be used when it is necessary to store certain data on specific shards and when the database administrator needs full control over moving data between shards. For example, user-defined sharding can be beneficial when performance constraints require certain data to reside on specific shards. In another example, user-defined sharding can be necessary when regulatory requirements necessitate that specific data be stored on specific shards. Another benefit of user-defined sharding is that in the event of planned or unplanned shard outages, the database administrator knows exactly which data becomes unavailable due to the outage. The disadvantage of user-defined sharding is that the database administrator needs to monitor the shards and manually maintain a balanced distribution of data and workload across shards.
[0149] Sharded tables can be partitioned by range or list using user-defined sharding. Aside from the additional requirement that each partition should be stored in a separate tablespace, the CREATE TABLE syntax for sharded tables is similar to that for creating regular tables. Table 4 provides example code for creating a table using user-defined sharding for the accounts table:
[0150] Table 4
[0151] CREATE SHARDED TABLE accounts
[0152] (id NUMBER
[0153] ,account_number NUMBER
[0154] customer_id NUMBER
[0155] branch_id NUMBER
[0156] ,state VARCHAR(2)
[0157] ,status VARCHAR2(1) )
[0159] PARTITION BY LIST(state)
[0160] (PARTITION p_northwest VALUES('OR','WA')TABLESPACE tbs1
[0161] ,PARTITION p_southwest VALUES('AZ','UT','NM')TABLESPACE tbs2
[0162] ,PARTITION p_northcentral VALUES('SD','WI')TABLESPACE tbs3
[0163] ,PARTITION p_southcentral VALUES('OK','TX')TABLESPACE tbs4
[0164] ,PARTITION p_northeast VALUES('NY','VR','NJ')TABLESPACE tbs5
[0165] ,PARTITION p_southeast VALUES('FL','GA')TABLESPACE tbs6 ) ;
[0168] In the example shown in Table 4, the `accounts` table is created as a sharded table and partitioned by a list using `state` as the partition key. The user is defining which partitions are assigned to specific tablespaces based on the partition key. There are no tablespaces set up for user-defined sharding. Each tablespace must be created explicitly. A "shard space" is a collection of shards that store data corresponding to a range or list of key values. Specifically, in the case of user-defined sharding, a shard space consists of a collection of shards or fully replicated shards. Table 5 provides exemplary code for creating tablespaces for the `accounts` table:
[0169] Table 5
[0170] CREATE TABLESPACE tbs1 IN SHARDSPACE west;
[0171] CREATE TABLESPACE tbs2 IN SHARDSPACE west;
[0172] CREATE TABLESPACE tbs3 IN SHARDSPACE central;
[0173] CREATE TABLESPACE tbs4 IN SHARDSPACE central;
[0174] CREATE TABLESPACE tbs5 IN SHARDSPACE east;
[0175] CREATE TABLESPACE tbs6 IN SHARDSPACE east;
[0176] Therefore, in the example shown in Table 5 above, the user manually defines the position of each tablespace within the shard space. Table 6 provides exemplary code for creating shard spaces and populating shards for the accounts table:
[0177] Table 6
[0178] ADD SHARDSPACE–SHARDSPACE west, central, east;
[0179] ADD SHARD–CONNECT shard-1–SHARDSPACE west;
[0180] ADD SHARD–CONNECT shard-2–SHARDSPACE central;
[0181] ADD SHARD–CONNECT shard-3–SHARDSPACE east;
[0182] Therefore, in the example shown in Table 6 above, the user manually defines what fragment space fills each fragment.
[0183] Figure 9 This example illustrates sharding that maps partitions to tablespaces and tablespaces to the `accounts` table. The `accounts` table is sharded across shards 920, 930, and 940. Shard 920 is associated with the West shard space, shard 930 with the Middle shard space, and shard 940 with the East shard space.
[0184] In one embodiment, user-defined shards can be used to segregate data across specific shards based on performance or other constraints. For example, in one embodiment, by using user-defined shards, a first dataset can be assigned to a first set of shards accessible to the cloud environment, and a second dataset can be assigned to a second set of shards inaccessible to the cloud environment. Segregating data can be helpful due to privacy or regulatory concerns regarding data inaccessible via the cloud environment. In other embodiments, data segregation can facilitate archiving purposes, as archived data is segregated to shards with lower performance capabilities, while actively used data is segregated to shards with higher performance capabilities.
[0185] Composite fragmentation
[0186] In one embodiment, composite sharding is an algorithm for combining the characteristics of user-defined shards with the characteristics of system-managed shards.
[0187] System-managed sharding distributes data randomly across shards, providing better load balancing compared to user-defined sharding that partitions by range or list. However, system-managed sharding does not give database administrators any control over the assignment of data to shards.
[0188] User-defined sharding provides the ability to allocate specific data to specific shards, which can be necessary to meet various requirements. For example, regulatory requirements may mandate that certain data be stored on computer hardware located in a specific geographic area. Similarly, sharding can be implemented on different computing devices with varying hardware performance characteristics, including but not limited to processors, memory, and storage devices. Furthermore, data can be differentiated based on classification, where the classification is determined based on specific columns in a table. For example, in the case of customer-related data, some customers might be classified as "GOLD" customers, while others might be classified as "SILVER". In this example, the data for GOLD customers might need to be stored in one shard set, while the data for SILVER customers might need to be stored in a different shard set.
[0189] Composite sharding allows data to be partitioned across more than one level of partitioning. For example, data can first be partitioned across multiple shard spaces by list or range, and then further partitioned across multiple shards within each shard space using a consistent hash. Given user-defined mapped data, multi-level sharding provided by composite sharding allows the system to automatically maintain a balanced distribution of data across shards within each shard space, while simultaneously partitioning the data across shard spaces. Composite sharding can be performed using a Data Definition Language (DDL).
[0190] For example, suppose a database administrator wants to allocate three shards hosted on a faster server to GOLD customers and four shards hosted on a slower machine to SILVER customers. Within each shard set, customers must be distributed using partitioning via a consistent hash of customer_id. Therefore, in this example, data is partitioned across the shard space using a first shard key indicating the customer category via a range or list. Data is further partitioned across shards within the shard space using a second shard key for customer_id via a consistent hash. In one embodiment, the first shard key may be referred to as the super shard key. The super shard key is a type of shard key used for higher-level partitioning in a composite sharding configuration. In one embodiment, a routing table stores mapping data for a combination of the first and second shard key values to the location of blocks containing data in a specific shard within a specific shard space. Therefore, the routing table can be used to determine the location of blocks containing data for a given key combination on a specific shard within a specific shard space using a combination of the first and second shard keys.
[0191] Figure 10 An example of a sharded database 1000 including shard space 1001 and shard space 1002 is shown. Shard space 1001 is a shard space dedicated to the GOLD client and includes shards 1020, 1030, and 1040 covering tablespace set 1010. Shard space 1002 is a shard space dedicated to the SILVER client and includes shards 1050, 1060, 1070, and 1080 covering tablespace set 1011.
[0192] Table 7 provides exemplary code for creating fragment spaces 1001 and 1002:
[0193] Table 7
[0194] ADD SHARDSPACE–SHARDSPACE shspace1,shspace2;
[0195] ADD SHARD–CONNECT shard1–SHARDSPACE shspace1;
[0196] ADD SHARD–CONNECT shard2–SHARDSPACE shspace1;
[0197] ADD SHARD–CONNECT shard3–SHARDSPACE shspace1;
[0198] ADD SHARD–CONNECT shard4–SHARDSPACE shspace2;
[0199] ADD SHARD–CONNECT shard5–SHARDSPACE shspace2;
[0200] ADD SHARD–CONNECT shard6–SHARDSPACE shspace2;
[0201] ADD SHARD–CONNECT shard7–SHARDSPACE shspace2;
[0202] In Table 7 above, the user provides a mapping from multiple shards (shard1, shard2, shard3, shard4, shard5, shard6, and shard7) to multiple shard spaces (shspace1 and shspace2). Table 8 provides example code for creating tablespace sets 1010 and 1011:
[0203] Table 8
[0204] CREATE TABLESPACE SET tbs1 IN SHARDSPACE shspace1;
[0205] CREATE TABLESPACE SET tbs2 IN SHARDSPACE shspace2;
[0206] In the example in Table 8 above, the user provided a mapping from tablespace sets (tbs1 and tbs2) to shard spaces (shspace1 and shspace2). Once the shard spaces and tablespace sets are created, sharded tables can be created, partitioning them into two partition sets: GOLD and SILVER, based on service categories. Table 9 provides exemplary code for creating sharded tables for sharded database 1000:
[0207] Table 9
[0208] CREATE SHARDED TABLE customers
[0209] (cust_id NUMBER NOT NULL
[0210] name VARCHAR2(50)
[0211] ,address VARCHAR2(250)
[0212] ,location_id VARCHAR2(20)
[0213] ,class VARCHAR2(3)
[0214] ,signup_date DATE
[0215] ,CONSTRAINT cust_pk PRIMARY KEY(cust_id))
[0216] PARTITIONSET BY LIST(class)
[0217] PARTITION BY CONSISTENT HASH(cust_id)
[0218] PARTITIONS AUTO
[0219] (PARTITIONSET gold VALUES('gld')TABLESPACE SET tbs1,PARTITIONSET silver VALUES('slv')TABLESPACE SET tbs2) ;
[0221] In the example in Table 9 above, the customer table is created as a sharded table. The customer table is first partitioned using the "class" column as the partition key. In Table 9, a user-provided mapping maps the GOLD customer to tablespace set tbs1 and the SILVER customer to tablespace set tbs2. This is an example of user-defined sharding. The customer table is further partitioned using "cust_id" as the partition key. This further partitioning uses a consistent hash and is an example of system-managed sharding. Therefore, the final result in Table 9 is a composite sharding that combines the characteristics of user-defined sharding and system-managed sharding.
[0222] Composite sharding offers the benefits of user-defined sharding because it allows users to customize the allocation of data to the shard space based on user-defined constraints. These user-defined constraints can vary from application to application and can include constraints such as regulatory restrictions on the physical location of data, distinctions based on data categories, or allocation of data across different hardware resources. Composite sharding also provides the benefits of system-managed sharding by offering automatic and randomized distribution of data within shards within the same shard space. Therefore, composite sharding is a multi-level sharding approach that combines the benefits of user-defined sharding and system-managed sharding.
[0223] In one embodiment, multi-level composite sharding is a strategy for distributing data in a sharded database that can include any number of N-level partitions and is not limited to two-level partitioning. For example, in one embodiment, the first N-1 levels are each partitioned based on a range or list, and can each be partitioned using a different partition key. Furthermore, in one embodiment, the Nth level is partitioned based on a hash function such as a consistent hash function. Therefore, such multi-level composite sharding can be used to simulate a multi-level hierarchical organization of data across one or more shard spaces, tablespace sets, and / or shards. This multi-level composite sharding strategy provides the benefits of both user-defined sharding and system-managed sharding.
[0224] In one embodiment, when using composite sharding, there exists a shard key for each level of sharding. Multiple shard key values can be stored in a key-value set to maintain mapping data, thereby allowing the location of specific data fragments residing on shards within a multi-level composite sharding database. In one embodiment, to determine which shard contains data for a specific database request, the client application will need to include the values of all shard keys within the database request.
[0225] Subpartition
[0226] In one embodiment, subpartitioning can be combined with sharding to allow partitions included in a shard to be further subpartitioned. Subpartitioning breaks each partition into smaller parts. Subpartitioning offers various benefits. For example, subpartitioning allows for multi-level organization of data across shards. For instance, within a shard, data can be organized into subpartitions based on date, customer category, or some other criterion derived from the data. Subpartitioning is beneficial for efficient parallel execution within shards, especially when sharding by range or list might be performed when the number of partitions in each shard is likely small. Subpartitioned data can be partitioned by list, range, or hash function (such as consistent hashing or traditional hashing).
[0227] Table 10 provides exemplary code for system-defined sharding using consistent hashing and sub-partitioning by range:
[0228] Table 10
[0229] CREATE SHARDED TABLE customers
[0230] (cust_id NUMBER NOT NULL
[0231] name VARCHAR2(50)
[0232] ,address VARCHAR2(250)
[0233] ,location_id VARCHAR2(20)
[0234] ,class VARCHAR2(3)
[0235] ,signup_date DATE
[0236] ,CONSTRAINT cust_pk PRIMARY KEY(cust_id,signup_date) )
[0238] TABLESPACE SET tbs1
[0239] PARTITION BY CONSISTENT HASH(cust_id)
[0240] SUBPARTITION BY RANGE(signup_date)
[0241] Subparitation TEMPLATE
[0242] (SUBPARTITION per1 VALUES LESS THAN(TO_DATE('01 / 01 / 2000','DD / MM / YYYY'))
[0243] ,SUBPARTITION per2 VALUES LESS THAN(TO_DATE('01 / 01 / 2010','DD / MM / YYYY'))
[0244] ,SUBPARTITION per3 VALUES LESS THAN(TO_DATE('01 / 01 / 2020','DD / MM / YYYY'))
[0245] ,SUBPARTITION future VALUES LESS THAN(MAXVALUE)) )
[0247] PARTITIONS AUTO ;
[0249] In this example, a sharded table named Customers is created. The Customers sharded table is first partitioned using a consistent hash using cust_id, and then subpartitioned into four subpartitions using signup_date. Figure 11An example of the result of applying this code is shown. Partitions are created in tablespace set tbs1 1110. Partitions across shards 1120, 1130, and 1140 are partitioned using a consistent hash. Each of these partitions resides in its own tablespace. Furthermore, each partition comprises four subpartitions, which are partitioned using signup_date and stored in the parent tablespace.
[0250] Subpartitions can be combined with system-managed shards, user-defined shards, or composite shards. If subpartitions are combined with composite shards, the data in the table will ultimately be organized into three levels: partition sets, partitions, and subpartitions. Each level of partition can have a different partition key, and the routing table can be updated to include mapping data for all three partition keys.
[0251] Example application areas for subpartitions could be storing data about customer tables that comply with regulatory and archiving requirements. Regulatory requirements might mandate that data about a specific customer be stored in shards physically located within the customer's geographic jurisdiction. Archiving requirements might require that data about older customers be stored separately from data about newer customers. For example, data about older customers might need to be stored separately for archiving to backup storage. User-defined shards can be used to create partitions for sharded tables, such that each shard contains data about customers geographically located within the same jurisdiction as the shard, and subpartitions can be created to group customers based on their service registration date to meet these requirements. Therefore, subpartitions allow for multi-level organization of data in a sharded database.
[0252] copy
[0253] Since each shard in a sharded database is its own database instance, this system can use existing replication technologies to replicate the data stored in the shards. For example, in active / active replication, two peer database instances can store the same dataset that can be changed and accessed by client applications. Changes made to one peer database instance are replicated to the other, and vice versa. Under active / active replication, there is no master replica, and the peer database instances remain identical.
[0254] In master / slave replication, there exists a master database instance and multiple slave database instances. The master replica is the source of information and can accept both read and write transactions. Changes made to the master database instance are replicated to the slave database instances. The slave database instances are read-only.
[0255] Therefore, in one embodiment, data in a block stored in a first shard can be copied to one or more additional shards. This copying of data in a block can be performed using active / active replication or master / slave replication.
[0256] Re-sharding
[0257] Resharding is the process of redistributing data among shards in a sharded database. Resharding can be triggered by various conditions. Resharding involves moving a block or a portion of a block to a new shard. As mentioned earlier, a block is the unit of data migration between shards. By storing related data from different shard tables in the same block and moving the block as an atomic unit between shards, resharding ensures that related data from different shard tables are moved together.
[0258] In one embodiment, resharding can be triggered when a new shard is added to the sharding database. In this case, resharding can be automatic, and one or more blocks will be automatically redistributed in the sharding database to move blocks from the old shards to the new shards.
[0259] In one embodiment, a resharding process can be triggered when a shard is removed from the sharding database. In this case, the resharding process can be automatic and will automatically redistribute one or more blocks from the shard being removed from the sharding database to another shard in the sharding database that is not being removed.
[0260] In one embodiment, resharding can be triggered when a database administrator uses commands to manually break down or split a specific block. This can occur, for example, when the database administrator recognizes that a particular block is too large and needs to be split into smaller blocks and redistributed. The oversized block can be split into two or more smaller blocks, which are then redistributed into shards in the sharded database.
[0261] In one embodiment, a resharding process can be triggered when the sharding database detects significant load on a particular shard. To balance the load across the sharding database, the sharding database can identify one or more blocks on a particular shard that need to be redistributed to other shards in the sharding database in response to the detection of significant load on that shard. In one embodiment, the identification of the one or more blocks used for redistribution is random.
[0262] In one embodiment, the ability to perform resharding provides elastic scalability to a sharded database by allowing the database to reconfigure the distribution of data in response to triggering events such as newly added or removed shards, manual intervention by database management, or detection of significant load on a particular shard.
[0263] Figure 12A , Figure 12B and Figure 12CThis describes an example of resharding a sharded database when a new shard is added to it. Figure 12A In the current sharding database, there are shards 1220, 1230, 1240, and 1250. Shard 1220 currently includes blocks 1201, 1202, 1203, and 1204. Shard 1230 includes blocks 1205, 1206, 1207, and 1208. Shard 1240 includes blocks 1209, 1210, 1211, and 1212. Furthermore, shard 1250 has just been added to the sharding database. However, since this shard is new, it does not yet contain any blocks.
[0264] exist Figure 12B In this process, blocks 1204, 1208, and 1212 from fragments 1220, 1230, and 1240, respectively, are copied. Copies of blocks 1204, 1208, and 1212 are stored in fragment 1250, and the original copies of these blocks are deleted from their original fragments 1220, 1230, and 1240.
[0265] Figure 12C An example of the final result of the resharding process is shown. After resharding is complete, the blocks in the shard database are redistributed as follows: shard 1220 includes blocks 1201, 1202, and 1203; shard 1230 includes blocks 1205, 1206, and 1207; shard 1240 includes blocks 1209, 1210, and 1211; and the new shard 1250 includes blocks 1204, 1208, and 1212, redistributed from the other shards in the shard database. Therefore, the shard database has automatically redistributed its data in response to the addition of the new shard 1250. This automatic redistribution in response to the addition of data in the new shard provides load balancing within the shard database.
[0266] Figure 13 A flowchart of a re-sharding process 1300 according to one embodiment is shown. In step 1302, a sharding director identifies a source block to be moved from a source shard to a destination shard. As previously described, this can be done automatically in response to detecting a significant load on the source shard or in response to detecting a new shard added to the sharding database. In this case, the source block to be moved can be identified based on a random selection from all blocks located in the source shard. In another embodiment, the source block can be identified in response to a database command (such as a command to split a block or move a block) provided by a database administrator. Process 1300 can then proceed to step 1304.
[0267] In step 1304, the sharding director may optionally determine whether the sharded database uses active / active replication. If the sharded database uses active / active replication, then a duplicate copy of the source block will exist in the replica, which can be used to serve database queries that need to access data in the source block when resharding processing continues. If the sharded database does not use active / active replication, then a duplicate copy of the source block may not exist in the replica, which can be used to serve database queries that need to access data in the source block when resharding processing continues. If the sharded database uses active / active replication, then process 1300 can proceed to step 1306. If the sharded database does not use active / active replication, then process 1300 can proceed to step 1312.
[0268] In step 1306, since the fragmentation director knows that a copy of the source block is available from the replicated source, the fragmentation director updates the routing table to indicate that the source block is unavailable on the source, but the data contained in the source block is available from the replicated fragment containing the replicated copy of the source block. Process 1300 can then proceed to step 1308.
[0269] In step 1308, the sharding director notifies client applications of the update to the routing table. This notification will also include mapping data for the new mapping, indicating that the data contained in the source block is now available from the replicated shard. Therefore, this lets client applications know that for queries that need to access the source block when it is unavailable on the source shard, they can continue to route the queries directly to the replicated shard. This ensures that client applications do not experience downtime accessing the data they need from the source block when it is migrated to the new destination shard. Processing 1300 can then proceed to step 1310.
[0270] In step 1310, the fragmentation director instructs the fragmentation to copy the source block from the source fragment to the destination fragment. Processing 1300 can then proceed to step 1320.
[0271] In step 1312, the fragmentation director instructs the destination fragment to store a fuzzy copy of the source block in the destination fragment. A fuzzy copy is a copy generated from the source block while the source block is still available to process further queries. Therefore, if the source block is modified during the copying process, the fuzzy copy of the source block may not be identical to the source block. Any additional modifications or updates made to the source block will not be reflected in the fuzzy copy. Process 1300 can then proceed to step 1314.
[0272] In step 1314, process 1300 determines whether there are any additional changes to the source block on the source fragment that have not yet been reflected in the fuzzy copy of the source block stored in the destination fragment. If there are additional changes that have not yet been reflected in the fuzzy copy, then process 1300 may proceed to step 1316. If there are no additional changes that have not yet been reflected in the fuzzy copy, then the fuzzy copy of the source block and the source block are now identical, and process 1300 proceeds to step 1318.
[0273] In step 1316, process 1300 copies any additional changes made to the source block but not yet reflected in the blurred copy into the blurred copy. Process 1300 then returns to step 1314.
[0274] In step 1318, the fuzzy copy of the source block is now identical to the source block, so the fragment director makes the source block on the source fragment read-only. Therefore, the source fragment can still be used for read queries, but cannot be updated. The process then proceeds to step 1320.
[0275] In step 1320, the fragmentation director updates the routing table to indicate that the source block is now available on the destination fragment and is no longer available at the source fragment. Process 1300 then proceeds to step 1322.
[0276] In step 1322, the sharding director notifies the client application that the routing table has been updated. The sharding director sends updated mapping data from the routing table to indicate that the source block is now available at the destination shard instead of the source shard. Therefore, if a query requests access to data stored in the block, the client application can use the updated mapping data included in the notification to route subsequent queries directly to the destination shard. Processing 1300 can then proceed to step 1324.
[0277] In step 1324, the fragmentation director can instruct the source fragment to remove the source block from the source fragment, thereby freeing up storage space on the source fragment. Processing 1300 can then end.
[0278] Route lookup
[0279] Most queries in typical OLTP client applications are short and must be processed with millisecond latency. The additional network hops and resolution during query routing to the appropriate shards can introduce unacceptable latency for OLTP client applications. This disclosure provides techniques for minimizing latency when routing queries sent from client applications.
[0280] return Figure 2Client application 210 will need to make data requests to sharded database 250. In some cases, data requests from client application 210 will require data from a single shard. This type of data request is called a single-shard query. Single-shard queries will represent most data requests in a typical OLTP client application because shards 260A, 260B, and 260C have been configured such that blocks in each shard contain corresponding partitions of tables from a table family. Therefore, most queries that depend on data from a table family will likely be served by a single shard because the relevant data for that table family is collocated on the same shard. Similarly, using duplicate tables for relatively small and / or static reference tables increases the likelihood that queries will be processed as single-shard queries.
[0281] In other cases, data requests from client application 210 will require data from multiple shards. This type of data request is called a cross-shard query. Processing cross-shard queries is generally slower than processing single-shard queries because it requires joining data from multiple shards. Cross-shard queries are often used to generate reports and collect statistics that require data from multiple shards. Such cross-shard queries are usually not performance-critical.
[0282] As previously described, sharded directory 230 maintains a routing table that maps the list of blocks hosted by each shard to a range of hash values associated with each block. Therefore, the routing table can be used to determine which shard contains blocks that include data for the shard key, or a set of shard key values. In one embodiment, where the database is sharded via composite sharding, the routing table may also include mapping information for combinations of shard keys and supershard keys. Thus, in the case of a composite sharded database, for a given set of shard key values, the routing table can be used to determine which shard contains blocks that include data for that set of shard key values.
[0283] In one embodiment, the routing table maintained by shard directory 230 is copied and accessible by the shard coordinator. The shard coordinator is a component that helps route queries to the appropriate shards. In one embodiment, the shard coordinator is implemented as part of shard directories 220. In another embodiment, the shard coordinator is implemented in client application 210. In another embodiment, the shard coordinator is implemented on one or more of each individual shard 260 in shard database 250. In another embodiment, the shard coordinator is implemented as a component separate from shard directories 220 and the individual shards 260. For example, the shard coordinator can be implemented as a software component external to shard directories 220 and the individual shards 260. In one embodiment, this software component can be part of shard database 250. In another embodiment, the software component can be external to shard database 250. In one embodiment, the software component can be external to both shard database 250 and client application 210.
[0284] In one embodiment, the sharding coordinator functionality can be distributed across multiple software components S1 to SN existing between the client application 210 and the sharding database 250. The software components S1 to SN can have varying levels of accessibility to the client application 210 and / or the sharding database 260. This accessibility reflects various communication characteristics, including but not limited to physical proximity, bandwidth, availability of computing resources, workload, and other characteristics that will affect the accessibility of the software components S1 to SN.
[0285] In one embodiment, software component S1 may be more easily accessible to client application 210 than software component S2. Similarly, software component S2 may be more easily accessible to client application 210 than software component S3, and so on. Therefore, software component S1 is considered closest to client application 210 because it is the most easily accessible to client application 210, and software component S1 is considered farthest from client application 210 because it is the least accessible to client application 210. In one embodiment, when a database request is created at client application 210 requiring processing by a sharding coordinator, the available software component closest to client application 210 is used as the sharding coordinator to attempt to process the database request. If the available software component closest to client application 210 cannot process the database request, then the next closest software component is tried, and so on, until the database request is successfully processed by the sharding coordinator. For example, if a software component does not have enough mapping data to correctly route the database request, then it may be unable to process the database request. By using the available software component closest to the client application 210 to act as the sharding coordinator that will handle database requests, the system can provide improved performance when processing requests because the closest available software component has improved accessibility compared to other software components.
[0286] Proxy-based query routing for single-shard queries with known keys
[0287] In one embodiment, client application 210 cannot directly route queries in a database request to a shard because it cannot independently determine which target shard(s) contains the data required to process the query. Client application 210 can use a shard coordinator to perform proxy-based routing to one or more destination shard(s).
[0288] Figure 14 A flowchart of process 1400 for proxy-based query routing for a single-shard query according to one embodiment is shown. In step 1402, client application 210 sends a database request including a single-shard query to the shard coordinator. The database request includes a shard key and, optionally, a supershard key. Process 1400 then proceeds to step 1404.
[0289] In step 1404, the shard coordinator can use the key and routing table to determine which target shard contains the data required to process a single-shard query for a database request. Processing 1400 then proceeds to step 1406.
[0290] In step 1406, the sharding coordinator directly redirects client application 210 to the target shard containing the data required to process a single-shard query. By redirecting client application 210 to the target shard instead of simply returning the target shard's identifier to client application 210, process 1400 removes unnecessary hops in processing single-shard queries, thereby improving system latency. Process 1400 then proceeds to step 1408.
[0291] In step 1408, client application 210 can directly execute a single-shard query on the target shard and directly receive the query results from the target shard. Processing 1400 then proceeds to step 1410.
[0292] In step 1410, the target shard may optionally return mapping data identifying all key ranges stored in the specific shard. This mapping data may be cached directly by the client application 210 or cached in a connection pool accessible to the client application 210. The mapping data will allow the client application 210 to route subsequent queries with shard keys matching the cached mapping data directly to the target shard without consulting the shard coordinator. This will improve the performance of subsequent database requests to the target shard. Processing 1400 can then conclude.
[0293] Proxy-based query routing for unknown sharding keys
[0294] Figure 15 illustrates a flowchart of process 1500 for proxy-based query routing when a database request sent by client application 210 does not include a shard key. For example, this process 1500 can be used when accessing a sharded database using a legacy client application and the database request has not yet been updated to include the shard key. Alternatively, this process 1500 can be used when client application 210 is performing a cross-shard query that requires data from multiple shards.
[0295] In step 1502, client application 210 sends a database request, including a query, to the sharding coordinator, which does not include the sharding key. Processing 1500 then proceeds to step 1504.
[0296] In step 1504, the sharding coordinator parses the query in the database request and determines the value of the sharding key for the query. The sharding coordinator can then compare the sharding key with the routing table to determine the target shard(s) containing the data required to process the query. Processing 1500 then proceeds to step 1506.
[0297] In step 1506, the shard coordinator sends a query to each shard identified in step 1504. Processing 1500 then proceeds to step 1508.
[0298] In step 1508, each shard in (one or more) of the shards processes the query and returns the result of the processed query to the shard coordinator. Processing then proceeds to step 1510.
[0299] In step 1510, the shard coordinator aggregates and filters the results received in step 1508 from one or more shards. Furthermore, the shard coordinator returns the aggregated results to the client application 210. Processing 1500 can then terminate.
[0300] In another embodiment, Figure 15B A flowchart is shown for process 1501 of proxy-based query routing when a database request sent by client application 210 does not include a sharding key. In step 1522, client application 210 sends a database request including the query to the sharding coordinator that does not include the sharding key. Process 1501 then proceeds to step 1524.
[0301] In step 1524, the sharding coordinator parses the query in the database request and determines the value of the sharding key for the query. The sharding coordinator then compares the sharding key with the routing table to determine the shard(s) containing the data needed to process the query. Process 1501 then proceeds to step 1526.
[0302] In step 1526, the shard coordinator sends the queried predicate to each shard identified in step 1524. Processing 1501 then proceeds to step 1528.
[0303] In step 1528, each of the (one or more) shards processes the predicate and sends the predicate result to the shard coordinator. Processing then proceeds to step 1530.
[0304] In step 1530, the shard coordinator receives predicate results from one or more shards and then applies the query to the predicate results. The final result of applying the query to the predicate results can then be returned to the client application 210. Processing 1501 can then end.
[0305] Direct routing to shards using cached mapped data.
[0306] In one embodiment, client application 210 may use cached mapping data to route queries directly to a specific shard 260. In this example, client application 210 may use mapping data cached within client application 210 or in a connection pool accessible to client application 210 to determine which specific shard 260 contains data for a specific query. For example, when a client connection pool for client application 210 first connects to a specific shard, it may need to use a shard coordinator to route the first query because client application 210 does not know which specific shard is needed to handle the first query. The specific shard can handle the first query and return a response. In addition to providing a response to the first query, the specific shard also returns mapping data identifying all key ranges stored in that specific shard. Client application 210 or the connection pool may cache the key range mappings for that specific shard.
[0307] Therefore, for subsequent queries that require data stored on a specific shard, client application 210 can use cached mapping data to route the query directly to that specific shard without accessing an intermediate shard coordinator. Thus, in the case of a single-shard query, client application 210 can route subsequent queries directly to the specific shard and return the results directly to client application 210. Direct routing to the shard using a known key provides improved latency because client application 210 can directly access the shard without consulting an intermediate shard coordinator to process the query.
[0308] In one embodiment, when a query is routed directly to a shard, predicates can be automatically added to the query to restrict the query output by the shard to the provided key(s)(s). For example, if "cust_id=123" is specified when establishing a database connection, and then "SELECT * FROM customers" is executed through that connection, the query can be rewritten as "SELECT * FROM customers where cust_id=123". The query can be rewritten by the client application, connection pool, or shard coordinator. This behavior is optional and can be controlled by user-defined parameters used for sharded databases.
[0309] Further details regarding direct routing queries between client application 210 and shards can be found in U.S. Patent Application No. 15 / 133,972 (“SYSTEM AND METHOD FOR PROVIDING DIRECT ACCESS TO A SHARDED DATABASE”), filed April 20, 2016, and U.S. Patent Application No. 15 / 133,979 (“SYSTEM AND METHOD FOR PROVIDING ACCESS TO A SHARDED DATABASE USING ACACHE AND A SHARD TECHNOLOGY”), filed April 20, 2016, the contents of each of which are incorporated herein by reference as if fully set forth herein.
[0310] Direct routes to shards are resolved by the client based on the query.
[0311] In one embodiment, client application 210 can parse a query to determine the shard(s) ...
[0312] Combined processing for query routing
[0313] In one embodiment, various proxy-based routing techniques and direct routing techniques can be combined into a combined processing for query routing. This combined processing uses direct routing when the client application 210 is able to perform direct routing, and uses proxy-based routing when the client application 210 is unable to perform direct routing. This combined processing provides the improved latency benefits of direct routing while retaining the ability to perform cross-shard queries using proxy-based routing and the ability to access one or more shards when the client application 210 does not know which shard(s)(s)(s) contain the data necessary to process the query.
[0314] Figure 16An example flowchart for the combined processing of query routing is shown. In step 1601, client application 210 determines whether it can access the sharding key. If the client application can access the sharding key, then process 1600 can proceed to step 1602. If client application 210 cannot access the sharding key, then process 1600 can proceed to step 1612.
[0315] In step 1602, client application 210 determines whether it can access cached mapping data for the sharding key. The cached mapping data is mapping data that includes a range of key values for one or more specific shards in the sharded database. If client application 210 can use the sharding key to determine from the cached mapping data which shard(s)(s) contains the data necessary to process the query, then process 1600 can proceed to step 1604. If client application 210 cannot use the sharding key to determine from the cached mapping data which shard(s)(s) contains the data necessary to process the query, then process 1600 can proceed to step 1606.
[0316] In step 1604, client application 210 can directly connect to one or more shards containing the data necessary to process the query and execute the query directly on one or more shards. The one or more shards return the query results directly to client application 210. Processing 1600 can then end.
[0317] In step 1606, client application 210 can determine whether the query is simple enough to determine which shard(s)(s) contains the data necessary to process the query. For example, if the query includes a "SELECT *" statement, then client application 210 can determine that the query needs to be processed by all(one or more) shard(s) in the sharded database. If client application 210 determines that the query is simple enough, then process 1600 can proceed to step 1608; otherwise, process 1600 can proceed to step 1612.
[0318] In step 1608, client application 210 can analyze the query to determine which target shard(s) contains the data necessary to process the query. Processing 1600 can then proceed to step 1610.
[0319] In step 1610, client application 210 can directly connect to one or more shards containing the data necessary to process the query and execute the query directly on one or more shards. The one or more shards return the query results directly to client application 210. Processing 1600 can then terminate.
[0320] In step 1612, client application 210 can send the query to the shard coordinator to assist in proxy-based routing of the query. Processing 1600 can then proceed to step 1614.
[0321] In step 1614, the sharding coordinator can analyze the query to determine the sharding key. The sharding coordinator can use the sharding key and the routing table to determine which shard(s)(s) contain the data necessary to process the query. Process 1600 can then proceed to step 1616.
[0322] In step 1616, the shard coordinator can determine whether the query is a single-shard query. If the query is a single-shard query, then process 1600 and proceed to step 1620. If the query is a cross-shard query, then process 1600 and proceed to step 1618.
[0323] In step 1618, the shard coordinator facilitates query processing for cross-shard queries. In one embodiment, the shard coordinator sends a query to each of one or more target shards, and one or more target shards process the query. The one or more target shards return the results to the shard coordinator, which aggregates and filters the results. The shard coordinator sends the final aggregated result of the cross-shard query back to the client application 210. Processing 1600 can then terminate.
[0324] In step 1620, the shard coordinator can redirect client application 210 to the target shard containing the data necessary to process a single-shard query. Process 1600 can then proceed to step 1622.
[0325] In step 1622, client application 210 can directly connect to the target shard and execute a query on the target shard. The result of the query is returned directly from the target shard to client application 210. Processing can then terminate at step 1600.
[0326] Automatic propagation of pattern changes
[0327] Database schema changes can be made via Data Definition Language (DDL). In sharded databases with a large number of shards, an efficient way to propagate schema changes across shards is crucial. In some conventional sharded databases, cross-shard schema changes are accomplished using scripts or applications manually written by the client. By manually connecting to each shard and applying code or scripts to each shard, the application code or script is used to propagate the schema change across all shards in the sharded database. This application code or script is not as uniform or robust as DDL. Moreover, this code or script is prone to user errors. In other sharded databases, cross-shard schema changes are manually accomplished by the database administrator applying the DDL for schema change individually to each shard. This approach is undesirable because some sharded databases contain hundreds or thousands of shards, and this manual work by the database administrator is time-consuming and error-prone.
[0328] This system provides automatic propagation of database schema changes using DDL to shards within a sharded database. In one embodiment, shard directory 230 can retain a history of received schema modification instructions. Each schema modification instruction is written as a DDL request. These DDL requests can be directly entered by the database administrator at shard directory 230.
[0329] Shard directory 230 can maintain a historical log indicating the status of applying DDL requests to each specific shard in the sharded database. Therefore, historical logic indicates which DDL requests have been applied to shards and which are awaiting application. In some cases, DDL requests may have been applied to some shards, but not all. In one embodiment, the database administrator can instruct a schedule on when pending DDL requests should be applied to the sharded database. The schedule can be sent to shard director 220, and shard director 220 can delay the execution of pending DDL requests until the time specified in the schedule.
[0330] In one embodiment, shard directory 230 may push pending DDL requests to shard directories 220. Shard directories 220 are responsible for applying pending DDL requests at shard 260. Shard directories 220 may have access to historical logs indicating which shards have had any historical DDL requests applied and the history of pending DDL requests.
[0331] Shard director 220 can act as a database client associated with shard 260. Therefore, shard director 220 can receive pending DDL requests from shard directory 230, connect to each of the shards 260, and execute the pending DDL requests for each of the shards 260. In one embodiment, the execution of schema changes on each shard is performed in parallel. In one embodiment, shard 260 can return status messages to shard director 220 to indicate whether pending DDL requests were successfully executed on a given shard. Shard director 220 can relay the received status messages to shard directory 230 so that database administrators can view them. Similarly, historical logs can be updated to indicate whether DDL requests have been successfully applied to a specific shard.
[0332] In one embodiment, shard director 220 can act as a listener capable of receiving notifications providing status information from shard 260. The status information may include whether a shard is available or unavailable. In one embodiment, shard director 220 may delay applying pending DDL requests to a shard until that shard becomes available. For example, if the sharded database has three shards A, B, and C, shards A and B are available, but shard C is unavailable. Shard director 220 may immediately apply DDL requests to shards A and B and update the history log accordingly. The shard director may then wait until shard C becomes available, as it will receive status updates from shard C when it becomes available. Once shard C becomes available, shard director 220 can use the history log to identify all pending DDL requests that have not yet been applied to shard C and apply them sequentially until the schema of shard C is fully updated with all outstanding schema modification instructions.
[0333] In another embodiment, the status information may include an indication of the workload the shard is experiencing. In one embodiment, the shard director 220 may delay applying pending DDL requests to shards until all shards in the shard database have a workload less than a threshold setting. By delaying the application of pending DDLs, the shard director 220 can ensure that DDLs are applied to shards in the shard database only if the performance impact is acceptable.
[0334] When the sharding director 220 executes a given DDL request on a shard in a sharded database, the shard has sufficient metadata stored on it to identify that the shard is simply a database instance within a larger sharded database. Therefore, the shard can execute a DDL request by only executing the portion of the DDL request that will affect the shard. For example, suppose the DDL request is a "CREATE SHARDED TABLE" request. In a normal database instance, a "CREATE TABLE" request would create a fully-formed, standalone database table along with the associated metadata for the table data objects. However, in the shards of this system, when it receives a "CREATE SHARDED TABLE" DDL request, instead of creating a fully-formed, standalone database table, the shard can use the stored metadata about the configuration of the sharded database as a whole to create partitions of the sharded table. These partitions will be stored in blocks found on the shard itself, without having to create the rest of the table, which would be stored in blocks located in different shards. In this way, DDL propagation can improve system performance by allowing shards to execute only the parts of their own shards affected by DDL requests, without having to execute the full content of the DDL requests.
[0335] This system allows database administrators to automatically perform schema changes across large sets of shards in a sharded database. The administrator simply provides the DDL specifying the schema change to the shard directory, and these changes are automatically propagated to every shard in the sharded database for execution. This is not done using application code or scripts to propagate the DDL requests, nor does it require the database administrator to manually connect to each individual shard to propagate the DDL schema changes.
[0336] grammar
[0337] While this disclosure provides various examples of the syntax for creating, managing, and manipulating sharded databases, these examples are merely illustrative. This system can be implemented using existing relational database coding languages or query languages such as Structured Query Language (SQL). This means that legacy systems can be easily upgraded, migrated, or connected to systems that include the sharded database teachings described herein, as no major changes to SQL will be required. The use of Data Manipulation Language (DML) requires no modification to take advantage of the system's benefits. Furthermore, the use of DDL requires only minor changes to support the keywords necessary to implement the sharded organization of the sharded database.
[0338] Database Overview
[0339] The embodiments of the present invention are used in the context of a database management system (DBMS). Therefore, a description of an example DBMS is provided.
[0340] Generally, a server, such as a database server, is a combination of integrated software components and the allocation of computing resources (such as memory, nodes, and processes on those nodes for executing the integrated software components), where the combination of software and computing resources is dedicated to providing specific types of functionality on behalf of clients. A database server governs and facilitates access to a specific database, thereby handling client requests to access the database.
[0341] A database comprises data and metadata stored on persistent storage mechanisms, such as collections of hard disks. For example, depending on the relational and / or object-relational database structure, this data and metadata can be logically stored in the database.
[0342] Users interact with the database server by submitting commands to the database server of the DBMS. These commands cause the database server to perform operations on the data stored in the database. A user can be one or more applications running on the client computer that interacts with the database server. Multiple users may also be collectively referred to as users in this document.
[0343] Database commands can take the form of database statements. For a database server to process a database statement, the statement must conform to a database language supported by the database server. A non-limiting example of a database language supported by many database servers is SQL, including proprietary forms of SQL supported by database servers such as Oracle (e.g., Oracle Database 11g). SQL Data Definition Language (“DDL”) instructions are issued to the database server to create or configure database objects, such as tables, views, or complex types. Data Manipulation Language (“DML”) instructions are issued to the DBMS to manage data stored within database structures. For example, SELECT, INSERT, UPDATE, and DELETE are common examples of DML instructions found in some SQL implementations. SQL / XML is a common extension of SQL used when manipulating XML data in object-relational databases.
[0344] Generally, data is stored in a database within one or more data containers, each container containing records, and the data within each record is organized into one or more fields. In relational database systems, data containers are typically called tables, records are called rows, and fields are called columns. In object-oriented databases, data containers are typically called object classes, records are called objects, and fields are called attributes. Other database architectures may use other terms. The systems implementing this invention are not limited to any particular type of data container or database architecture. However, for purposes of explanation, the examples and terminology used herein will be those commonly associated with relational databases or object-relational databases. Therefore, the terms “table,” “row,” and “column” are used herein to refer to data containers, records, and fields, respectively.
[0345] A multi-node database management system consists of interconnected nodes that share access to the same database. Typically, nodes are interconnected via a network and share access to shared storage devices to varying degrees; for example, they share access to a set of disk drives and the blocks of data stored thereon. Nodes in a multi-node database system can be in the form of a group of computers (e.g., workstations, personal computers) interconnected via a network. Alternatively, nodes can be nodes in a grid, where the grid consists of nodes in the form of server blades interconnected with other server blades on a rack.
[0346] In a multi-node database system, each node hosts a database server. A server, such as a database server, is a combination of integrated software components and computing resources (such as memory, nodes, and processes on the nodes for executing the integrated software components on the processor), where the combination of software and computing resources is dedicated to performing specific functions on behalf of one or more clients.
[0347] Resources from multiple nodes in a multi-node database system can be allocated to run software for a specific database server. Each combination of software and resource allocation from nodes is a server referred to herein as a "server instance" or "instance". A database server may include multiple database instances, some or all of which run on separate computers (including separate server blades).
[0348] Query optimization and execution plan
[0349] Query optimization generates one or more distinct candidate execution plans for a query. The query optimizer evaluates these candidate execution plans to determine which one should be used to compute the query.
[0350] An execution plan can be represented by a graph of interconnected nodes (referred to as operators or row sources in this paper), each node corresponding to a step in the execution plan (referred to as an execution plan operation in this paper). The hierarchical structure of the graph (i.e., a directed tree) represents the execution order of the execution plan operations and how data flows between each execution plan operation. Execution plan operators produce a set of rows (which may be referred to as a table) as output, and execution plan operations include, for example, table scans, index scans, sort-merge joins, nested-loop joins, filters, and importantly, complete outer joins.
[0351] The query optimizer can optimize queries by transforming them. Generally, query transformation involves rewriting a query into another semantically equivalent query that should produce the same results and potentially be executed more efficiently—that is, a query for which a potentially more efficient and less costly execution plan can be generated. Examples of query transformations include view merging, subquery unnesting, predicate moving and pushdown, common subexpression elimination, outer-to-inner join transformation, materialized view rewriting, and star schema transformation.
[0352] Hardware Overview
[0353] Now for reference Figure 17 , Figure 17 This is a block diagram illustrating a basic computing device 1700 in which one or more exemplary embodiments of the present invention may be embodied. The computing device 1700 and its components, including their connections, relationships, and functions, are intended to be exemplary only and are not intended to limit the implementation of one or more exemplary embodiments. Other computing devices suitable for implementing one or more exemplary embodiments may have different components, including components with different connections, relationships, and functions.
[0354] The computing device 1700 may include a bus 1702 or other communication mechanisms for addressing the main memory 1706 and for transferring data between the various components of the device 1700.
[0355] The computing device 1700 may also include one or more hardware processors 1704 coupled to the bus 1702 for processing information. The hardware processor 1704 may be a general-purpose microprocessor, a system-on-a-chip (SoC), or other processor.
[0356] Main memory 1706, such as random access memory (RAM) or other dynamic storage devices, may also be coupled to bus 1702 for storing information and software instructions to be executed by processor(s) 1704. Main memory 1706 may also be used to store temporary variables or other intermediate information during the execution of software instructions to be executed by processor(s) 1704.
[0357] When software instructions are stored in storage media accessible to processor(s) 1704, they cause computing device 1700 to become a dedicated computing device customized to perform the operations specified in the software instructions. The terms “software,” “software instructions,” “computer program,” “computer-executable instructions,” and “processor-executable instructions” should be interpreted broadly to cover any machine-readable information (whether human-readable or not) used to instruct computing devices to perform specific operations, and include, but are not limited to, application software, desktop applications, scripts, binaries, operating systems, device drivers, bootloaders, shells, utilities, system software, JavaScript, web pages, web applications, plug-ins, embedded software, microcode, compilers, debuggers, interpreters, virtual machines, linkers, and text editors.
[0358] The computing device 1700 may also include a read-only memory (ROM) 1708 or other static storage device coupled to the bus 1702 for storing static information and software instructions for one or more processors 1704.
[0359] One or more mass storage devices 1710 may be coupled to bus 1702 for persistently storing information and software instructions on a fixed or removable medium such as magnetic, optical, solid-state, magneto-optical, flash memory, or any other available mass storage technology. The mass storage device may be shared on a network or may be a dedicated mass storage device. Typically, at least one of the mass storage devices 1710 (e.g., the device's main hard disk) stores the body of programs and data used to direct the operation of the computing device, including operating systems, user applications, drivers, and other supporting files, as well as various other data files.
[0360] The computing device 1700 may be coupled to a display 1712, such as a liquid crystal display (LCD) or other electronically visible display, via a bus 1702 for displaying information to a computer user. In some configurations, a touch-sensitive surface incorporating touch detection technology (e.g., resistive, capacitive, etc.) may be overlaid on the display 1712 to form a touch-sensitive display for transmitting touch gestures (e.g., fingers or styluses) input to one or more processors 1704.
[0361] Input device 1714, including alphanumeric keys and other keys, can be coupled to bus 1702 for transmitting information and command selection to processor 1704. In addition to or as a substitute for alphanumeric keys and other keys, input device 1714 may also include one or more physical buttons or switches, such as, for example, a power (on / off) button, a "home" button, a volume control button, etc.
[0362] Another type of user input device can be a cursor control 1716 (such as a mouse, trackball, or cursor arrow keys) for transmitting directional information and command selection to the processor 1704 and for controlling cursor movement on the display 1712. Such input devices typically have two degrees of freedom in two axes (a first axis (e.g., x) and a second axis (e.g., y)) that allow the device to specify a position in a plane.
[0363] Although in some configurations (such as Figure 17 In the illustrated configuration, one or more of the display 1712, input device 1714, and cursor control 1716 are external components (i.e., peripheral devices) of the computing device 1700, but some or all of the display 1712, input device 1714, and cursor control 1716 are integrated as part of the form factor of the computing device 1700 in other configurations.
[0364] The functions of the disclosed systems, methods, and modules can be executed by computing device 1700 in response to processor(s) 1704 executing one or more programs containing software instructions stored in main memory 1706. These software instructions can be read into main memory 1706 from another storage medium, such as storage(s) 1710. Execution of the software instructions stored in main memory 1706 causes processor(s) 1704 to perform the functions of one or more example embodiments.
[0365] While the functionality and operation of one or more example embodiments may be implemented entirely with software instructions, the hardwired or programmable circuitry system of computing device 1700 (e.g., ASIC, FPGA, etc.) may be used in other embodiments, in place of or in combination with software instructions, to perform functions according to the requirements of the particular implementation at hand.
[0366] As used herein, the term "storage medium" refers to any non-transitory medium that stores data and / or software instructions that enable a computing device to operate in a particular manner. Such storage media can include non-volatile media and / or volatile media. Non-volatile media include, for example, non-volatile random access memory (NVRAM), flash memory, optical discs, hard disks, or solid-state drives, such as storage device 1710. Volatile media include dynamic memory, such as main memory 1706. Common forms of storage media include, for example, floppy disks, flexible disks, hard disks, solid-state drives, magnetic tape, or any other magnetic data storage media, CD-ROMs, any other optical data storage media, any physical medium with a perforated pattern, RAM, PROMs and EPROMs, FLASH-EPROMs, NVRAM, flash memory, any other memory chips, or magnetic tape cassettes.
[0367] Storage media differ from transmission media but can be used in conjunction with them. Transmission media participate in transferring information between storage media. For example, transmission media include coaxial cables, copper wires, and optical fibers, including lines containing bus 1702. Transmission media can also take the form of sound waves or light waves, such as those generated during radio data communication and infrared data communication.
[0368] Carrying one or more sequences of software instructions to processor(s) 1704 for execution can involve various forms of media. For example, the software instructions may initially be carried on a disk or solid-state drive of a remote computer. The remote computer may load the software instructions into its dynamic memory and transmit them via a telephone line using a modem. A modem located locally on computing device 1700 may receive data over the telephone line and convert the data into an infrared signal using an infrared transmitter. An infrared detector may receive the data carried in the infrared signal, and appropriate circuitry may place the data on bus 1702. Bus 1702 carries the data to main memory 1706, from which processor(s) 1704 retrieves and executes the software instructions. The software instructions received by main memory 1706 may optionally be stored on storage device(s) 1710 before or after execution by processor(s) 1704.
[0369] The computing device 1700 may also include one or more communication interfaces 1718 coupled to a bus 1702. The communication interface 1718 provides bidirectional data communication coupled to a wired or wireless network link 1720 connected to a local network 1722 (e.g., Ethernet, wireless LAN, cellular telephone network, Bluetooth wireless network, etc.). The communication interface 1718 transmits and receives electrical, electromagnetic, or optical signals carrying digital data streams representing various types of information. For example, the communication interface 1718 may be a wired network interface card, a wireless network interface card with an integrated radio antenna, or a modem (e.g., ISDN, DSL, or cable modem).
[0370] One or more network links 1720 typically provide data communication to other data devices via one or more networks. For example, network link 1720 may provide a connection to host computer 1724 or to data equipment operated by Internet Service Provider (ISP) 1726 via local network 1722. ISP 1726, in turn, provides data communication services via a global packet data communication network now commonly referred to as the “Internet” 1728. One or more local networks 1722 and Internet 1728 use electrical, electromagnetic, or optical signals carrying digital data streams. Signals carrying digital data to and from computing device 1700 via various networks, as well as signals on one or more network links 1720 and via one or more communication interfaces 1718, are example forms of transmission media.
[0371] Computing device 1700 can send messages and receive data, including program code, via one or more networks, one or more network links 1720, and one or more communication interfaces 1718. In the Internet example, server 1730 can send request codes for applications via the Internet 1728, ISP 1726, one or more local networks 1722, and one or more communication interfaces 1718.
[0372] The received code may be executed by processor 1704 upon receipt and / or stored in storage device 170 or other non-volatile storage device for subsequent execution.
[0373] Software Overview
[0374] Figure 18 This is a block diagram of a basic software system 1800 that can be used to control the operation of a computing device 1700. The software system 1800 and its components, including their connections, relationships, and functions, are intended to be exemplary only and are not intended to limit the implementation of one or more example embodiments. Other software systems suitable for implementing one or more example embodiments may have different components, including components with different connections, relationships, and functions.
[0375] The software system 1800 is provided to guide the operation of the computing device 1700. The software system 1800, which may be stored in system memory (RAM) 1706 and on fixed storage devices (e.g., hard disk or flash memory) 1710, includes a kernel or operating system (OS) 1810.
[0376] OS 1810 manages the low-level aspects of computer operations, including managing process execution, memory allocation, file input and output (I / O), and device I / O. One or more applications, designated 1802A, 1802B, 1802C...1802N, can be "loaded" (e.g., transferred from fixed storage device 1710 to memory 1706) for execution by system 1800. Applications or other software intended for use on device 1800 can also be stored as downloadable computer-executable instruction sets, for example, for downloading and installing from Internet locations (e.g., web servers, app stores, or other online services).
[0377] Software system 1800 includes a graphical user interface (GUI) 1815 for receiving user commands and data graphically (e.g., "click" or "touch gestures"). These inputs can then be acted upon by system 1800 according to instructions from operating system 1810 and / or (one or more) applications 1802. GUI 1815 also displays the results of actions from OS 1810 and (one or more) applications 1802, on which the user can provide additional input or terminate the session (e.g., log off).
[0378] OS 1810 can execute directly on the raw hardware 1820 of device 1700 (e.g., one or more processors 1704). Alternatively, a hypervisor or virtual machine monitor (VMM) 1830 can be interposed between the raw hardware 1820 and OS 1810. In this configuration, VMM 1830 acts as a software "buffer" or virtualization layer between OS 1810 and raw hardware 1820 of device 1700.
[0379] VMM 1830 instantiates and runs one or more virtual machine instances (“guest machines”). Each guest machine includes a “guest” operating system (such as OS 1810) and one or more applications (such as Application 1802) designed to run on the guest operating system. VMM 1830 presents a virtual operating platform to the guest operating system and manages the execution of the guest operating system.
[0380] In some cases, VMM 1830 can allow a guest operating system to run as if it were running directly on the raw hardware 1820 of device 1700. In these cases, the same version of the guest operating system configured to run directly on the raw hardware 1820 can also run on VMM 1830 without modification or reconfiguration. In other words, in some situations, VMM 1830 can provide full hardware and CPU virtualization to the guest operating system.
[0381] In other cases, the guest operating system can be specially designed or configured to run on VMM 1830 for improved efficiency. In these cases, the guest operating system "knows" that it is running on the virtual machine monitor. In other words, in some situations, VMM 1830 can provide para-virtualization to the guest operating system.
[0382] The basic computer hardware and software described above are given to illustrate the basic underlying computer components that can be used to implement one or more example embodiments. However, the one or more example embodiments are not necessarily limited to any particular computing environment or computing device configuration. Rather, the one or more example embodiments can be implemented in any type of system architecture or processing environment that a person skilled in the art will understand from this disclosure as capable of supporting the features and functionality of the one or more example embodiments given herein.
[0383] Expansion and Replacement
[0384] While some of the figures described in the foregoing specification include flowcharts with steps shown in a certain order, these steps can be performed in any order, and are not limited to the order shown in those flowcharts. Furthermore, some steps may be optional, may be performed multiple times, and / or may be performed by different components. All steps, operations, and functions in the flowcharts described herein are intended to indicate operations performed using programming in various embodiments using a dedicated or general-purpose computer. In other words, each flowchart in this disclosure, in conjunction with the related text herein, is a guide, plan, or specification for programming a computer to perform all or part of the described functions. The level of skill in the art associated with this disclosure is known to be high, and therefore the flowcharts and related text in this disclosure are prepared to convey information with the sufficiency and detail typically expected in the art when those skilled in the art communicate among themselves regarding programs, algorithms, and their implementations.
[0385] In the foregoing description, numerous specific details have been described of one or more exemplary embodiments of the invention. However, these details may vary depending on the requirements of a particular implementation. Therefore, the exemplary embodiments(s) should be considered illustrative rather than restrictive.
Claims
1. A method comprising: Multiple mapping entries are stored by the sharding coordinator; Each of the multiple mapping entries maps a different set of shard key-value pairs to a shard in multiple shards of the sharded database; At least one of the plurality of fragments is mapped to a fragment key value set that includes non-contiguous fragment key values; Each of the plurality of shards includes one or more database servers; The first query is received from the client application at the sharding coordinator, wherein the first query includes a first sharding key value; In response to receiving the first query, execute: Identify a specific mapping entry among the plurality of mapping entries that corresponds to the set of fragment key values including the first fragment key value; The specific mapping entry is used to identify the target fragment among the plurality of fragments that corresponds to the fragment key value set including the first fragment key value; A first result for a first query is generated by using the target shard corresponding to the set of shard key values that includes the first shard key value; as well as Send a response to the first query based on the first result to the client application; Each mapping entry maps the shard key value set to a corresponding shard that is a shard among the plurality of shards that includes a block associated with the shard key value set; Generating a first result for a first query includes generating a first result for a first query by using a block at the target shard corresponding to the set of shard key values that includes the first shard key value; The first partition of the first table of the block storage table family and the second partition of the second table of the table family are used to partition the first partition and the second partition using the same partition key as the sharding key. The method further includes, In response to receiving a first query, the shard coordinator coordinates the processing of the first query for the target shard. as well as The sending of the response to the first query based on the first result to the client application is performed by the sharding coordinator.
2. The method of claim 1, wherein coordinating the processing of the first query for the target shard by the shard coordinator includes: Send the first query from the shard coordinator to the target shard; A first query for the target shard is executed by using a block at the target shard that corresponds to the set of shard key values that includes the first shard key value, in order to generate a first result; The first result is sent from the target shard to the shard coordinator; as well as The fragment coordinator aggregates the first and second results into an aggregated result. The second result is received from the second fragment by the fragment coordinator; as well as The response to the first query includes the aggregated results.
3. The method of claim 1, wherein coordinating the processing of the first query for the target shard by the shard coordinator includes: Send the first predicate of the first query from the sharding coordinator to the target shard; A first predicate is executed on the target shard by using the block at the target shard corresponding to the set of shard key values that includes the first shard key value, to generate a first result; The first result is sent from the target shard to the shard coordinator; as well as The sharding coordinator executes a first query on the first and second results to generate the final result; The second result is received from the second fragment by the fragment coordinator; as well as The response to the first query includes the final result.
4. The method of claim 1, wherein the first query is written in a query language.
5. The method of claim 4, wherein the query language is Structured Query Language (SQL).
6. The method of claim 1, wherein each of the plurality of shards does not share a processor, memory, or disk storage device with another of the plurality of shards.
7. The method of claim 1, further comprising: For each of the plurality of mapping entries, at least in part, different shard key-value sets are mapped to shards in a plurality of shards of the sharded database based on user-defined instructions for mapping by range or list.
8. The method of claim 1, further comprising: For each of the plurality of mapping entries, at least in part, a consistent hash function applied to the shard key is used to map different sets of shard key values to shards in a sharded database.
9. The method of claim 1, wherein the sharding coordinator is executed outside both the client application and the sharding database.
10. A computing device, comprising: One or more hardware processors; as well as One or more storage media storing instructions that, when executed by the one or more hardware processors, cause the method as described in any one of claims 1 to 9 to be performed.
11. A non-transitory computer-readable medium storing one or more instructions, which, when executed by one or more hardware processors, cause to perform the method as described in any one of claims 1 to 9.
12. An apparatus comprising one or more devices configured to perform the method as described in any one of claims 1 to 9.