The detailed features and advantages of the present invention will be described in detail below in the specific embodiments. The content is sufficient to enable any person skilled in the art to understand the technical content of the present invention and implement it accordingly, and according to the specification, claims and drawings disclosed in this specification Those skilled in the art can easily understand the related objectives and advantages of the present invention.
 In the following, first introduce some terms used in this application, and then refer to figure 1 with figure 2 Introduce some implementation methods of distributed external connections, and then refer to image 3 Introduce the execution mode of the distributed external connection according to the embodiment of the present invention, and finally refer to Figure 4 A method for externally connecting a first table of a database to a second table according to an embodiment of the present invention is introduced.
 The database mentioned in this article usually refers to a row-stored database, which has no dependence on the storage engine structure of the database, the network interconnection method, and the cost model of the optimizer. The database may include various common databases, such as but not limited to OceanBase, MySQL, Oracle, DB2, etc.
 The database join operation refers to the operation of associating related data in the database through certain conditions.
 Outer join operation is a form of join operation, which usually involves outer table and inner table. In the outer join operation, in addition to returning the matching rows of the two tables, it also returns the rows in the outer table that did not find a matching row in the inner table. Common outer join methods include left outer join and right outer join. In the left outer join, the left table is the outer table and the right table is the inner table; and in the right outer join, the right table is the outer table and the left table is the inner table. In the following, unless otherwise specified, "connecting the first database to the second database externally" generally refers to performing external connections with the first database as the external table and the second database as the internal table.
 Database includes centralized database and distributed database. In the connection operation of the distributed database, data redistribution is usually required. Data redistribution usually refers to the process of redistributing data at different physical nodes through a network connection according to certain rules. The general data redistribution methods include: broadcast, random, hash, etc.
 The optimizer is also commonly used in database operations. The optimizer usually refers to the module that generates the execution plan in the database. Generally speaking, the most advanced databases often use optimizers based on execution costs to generate the best execution plan.
 It should be pointed out that the introduction of the above terms is only to help understand the application, and is not intended to limit the scope of the present invention in any way.
 reference figure 1 , Which shows a schematic diagram of a distributed external connection execution mode. figure 1 Two tables T1 and T2 are shown in.
 For the convenience of description, suppose that Table T1 is shown in Table 1 below:
 age subsidy 22 1000 23 1500 24 2000
 Table 1: Table T1
 Assume that Table T2 is shown in Table 2 below:
 Table 2: Table T2
 Suppose the outer join operation is performed on it, for example: select*from T1 left outer join T2 on T1. Age.
 Such as figure 1 As shown, in this connection mode, the data of the outer table T1 and the inner table T2 are redistributed to multiple join operators according to the hash value of the join key. In this way, it is ensured that rows with the same value must appear in the input of the same join operator.
 For example, the operation can be performed in the following manner. First select a column in the table as the hash value, such as "age", and then for each row in the data table, through a predefined hash function, generate the hash value for different ages (the general result is a no Signed integer), and then take the modulus according to the predefined grouping number, and send the row of data to the corresponding partition. For example, for age 22, the generated hash value is 37, and the grouping number is 4 (the result is 1) , The row of data will be sent to the group numbered 1. Grouping is a logical concept, and the data of the same group will be sent to the same worker for execution.
 In the example of the above tables T1 and T2, suppose that age is divided into three groups: , [23,25], , so that all data with age 22 will be sent to the first worker for execution, and all ages The data of 27 will be sent to the second worker for execution, and all data with ages of 23 and 25 will be sent to the third worker for execution. The first, second and third workers can respectively perform corresponding connection operations.
 The above redistribution and connection manners can be performed in any manner known to those skilled in the art, and are not limited to the above manners.
 However, this method requires a large amount of data redistribution, the amount of data to be transmitted is very large, and there are many operations that need to be performed.
 In order to simplify the above operations, an alternative distributed external connection execution method such as figure 2 Shown. in figure 2 In the example, the outer table is broadcast to each node of the inner table. On each node, outer join operations are performed.
 However, this external connection execution method is not without problems. According to the semantics of the outer join, if a row in the outer table does not have a matching row in the inner table, the row needs to be returned and the column related to the inner table in the projection is set to NULL. Therefore, this external connection method may bring wrong results.
 Take the above table T2 as an example, its nodes are shown in the following table 3-6:
 age subsidy 22 Zhang San 22 Li Si
 Table 3: Table T2_1
 age subsidy 23 Wang Wu
 Table 4: Table T2_2
 age subsidy 25 Zhao Liu
 Table 5: Table T2_3
 age subsidy 27 Sun Qi 27 Money eight
 Table 6: Table T2_4
 Using the above outer join execution method, for the row (22,1000) in table T1, (22,1000,'Zhang San') and (22,1000,'李四') will be returned in the connection with table T2_1 ) Two rows, but no matching row can be found in the table T2_2/T2_3/T2_4. At this time, according to the outer join semantics, (22, 1000, NULL) will be returned. The outer join should not return NULL when there is a match in the row of the outer table. Therefore, the execution result of this method will result in extra rows containing NULL, resulting in incorrect results. In addition, this method also generates a large number of tables containing NULL, which brings a lot of redundant data.
 For example, for the line (22,1000), the final result returned by the distributed plan is
 (22,1000, ‘Zhang San’) (from T2_1)
 (22,1000,‘Li Si’) (from T2_1)
 (22,1000,NULL) (from T2_2)
 (22,1000,NULL) (from T2_3)
 (22,1000,NULL) (from T2_4)
 But according to outer join semantics, the correct result should be:
 (22,1000, ‘Zhang San’) (from T2_1)
 (22,1000,‘Li Si’) (from T2_1)
 Another alternative distributed external connection execution method such as image 3 Shown. image 3 Way with figure 2 Similar, the difference is that at each node, internal connections are performed instead of external connections. There are also problems with this way of execution. Take the above example as an example. For the row (24,2000) in table T1, since no matching row can be found in each node of table T2, the corresponding inner join will return an empty result, resulting in the final result The execution result of is missing the corresponding data for the row.
 Reference below Figure 4 , Which shows a schematic diagram of the execution mode of the distributed external connection according to the embodiment of the present invention.
 In one embodiment, the outer table T1 is broadcast to each node of the inner table T2. It can be avoided by broadcasting figure 1 The large amount of data in the example is redistributed, compared to figure 1 The execution method has advantages.
 versus figure 2 In the preferred embodiment of the present invention, the outer connection is connected to the corresponding node of the inner watch on one or more nodes, and the outer connection is connected to the inner watch on other nodes. Connect to the corresponding node of the internal table. By reducing the number of nodes that perform outer join operations, the number of rows containing NULL is greatly reduced, thereby reducing the amount of redundant data, so compared to figure 2 The execution method also has advantages. In addition, since outer joins are performed on at least one node, there will be no missing data, so compared to image 3 The execution method also has advantages.
 Preferably, the outer join operation is performed on only one node of the inner table, and the inner join operation is performed on other nodes. Alternatively, instead of performing external connection operations on only one node, external connection operations are performed on one or more nodes, and internal connection operations are performed on other nodes. For example, the outer join operation can be performed on the true subset of all nodes in the inner table. Or, you can perform outer join operations on all nodes in the inner table. The number of nodes on which outer join operations are performed will affect the amount of redundant data.
 In one example, the selection of nodes that perform outer connections may be random. In an alternative example, the node that performs the outer connection may be selected based on specific criteria. For example, a node with the least amount of data in the inner table (or multiple nodes with a smaller amount of data) can be selected as the node for performing the outer join. One or more nodes that perform external connections can also be selected based on other factors such as the network distribution of data.
 In an embodiment of the present invention, preferably, a group by operation based on the connection key of the outer connection is performed on each node of the inner table. The group by operation can be performed by an upper-level aggregation node. By grouping according to the key value of the outer connection, the present invention can ensure that the rows with NULL generated by the error are finally deleted in the related grouping, thereby ensuring the correctness of the result and reducing the amount of redundant data.
 The following takes the example in Table 1-6 above as an example for specific description. Suppose that the node T2_1 in the inner table T2 is selected to perform the outer join operation, and the inner join operation is performed on the other nodes in the inner table T2.
 For the row (22,1000) in the outer table T1, since the matching row can be found on the node T2_1, the outer join operation will return two rows: (22,1000,'Zhang San') and (22,1000,' Li Si'). When performing inner join operations on other nodes, no results will be returned because there is no matching row. Also, since there is no row including NULL, the processing of the aggregation node will not change the data.
 For the row (23,1500) in the outer table T1, performing an outer join operation on node T2_1 will return the row: (23,1500, NULL), while performing an inner join operation on node T2_2 will return (23,1500,' Wang Wu'), no results are returned on other nodes. When the aggregation node performs the group by operation, the row (23,1500, NULL) will be removed and the row (23,1500, ‘王五’) will be retained, thus producing the correct result.
 For the row (24,2000) in the outer table T1, executing the outer join operation on the node T2_1 will return the row: (24,2000,NULL), and no results will be returned on all other nodes. When the aggregation node performs the group by operation, the row (24, 2000, NULL) will be reserved, which will produce the correct result.
 It can be appreciated that although the external table T1 is described as a non-distributed table in the above example, the external table T1 may also be a distributed table including multiple nodes. At this time, every node of the outer table T1 will be broadcast to every node of the inner table T2.
 Reference below Figure 5 , Which shows a flowchart of a method 500 for connecting a first watch to a second watch according to an embodiment of the present invention.
 The method 500 may include: at block 502, broadcasting the first table (eg, outer table T1) to each node (eg, nodes T2_1, T2_2, T2_3, T2_4) of the second table (eg, inner table T2).
 The method 500 may further include: in block 504, performing an outer join operation on one or more nodes of the second table, and performing an inner join operation on other nodes of the second table. For example, the external table T1 is externally connected to the internal table T2_1, and the external table T1 is internally connected to each of T2_1, T2_2, T2_3, and T2_4.
 Preferably, performing an outer join operation on one or more nodes of the second table is to perform an outer join operation on a true subset of all nodes of the second table. More preferably, performing an outer join operation on one or more nodes of the second table is performing an outer join operation on only one node of the second table.
 Preferably, the first table is the first table, and broadcasting the first table to each node of the second table includes broadcasting each node of the first table to each node of the second table.
 The method 500 may further include: at optional block 506, performing a group by operation based on the connection key of the outer connection on each node of the second table. For example, perform a group by operation based on the connection key (for example, t1. age) on the externally connected nodes T2_1, T2_2, T2_3, and T2_4.
 Moreover, the present invention also discloses a system for storing data. The system includes a first table and a second table. The system may be configured to perform the method as described above.
 Preferably, the system may further include an aggregation node, and the aggregation node may perform the group by operation as described above.
 Moreover, this application also discloses a computer-readable storage medium including computer-executable instructions stored thereon, which, when executed by a processor, causes the processor to perform the various implementations described herein. Example method.
 Moreover, the present application also discloses a computer system including a device for executing the method of each embodiment described herein.
 It can be understood that the methods according to various embodiments of the present invention can be implemented by software, firmware or a combination thereof.
 It should be understood that the specific order or hierarchy of the steps in the disclosed method is an illustration of an exemplary process. Based on design preferences, it should be understood that the specific order or hierarchy of the steps in these methods can be rearranged. The accompanying method claims present the elements of the various steps in a sample order, and are not meant to be limited to the specific order or hierarchy presented, unless specifically stated herein.
 It should be understood that the description of an element in the singular form herein or the display of only one element in the drawings does not mean that the number of the element is limited to one. In addition, modules or elements described or shown as separate herein may be combined into a single module or element, and modules or elements described or shown herein as a single module or element may be split into multiple modules or elements.
 It should also be understood that the terms and expressions used herein are only for description, and the present invention should not be limited to these terms and expressions. The use of these terms and expressions does not mean to exclude any equivalent features of the illustration and description (or part of them), and it should be recognized that various modifications that may exist should also be included in the scope of the claims. Other modifications, changes and replacements may also exist. Accordingly, the claims should be regarded as covering all these equivalents.
 Similarly, it should be pointed out that although the present invention has been described with reference to the current specific embodiments, those of ordinary skill in the art should realize that the above embodiments are only used to illustrate the present invention, without departing from the present invention. Various equivalent changes or substitutions can be made in the spirit of the present invention. Therefore, as long as the changes and modifications of the foregoing embodiments are within the essential spirit of the present invention, they will fall within the scope of the claims of this application.