Table joining method and related products

By extracting key-value field data from the database for partitioning and matching filtering, the problem of resource consumption from table joins is solved, improving efficiency and saving storage and network resources.

CN116991885BActive Publication Date: 2026-06-23TENCENT TECH SHANGHAI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TENCENT TECH SHANGHAI
Filing Date
2022-10-11
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

When joining multiple tables in a database, existing technologies consume a lot of computing and network transmission resources, leading to a decline in computer performance.

Method used

By extracting key-value field data from the table, partitioning it, and then matching and filtering it on the compute nodes to obtain the intersection data, the table to be joined is updated, thus avoiding the partitioning and matching of the entire data.

Benefits of technology

It improves table join efficiency, saves storage and computing resources on computing nodes, reduces data transmission volume, and lowers network bandwidth consumption.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116991885B_ABST
    Figure CN116991885B_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of computers, and particularly relates to a table connection method, a table connection device, a computer readable medium, an electronic device and a computer program product. The method comprises the following steps: extracting field data corresponding to a key value from two tables to be connected according to the key value as a table connection basis; performing partition processing on the extracted field data to obtain multiple groups of partition data corresponding to different computing nodes; performing matching screening on two groups of partition data respectively belonging to the two tables to be connected on each computing node to obtain intersection data of the two groups of partition data; and updating the tables to be connected according to the intersection data screened by each computing node. The embodiment of the application can reduce the consumption of computing resources of table connection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of computer technology, and specifically relates to a table connection method, a table connection device, a computer-readable medium, an electronic device, and a computer program product. Background Technology

[0002] In database applications, it is often necessary to join and combine multiple different tables in a database to use data from different tables simultaneously. When joining and combining tables, all data rows and columns within the table are typically copied and transferred multiple times. As the table size increases, the join and combination operations between tables will consume a significant amount of computing and network transmission resources, leading to a decline in computer performance. Summary of the Invention

[0003] This application provides a table join method, table join device, computer-readable medium, electronic device, and computer program product, with the aim of reducing the computational resource consumption of table joins.

[0004] Other features and advantages of this application will become apparent from the following detailed description, or may be learned in part from practice of this application.

[0005] According to one aspect of the embodiments of this application, a table joining method is provided, the method comprising:

[0006] Based on the key value used as the basis for table connection, extract the field data corresponding to the key value from the two tables to be joined;

[0007] The extracted field data is partitioned to obtain multi-group partitioned data corresponding to multiple computing nodes;

[0008] On each of the computing nodes, the two groups of partitioned data belonging to the two tables to be connected are matched and filtered to obtain the intersection data of the two groups of partitioned data.

[0009] The table to be connected is updated based on the intersection data obtained from filtering by each computing node.

[0010] According to one aspect of the embodiments of this application, a table connection device is provided, the device comprising:

[0011] The extraction module is configured to extract field data corresponding to the key value from two tables to be joined, based on the key value used as the basis for table joining.

[0012] The partitioning module is configured to partition the extracted field data to obtain multi-group partitioned data corresponding to multiple computing nodes;

[0013] The filtering module is configured to perform matching and filtering on each of the computing nodes on the two groups of partitioned data belonging to the two tables to be connected, to obtain the intersection data of the two groups of partitioned data.

[0014] The update module is configured to update the table to be joined based on the intersection data obtained from filtering each computing node.

[0015] In some embodiments of this application, based on the above technical solutions, the updating module includes:

[0016] The data merging module is configured to merge the intersection data obtained by filtering from various computing nodes;

[0017] The data row update module is configured to update the data rows of the table to be joined based on the intersection data obtained by merging.

[0018] The data column update module is configured to add a data column to the tables to be joined to represent the index values ​​of the intersection data.

[0019] In some embodiments of this application, based on the above technical solutions, the data row update module includes:

[0020] The data row filtering module is configured to match and filter each data row of the table to be joined with the intersection data to obtain the data rows of the table to be joined that contain the intersection data;

[0021] The quantity acquisition module is configured to acquire the distribution quantity of the intersection data in the table to be joined, wherein the distribution quantity is the number of data rows containing the intersection data;

[0022] The data row copying module is configured to copy and add the data rows containing the intersection data to another table to be joined, based on the distribution quantity of the merged intersection data in one table to be joined.

[0023] In some embodiments of this application, based on the above technical solutions, the data column update module includes:

[0024] The first index module is configured to obtain a primary index value for distinguishing the different values ​​of the intersection data;

[0025] The second index module is configured to obtain a secondary index value for distinguishing the different data rows of the intersection data between the two tables to be joined;

[0026] The index value adding module is configured to add a data column containing the primary index value and the secondary index value to the table to be joined.

[0027] In some embodiments of this application, based on the above technical solutions, the second indexing module includes:

[0028] The offset determination module is configured to determine an offset associated with the intersection data based on the distribution quantity of the merged intersection data in the table to be joined, wherein the distribution quantity is the number of data rows containing the intersection data, and the offset is used to distinguish different data rows containing the intersection data in a single table;

[0029] The index value determination module is configured to determine a secondary index value based on the distribution quantity and the offset, for distinguishing the different data rows of the intersection data between the two tables to be joined.

[0030] In some embodiments of this application, based on the above technical solutions, the index value determination module includes:

[0031] The rule acquisition module is configured to acquire a first mapping rule and a second mapping rule that are mirror images of each other. The mirror image relationship is used to indicate that the two mapping rules use mapping parameters from different tables at the same position.

[0032] The first mapping module is configured to map the distribution quantity and the offset according to one of the first mapping rule and the second mapping rule to obtain the secondary index values ​​of each data row where the intersection data is located in the local table; the local table is one of the two tables to be joined.

[0033] The second mapping module is configured to perform mapping processing on the distribution quantity and the offset according to another of the first mapping rule and the second mapping rule to obtain the secondary index values ​​of each data row where the intersection data is located in the opposite table; the opposite table is the other table of the two tables to be joined.

[0034] In some embodiments of this application, based on the above technical solutions, the first mapping rule includes: multiplying the distribution quantity of the opposite table with the offset of the local table, and then summing it with the offset of the opposite table;

[0035] The second mapping rule includes: multiplying the distribution quantity of the local table with the offset of the opposite table, and then summing the product with the offset of the local table;

[0036] Wherein, the local table is one of the two tables to be connected, and the opposite table is the other of the two tables to be connected.

[0037] In some embodiments of this application, based on the above technical solutions, the extraction module includes:

[0038] The partition acquisition module is configured to acquire at least two table partitions that make up the two tables to be joined, wherein the at least two table partitions are distributed on different computing nodes;

[0039] The field extraction module is configured to extract field data corresponding to the key value from the at least two table partitions based on the key value used as the basis for table connection.

[0040] In some embodiments of this application, based on the above technical solutions, the partitioning module includes:

[0041] The hash operation module is configured to perform a hash operation on the extracted field data to obtain the hash value of the field data;

[0042] The field partitioning module is configured to partition the field data according to the hash value to obtain multi-group partitioned data corresponding to multiple computing nodes.

[0043] In some embodiments of this application, based on the above technical solutions, the field partitioning module includes:

[0044] The quantity acquisition module is configured to acquire the number of nodes used for processing field data in the calculation.

[0045] The hash value mapping module is configured to map the hash values ​​according to the number of nodes to obtain a set of hash values ​​that is the same as the number of nodes;

[0046] The set allocation module is configured to allocate the hash value set to each computing node, and determine the multi-group partitioned data corresponding to the multiple computing nodes based on the allocation results.

[0047] In some embodiments of this application, based on the above technical solutions, the screening module includes:

[0048] The quantity comparison module is configured to compare the number of data belonging to two groups of partitioned data in two tables to be joined on the computing node.

[0049] The hash operation module is configured to perform hash operations on a group of partitions with a small amount of data to obtain a hash table corresponding to that group of partitions.

[0050] The matching and filtering module is configured to match and filter a group of partition data with a large amount of data against the hash table to obtain the intersection data of the two groups of partition data.

[0051] According to one aspect of the embodiments of this application, a computer-readable medium is provided having a computer program stored thereon, which, when executed by a processor, implements the table join method as described in the above technical solutions.

[0052] According to one aspect of the embodiments of this application, an electronic device is provided, the electronic device comprising: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform a table join method as described above by executing the executable instructions.

[0053] According to one aspect of the embodiments of this application, a computer program product is provided, including a computer program that, when executed by a processor, implements the table join method as described in the above technical solutions.

[0054] In some embodiments of this application, the computer program product or computer program includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform the table join method as described in the above technical solutions.

[0055] In the technical solution provided in this application embodiment, by extracting field data corresponding to key values ​​from the table, the field data can be partitioned separately, thereby matching and filtering the partitioned data on the computing node, and finally updating the table to be joined based on the intersection data obtained from the filtering. This application embodiment does not require partitioning and matching the full data of the table, which can improve the table joining efficiency while greatly saving storage and computing resources on a single computing node, and can also reduce the amount of data transmission between multiple computing nodes, saving network bandwidth resources.

[0056] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0057] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application. It is obvious that the drawings described below are merely some embodiments of this application, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.

[0058] Figure 1 This diagram illustrates the result of connecting two tables in an application scenario.

[0059] Figure 2 This diagram illustrates the principle of table joins based on broadcast hashing, which is applicable to scenarios where a large table is joined with a small table.

[0060] Figure 3This diagram illustrates the principle of table joining based on repartitioning, which is suitable for application scenarios where both tables contain a large number of data rows.

[0061] Figure 4 This illustration shows a product form diagram of a table connection on a local database according to an embodiment of this application.

[0062] Figure 5 This illustration shows a product form diagram of a table join between databases stored in two locations, according to an embodiment of this application.

[0063] Figure 6 An exemplary system architecture block diagram illustrating the application of the technical solution of this application is shown schematically.

[0064] Figure 7 A flowchart illustrating the steps of a table connection method in one embodiment of this application is shown.

[0065] Figure 8 A flowchart illustrating the steps of performing table joins on tables in distributed storage according to one embodiment of this application is shown.

[0066] Figure 9 A flowchart illustrating the steps of performing table joins based on hash value partitioning in one embodiment of this application is shown.

[0067] Figure 10 A flowchart illustrating the steps of performing table joins based on hash table matching in one embodiment of this application is shown.

[0068] Figure 11 A flowchart illustrating the steps of performing table joins based on index updates in one embodiment of this application is shown.

[0069] Figure 12 A flowchart illustrating the steps of table join based on key-value field repartitioning and index update in one embodiment of this application is shown.

[0070] Figure 13 This illustration shows a schematic diagram of the process of performing an inner join on a table in an application scenario according to an embodiment of this application.

[0071] Figure 14 This illustration shows a schematic diagram of the result of table join based on Cartesian product in an application scenario according to an embodiment of this application.

[0072] Figure 15 This illustration shows the result of a table inner join based on Cartesian product calculation in an application scenario according to an embodiment of this application.

[0073] Figure 16 This illustration shows the result of adding index values ​​to a table in an embodiment of this application.

[0074] Figure 17 A schematic block diagram of the table connection device provided in an embodiment of this application is shown.

[0075] Figure 18 A schematic diagram of a computer system architecture suitable for implementing the embodiments of this application is shown. Detailed Implementation

[0076] Exemplary embodiments will now be described more fully with reference to the accompanying drawings. However, these exemplary embodiments can be implemented in many forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided to make this application more comprehensive and complete, and to fully convey the concept of the exemplary embodiments to those skilled in the art.

[0077] Furthermore, the described features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. Numerous specific details are provided in the following description to give a thorough understanding of embodiments of this application. However, those skilled in the art will recognize that the technical solutions of this application can be practiced without one or more of the specific details, or other methods, components, apparatuses, steps, etc., can be employed. In other instances, well-known methods, apparatuses, implementations, or operations are not shown or described in detail to avoid obscuring various aspects of this application.

[0078] The block diagrams shown in the accompanying drawings are merely functional entities and do not necessarily correspond to physically independent entities. That is, these functional entities can be implemented in software, in one or more hardware modules or integrated circuits, or in different network and / or processor devices and / or microcontroller devices.

[0079] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily need to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.

[0080] In the specific implementation of this application, data related to tables is involved. When the various embodiments of this application are applied to specific products or technologies, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0081] The relevant terms used in the technical solution of this application are explained as follows.

[0082] Table Join: In database computation, the SQL JOIN clause combines rows from two or more tables based on common fields. The most common type of JOIN is SQL INNER JOIN (simple JOIN). SQL INNER JOIN is used to return all rows from multiple tables that satisfy the JOIN condition.

[0083] Table dimensions: Table dimensions refer to the number of fields in each row of the table, which is also the number of columns in the table.

[0084] Join Key: During the join process, two tables are joined based on the same data in a certain column of their respective tables. The corresponding rows are then joined together. The value or data column used as the basis for the join is called the Join Key.

[0085] Partition: In a distributed environment, a table is divided into several parts according to rows, and the data in each part of the table is stored and computed on different nodes. The table data in each part is called a partition.

[0086] Wide table: A new table is generated by joining two tables, which contains fields from both tables.

[0087] In traditional SQL or generalized SQL (such as Spark SQL) applications, the "JOIN" statement is frequently used to combine two or more different tables in a database. The result of a "JOIN" statement can be saved as a table or used as a table. The meaning of the JOIN statement is to combine the attributes of two tables through their values. The ANSI standard SQL lists five JOIN types: inner join, full outer join, left outer join, right outer join, and cross join.

[0088] The following is based on Figures 1 to 3 The principle of table linking in the related technologies of this application is explained.

[0089] Figure 1 This diagram illustrates the result of connecting two tables in an application scenario.

[0090] like Figure 1 As shown, in this application scenario, there are two tables in the database: Employee Table 101 and Department Table 102. In these two tables, Department.DepartmentID is the primary key, and Employee.DepartmentID is the foreign key.

[0091] Taking an inner join as an example, suppose a user wants to retrieve the intersection of the two tables above, they can use the following SQL statement to achieve this:

[0092] SELECT*

[0093] FROM employee

[0094] INNER JOIN department

[0095] ON employee.DepartmentID=department.DepartmentID

[0096] After executing the above SQL statement, you can obtain the joined table 103.

[0097] In real-world applications, tables are typically very large in both length (number of rows) and width (number of columns, or dimensions). The number of rows can reach hundreds of millions or even billions, and the number of columns can reach thousands. In such scenarios, single-machine SQL is clearly insufficient to handle such massive amounts of data. Therefore, a distributed environment is needed to store the data and perform SQL calculations based on the table data.

[0098] In the related technologies of this application, a large table can be divided into several partitions according to rows, and different partitions can be stored on different computing nodes, thereby realizing distributed data storage and computation. There are three ways to perform distributed table joins: Broadcast Hash Join based on broadcast hashing, Shuffle Hash Join based on repartitioning, and Sort Merge Join.

[0099] Figure 2 This diagram illustrates the principle of table joins based on broadcast hashing, which is applicable to scenarios where a large table is joined with a small table.

[0100] like Figure 2As shown, table 201 has a large amount of data and is maintained by multiple compute nodes, with each compute node maintaining one table partition 202. Table 203 has a smaller amount of data and can be maintained on a single compute node. When joining table 201 and table 203, the join key of table 203 can first be used to generate a key-value hash table 204, and then the key-value hash table 204 can be broadcast to all compute nodes where table partitions 202 of table 201 reside. On each compute node, the table partition 202 is checked in the key-value hash table 204 to see if there is a matching key value. If a matching key value exists, it means that a table join is needed for the data rows containing the key value. Specifically, the table join involves updating the data rows and columns in each table. Updating the data rows ensures that table 201 and table 203 have corresponding data rows for the same key value, and updating the data columns ensures that table 201 and table 203 have corresponding data columns for the same key value. The final table join result is obtained based on the table partition join results of large table 201 and small table 203 on different computing nodes. Each computing node performs a table join on a portion of the partitioned data in the table. The table partition join results from all computing nodes are merged to obtain the complete table join result. The final table join result forms a logical wide table that includes both large table 201 and small table 203, encompassing all data rows and columns from both tables after the changes made to large table 201 and small table 203.

[0101] For example, table 201 and table 203 may contain the same key value X. This key value X is distributed in data row H1 in table 201 and in data row H2 in table 203. Through table join, data row H1 can be copied from table 201 to form a new data row H12 corresponding to table 203. This data row H12 is connected to data row H2 in table 203, forming the same data row in the logical wide table. Similarly, through table join, data row H2 can be copied from table 203 to form a new data row H21 corresponding to table 201. This data row H21 is connected to data row H1 in table 201, forming the same data row in the logical wide table. To represent the connection relationship between the data rows in the two tables, a data column L representing the data row index can be added to both table 201 and table 203. This data column L allows the data rows in the two tables to form an interrelated connection. For example, index value L1 can be used to connect data row H12 in large table 201 and data row H2 in small table 203, and index value L2 can be used to connect data row H1 in large table 201 and data row H21 in small table 203.

[0102] Figure 3This diagram illustrates the principle of table joining based on repartitioning, which is suitable for application scenarios where both tables contain a large number of data rows.

[0103] like Figure 3 As shown, the left table 301 is maintained on a single compute node, while the right table 302 is maintained on multiple compute nodes. Each compute node includes a table partition 303 of the right table 302. When performing a table join between the left table 301 and the right table 302, the two tables can first be repartitioned, using the same partitioning rules based on the key values ​​to divide the tables into the same number of partitions by row. Partitioning rules can include, for example, hash partitioning, range partitioning, list partitioning, combined partitioning, etc.

[0104] Hash partitioning converts all data into hash values ​​through hash operations. If you want to divide the data into N partitions, you divide the hash value by N and take the remainder. Data with the same remainder will be assigned to the same partition.

[0105] Range partitioning divides multiple numerical ranges using preset boundary values ​​as constraints. By comparing the data in a field with the values ​​of these boundary values, data within the same range are grouped into the same partition. For example, if the preset boundary values ​​divide the numerical ranges into 0-50, 50-100, and 100-150, comparing a field in the table with each boundary value determines the relationship between the field's data value and each range, thus creating three partitions corresponding to the three ranges.

[0106] List partitioning divides data based on specific values ​​in a field, grouping data with the same value into the same partition. For example, if a field has values ​​A, B, and C, then all data with the value A will be placed in the first partition, all data with the value B in the second partition, and all data with the value C in the third partition.

[0107] Composite partitioning refers to partitioning different fields using the different rules mentioned above, such as a combination of range partitioning and hash partitioning, a combination of range partitioning and list partitioning, a combination of list partitioning and hash partitioning, and so on.

[0108] After partitioning the left table 301, we obtain left table partition 304, each maintained by one of the two compute nodes. After repartitioning the right table 302, we obtain right table partition 305, each maintained by one of the two compute nodes. Then, on each compute node, we perform a join operation on the corresponding left table partition 304 and right table partition 305 to obtain the join result of the table partitions. Based on the join result of each table partition, we can obtain the join result of the left table 301 and the right table 302. For example, after partitioning the left table 301, we can obtain the first left table partition A1 maintained by compute node A and the second left table partition B1 maintained by compute node B; after repartitioning the right table 302, we can obtain the first right table partition A2 maintained by compute node A and the second right table partition B2 maintained by compute node B. On compute node A, perform a join operation on left table partition A1 and right table partition A2 to obtain the join result A0. On compute node B, perform a join operation on left table partition B1 and right table partition B2 to obtain the join result B0. Finally, merge join result A0 and join result B0 to obtain the join result of left table 301 and right table 302. For specific table join methods and results, please refer to the above. Figure 2 The description will not be repeated here.

[0109] Within a compute node's partition, Shuffle Hash Join extracts the key values ​​of a table's partition data into a hash table and finds the intersection based on the hash table query; Sort Merge Join first sorts the partition data of two tables separately, and then finds the intersection of the key values ​​in ascending order.

[0110] Based on the principles of the above related technologies, in distributed table join applications, for scenarios where one table has a small number of rows (key-value hash tables can be loaded into single-machine memory), Broadcast HashJoin can be used for efficient implementation. When both tables have a large number of rows, Shuffle Hash Join or MergeSort Join is needed. Both methods require repartitioning the tables. For tables within a partition, if one partition table is smaller, Hash Join can be used; otherwise, MergeSort Join should be used.

[0111] In the related technologies of this application, for tables with a large number of rows, it is necessary to repartition (because a complete key-value hash table cannot be stored in a single partition). If the number of columns in the table is also very large, the repartitioning process will also consume a lot of storage and communication resources, causing the connection process to occupy a lot of resources and the calculation process to be slow.

[0112] This application proposes an efficient join method for scenarios with a very large number of rows and columns in the table. During the repartitioning stage, only the JoinKey of each row is extracted, and the original partition number where the JoinKey resides is recorded. After matching the JoinKeys of the two tables on the new partition, the JoinKeys are remapped back to the original partitions, and the other fields and JoinKeys are concatenated on the original partitions of the two tables. The method proposed in this application avoids the full-field repartitioning process during high-dimensional table joins, greatly reducing the storage and communication overhead caused by repartitioning and improving the speed of join calculations. Simultaneously, after the join, rows of the wide table are identified by an additional primary key, eliminating the need to actually store each row together, thus saving the communication and storage overhead caused by wide table concatenation.

[0113] The following is based on Figures 4 to 6 The product forms implemented in different application scenarios according to the embodiments of this application are described.

[0114] Figure 4 This illustration shows a product form diagram of a table connection on a local database according to an embodiment of this application.

[0115] like Figure 4 As shown, the left table 402 and the right table 403 to be joined are loaded from the local database 401. Based on the table joining method provided in this embodiment, the left table 402 and the right table 403 are joined to generate a logical wide table 404. In the logical wide table 404, the joined rows can be identified by an additionally generated primary key, without actually storing the joined rows together. The logical wide table 404 is a table composed of all fields from the left table 402 and all fields from the right table 403. For example, the left table 402 includes three data columns, corresponding to fields A, B, and C, and the right table 403 includes three data columns, corresponding to fields B, C, and D. The logical wide table 404 formed by joining the two will include four data columns, corresponding to fields A, B, C, and D.

[0116] For example, table 402 on the left includes m data rows H(1) to H(m). After table joining, n data rows H(m+1) to H(m+n) corresponding to table 403 on the right can be copied from table 402. Based on this, a new primary key (equivalent to a foreign key in table 403) can be generated in table 402. For instance, the primary key values ​​of the n data rows H(m+1) to H(m+n) in the left table can be (m+1) to (m+n), corresponding to the n data rows in table 403 on the right. Simultaneously, a new primary key (equivalent to a foreign key in table 402) is also generated in table 403 on the right. Using the primary key as an index, a logical connection can be formed between table 402 and table 403 on the corresponding data rows. The resulting logical wide table 404 is a distributed storage table containing four fields A, B, C, and D and m+n data rows. The data corresponding to fields A, B, and C are distributed in the left table 402, and the data corresponding to fields B, C, and D are distributed in the right table 403.

[0117] Figure 5 This illustration shows a product form diagram of a table join between databases stored in two locations, according to an embodiment of this application.

[0118] like Figure 5 As shown, the two tables involved in the join are stored in the left table database 501 and the right table database 502, respectively. The left table 502 is loaded from the left table database 501, and the right table 504 is loaded from the right table database 503. After joining the left table 502 and the right table 504 based on the table join method provided in this application embodiment, a logical wide table 505 is generated.

[0119] The table join process typically involves collaborative computation between two different parties, with data exchange between them requiring communication over the public internet. In a logical wide table (505), a separately generated primary key identifies the joined rows. The wide table's fields are stored in the respective distributed environments of the two parties, eliminating the need to copy the table to either party. The specific generation method for a logical wide table (505) can be referenced above. Figure 4 The description will not be repeated here.

[0120] In addition to the two product forms mentioned above, this application embodiment can also perform connection calculations between participating parties based on privacy protection. Due to the need for privacy-preserving calculations during the connection phase, after both parties repartition using the JoinKey, they execute the PSI (Privacy Set Intersection) protocol on the new partition to ensure that only the JoinKey of the intersection portion can be obtained by both parties, and the non-intersection portion cannot be disclosed. After obtaining the public JoinKey, both parties perform field concatenation calculations locally.

[0121] Figure 6 An exemplary system architecture block diagram illustrating the application of the technical solution of this application is shown schematically.

[0122] like Figure 6 As shown, system architecture 600 may include terminal device 610 and server 620. Terminal device 610 may include various electronic devices such as smartphones, tablets, laptops, desktop computer smart speakers, smart wearable devices, smart in-vehicle devices, and smart payment terminals. Server 630 may be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The network may be a communication medium of various connection types capable of providing a communication link between terminal device 610 and server 620, such as a wired communication link or a wireless communication link.

[0123] Depending on the implementation requirements, the system architecture in this application embodiment can have any number of terminal devices, networks, and servers. For example, server 620 can be a server group composed of multiple server devices. In addition, the technical solutions provided in this application embodiment can be applied to terminal device 610, or to server 620, or can be implemented jointly by terminal device 610 and server 620. This application does not impose any special limitations on this.

[0124] Taking the table joining method jointly implemented by terminal device 610 and server 620 as an example, the terminal device 610 can first obtain the table X and table Y to be joined. After partitioning table X, the corresponding multi-group partitioned data X1, X2...Xn can be obtained. After partitioning table Y, the corresponding multi-group partitioned data Y1, Y2...Yn can be obtained.

[0125] Server 620 comprises n compute nodes. Partition data X1 belonging to table X and partition data Y1 belonging to table Y will be assigned to node 1, partition data X2 belonging to table X and partition data Y2 belonging to table Y will be assigned to node 2, and so on. Partition data Xn belonging to table X and partition data Yn belonging to table Y will be assigned to node n. Each compute node matches and filters the two groups of partition data belonging to table X and table Y respectively to obtain the corresponding intersection data. After updating the table based on the intersection data obtained by all compute nodes, a logical wide table X+Y can be obtained. The specific generation method of logical wide table X+Y can be referred to the above. Figure 4 The description will not be repeated here.

[0126] It should be noted that the above system architecture is merely an example and is not intended to limit the implementation of the technical solution of this application. Those skilled in the art can choose other system architecture implementation methods according to actual needs. The computing node that performs the connection processing of partitioned data can be any electronic device with data computing capabilities, such as any one or more combinations of terminal devices, servers, or server groups. The computing node can also be a data computing unit in any of the above electronic devices. For example, multiple independent data computing units can be divided into terminal devices or servers, where each data computing unit constitutes a computing node. In addition, the tables X and Y to be connected can be two tables maintained on the local database of the same electronic device (e.g., Figure 4 Alternatively, it could be two tables maintained on multiple databases on different electronic devices (e.g., ...). Figure 5 The specific processing steps, such as table partitioning, partition data matching and filtering, and logical wide table generation, can be executed on the terminal device or on the server. This application embodiment does not impose any special limitations on this.

[0127] The following is based on Figures 7 to 12 The table connection method in the specific embodiments of this application is described in detail.

[0128] Figure 7 This illustration shows a flowchart of the steps of a table join method according to one embodiment of this application. The table join method can be performed by… Figure 6 The method can be executed by the terminal device or the server, or it can be executed by both the terminal device and the server. This application uses the table join method executed by the terminal device as an example for illustration.

[0129] like Figure 7 As shown, the table connection method in this application embodiment includes the following steps S710 to S740.

[0130] S710: Extract field data corresponding to the key value from the two tables to be joined, based on the key value used as the basis for table joining.

[0131] This application embodiment is used to perform join processing on two or more tables. Taking the join of two tables as an example, the principle of joining tables is based on establishing a correspondence between different data rows in the same data in the two tables, and then using the correspondence to join the data columns to obtain a logical wide table composed of the fields of the two tables.

[0132] In some alternative implementations, the key used as the basis for joining tables can be a common data column selected from both tables, where the two tables have partially identical data. For example, the local table could be an employee table containing columns for employee name and department ID, while the local table could be a department table containing columns for department ID and department name. Since both tables share the same data column "Department ID," this column can be used as the basis for joining the tables. In other implementations, the key used as the basis for joining tables can be two or more data columns selected from the two tables.

[0133] In some alternative implementations, the field data extracted from the table based on the key value includes all data in the data column containing the key value. In other implementations, the field data extracted from the table may be only a portion of the data in the data column containing the key value.

[0134] S720: The extracted field data is partitioned to obtain multi-group partitioned data corresponding to multiple computing nodes.

[0135] Each table to be joined is partitioned to obtain the same number of partitioned data groups. For example, when joining tables A and B, partitioning the field data extracted from table A yields three partitioned data groups A1, A2, and A3, and partitioning the field data extracted from table B yields the same three partitioned data groups B1, B2, and B3. In the multi-partitioned data obtained after partitioning field data belonging to the same table, each partitioned data group can have the same or different numbers of data.

[0136] In some optional implementations, one or more partition positions can be selected from the field data according to their arrangement order in the table for data partitioning. The sequence of field data is then split based on the partition positions to obtain multiple sub-sequences, and each sub-sequence serves as a group of partitioned data. In other implementations, a random allocation method can be used to randomly assign partition identifiers to field data belonging to the same table, and field data with the same partition identifier are considered as the same group of partitioned data.

[0137] Compute nodes can be Figure 6 The terminal device or server shown can communicate with each other via network connection in a node cluster consisting of multiple computing nodes. For example, a computing node can communicate with another computing node by sending a point-to-point message, or it can communicate with multiple other computing nodes in the node cluster by broadcasting messages.

[0138] After partitioning the field data, multi-group partitioned data corresponding to multiple computing nodes can be obtained, and each computing node can maintain one set of partitioned data. In some optional implementations, the working status of each computing node can be collected in real time to determine whether each computing node can perform table join operations efficiently. For example, computing nodes in the node cluster that are in an idle state can be filtered out, and a set of partitioned data can be assigned to each of the filtered computing nodes.

[0139] In some alternative implementations, the field data can be evenly distributed according to the number of computing nodes, so that each computing node maintains the same or similar number of field data. Alternatively, the amount of partition data allocated to each computing node can be determined based on the computing power of each computing node.

[0140] S730: On each computing node, the two groups of partitioned data belonging to the two tables to be joined are matched and filtered to obtain the intersection data of the two groups of partitioned data.

[0141] On each computing node, the two groups of partitioned data belonging to two different tables are matched and filtered, and the intersection data with the same values ​​is selected. The computing node can perform consistency checks on the two groups of partitioned data to determine whether the data in each partition of the two tables is the same, and thus the data in the same partitions is taken as the intersection data.

[0142] For example, when joining the employee table and the department table, the department ID is used as the key value. After partitioning, the compute node can perform a matching check between a set of department IDs belonging to the employee table and another set of department IDs belonging to the department table to determine whether the department IDs in the two sets of data are the same. If they are the same, they are identified as the intersection data.

[0143] S740: Update the table to be joined based on the intersection data obtained from filtering each computing node.

[0144] The intersection data corresponds to a data column in each table, that is, the data column containing the key value. Based on this intersection data, the tables to be joined can be updated separately, so that the data rows and columns in the two tables can form a corresponding relationship that can be connected to each other.

[0145] One way to update the data rows of a table is to copy the data rows containing the intersection data from one of the tables to be joined. The copied data rows correspond to the data rows in the other table to be joined, which is equivalent to generating data rows in the current table that are associated with the other table.

[0146] One way to update the data columns of a table is to add a primary key field as a data index to each of the two tables to be joined. Based on the primary key field, the corresponding data rows in the two tables to be joined can be linked, that is, a data row in one table can be connected to another corresponding data row in the other table. These two data rows are represented as the same data row in the logical wide table.

[0147] For example, the local table to be joined includes m data rows H(1) to H(m). Based on the intersection data, n new data rows H(m+1) to H(m+n) corresponding to the adjacent table to be joined can be copied from the local table. Simultaneously, a new primary key (equivalent to a foreign key in the adjacent table) can be generated in the local table. For instance, the primary key values ​​of the n data rows H(m+1) to H(m+n) in the local table can be (m+1) to (m+n), corresponding to the n data rows in the adjacent table. Correspondingly, a new primary key (equivalent to a foreign key in the local table) is also generated in the adjacent table. Using the primary key as an index, a logical connection can be formed between the local and adjacent tables on the corresponding data rows.

[0148] In the technical solution provided in this application embodiment, by extracting field data corresponding to key values ​​from the table, the field data can be partitioned separately, thereby matching and filtering the partitioned data on the computing node, and finally updating the table to be joined based on the intersection data obtained from the filtering. This application embodiment does not require partitioning and matching the full data of the table, which can improve the table joining efficiency while greatly saving storage and computing resources on a single computing node, and can also reduce the amount of data transmission between multiple computing nodes, saving network bandwidth resources.

[0149] Figure 8 This illustration shows a flowchart of the steps for performing a table join on tables in distributed storage according to one embodiment of this application. The table join method can be performed by… Figure 6 The method can be executed by the terminal device or the server, or it can be executed by both the terminal device and the server. This application uses the table join method executed by the terminal device as an example for illustration.

[0150] like Figure 8 As shown, the method for performing table joins on tables in distributed storage in this embodiment of the application includes the following steps S810 to S850.

[0151] S810: Obtain at least two table partitions that make up the two tables to be joined, with the at least two table partitions distributed on different computing nodes.

[0152] The tables to be joined can include two or more table partitions, each maintained by a different compute node. For example, table A on this side includes partition tables A1 and A2 maintained by two compute nodes, while table B on the other side includes partition tables B1, B2, and B3 maintained by three compute nodes.

[0153] S820: Extract field data corresponding to the key value from at least two table partitions based on the key value used as the basis for table join.

[0154] The key used to join tables is a common data column selected from both tables, where the two tables have partially identical data. The key is sent to each compute node maintaining the partitioned data, where the corresponding field data can be extracted. The extracted field data from each table partition together form the field data of a single table. For example, field data A10 and A20 belonging to the local table can be extracted from partitioned tables A1 and A2, while field data B10, B20, and B30 belonging to the opposite table can be extracted from partitioned tables B1, B2, and B3.

[0155] S830: The extracted field data is partitioned to obtain multi-group partitioned data corresponding to multiple computing nodes.

[0156] On each computing node, the extracted field data is partitioned, resulting in multi-group partitioned data corresponding to different computing nodes. For example, when allocating partitioned data to a node cluster including three computing nodes, the field data A10 and A20 belonging to the local table and the field data B10, B20, and B30 belonging to the opposite table can be partitioned to obtain partitioned data A11, A21, B11, B21, and B31 corresponding to the first computing node, partitioned data A12, A22, B12, B22, and B32 corresponding to the second computing node, and partitioned data A13, A23, B13, B23, and B33 corresponding to the third computing node.

[0157] S840: On the computing node, match and filter the two groups of partitioned data belonging to the two tables to be joined, and obtain the intersection data of the two groups of partitioned data.

[0158] On each computing node, the two groups of partitioned data belonging to two different tables are matched and filtered, and the intersection data with the same values ​​is selected. The computing node can perform consistency checks on the two groups of partitioned data to determine whether the data in each partition of the two tables is the same, and thus the data in the same partitions is taken as the intersection data.

[0159] For example, the first computing node includes a set of partitioned data A11 and A21 belonging to its own table, and another set of partitioned data B11, B21, and B31 belonging to the opposite table. The first computing node can perform a consistency check on these two sets of partitioned data and select the identical data as the intersection data.

[0160] Similarly, the second computing node includes a set of partitioned data A12 and A22 belonging to its own table, and another set of partitioned data B12, B22, and B32 belonging to the opposite table; the third computing node includes a set of partitioned data A13 and A23 belonging to its own table, and another set of partitioned data B13, B23, and B33 belonging to the opposite table. The second and third computing nodes can perform consistency checks on the two sets of partitioned data belonging to their own and opposite tables, respectively, and select the identical data as the intersection data.

[0161] S850: Update the table to be joined based on the intersection data obtained from filtering each computing node.

[0162] The intersection data corresponds to a data column in each table, that is, the data column containing the key value. Based on this intersection data, the tables to be joined can be updated separately, so that the data rows and columns in the two tables can form a corresponding relationship that can be connected to each other.

[0163] In some alternative implementations, the intersection data obtained by each computing node can be sent to the computing node where each table partition is located, and each computing node can update the table partition it maintains.

[0164] In this embodiment, a secondary partitioning approach is used to join tables in distributed storage. This approach can fully utilize the cluster effect of computing nodes, perform secondary allocation of computing resources within the node cluster, achieve balanced load among multiple computing nodes, and improve the processing efficiency of table joins.

[0165] Figure 9 This illustration shows a flowchart of the steps for performing a table join based on hash value partitioning in one embodiment of this application. This table join method can be performed by… Figure 6 The method can be executed by the terminal device or the server, or it can be executed by both the terminal device and the server. This application uses the table join method executed by the terminal device as an example for illustration.

[0166] like Figure 9 As shown, the table connection method in this application embodiment includes the following steps S910 to S950.

[0167] S910: Extract field data corresponding to the key value from the table to be joined based on the key value used as the basis for table joining.

[0168] In this embodiment, the key value used as the basis for connecting the tables is a common data column selected from both tables, where the two tables have partially identical data. The field data extracted from the tables based on the key value includes all data in the data column containing the key value.

[0169] For example, this side table is an employee table containing two columns: employee name and department ID. The other side table is a department table containing two columns: department ID and department name. Both tables share the same "department ID" column, which can be used as the basis for joining the tables. Extracting the "department ID" column from both the employee and department tables yields two columns that correspond to the key values.

[0170] S920: Perform a hash operation on the extracted field data to obtain the hash value of the field data.

[0171] Hash, also known as "hashing," transforms any set of input information of arbitrary length into a fixed-length hash value through hash operations. This application embodiment can use any hash algorithm such as MD (Message Digest), SHA (Secure Hash Algorithm), or MAC (Message Authentication Code) to perform hash operations on the extracted field data to obtain the hash value of the field data. After the hash operation, each field data can be mapped to a hash value of fixed length.

[0172] S930: Partition the field data according to the hash value to obtain multi-group partitioned data corresponding to multiple computing nodes.

[0173] Each field data is associated with its corresponding hash value. In this embodiment of the application, the hash value of each field data can be partitioned. Based on the association between the hash value and the field data, the field data can be partitioned into multiple groups of partitioned data.

[0174] In some optional implementations, the number of computing nodes used to process field data can be obtained; the hash values ​​can be mapped according to the number of nodes to obtain a set of hash values ​​that is the same as the number of nodes; the set of hash values ​​can be allocated to each computing node, and the multi-group partitioned data corresponding to different computing nodes can be determined according to the allocation results.

[0175] In some alternative implementations, a consistent hashing algorithm can be used to determine the allocation relationship between each field of data and the computing node. Multiple fields of data that are allocated to the same computing node constitute a set of partitioned data.

[0176] Consistent hashing algorithms organize the hash values ​​of data fields into a ring, where each computing node corresponds to a specific hash value range on the ring. If a hash value falls within the hash value range corresponding to a computing node, that data field can be assigned to that node. For example, in a hash value range of 0 to 2... 32 On the ring marked -1, three computing nodes, Server1, Server2, and Server3, are distributed, dividing the ring into three hash value intervals. After performing a hash operation on a field of data to be partitioned to obtain a hash value, the hash value interval in which that hash value belongs can be determined. For example, if the field data falls within the hash value interval between Server1 and Server2, then the user request can be assigned to computing node Server2 in a clockwise direction.

[0177] S940: On the computing node, match and filter the two groups of partitioned data belonging to the two tables to be joined, and obtain the intersection data of the two groups of partitioned data.

[0178] On each computing node, the two groups of partitioned data belonging to two different tables are matched and filtered, and the intersection data with the same values ​​is selected. The computing node can perform consistency checks on the two groups of partitioned data to determine whether the data in each partition of the two tables is the same, and thus the data in the same partitions is taken as the intersection data.

[0179] In some alternative implementations, the computing node can perform a consistency check on the hash value sets corresponding to the two sets of partitioned data, determine whether the hash values ​​in the two hash value sets are the same, and take the partitioned data corresponding to the same hash value as the intersection data.

[0180] S950: Update the table to be joined based on the intersection data obtained from filtering each computing node.

[0181] The intersection data corresponds to a data column in each table, that is, the data column containing the key value. Based on this intersection data, the tables to be joined can be updated separately, so that the data rows and columns in the two tables can form a corresponding relationship that can be connected to each other.

[0182] In this embodiment, by performing a hash operation on the field data to obtain a hash value, and then partitioning the field data according to the hash value, the problem of uneven partitioning caused by excessive differences in the content of the field data can be avoided, thereby improving the uniformity of the quantity distribution among the data in each partition. Additionally, in some other embodiments, partitioning can also be performed based on the original values ​​of the field data; this application does not impose any specific limitations on this approach.

[0183] Figure 10 This illustration shows a flowchart of the steps for performing table joins based on hash table matching in one embodiment of this application. This table join method can be performed by… Figure 6 The method can be executed by the terminal device or the server, or it can be executed by both the terminal device and the server. This application uses the table join method executed by the terminal device as an example for illustration.

[0184] like Figure 10 As shown, the table connection method in this application embodiment includes the following steps S1010 to S1060.

[0185] S1010: Extract field data corresponding to the key value from the table to be joined based on the key value used as the basis for table joining.

[0186] In this embodiment, the key value used as the basis for connecting the tables is a common data column selected from both tables, where the two tables have partially identical data. The field data extracted from the tables based on the key value includes all data in the data column containing the key value.

[0187] S1020: Partition the extracted field data to obtain multi-group partitioned data corresponding to multiple computing nodes.

[0188] Each table to be joined is partitioned to obtain the same number of partitioned data groups. For example, when joining tables A and B, partitioning the field data extracted from table A yields three partitioned data groups A1, A2, and A3, and partitioning the field data extracted from table B yields the same three partitioned data groups B1, B2, and B3. In the multi-partitioned data obtained after partitioning field data belonging to the same table, each partitioned data group can have the same or different numbers of data.

[0189] In this embodiment of the application, when partitioning the extracted field data, the original values ​​of the field data can be partitioned directly, or the original values ​​of the field data can be converted into hash values ​​first, and after partitioning the hash values, the corresponding partition data can be obtained according to the association between the hash values ​​and the field data.

[0190] This application embodiment can use a random allocation algorithm or a consistent hashing algorithm to partition the field data. Specific partitioning schemes can be found in the embodiments above, and will not be repeated here.

[0191] S1030: Compare the number of data in the two groups of partitioned data belonging to the two tables to be joined on the compute node.

[0192] On each compute node, the two sets of partitioned data belonging to two different tables are matched and filtered, and the intersection data with the same values ​​is selected. The compute node can perform consistency checks on the two sets of partitioned data, determining whether the partition data in the two tables are the same, and thus identifying the identical partition data as the intersection data. The compute node can also perform consistency checks on the hash value sets corresponding to the two sets of partitioned data, determining whether the hash values ​​in the two hash value sets are the same, and identifying the partition data corresponding to the same hash values ​​as the intersection data.

[0193] S1040: Perform a hash operation on a group of partitions with a small amount of data to obtain a hash table corresponding to that group of partitions.

[0194] On the same computing node, two groups of partitioned data belonging to two tables to be joined can have the same or different amounts of data. This embodiment of the application, by comparing the amount of data in the two groups of partitioned data, can determine the group with relatively less data and the other group with relatively more data. Specifically, for the group with less data, a hash operation can be performed to obtain a hash table corresponding to that group of partitioned data.

[0195] S1050: Match and filter the set of partition data with the hash table to obtain the intersection data of the two sets of partition data.

[0196] On the same computing node, a set of partitioned data with a relatively large amount of data is subjected to consistency checks with each hash value in the hash table. If the hash values ​​obtained after hashing each data item in the partitioned data using the same hash function are the same as at least one hash value in the hash table, it can be determined that the data item is the intersection of the two sets of partitioned data.

[0197] 1060: Update the table to be joined based on the intersection data obtained from filtering each computing node.

[0198] The intersection data corresponds to a data column in each table, that is, the data column containing the key value. Based on this intersection data, the tables to be joined can be updated separately, so that the data rows and columns in the two tables can form a corresponding relationship that can be connected to each other.

[0199] In this embodiment of the application, by comparing the number of data in the two groups of partitioned data, the group of partitioned data with fewer data can be converted into a hash value, and the group of partitioned data with more data can be compared with the hash value for consistency judgment. This can avoid the problem of difficulty in data consistency judgment due to large differences in the content between data items, improve the efficiency of filtering intersection data, and save computing resources and network transmission resources of computing nodes.

[0200] Figure 11 This invention illustrates a flowchart of the steps for performing a table join based on index updates in one embodiment of the present application. This table join method can be performed by… Figure 6 The method can be executed by the terminal device or the server, or it can be executed by both the terminal device and the server. This application uses the table join method executed by the terminal device as an example for illustration.

[0201] like Figure 11 As shown, the table connection method in this application embodiment includes the following steps S1110 to S1160.

[0202] S1110: Extract field data corresponding to the key value from the table to be joined based on the key value used as the basis for table joining.

[0203] In this embodiment, the key value used as the basis for connecting the tables is a common data column selected from both tables, where the two tables have partially identical data. The field data extracted from the tables based on the key value includes all data in the data column containing the key value.

[0204] S1120: Perform partitioning on the extracted field data to obtain multi-group partitioned data corresponding to multiple computing nodes.

[0205] Each table to be joined is partitioned to obtain the same number of partitioned data groups. For example, when joining tables A and B, partitioning the field data extracted from table A yields three partitioned data groups A1, A2, and A3, and partitioning the field data extracted from table B yields the same three partitioned data groups B1, B2, and B3. In the multi-partitioned data obtained after partitioning field data belonging to the same table, each partitioned data group can have the same or different numbers of data.

[0206] In this embodiment of the application, when partitioning the extracted field data, the original values ​​of the field data can be partitioned directly, or the original values ​​of the field data can be converted into hash values ​​first, and after partitioning the hash values, the corresponding partition data can be obtained according to the association between the hash values ​​and the field data.

[0207] This application embodiment can use a random allocation algorithm or a consistent hashing algorithm to partition the field data. Specific partitioning schemes can be found in the embodiments above, and will not be repeated here.

[0208] S1130: On the computing node, match and filter the two groups of partitioned data belonging to the two tables to be joined, and obtain the intersection data of the two groups of partitioned data.

[0209] On each computing node, the two groups of partitioned data belonging to two different tables are matched and filtered, and the intersection data with the same values ​​is selected. The computing node can perform consistency checks on the two groups of partitioned data to determine whether the data in each partition of the two tables is the same, and thus the data in the same partitions is taken as the intersection data.

[0210] In some alternative implementations, the computing node can perform a consistency check on the hash value sets corresponding to the two sets of partitioned data, determine whether the hash values ​​in the two hash value sets are the same, and take the partitioned data corresponding to the same hash value as the intersection data.

[0211] The intersection data corresponds to a data column in each table, that is, the data column containing the key value. Based on this intersection data, the tables to be joined can be updated separately, so that the data rows and columns in the two tables can form a corresponding relationship that can be connected to each other.

[0212] S1140: Merge the intersection data obtained from filtering each computing node.

[0213] After merging the intersection data obtained from filtering each computing node, duplicate data items in the intersection data can be removed.

[0214] S1150: Update the data rows of the table to be joined based on the intersection data obtained from the merge.

[0215] After obtaining the intersection data, each data row in the table can be matched and filtered with the intersection data to obtain the data rows in the table that contain the intersection data; obtain the distribution number of the intersection data in the table, which is the number of data rows containing the intersection data; based on the distribution number of the merged intersection data in one table, copy and add the data rows containing the intersection data to another table.

[0216] For example, the local table and the opposite table form a pair of tables to be joined. The number of intersection data in the local table is N, and the number of intersection data in the opposite table is M. Based on this, each data row containing the intersection data can be copied (M-1) times in the local table, and each data row containing the intersection data can be copied (N-1) times in the opposite table, thereby realizing the connection correspondence between the local table and the opposite table on each data row.

[0217] S1160: Add a data column to the tables to be joined to represent the index values ​​of the intersection data.

[0218] In some alternative implementations, a primary index value is obtained to distinguish different values ​​of the intersection data; a secondary index value is obtained to distinguish different data rows containing the same intersection data between the tables to be joined; and a data column containing the primary and secondary index values ​​is added to the tables to be joined.

[0219] Primary index values ​​correspond to the intersection data; identical intersection data has the same primary index value. Secondary index values ​​correspond to data rows. If two data rows distributed in two tables have the same primary and secondary index values, it means that these two rows are actually the same data row formed by joining data columns in the logical wide table.

[0220] In this embodiment of the application, by adding an index value field to the table to be connected, the two tables can be connected through the index value field without the need for full table data communication and storage between the participants maintaining the tables. Therefore, it can greatly save computing resources, storage resources and network communication resources, and improve table connection efficiency.

[0221] Secondary index values ​​can be used to determine the correspondence between data rows in the local table and the counterpart table. Methods for determining secondary index values ​​based on the distribution quantity can include: determining the offset associated with the intersection data based on the distribution quantity of the merged intersection data in the table, where the distribution quantity is the number of data rows containing the intersection data, and the offset is used to distinguish different data rows containing the same intersection data in a single table; and determining secondary index values ​​based on the distribution quantity and the offset to distinguish different data rows containing the same intersection data between the tables to be joined.

[0222] For example, if the number of intersection data points distributed in this table is N, the offsets associated with the intersection data can be determined to be 0, 1, 3, ..., N. Similarly, if the number of intersection data points distributed in the opposite table is M, the offsets associated with the intersection data can be determined to be 0, 1, 3, ..., M.

[0223] In some optional implementations, embodiments of this application may obtain a first mapping rule and a second mapping rule that are mirror images of each other. The mirror image relationship is used to indicate that the two mapping rules use mapping parameters of different tables at the same position. The distribution quantity and offset are mapped according to one of the first mapping rule and the second mapping rule to obtain the secondary index values ​​of each data row where the intersection data is located in the local table. The distribution quantity and offset are mapped according to the other of the first mapping rule and the second mapping rule to obtain the secondary index values ​​of each data row where the intersection data is located in the opposite table.

[0224] For example, if a mapping rule requires multiple mapping parameters corresponding to different positions, and the first mapping rule uses the distribution quantity corresponding to the local table as the mapping parameter at one position, then the second mapping rule should use the distribution quantity corresponding to the opposite table at the same position. Similarly, if the first mapping rule uses the offset corresponding to the opposite table as the mapping parameter at another position, then the second mapping rule should use the offset corresponding to the local table at the same position.

[0225] In some embodiments of this application, the first mapping rule includes: multiplying the distribution quantity of the opposite table with the offset of the local table, and then summing the product with the offset of the opposite table; the second mapping rule includes: multiplying the distribution quantity of the local table with the offset of the opposite table, and then summing the product with the offset of the local table.

[0226] In this embodiment, two mapping rules that are mirror images of each other are used to perform function mapping on two tables. This allows the same secondary index value to be assigned to the data rows forming the connection in both tables, thus determining the connection relationship between the tables based on the index value. Furthermore, the data columns between the two tables can be joined based on the index value to form a logically wide table after the connection. This embodiment improves the efficiency of connecting and updating the tables, avoiding the problem of transmitting or storing the entire table or table partitions between different computing nodes during the connection process.

[0227] Figure 12 This illustration shows a flowchart of the steps for performing a table join based on key-value field repartitioning and index updating in one embodiment of this application. This table join method can be performed by… Figure 6 The method can be executed by the terminal device or the server, or it can be executed by both the terminal device and the server. This application uses the table join method executed by the terminal device as an example for illustration.

[0228] like Figure 12 As shown, the table connection method in this application embodiment includes the following steps S1210 to S12110.

[0229] S1210: Obtain at least two table partitions that make up the table to be joined, wherein the at least two table partitions are distributed on different compute nodes.

[0230] The tables to be joined can include two or more table partitions, each maintained by a different compute node. For example, table A on this side includes partition tables A1 and A2 maintained by two compute nodes, while table B on the other side includes partition tables B1, B2, and B3 maintained by three compute nodes.

[0231] S1220: Extract field data corresponding to the key value from at least two table partitions based on the key value used as the basis for table connection.

[0232] The key used to join tables is a common data column selected from both tables, where the two tables have partially identical data. The key is sent to each compute node maintaining the partitioned data, where the corresponding field data can be extracted. The extracted field data from each table partition together form the field data of a single table. For example, field data A10 and A20 belonging to the local table can be extracted from partitioned tables A1 and A2, while field data B10, B20, and B30 belonging to the opposite table can be extracted from partitioned tables B1, B2, and B3.

[0233] S1230: Perform a hash operation on the extracted field data to obtain the hash value of the field data.

[0234] In this application embodiment, any hash algorithm such as MD (Message Digest), SHA (Secure Hash Algorithm), or MAC (Message Authentication Code) can be used to perform hash operations on the extracted field data to obtain the hash value of the field data. After the hash operation, each field data can be mapped to a hash value with a fixed length.

[0235] S1240: Partition the field data according to the hash value to obtain multi-group partitioned data corresponding to multiple computing nodes.

[0236] Each field data is associated with its corresponding hash value. In this embodiment of the application, the hash value of each field data can be partitioned. Based on the association between the hash value and the field data, the field data can be partitioned into multiple groups of partitioned data.

[0237] In some optional implementations, the number of computing nodes used to process field data can be obtained; the hash values ​​can be mapped according to the number of nodes to obtain a set of hash values ​​that is the same as the number of nodes; the set of hash values ​​can be allocated to each computing node, and the multi-group partitioned data corresponding to different computing nodes can be determined according to the allocation results.

[0238] In some alternative implementations, a consistent hashing algorithm can be used to determine the allocation relationship between each field of data and the computing node. Multiple fields of data that are allocated to the same computing node constitute a set of partitioned data.

[0239] S1250: Compare the number of data in the two groups of partitioned data belonging to the two tables to be joined on the compute node.

[0240] On each compute node, the two sets of partitioned data belonging to two different tables are matched and filtered, and the intersection data with the same values ​​is selected. The compute node can perform consistency checks on the two sets of partitioned data, determining whether the partition data in the two tables are the same, and thus identifying the identical partition data as the intersection data. The compute node can also perform consistency checks on the hash value sets corresponding to the two sets of partitioned data, determining whether the hash values ​​in the two hash value sets are the same, and identifying the partition data corresponding to the same hash values ​​as the intersection data.

[0241] S1260: Perform a hash operation on a group of partitions with a small amount of data to obtain a hash table corresponding to that group of partitions.

[0242] On the same computing node, two groups of partitioned data belonging to two tables to be joined can have the same or different amounts of data. This embodiment of the application, by comparing the amount of data in the two groups of partitioned data, can determine the group with relatively less data and the other group with relatively more data. Specifically, for the group with less data, a hash operation can be performed to obtain a hash table corresponding to that group of partitioned data.

[0243] S1270: Match and filter the set of partition data with the hash table to obtain the intersection data of the two sets of partition data.

[0244] On the same computing node, a set of partitioned data with a relatively large amount of data is subjected to consistency checks with each hash value in the hash table. If the hash values ​​obtained after hashing each data item in the partitioned data using the same hash function are the same as at least one hash value in the hash table, it can be determined that the data item is the intersection of the two sets of partitioned data.

[0245] S1280: Merge the intersection data obtained from filtering each computing node.

[0246] After merging the intersection data obtained from filtering each computing node, duplicate data items in the intersection data can be removed, avoiding redundant processing of the intersection data.

[0247] S1290: Update the data rows of the tables to be joined based on the intersection data obtained from the merge.

[0248] After obtaining the intersection data, each data row in the table can be matched and filtered with the intersection data to obtain the data rows in the table that contain the intersection data; obtain the distribution number of the intersection data in the table, which is the number of data rows containing the intersection data; based on the distribution number of the merged intersection data in one table, copy and add the data rows containing the intersection data to another table.

[0249] For example, the local table and the opposite table form a pair of tables to be joined. The number of intersection data in the local table is N, and the number of intersection data in the opposite table is M. Based on this, each data row containing the intersection data can be copied (M-1) times in the local table, and each data row containing the intersection data can be copied (N-1) times in the opposite table, thereby realizing the connection correspondence between the local table and the opposite table on each data row.

[0250] S12100: Add a data column to the tables to be joined to represent the index values ​​of the intersection data.

[0251] In this embodiment, a primary index value can first be obtained to distinguish different values ​​of the intersection data; then, a secondary index value can be obtained to distinguish different data rows of the same intersection data between the tables to be joined; finally, a data column containing the primary index value and the secondary index value can be added to the tables to be joined.

[0252] Primary index values ​​correspond to the intersection data; identical intersection data has the same primary index value. Secondary index values ​​correspond to data rows. If two data rows distributed in two tables have the same primary and secondary index values, it means that these two rows are actually the same data row formed by joining data columns in the logical wide table.

[0253] In this embodiment, the offset associated with the intersection data can be determined based on the number of data rows containing the intersection data in the table. The number of data rows containing the intersection data is the number of rows containing the same intersection data in a single table. The offset is used to distinguish different data rows containing the same intersection data in a single table. The secondary index value is determined based on the number of data rows containing the same intersection data in different data rows between the tables to be joined.

[0254] This application embodiment can obtain a first mapping rule and a second mapping rule that are mirror images of each other. The mirror image relationship is used to indicate that the two mapping rules use mapping parameters of different tables at the same position. The distribution quantity and offset are mapped according to one of the first mapping rule and the second mapping rule to obtain the secondary index value of each data row where the intersection data is located in the local table. The distribution quantity and offset are mapped according to the other of the first mapping rule and the second mapping rule to obtain the secondary index value of each data row where the intersection data is located in the opposite table.

[0255] The first mapping rule in this application embodiment includes: multiplying the distribution quantity of the opposite table with the offset of the local table, and then summing it with the offset of the opposite table; the second mapping rule includes: multiplying the distribution quantity of the local table with the offset of the opposite table, and then summing it with the offset of the local table.

[0256] This application embodiment extracts field data corresponding to key values ​​from a table, allowing for separate partitioning of the field data. This partitioned data is then matched and filtered on the computing nodes, and the table to be joined is updated based on the intersection data obtained from the filtering. This embodiment eliminates the need for partitioning and matching the entire table data, significantly improving table join efficiency while greatly saving storage and computing resources on individual computing nodes. Furthermore, it reduces data transfer between multiple computing nodes, conserving network bandwidth. Using a secondary partitioning approach for table joins in distributed storage fully leverages the cluster effect of computing nodes, redistributing computing resources within the node cluster to achieve balanced load across multiple computing nodes and improve table join processing efficiency. By performing hash operations on the field data to obtain hash values ​​and then partitioning the field data based on these hash values, the problem of uneven partitioning due to large differences in field data content can be avoided, improving the uniformity of data distribution across different partitions. By utilizing two mirror-image mapping rules to perform function mapping on two tables, the same secondary index value can be assigned to the data rows forming the connection in both tables. This allows the connection relationship between the tables to be determined by the index value, and then the data columns of the two tables can be joined based on the index value to form a logically wide table after the connection. This embodiment of the application improves the efficiency of connecting and updating the tables, and avoids the problem of transferring or storing the entire table or table partitions between different computing nodes during the connection process.

[0257] The following is based on Figures 13 to 16 The table join process and table join results in the application scenarios of the embodiments of this application are explained.

[0258] Figure 13 This illustration shows a schematic diagram of the process of performing an inner join on a table in an application scenario according to an embodiment of this application.

[0259] like Figure 13 As shown, the process of performing an inner join on a table using the table join method provided in this application can include the following four stages.

[0260] S1310: Repartition.

[0261] Extract the Join Key field from both tables and calculate its hash value (the hash function can be, but is not limited to, SHA256, MD5, etc., or you can use the original field text directly without a hash function. The advantage of calculating the hash value is that it makes the data distribution more even after repartitioning without affecting the correctness of the calculation process). Repartition the two tables using the hash value (or the original text). After repartitioning, the two tables will have the same number of partitions, and the same Join Key will map to the same partition number.

[0262] S1320: Intra-partition join.

[0263] The two tables are matched on the Join Key field within the same partition, and rows that are not in the intersection are filtered out, leaving the remaining rows, while ensuring that the order of the Join Key in the intersection of the two tables is consistent.

[0264] S1330: Supplementary information.

[0265] Since step S1320 ensures that the order of the results of joining the two tables in the new partition is consistent, a new field called the index value is added to the result of the table join to ensure that the same Join Key on both sides has the same index value.

[0266] S1340: Concatenate fields in the original partition.

[0267] The results generated in step S1330 are mapped back to the original partitions, with the partition number field indicating the partition (e.g., if the partition number is 3, then the row goes to partition 3). The results from the original partitions are converted into a hash table. For each row in the original table, the hash table within the current partition is queried. If no corresponding Join Key is found, the row is removed; otherwise, it is retained. The index value is added to a new field as the primary key of the logical wide table, used to identify a row in the table. For the results generated on both sides, the same index value indicates the same row.

[0268] Figure 14 This illustration shows a schematic diagram of the result of table join based on Cartesian product in an application scenario according to an embodiment of this application.

[0269] In simple scenarios, the Join Key used for joining tables is unique. Therefore, we only need to find the intersection of the two tables and generate the joined table based on that intersection. In slightly more complex scenarios, one side of the table may have duplicate Join Keys. In this case, we only need to copy the corresponding number of rows from the non-duplicate side of the table, based on the number of times the Join Key is repeated on the other side. In more general scenarios, both tables contain duplicate Join Keys. If the same Join Key is repeated in both tables, then we need to use a Cartesian product method to generate the corresponding join result. For example... Figure 14 As shown, the left table 1401 contains 2 rows of data, and the right table 1402 contains 3 rows of data, and these rows have the same Join Key. The result of performing an inner join on the left table 1401 and the right table 1402 is equivalent to calculating the Cartesian product of the data in each table, resulting in a wide table 1403 containing 6 rows of data. Wide table 1403 includes all the data columns from both the left table 1401 and the right table 1402.

[0270] Figure 15 This illustration shows a schematic diagram of the result of a table inner join based on Cartesian product calculation in an application scenario according to an embodiment of this application, where elements related to... Figure 14 The same parts.

[0271] The process of performing an inner join on a table based on the Cartesian product calculation using the table join method provided in this application can also include the following four stages.

[0272] S1501: Re-partitioning stage.

[0273] For each row after repartitioning, the Join Key needs to be deduplicated, and the partition number and number of duplicates for each row before repartitioning need to be recorded. The partition number field needs to record all partitions, for example... Figure 15 As shown, the same Join Key has two instances in partition 3 and one instance in partition 1, so the partition number field is 3, 3, 1.

[0274] S1502: Intra-partition join.

[0275] At this stage, the intersection can be found using the deduplicated Join Key.

[0276] S1503: Supplementary Information Phase.

[0277] After joining within a partition, additional information needs to be added to the count field to record the number of duplicate JoinKeys in the other table. At this point, the count field should be a pair of numbers, recording the number of duplicate JoinKeys between the current table and the other table (e.g., ...). Figure 15 (As shown). Add an offset field to the results within the partition, initialized to 0. For each row of the Join result, generate a corresponding number of duplicate rows based on the number of duplicates recorded in the local table in the quantity field. Split the original partition number into each row and increment the offset field accordingly. Figure 10 Taking the third row as an example, the number of rows in the Join result is (3 locally, 2 on the other side). This row needs to be copied into 3 rows, with offset fields of 0, 1, and 2 for each row. It's important to note that the offsets for the same partition number should be consecutive.

[0278] S1504: Concatenate fields in the original partition.

[0279] In the hash table of the original partition, if multiple rows with the same Join Key exist, these rows are merged, and the smallest offset is taken. For each row of the original table, if a corresponding Join Key is found in the hash table, the row of the local table needs to be copied according to the number of rows in the opposite table of the quantity field, and an index value is generated according to the corresponding rules and added to the table field. After completion, the value in the offset column of the hash table needs to be incremented by 1.

[0280] To ensure the correct correspondence between the fields on both sides of the Cartesian product, the two tables need to adopt different index value generation rules. In one implementation, the two rules are as follows.

[0281] First mapping rule: New index value = index value + "_" + quantity on this side * offset on this side + i

[0282] Second mapping rule: New index value = index value + "_" + local offset + i * local quantity

[0283] Here, 'i' represents the iteration value during this copying process, which is equivalent to the identifier of the offset on the other side. For example, if the number of data in the opposite table is 3, and the offsets on the other side are marked as 0, 1, and 2 respectively, and after copying and adding the corresponding data rows in the current table, there are three copies, the values ​​of 'i' are 0, 1, and 2 respectively.

[0284] Figure 16 This diagram illustrates the result of adding index values ​​to a table in an embodiment of this application. For example... Figure 16 As shown, after the left table 1601 and the right table 1602 are joined, an index value data column can be added to the updated table. This data column includes primary index values ​​and secondary index values ​​connected by underscores.

[0285] For other types of joins besides Inner Join (Left Join, Full Join, etc.), the result of Inner Join can be calculated within the original partitions of each of the two tables.

[0286] Taking Left Join as an example, after obtaining the Inner Join result, you only need to add the rows from the local tables that are not part of the Inner Join result to the result table. In the logical wide table, the fields in the opposite table corresponding to the non-intersecting rows are NULL.

[0287] The technical solution provided in this application, when both joined tables have a large number of rows and columns, can minimize the communication volume and storage overhead of distributed computing nodes. This significantly improves computing speed and reduces the computing resources required for table joins.

[0288] The method proposed in this invention constructs a logically wide table after joining two tables, but the actual storage location remains on the distributed nodes of the two tables themselves. It is applicable to scenarios where the two tables come from the same database or different databases. It also satisfies the need for two participating parties to provide tables separately and complete the privacy-preserving table join based on privacy-preserving computation techniques.

[0289] It should be noted that although the steps of the method in this application are described in a specific order in the accompanying drawings, this does not require or imply that the steps must be performed in that specific order, or that all the steps shown must be performed to achieve the desired result. Additional or alternative steps may be omitted, multiple steps may be combined into one step, and / or one step may be broken down into multiple steps.

[0290] The following describes an apparatus embodiment of this application, which can be used to execute the table connection method in the above embodiments of this application. Figure 17 A schematic block diagram of the table connection device provided in an embodiment of this application is shown. Figure 17 As shown, the table connection device 1700 includes:

[0291] Extraction module 1710 is configured to extract field data corresponding to the key value from the table to be joined based on the key value used as the basis for table joining;

[0292] The partitioning module 1720 is configured to partition the extracted field data to obtain multi-group partitioned data corresponding to multiple computing nodes;

[0293] The filtering module 1730 is configured to perform matching filtering on the computing node on the two groups of partitioned data belonging to two tables to be joined, respectively, to obtain the intersection data of the two groups of partitioned data.

[0294] The update module 1740 is configured to update the table to be joined based on the intersection data obtained by filtering each computing node.

[0295] In some embodiments of this application, based on the above technical solutions, the update module 1740 includes:

[0296] The data merging module is configured to merge the intersection data obtained by filtering from various computing nodes;

[0297] The data row update module is configured to update the data rows of the table to be joined based on the intersection data obtained by merging.

[0298] The data column update module is configured to add a data column to the tables to be joined to represent the index values ​​of the intersection data.

[0299] In some embodiments of this application, based on the above technical solutions, the data row update module includes:

[0300] The data row filtering module is configured to match and filter each data row of the table with the intersection data to obtain the data rows in the table that contain the intersection data;

[0301] The quantity acquisition module is configured to acquire the distribution quantity of the intersection data in the table, wherein the distribution quantity is the number of data rows containing the intersection data;

[0302] Based on the distribution of the merged intersection data in one table, copy and add the data row containing the intersection data to another table.

[0303] In some embodiments of this application, based on the above technical solutions, the data column update module includes:

[0304] The first index module is configured to obtain a primary index value used to distinguish the different values ​​of the intersection data;

[0305] The second index module is configured to obtain secondary index values ​​used to distinguish different rows of data with the same intersection between the tables to be joined;

[0306] The index value adding module is configured to add a data column containing the primary index value and the secondary index value to the table to be joined.

[0307] In some embodiments of this application, based on the above technical solutions, the second indexing module includes:

[0308] The offset determination module is configured to determine an offset associated with the intersection data based on the number of distributions of the merged intersection data in the table, wherein the number of distributions is the number of data rows containing the intersection data, and the offset is used to distinguish different data rows containing the same intersection data in a single table;

[0309] The index value determination module is configured to determine a secondary index value based on the distribution quantity and the offset, for distinguishing different data rows containing the same intersection data between tables to be joined.

[0310] In some embodiments of this application, based on the above technical solutions, the index value determination module includes:

[0311] The rule acquisition module is configured to acquire a first mapping rule and a second mapping rule that are mirror images of each other. The mirror image relationship is used to indicate that the two mapping rules use mapping parameters from different tables at the same position.

[0312] The first mapping module is configured to map the distribution quantity and the offset according to one of the first mapping rule and the second mapping rule to obtain the secondary index values ​​of each data row where the intersection data is located in the local table; the local table is one of the two tables to be joined.

[0313] The second mapping module is configured to perform mapping processing on the distribution quantity and the offset according to another of the first mapping rule and the second mapping rule to obtain the secondary index values ​​of each data row where the intersection data is located in the opposite table; the opposite table is the other table of the two tables to be joined.

[0314] In some embodiments of this application, based on the above technical solutions, the first mapping rule includes: multiplying the distribution quantity of the opposite table with the offset of the local table, and then summing it with the offset of the opposite table;

[0315] The second mapping rule includes: multiplying the distribution quantity of the local table with the offset of the opposite table, and then summing the product with the offset of the local table.

[0316] In some embodiments of this application, based on the above technical solutions, the extraction module 1710 includes:

[0317] The partition acquisition module is configured to acquire at least two table partitions that make up the table to be joined, the at least two table partitions being distributed on different computing nodes;

[0318] The field extraction module is configured to extract field data corresponding to the key value from the at least two table partitions based on the key value used as the basis for table connection.

[0319] In some embodiments of this application, based on the above technical solutions, the partitioning module 1720 includes:

[0320] The hash operation module is configured to perform a hash operation on the extracted field data to obtain the hash value of the field data;

[0321] The field partitioning module is configured to partition the field data according to the hash value to obtain multi-group partitioned data corresponding to multiple computing nodes.

[0322] In some embodiments of this application, based on the above technical solutions, the field partitioning module includes:

[0323] The quantity acquisition module is configured to acquire the number of nodes used for processing field data in the calculation.

[0324] The hash value mapping module is configured to map the hash values ​​according to the number of nodes to obtain a set of hash values ​​that is the same as the number of nodes;

[0325] The set allocation module is configured to allocate the hash value set to each computing node, and determine the multi-group partitioned data corresponding to multiple computing nodes based on the allocation results.

[0326] In some embodiments of this application, based on the above technical solutions, the screening module 1730 includes:

[0327] The quantity comparison module is configured to compare the number of data belonging to two groups of partitioned data in two tables to be joined on the computing node.

[0328] The hash operation module is configured to perform hash operations on a group of partitions with a small amount of data to obtain a hash table corresponding to that group of partitions.

[0329] The matching and filtering module is configured to match and filter a group of partition data with a large amount of data against the hash table to obtain the intersection data of the two groups of partition data.

[0330] The specific details of the table connection device provided in the various embodiments of this application have been described in detail in the corresponding method embodiments, and will not be repeated here.

[0331] Figure 18 A schematic block diagram of a computer system architecture for implementing an electronic device according to embodiments of the present application is shown.

[0332] It should be noted that, Figure 18 The computer system 1800 of the electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.

[0333] like Figure 18 As shown, the computer system 1800 includes a central processing unit (CPU) 1801, which can perform various appropriate actions and processes based on programs stored in read-only memory (ROM) 1802 or programs loaded from storage section 1808 into random access memory (RAM). The RAM 1803 also stores various programs and data required for system operation. The CPU 1801, ROM 1802, and RAM 1803 are interconnected via a bus 1804. An input / output interface 1805 (I / O interface) is also connected to the bus 1804.

[0334] The following components are connected to the input / output interface 1805: an input section 1806 including a keyboard, mouse, etc.; an output section 1807 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and speakers, etc.; a storage section 1808 including a hard disk, etc.; and a communication section 1809 including a network interface card such as a local area network card, modem, etc. The communication section 1809 performs communication processing via a network such as the Internet. A drive 1810 is also connected to the input / output interface 1805 as needed. Removable media 1811, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., are installed on the drive 1810 as needed so that computer programs read from them can be installed into the storage section 1808 as needed.

[0335] Specifically, according to embodiments of this application, the processes described in the various method flowcharts can be implemented as computer software programs. For example, embodiments of this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via communication section 1809, and / or installed from removable medium 1811. When the computer program is executed by central processing unit 1801, it performs various functions defined in the system of this application.

[0336] It should be noted that the computer-readable medium shown in the embodiments of this application can be a computer-readable signal medium, a computer-readable storage medium, or any combination of the two. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), flash memory, optical fiber, portable compact disc read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this application, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In this application, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such transmitted data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. The computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to wireless, wired, etc., or any suitable combination thereof.

[0337] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0338] It should be noted that although several modules or units for the device used to perform actions have been mentioned in the detailed description above, this division is not mandatory. In fact, according to the embodiments of this application, the features and functions of two or more modules or units described above can be embodied in one module or unit. Conversely, the features and functions of one module or unit described above can be further divided and embodied by multiple modules or units.

[0339] Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, touch terminal, or network device, etc.) to execute the method according to the embodiments of this application.

[0340] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein.

[0341] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this application is limited only by the appended claims.

Claims

1. A method of table joining, characterized by, The method comprises the following steps: extracting field data corresponding to a key value from two tables to be connected according to the key value as a table connection basis; partitioning the extracted field data to obtain multiple groups of partition data corresponding to multiple computing nodes; performing matching screening on two groups of partition data respectively belonging to the two tables to be connected on each computing node to obtain intersection data of the two groups of partition data; merging the intersection data screened on each computing node; updating data rows of the tables to be connected according to the merged intersection data; obtaining a primary index value for distinguishing different values of the intersection data; obtaining a secondary index value for distinguishing different data rows where the intersection data is located between the two tables to be connected; adding a data column containing the primary index value and the secondary index value to the tables to be connected.

2. The table joining method according to claim 1, characterized by, The method for updating data rows of the tables to be connected according to the merged intersection data comprises the following steps: performing matching screening on each data row of the tables to be connected and the intersection data to obtain data rows of the tables to be connected containing the intersection data; obtaining a distribution quantity of the intersection data in the tables to be connected, the distribution quantity being a quantity of data rows containing the intersection data; reproducing and adding data rows where the intersection data is located in one of the tables to be connected according to a distribution quantity of the intersection data in the other table to be connected.

3. The table joining method according to claim 1, characterized by, The method for obtaining a secondary index value for distinguishing different data rows where the intersection data is located comprises the following steps: determining an offset associated with the intersection data according to a distribution quantity of the intersection data in the tables to be connected, the distribution quantity being a quantity of data rows containing the intersection data, the offset being used for distinguishing different data rows where the intersection data is located in a single table; determining a secondary index value for distinguishing different data rows where the intersection data is located between the two tables to be connected according to the distribution quantity and the offset.

4. The table joining method according to claim 3, characterized by, The method for determining a secondary index value for distinguishing different data rows where the intersection data is located between the two tables to be connected according to the distribution quantity and the offset comprises the following steps: obtaining a first mapping rule and a second mapping rule in a mirror image relationship, the mirror image relationship being used for representing that the two mapping rules use mapping parameters of different tables at the same position; performing mapping processing on the distribution quantity and the offset according to one of the first mapping rule and the second mapping rule to obtain a secondary index value of each data row where the intersection data is located in a self-side table; the self-side table being one of the two tables to be connected; performing mapping processing on the distribution quantity and the offset according to the other of the first mapping rule and the second mapping rule to obtain a secondary index value of each data row where the intersection data is located in a counter-side table; the counter-side table being the other of the two tables to be connected.

5. The table joining method according to claim 4, characterized by, The first mapping rule comprises: multiplying a distribution quantity of the counter-side table with an offset of the self-side table, and then summing an offset of the counter-side table. The second mapping rule includes: multiplying the distribution quantity of the local table with the offset of the opposite table, and then summing the product with the offset of the local table; Wherein, the local table is one of the two tables to be connected, and the opposite table is the other of the two tables to be connected.

6. The form joining method according to any one of claims 1 to 5, characterized in that, Based on the key value used as the basis for table joining, extract the field data corresponding to the key value from the two tables to be joined, including: Obtain at least two table partitions that make up the two tables to be joined, wherein the at least two table partitions are distributed on different computing nodes; Based on the key value used as the basis for table connection, extract field data corresponding to the key value from the at least two table partitions.

7. The form joining method according to any one of claims 1 to 5, characterized by, The extracted field data is partitioned to obtain multi-group partitioned data corresponding to multiple computing nodes, including: Perform a hash operation on the extracted field data to obtain the hash value of the field data; The field data is partitioned based on the hash value to obtain multi-group partitioned data corresponding to multiple computing nodes.

8. The table joining method according to claim 7, wherein The field data is partitioned based on the hash value to obtain multi-group partitioned data corresponding to multiple computing nodes, including: Get the number of nodes used for computational nodes that process field data; The hash values ​​are mapped according to the number of nodes to obtain a set of hash values ​​with the same number of nodes; The hash value set is allocated to each of the computing nodes, and multi-group partitioned data corresponding to multiple computing nodes is determined based on the allocation results.

9. The form joining method according to any one of claims 1 to 5, characterized by, On each of the aforementioned computing nodes, the two groups of partitioned data belonging to the two tables to be joined are matched and filtered to obtain the intersection data of the two groups of partitioned data, including: Compare the number of data points belonging to the two groups of partitioned data in the two tables to be joined at the computing node; Perform a hash operation on a group of partitions with a small amount of data to obtain a hash table corresponding to that group of partitions; The group of partitions with a larger amount of data is matched and filtered with the hash table to obtain the intersection data of the two groups of partitions.

10. A table connecting device, characterized in that include: The extraction module is configured to extract field data corresponding to the key value from two tables to be joined, based on the key value used as the basis for table joining. The partitioning module is configured to partition the extracted field data to obtain multi-group partitioned data corresponding to multiple computing nodes; The filtering module is configured to perform matching and filtering on each of the computing nodes on the two groups of partitioned data belonging to the two tables to be connected, to obtain the intersection data of the two groups of partitioned data. The update module is configured to merge the intersection data obtained from filtering by each computing node; and update the data rows of the table to be joined based on the merged intersection data. Obtain a primary index value to distinguish the different values ​​of the intersection data; obtain a secondary index value to distinguish the different data rows of the intersection data between the two tables to be joined; Add a data column containing the primary index value and the secondary index value to the table to be joined.

11. A computer readable medium characterized by The computer-readable medium stores a computer program that, when executed by a processor, implements the table join method according to any one of claims 1 to 9.

12. An electronic device, comprising: include: processor; as well as Memory for storing the executable instructions of the processor; The processor is configured to cause the electronic device to perform the table join method according to any one of claims 1 to 9 by executing the executable instructions.

13. A computer program product comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the table join method according to any one of claims 1 to 9.