A method and system for multi-table join processing of SQL streaming data

By splitting real-time streaming data into attribute stream tables and constructing entity-attribute mapping index tables, and combining field semantics and distribution characteristics to generate multi-table join plans, the problem that predefined join rules in existing technologies are difficult to adapt to complex queries is solved, thus improving the efficiency and accuracy of streaming multi-table joins.

CN122240618APending Publication Date: 2026-06-19SHOUKE SOFTSTART (BEIJING) SOFTWARE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHOUKE SOFTSTART (BEIJING) SOFTWARE CO LTD
Filing Date
2026-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing streaming data processing technologies rely heavily on predefined table structures and fixed join keys, making it difficult to adapt to complex query requirements. This leads to invalid joins and an increase in the size of intermediate results, reducing the efficiency and accuracy of streaming multi-table join processing.

Method used

By receiving real-time streaming data, the data is split into multiple attribute stream tables according to the entity attribute dimension, and an entity-attribute mapping index table is constructed. Based on field information and historical data distribution, candidate relationships for multi-table associations are determined. Combined with user queries, a streaming multi-table association execution plan with semantics equivalent to SQL queries is generated to dynamically match association requirements.

Benefits of technology

It enables adaptive matching of multi-table join requirements under different query scenarios without the need for pre-fixed join keys, reducing invalid joins and the size of intermediate results, and improving the execution efficiency and accuracy of streaming SQL multi-table joins.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240618A_ABST
    Figure CN122240618A_ABST
Patent Text Reader

Abstract

This invention provides a method and system for multi-table join processing of SQL streaming data, relating to the field of data processing technology. The method includes: receiving real-time streaming data; splitting the real-time streaming data into multiple attribute stream tables based on entity identifier fields and attribute dimensions; constructing an entity-attribute mapping index table based on each attribute stream table; determining a set of candidate multi-table join relationships between the attribute stream tables based on field information and historical data distribution of each attribute stream table; determining a set of join requirements based on user-submitted SQL queries; determining a set of participating attribute stream tables based on the set of join requirements and the set of candidate multi-table join relationships; generating a streaming multi-table join execution plan semantically equivalent to the SQL query based on the participating attribute stream table set and the corresponding join field pairs; and executing the streaming multi-table join execution plan based on the entity-attribute mapping index table and outputting the real-time processing result of the SQL query.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and in particular to a method and system for processing multi-table associations of SQL stream data. Background Technology

[0002] With the continuous development of big data and real-time computing technologies, SQL-based streaming data processing has been widely applied in business scenarios such as financial risk control, IoT monitoring, log analysis, and real-time decision-making. Streaming data is typically generated continuously and rapidly, characterized by large data volume, high update frequency, and strong time sequence. To meet the needs of complex business analysis, it is often necessary to perform multi-table joins on streaming data from different sources or dimensions to achieve cross-attribute and cross-dimensional information integration and real-time analysis.

[0003] In streaming data processing, achieving efficient and accurate multi-table joins not only enhances the expressive power of real-time queries, but also supports advanced SQL semantics such as complex condition filtering, aggregation statistics, and online analysis, thereby providing business systems with richer real-time insights.

[0004] Existing streaming data processing technologies largely rely on predefined table structures and fixed join keys to perform multi-table join operations, typically assuming that the relationships between tables are static and explicit. However, in practical applications, streaming data often exhibits characteristics such as dynamically changing attributes, unstable field value distribution, and missing entity attributes, making predefined join rules difficult to adapt to complex query requirements. Furthermore, the lack of comprehensive modeling of the relationship between SQL query semantics and the actual distribution of streaming data easily introduces invalid joins and expands the scale of intermediate results, thereby reducing the efficiency and accuracy of streaming multi-table join processing. Summary of the Invention

[0005] In view of the shortcomings of the prior art, the purpose of this invention is to provide a method for processing multi-table joins in SQL streaming data. This method addresses the problem that existing streaming data processing technologies often rely on predefined table structures and fixed join keys to perform multi-table join operations, making it difficult for predefined join rules to adapt to complex query requirements. Furthermore, the lack of comprehensive modeling of the relationship between SQL query semantics and the actual distribution of streaming data easily introduces invalid joins and expands the scale of intermediate results, thereby reducing the efficiency and accuracy of streaming multi-table join processing.

[0006] A first aspect of this invention provides a method for processing multi-table joins in SQL stream data, comprising: S1: Receive real-time streaming data, which includes an entity identifier field and multiple attribute fields; S2: Based on the entity identifier field, real-time streaming data is split into multiple attribute stream tables according to attribute dimensions; S3: Construct an entity-attribute mapping index table based on each attribute flow table; S4: Based on the field information and historical data distribution of each attribute flow table, determine the set of candidate multi-table association relationships between each attribute flow table; S5: Combine the SQL queries submitted by the user to determine the set of related requirements; S6: Determine the set of participating attribute flow tables by using the set of associated requirements as a constraint and combining it with the set of candidate relationships for multi-table association; S7: Generate a streaming multi-table join execution plan that is semantically equivalent to the SQL query, based on the participating attribute flow table set and the corresponding associated field pairs of the participating attribute flow table set; S8: Combines the entity-attribute mapping index table to execute a streaming multi-table join execution plan and outputs the real-time processing results of the SQL query.

[0007] A second aspect of this invention provides a multi-table join processing system for SQL stream data, comprising: a processor and a memory; The memory stores programs or instructions that can run on a processor, and when the programs or instructions are executed by the processor, they implement the steps of a multi-table join processing method for SQL stream data as described in the first aspect.

[0008] A third aspect of the present invention provides a readable storage medium on which a program or instructions are stored, and when the program or instructions are executed by a processor, the steps of the multi-table join processing method for SQL stream data as described in the first aspect are implemented.

[0009] The beneficial effects of the technical solutions provided by the embodiments of the present invention include at least the following: In this embodiment of the invention, real-time streaming data is split into multiple attribute stream tables according to entity attribute dimensions. A set of candidate relationships for multi-table association is constructed by combining field semantic similarity, value distribution similarity, and field uniqueness constraints. Based on parsing the semantics of SQL queries, the attribute stream tables participating in the association and their associated field pairs are dynamically determined, thereby generating a streaming multi-table association execution plan equivalent to the query semantics. This achieves efficient streaming association processing driven by entity identifiers. As a result, it can adaptively match the multi-table association requirements under different query scenarios without pre-fixing the association key, effectively reducing invalid associations and the scale of intermediate results, and improving the execution efficiency, result accuracy, and overall system resource utilization of streaming SQL multi-table association. Attached Figure Description

[0010] The accompanying drawings are for illustrative purposes only and are not intended to limit the invention. Throughout the drawings, the same reference numerals denote the same parts. Obviously, the drawings described below are merely some embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative effort.

[0011] Figure 1 This is a flowchart illustrating a multi-table join processing method for SQL stream data provided in an embodiment of the present invention.

[0012] Figure 2 This is a schematic diagram of the structure of a multi-table association processing system for SQL stream data provided in an embodiment of the present invention. Detailed Implementation

[0013] To enable those skilled in the art to better understand the technical solutions in the embodiments of the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. It should be understood that these descriptions are merely exemplary and are not intended to limit the scope of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0014] The multi-table join processing method for SQL stream data provided by the present invention will be described in detail below with reference to the accompanying drawings, through specific embodiments and application scenarios.

[0015] Reference manual attached Figure 1 The diagram illustrates a flowchart of a multi-table join processing method for SQL stream data provided in an embodiment of the present invention.

[0016] This invention provides a method for processing multi-table joins in SQL stream data, which may include the following steps: S1: Receive real-time streaming data, which includes an entity identifier field and multiple attribute fields.

[0017] Real-time streaming data refers to data records that are continuously generated and received by the system in real time, typically characterized by high arrival frequency and time sequence. The entity identifier field is used to uniquely identify business entity instances, distinguishing different entity objects, such as user identifiers, device identifiers, or order identifiers. Attribute fields are data fields that describe the state or characteristics of an entity across different dimensions, used to characterize the entity's multidimensional attribute information. Their values ​​can change dynamically over time and may be missing or empty in different entity instances.

[0018] It should be noted that by uniformly receiving real-time streaming data containing entity identifier fields and multiple attribute fields, the system can organize and drive subsequent streaming data processing with entities as the core. This lays a consistent data foundation for attribute-level splitting, multi-table joins, and streaming execution. This approach helps to preserve the natural relationship between entities and attributes when the data arrives, avoiding the loss of entity information due to improper data preprocessing. At the same time, it enhances the system's adaptability to high-concurrency and dynamically changing streaming data, providing a stable and scalable input guarantee for subsequent multi-table join processing.

[0019] S2: Based on the entity identifier field, real-time streaming data is split into multiple attribute stream tables according to attribute dimensions.

[0020] Among them, the attribute flow table is a streaming data table structure built for a single attribute field. It is used to continuously store and update the entity identifier, timestamp, and attribute value corresponding to the attribute. Each attribute flow table only focuses on the changes of a certain attribute under different entities and different times.

[0021] Specifically, each attribute corresponds to an independent attribute flow table, which is used to store entity identifiers, timestamps, and attribute values.

[0022] In one possible implementation, S2 specifically includes: S201: Using the entity identifier field as an index, identify the set of attributes with non-null values ​​in each real-time stream data record.

[0023] The set of attributes with non-null values ​​refers to the set of attribute fields that actually have valid values ​​and are not null values ​​in a certain real-time streaming data record.

[0024] S202: For each attribute field in the attribute set, split the real-time stream data according to the attribute dimension to determine multiple attribute stream tables.

[0025] Specifically, for each real-time arriving stream data record, the record is parsed based on the entity identifier field, and the non-empty attribute fields identified are traversed. The data corresponding to each attribute field is split from the original real-time stream data according to the attribute dimension. For each attribute field with a non-empty value, a data unit containing the entity identifier and the attribute value is generated and written into the attribute stream table corresponding to that attribute field. This allows data of different attributes to be stored and updated in an independent streaming form, realizing the continuous splitting of real-time stream data in terms of attribute dimensions.

[0026] It should be noted that by using the entity identifier field as an index to identify the set of attributes with non-null values ​​in real-time streaming data records, and constructing attribute flow tables only for attribute fields within the attribute set, the system can avoid writing null values ​​or invalid attributes to streaming storage, thereby reducing data redundancy and storage overhead. This processing method improves the data validity and density of the attribute flow tables, which is beneficial for quickly locating the actual attribute information of entities during subsequent multi-table joins, reducing meaningless join operations, and improving the overall efficiency and accuracy of streaming multi-table join processing.

[0027] S3: Construct an entity-attribute mapping index table based on each attribute flow table.

[0028] The entity-attribute mapping index table is a data structure used to record the correspondence between entity instances and their attributes. The entity identifier is used as the index item to map the set of attribute fields that the entity currently or historically has non-null values, which is used to represent the association between the entity and multiple attribute flow tables.

[0029] Specifically, each attribute flow table is scanned and maintained, using the entity identifier field as a unified index key. The attribute field information for each entity appearing in each attribute flow table is statistically recorded. When a non-null attribute value is detected for an entity in a corresponding attribute flow table, a mapping relationship between the entity identifier and the corresponding attribute field is established in the entity-attribute mapping index table. By continuously updating the mapping index table, it dynamically reflects the set of non-null attributes possessed by the entity instance, thus providing efficient index support for subsequent entity-driven multi-table joins.

[0030] It should be noted that by constructing an entity-attribute mapping index table based on each attribute flow table, the system can quickly retrieve the corresponding attribute set for each entity, thereby avoiding blind scanning of unrelated attribute flow tables in subsequent multi-table join processing. This index structure helps to narrow down the range of attribute flow tables actually involved in the join, reduces the computational complexity of streaming multi-table joins, and provides an efficient positioning mechanism for entity-identity-driven join execution, thus improving the real-time performance and resource utilization efficiency of the overall streaming query processing.

[0031] S4: Based on the field information and historical data distribution of each attribute flow table, determine the set of candidate multi-table association relationships between each attribute flow table.

[0032] Here, field information refers to the field names, data types, and semantic descriptions contained in each attribute flow table. Historical data distribution refers to the distribution characteristics of field values ​​formed by attribute flow tables within a historical time range, including statistical information such as value range, frequency, and degree of repetition. The multi-table join candidate relationship set refers to the set of relationships that may be used for join operations between attribute flow tables, obtained based on field information and historical data distribution analysis. Each candidate relationship typically includes the attribute flow table pair involved in the join and the corresponding candidate join field pair.

[0033] In one possible implementation, S4 specifically includes: S401: Determine the field combinations between different attribute flow tables.

[0034] Among them, field combination refers to the combination relationship formed by selecting fields from the field sets of different attribute flow tables in pairs.

[0035] S402: Construct candidate pairs of related fields between various attribute flow tables based on field combinations: ; in, Represents attribute flow table With attribute flow table Candidate pairs of related fields between them Indicates the attribute flow table Selecting a specific field from the set of fields, Indicates from Selecting a specific field from the set of fields, Indicates the first i Each attribute stream table, Indicates the first j Each attribute stream table, Cols ( ) represents the set of fields in the attribute flow table.

[0036] It should be noted that this formula constructs a complete candidate space of related field pairs by systematically combining the field sets of different attribute flow tables. This allows for comprehensive coverage of potential associations between attribute flow tables without relying on manually specified association keys. This Cartesian combination-based approach avoids missing valid related field pairs due to insufficient prior assumptions, providing a unified and standardized candidate input foundation for subsequent connectivity assessments based on multi-dimensional indicators such as semantic similarity and value distribution similarity. This is beneficial for improving the completeness and robustness of multi-table association discovery in complex and heterogeneous streaming data scenarios.

[0037] Among them, the candidate pair of related fields refers to a pair of fields consisting of fields from two different attribute flow tables that may be used for related operations.

[0038] S403: Calculate the field-level connectability score for candidate pairs of related fields based on the similarity of field value distribution and field semantic similarity. ; in, Represents attribute flow table Middle field With attribute flow table Middle field Field-level connectability scores between them Representation field With fields Semantic similarity of fields between them Representation field With fields The similarity of the value distributions between them, where max represents the maximum value. u ( ) indicates a field uniqueness measure.

[0039] It should be noted that this formula integrates the similarity of field value distribution with the similarity of field semantics, and introduces a field uniqueness metric as a weighted constraint. This comprehensive evaluation of candidate pairs of related fields from the data, semantic, and structural levels avoids the limitations of relying solely on a single feature to determine field correlation. This multi-dimensional connectibility scoring mechanism can more accurately identify field pairs that have correlation value in both business meaning and actual data distribution, providing a reliable quantitative basis for subsequent multi-table relationship screening and execution plan generation, thus improving the accuracy and stability of multi-table join processing.

[0040] Among them, value distribution similarity refers to measuring the degree of overlap at the data level by comparing the intersection and union of the historical value instance sets of two fields. Field semantic similarity refers to calculating the similarity in business meaning by using the semantic embedding vectors of field names or descriptions. Field-level connectability score is a quantitative indicator obtained by comprehensively evaluating candidate field pairs at the semantic level, data distribution level, and uniqueness constraint level.

[0041] in, ,in, Representation field The collection of all actual value instances that have appeared in the corresponding attribute flow table. Representation field The set of all actual value instances that have appeared in the corresponding attribute flow table. The intersection symbol is used to represent the intersection. It represents the union of sets. ,in, Represents the cosine function. Indicates the field The semantic representation vector, For fields The semantic representation vector. ,in, Representation field The degree of uniqueness of the values ​​in its respective attribute flow table. Representation field The set consisting of all distinct values ​​in the set. Representation field The formula is a set of all value instances that have appeared in the corresponding attribute flow table. It is used to quantify the uniqueness of field values ​​by calculating the ratio of the number of different values ​​of a field to the total number of value instances of the field. This determines whether a field has primary key or near-primary key characteristics, so as to ensure that at least one side of the field has a stable one-to-many or one-to-one mapping relationship in the process of multi-table association.

[0042] Specifically, the first formula quantifies the degree of overlap between two sets of field value instances by calculating the intersection-union ratio (IUR) of those instances, thus assessing the likelihood that a field, when used as a association key, has a matching basis at the instance level. The second formula assesses whether fields express the same or similar business meanings at the pattern or semantic level by comparing the cosine similarity between their semantic embedding vectors.

[0043] S404: Based on the distribution characteristics of fields in the corresponding attribute flow table, constrain and correct the field-level connectability score: ; in, This represents the field-level connectability score after constraint correction.

[0044] It should be noted that this formula, by introducing a field uniqueness metric as a constraint factor on top of the initial field-level joinability score, performs a secondary correction on candidate pairs of related fields, thereby strengthening the ability of fields to act as stable join keys in the relationship. By assigning higher weights to fields with higher uniqueness, the interference of highly repetitive or noisy fields on the join score can be effectively suppressed, ensuring that at least one side of the fields has a relatively stable one-to-one or one-to-many mapping relationship in the multi-table join process, thus improving the reliability and practical feasibility of the multi-table join relationship.

[0045] S405: Based on the field-level connectability score after constraint correction, determine the set of candidate multi-table association relationships between various attribute flow tables.

[0046] Specifically, based on the constrained and corrected field-level connectability scores, all candidate pairs of related fields between different attribute flow tables are sorted and filtered. When the connectability score of a certain field pair meets a preset threshold or is at the optimal level within the same attribute flow table pair, that field pair, along with its corresponding attribute flow table pair, is included in the multi-table association candidate relationship set. In this way, only field relationships with high association credibility in terms of semantic consistency, value distribution matching, and uniqueness constraints are retained, thus forming a well-structured and quality-controlled multi-table association candidate relationship set, providing a reliable structural foundation for subsequent multi-table association processing.

[0047] The candidate relationships in the multi-table association candidate relationship set include: the attribute flow table pairs involved in the association, the corresponding field pairs, and the joinability evaluation value of the field pairs.

[0048] It should be noted that by systematically constructing candidate pairs of related fields at the field combination level, and combining the similarity of field value distribution, semantic similarity, and uniqueness constraints to perform multi-dimensional quantitative evaluation of these candidate pairs, the system can automatically identify field relationships with actual association significance without manual specification of the association key. This method effectively avoids the misjudgment problem caused by relying solely on field names or single statistical features, significantly improving the accuracy and reliability of multi-table association candidate relationships. It provides a high-quality, executable structural foundation for the subsequent generation of streaming multi-table association plans, thereby improving the efficiency and stability of overall streaming data association processing.

[0049] S5: Combine the SQL queries submitted by the user to determine the set of related requirements.

[0050] Here, the user-submitted SQL query refers to a structured query statement initiated by the user or upper-layer application for query analysis of streaming data. The set of related requirements refers to the set of requirement descriptions obtained after parsing the SQL query, which is used to characterize the attribute information that the query needs to cover at the semantic level, including the query target field, filter condition fields, and grouping and aggregation related fields, etc. These requirements together determine the scope of attributes and relationships that must be involved in multi-table join processing.

[0051] Optionally, S5 specifically involves: parsing the SQL query submitted by the user, extracting the target attribute fields, filter condition fields, and aggregation fields involved in the SQL query, and constructing a set of related requirements based on the parsing results to represent the attribute information that the SQL query needs to cover semantically.

[0052] It's important to note that by combining user-submitted SQL queries to determine the set of join requirements, the system can use query semantics as a driving constraint for subsequent multi-table join processing, avoiding blind joins based solely on data structure or historical statistics. This approach helps accurately identify the attribute fields that the current query truly depends on, thereby excluding attribute flow tables irrelevant to the query during subsequent attribute flow table selection and join plan generation. This reduces unnecessary join calculations and improves the targeting, execution efficiency, and semantic consistency of streaming SQL multi-table join processing.

[0053] In one possible implementation, the SQL query includes: the target field, the data source identifier, the filter conditions, the grouping information, the aggregation information, and the join semantics.

[0054] Specifically, the target field of the query is the SELECT clause, the data source identifier is the FROM clause, the filtering conditions are the WHERE clause, the grouping and aggregation information is the GROUP BY / HAVING clause, and the join semantics are implicit or explicit JOIN.

[0055] S6: Using the set of associated requirements as constraints, and combining it with the set of candidate relationships for multi-table associations, determine the set of tables participating in the attribute flow.

[0056] Among them, the participating attribute flow table set refers to the set of attribute flow tables that are selected to participate in the current multi-table association processing process, provided that the set of association requirements is met and the executable association relationship constraints are met.

[0057] In one possible implementation, S6 specifically includes: S601: Select multiple attribute flow tables related to the set of associated requirements, and combine the selected attribute flow tables to obtain a set of attribute flow tables.

[0058] S602: Initialize the attribute flow table set.

[0059] S603: Based on the coverage of the initialized attribute flow table set to the associated requirement set, calculate the marginal coverage gain of the candidate attribute flow table to the uncovered associated requirements: ; in, This indicates the attribute stream set after initialization. Based on this, a candidate attribute flow table is introduced. Then, the additional coverage capability brought about by the set of query related requirements, i.e., the marginal coverage gain, Indicates the first In the round of iteration, the set of attribute stream tables that will participate in the association processing has been determined. Indicates the first A candidate attribute flow table, Represents the first in the set of related requirements Each requirement item (a SELECT field, a WHERE condition field, and a GROUP BY field) Represents the candidate attribute flow table For the The coverage or matching degree of each requirement item. This indicates the attribute stream set after initialization. Next, the The extent to which each requirement has been covered. This represents the marginal cover constraint operator.

[0060] Among them, marginal coverage gain is a quantitative indicator used to measure the new coverage capability of candidate attribute flow tables for previously uncovered query needs after their introduction.

[0061] Specifically, the greater the marginal coverage gain, the more queries that are not yet fully covered can be answered by the table. If the marginal coverage gain is 0, it means that the table does not add any new value to the current query.

[0062] It should be noted that this formula calculates the marginal coverage capability of candidate attribute flow tables for query association requirements that are not yet fully covered, based on the currently selected attribute flow table set, thereby avoiding the repeated selection of attribute flow tables with overlapping functions and prioritizing the selection of attribute flow tables that can supplement query information.

[0063] This refers to candidate attribute flow tables that have not yet been selected into the set of selected attribute flow tables.

[0064] S604: Calculate the association connectability gain between candidate attribute flow tables and the initialized set of attribute flow tables: ; in, This indicates the attribute stream set after initialization. Based on this, a candidate attribute flow table is introduced. Subsequently, the new associative executable capability between the candidate attribute flow table and the initialized attribute flow table set at the structural level, namely the associative connectability gain, Indicates the first A candidate attribute flow table.

[0065] Specifically, the greater the association connectivity gain, the easier it is for the candidate attribute flow table to form an executable JOIN structure with the initialized attribute flow table set. If it is 0, it means that the candidate table cannot be effectively associated with the initialized attribute flow table set.

[0066] It should be noted that this formula measures the improvement in the overall connectivity of the association structure after the addition of a candidate attribute flow table by calculating the cumulative value of the optimal field-level association connectivity between the candidate attribute flow table and each of the selected attribute flow tables, based on the current set of selected attribute flow tables.

[0067] S605: Construct the comprehensive utility function based on the marginal coverage gain and the associated connectivity gain: ; in, Indicates the first In each round of iterations, the attribute flow tables that are ultimately selected into the initialized attribute flow table set after comprehensive evaluation are... Represents the attribute stream set, This represents the marginal coverage gain weight (the larger the marginal coverage gain weight, the more the system tends to select the attribute flow table that can cover more query needs; those skilled in the art can set the size of the marginal coverage gain weight according to actual needs, and this invention does not limit it). The association joinability gain weight coefficient represents the association joinability gain weight coefficient. (The larger the association joinability gain weight coefficient is, the more the system tends to select the attribute flow table that is easier to form a JOIN structure with the existing table. Those skilled in the art can set the size of the association joinability gain weight coefficient according to actual needs. This invention does not limit this setting.) Indicates marginal coverage gain. This indicates the gain of connectivity.

[0068] It should be noted that this formula, by comprehensively considering the ability of the attribute flow table to cover the query association requirements and the association and connectivity with the selected attribute flow table set among the candidate attribute flow tables that have not yet been selected, selects the attribute flow table with the highest comprehensive utility as the selection result of the current round, thereby gradually constructing a multi-table association processing scheme that satisfies both query semantic requirements and has an executable association structure.

[0069] S606: Select the candidate attribute flow table that maximizes the comprehensive utility function value as the attribute flow table to be added.

[0070] S607: Add the attribute flow table to be added to the initialized attribute flow table set to determine the set of participating attribute flow tables for association processing.

[0071] It should be noted that by introducing an iterative selection mechanism based on marginal coverage gain and associative connectability gain, the system can prioritize adding attribute flow tables with good associative and executable properties to the already selected attribute flow tables while ensuring that the semantic requirements of SQL queries are gradually and fully covered. This avoids repeatedly selecting attribute flow tables with overlapping functions or those that cannot form an effective JOIN structure. This method achieves a balance between semantic integrity and structural feasibility, gradually building a set of participating attribute flow tables with controlled size and structural connectivity, effectively reducing the multi-table join search space, and improving the execution efficiency and stability of streaming multi-table join processing.

[0072] S7: Generate a streaming multi-table join execution plan that is semantically equivalent to the SQL query, based on the participating attribute stream table set and the corresponding associated field pairs of the participating attribute stream table set.

[0073] In this context, a join pair refers to a combination of fields used to perform join operations between different attribute flow tables. A streaming multi-table join execution plan is a plan structure that describes the specific join order, join conditions, and execution logic of multiple attribute flow tables in a streaming processing environment. SQL query semantic equivalence means that the generated streaming multi-table join execution plan is consistent with the user-submitted SQL query at the query result level, correctly reflecting the semantic requirements such as filtering, projection, and aggregation defined in the SQL query.

[0074] In one possible implementation, S7 specifically includes: S701: From the set of candidate relationships for multi-table association, determine the association field pairs used for association processing between various attribute flow tables.

[0075] S702: Based on the participating attribute flow table set and associated field pairs, construct an attribute flow table association graph, where the nodes in the attribute flow table association graph are attribute flow tables, and the edges are the association relationships between various attribute flow tables.

[0076] Specifically, each attribute flow table in the participating attribute flow table set is used as a node in the graph structure, and corresponding edges are established between attribute flow tables based on the association field pairs. Each edge represents an executable association between two attribute flow tables, and the edge attributes can include the corresponding association field pairs and their connectability evaluation values. By abstracting attribute flow tables and their associations into an attribute flow table association graph, the system can characterize the association topology between multiple tables at the overall structural level, providing an intuitive and computable structural basis for subsequently determining the execution order of multi-table associations and generating streaming multi-table association execution plans.

[0077] Among them, the attribute flow table association graph is a graph structure constructed with attribute flow tables as nodes and the association relationships between attribute flow tables as edges, used to describe the overall structure of multi-table associations.

[0078] S703: Determine the execution order of multi-table associations between various attribute flow tables based on the attribute flow table association diagram.

[0079] The execution order of multi-table associations refers to the order in which each attribute stream table participates in the association operation during the streaming process.

[0080] S704: Based on the execution order of multi-table associations, generate the corresponding streaming multi-table association logic execution plan, wherein the streaming multi-table association logic execution plan includes the association order and association conditions of the attribute flow tables.

[0081] Among them, the streaming multi-table association logic execution plan is a logical layer execution scheme that describes the association order and association conditions of attribute flow tables.

[0082] S705: Perform semantic fusion on the execution plan of streaming multi-table join logic based on the filtering conditions and projection fields in the SQL query.

[0083] Specifically, by performing semantic parsing on the SQL query, the filtering conditions and projection field information are extracted, and the filtering conditions are pushed forward and integrated into the corresponding attribute flow table or association node in the streaming multi-table association logic execution plan. This allows the filtering operation to be performed as early as possible during the association execution process. At the same time, the output fields in the logic execution plan are pruned according to the projection fields, retaining only the data items required for the query results. This ensures that the streaming multi-table association logic execution plan maintains consistency with the SQL query semantics in terms of execution structure and output results, and effectively reduces the size of intermediate results and unnecessary data processing overhead.

[0084] It should be noted that semantic fusion is performed on the execution plan of the streaming multi-table join logic to ensure that the logical execution plan and the SQL query remain equivalent in terms of query results.

[0085] S706: Converts the semantically fused streaming multi-table join logic execution plan into a streaming multi-table join execution plan that is semantically equivalent to the SQL query.

[0086] It should be noted that by introducing an attribute flow table relationship graph and determining the execution order of multi-table joins accordingly, the system can plan the multi-table join process at the overall structural level, avoiding disordered or inefficient join execution. Simultaneously, by semantically fusing the logical execution plan with filtering conditions and projection fields in the SQL query, filtering and projection operations can be applied in advance during the join process, effectively reducing the size of intermediate results. This method can generate a clear and highly executable streaming multi-table join execution plan while ensuring semantic equivalence of queries, improving the efficiency, accuracy, and system stability of streaming multi-table join processing.

[0087] S8: Combines the entity-attribute mapping index table to execute a streaming multi-table join execution plan and outputs the real-time processing results of the SQL query.

[0088] Among them, the real-time processing result of SQL query refers to the query result data stream that is continuously output after performing correlation, filtering, projection and optional aggregation operations on the real-time arriving streaming data according to the execution plan.

[0089] In one possible implementation, S8 specifically includes: S801: Based on the entity-attribute mapping index table, determine the subset of attribute flow tables participating in the association processing: ; in, Indicates the first [number] participants in the query processing. One entity instance, Indicates the first One attribute field, Indicates the first The set of non-empty attributes corresponding to each entity instance Indicates a null value. Indicates the first The attribute field in the first The value can be obtained under each entity instance.

[0090] It should be noted that this formula utilizes an entity-attribute mapping index table to select only the attribute flow table corresponding to the non-null attributes actually possessed by the current entity instance for association processing, thereby precisely limiting the scope of participation in multi-table associations at the entity level. This dynamic filtering mechanism based on non-null attribute sets can effectively avoid the invalid participation of irrelevant attribute flow tables, significantly reduce the computational load and intermediate result size in the streaming multi-table association process, improve the efficiency and targeting of association execution, and enhance the system's adaptability in scenarios with sparse attributes and dynamically changing data.

[0091] Among them, the attribute flow table subset refers to the set of attribute flow tables that participate in the association processing, which are selected from all attribute flow tables based on the non-empty attributes that a specific entity instance actually possesses.

[0092] S802: Generate an entity identifier stream from the entity-attribute mapping index table, driven by the entity identifier field.

[0093] Among them, the entity identifier stream is a sequence of entity identifiers that is continuously generated in a streaming manner with the entity identifier field as the core, and is used as the main driving data stream for multi-table join processing.

[0094] Specifically, the entity identifier stream is used as the main driving stream for streaming multi-table association.

[0095] S803: Based on the streaming multi-table association execution plan, driven by the entity identifier stream, the attribute stream tables in the attribute stream table subset are sequentially associated with each entity identifier stream to generate the initial streaming multi-table association result stream.

[0096] The initial streaming multi-table join result stream refers to the intermediate result data stream of multi-table joins that is generated step by step during the streaming process according to the predetermined join execution plan.

[0097] S804: Based on the filtering conditions in the SQL query, perform streaming filtering on the initial streaming multi-table join result stream to obtain a set of candidate entities that meet the filtering conditions.

[0098] S805: When the SQL query contains multiple filtering conditions, perform set operations on the candidate entity set according to the logical relationship between the filtering conditions to determine the final target entity set that meets the filtering conditions.

[0099] Specifically, when an SQL query contains multiple filtering conditions, the system first identifies the logical relationships between the filtering conditions (such as logical AND, logical OR, and logical NOT), and then filters the initial streaming multi-table join result stream based on each filtering condition to obtain the corresponding candidate entity set. Subsequently, based on the logical relationships between the filtering conditions, set operations such as set intersection, union, and difference are performed on the multiple candidate entity sets to comprehensively determine the target entity set that simultaneously satisfies all logical constraints.

[0100] Among them, the candidate entity set and the target entity set refer to the set of entity instances that meet the SQL query conditions after preliminary filtering and logical combination of screening conditions.

[0101] S806: Extract the corresponding associated records from the initial streaming multi-table join result stream.

[0102] Specifically, the system uses the target entity set as the filtering criterion to match and locate the initial streaming multi-table association result stream, extracting association records whose entity identifiers belong to the target entity set. In this way, multi-table association results that are only related to the target entity are retained, forming a valid set of association records for subsequent result generation.

[0103] S807: Based on the target entity set, combined with the projection field and aggregation information in the SQL query, perform projection and aggregation operations on the associated records to generate an intermediate result stream.

[0104] Specifically, based on the projection fields specified in the SQL query, the fields in the associated records are pruned, retaining only the data items required for the query results. When the SQL query contains aggregation operations, the associated records are further subjected to corresponding aggregation calculations according to the aggregation fields and grouping conditions, generating a continuous stream of intermediate results, thus providing a direct data foundation for the final real-time query results.

[0105] The intermediate result stream refers to the result data stream formed after filtering, projection, and optional aggregation operations are completed, and is used to finally output the query results.

[0106] S808: Outputs the intermediate result stream as the real-time processing result of the SQL query, completing the execution of the streaming multi-table join execution plan.

[0107] It should be noted that by dynamically determining the subset of attribute stream tables that each entity instance actually participates in the association based on the entity-attribute mapping index table, and using the entity identifier stream as the main driver for executing streaming multi-table associations, the system can precisely control the scope of associations at the entity level, avoiding the participation of irrelevant attributes in the calculation. Simultaneously, by combining the filtering conditions, projection fields, and aggregation requirements in the SQL query to process the association results in stages, the size of intermediate results can be effectively reduced while ensuring the semantic accuracy of the results. This achieves a highly efficient, scalable, and real-time streaming multi-table association execution process.

[0108] In one possible implementation, the following is included after S8: S9: Update the set of candidate relationships for multi-table associations based on the latest data distribution characteristics obtained during the execution process.

[0109] Among them, the latest data distribution characteristics refer to the dynamic statistical information such as field value distribution, matching frequency and association success rate obtained in real time or periodically during the execution of streaming multi-table association.

[0110] In one possible implementation, S9 specifically refers to: During the execution of the streaming multi-table join execution plan, the matching frequency, join success rate, and field value distribution of each join field pair in the actual join operation are statistically analyzed, and the multi-table join candidate relationship set is updated based on the statistical results.

[0111] Specifically, based on the statistical results, the association connectivity assessment values ​​for the corresponding fields are adjusted. Then, based on the updated association connectivity assessment values, the set of candidate multi-table association relationships is updated (by adding, deleting, or adjusting weights). This ensures that the set of candidate multi-table association relationships reflects the latest data distribution characteristics, making it suitable for subsequent multi-table association processing in SQL queries.

[0112] S10: Based on the update results, repeat steps S5-S8 until the SQL query ends.

[0113] The end of an SQL query refers to the completion or termination of the current SQL query in the stream processing system.

[0114] It should be noted that by dynamically updating the candidate set of multi-table joins based on the latest data distribution characteristics during the execution of streaming multi-table joins, the system can promptly perceive changes in field value distribution and join validity, and adjust subsequent join strategies accordingly. This adaptive mechanism helps avoid a decline in join effectiveness due to data distribution drift, making subsequent multi-table join executions more closely aligned with actual data characteristics, thereby continuously improving the accuracy, stability, and long-term performance of streaming SQL multi-table join processing.

[0115] In this embodiment of the invention, real-time streaming data is split into multiple attribute stream tables according to entity attribute dimensions. A set of candidate relationships for multi-table association is constructed by combining field semantic similarity, value distribution similarity, and field uniqueness constraints. Based on parsing the semantics of SQL queries, the attribute stream tables participating in the association and their associated field pairs are dynamically determined, thereby generating a streaming multi-table association execution plan equivalent to the query semantics. This achieves efficient streaming association processing driven by entity identifiers. As a result, it can adaptively match the multi-table association requirements under different query scenarios without pre-fixing the association key, effectively reducing invalid associations and the scale of intermediate results, and improving the execution efficiency, result accuracy, and overall system resource utilization of streaming SQL multi-table association.

[0116] Reference manual attached Figure 2 The diagram shows a structural schematic of a multi-table association processing system for SQL stream data provided in an embodiment of the present invention.

[0117] This invention provides a multi-table association processing system 20 for SQL stream data, including: a processor 201 and a memory 202; The memory 202 stores programs or instructions that can run on the processor 201. When the program or instructions are executed by the processor 201, they implement the steps of the above-described method for multi-table join processing of SQL stream data and achieve the same technical effect. To avoid repetition, this invention will not elaborate further.

[0118] It should be understood that the processor 201 in this embodiment of the invention may be a central processing unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

[0119] It should also be understood that the memory 202 in the embodiments of the present invention can be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of random access memory are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct memory bus RAM (DR RAM).

[0120] The above embodiments can be implemented, in whole or in part, by software, hardware (such as circuits), firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of the present invention are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive.

[0121] It should be understood that, in various embodiments of the present invention, the order of the above-mentioned process numbers does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

[0122] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementations should not be considered beyond the scope of this invention.

[0123] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the devices, apparatuses, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0124] In the several embodiments provided by this invention, it should be understood that the disclosed devices, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another device, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0125] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0126] In addition, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0127] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, essentially, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0128] This invention provides a readable storage medium comprising: storing a program or instructions on the readable storage medium, wherein when the program or instructions are executed by a processor, the program or instructions implement the steps of the above-described method for multi-table association processing of SQL stream data, and can achieve the same technical effect. To avoid repetition, this invention will not elaborate further.

[0129] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the embodiments of the present invention, and are not intended to limit them. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the protection scope of the present invention.

Claims

1. A method for processing multi-table joins in SQL stream data, characterized in that, include: S1: Receive real-time streaming data, wherein the real-time streaming data includes an entity identifier field and multiple attribute fields; S2: Based on the entity identifier field, the real-time stream data is split into multiple attribute stream tables according to the attribute dimension; S3: Construct an entity-attribute mapping index table based on each of the attribute flow tables; S4: Based on the field information and historical data distribution of each attribute flow table, determine the set of candidate multi-table association relationships between each attribute flow table; S5: Combine the SQL queries submitted by the user to determine the set of related requirements; S6: Using the set of associated requirements as constraints, and combining the set of candidate relationships for multi-table associations, determine the set of participating attribute flow tables; S7: Generate a streaming multi-table association execution plan that is semantically equivalent to an SQL query based on the participating attribute flow table set and the corresponding associated field pairs of the participating attribute flow table set; S8: Combine the entity-attribute mapping index table, execute the streaming multi-table association execution plan, and output the real-time processing result of the SQL query.

2. The method for processing multi-table joins in SQL stream data according to claim 1, characterized in that, S2 specifically includes: S201: Using the entity identifier field as an index, identify the set of attributes with non-empty values ​​in each real-time stream data record; S202: For each attribute field in the attribute set, the real-time stream data is split according to the attribute dimension to determine multiple attribute stream tables.

3. The method for processing multi-table joins in SQL stream data according to claim 1, characterized in that, S4 specifically includes: S401: Determine the field combinations between different attribute flow tables; S402: Based on the field combination, construct candidate pairs of associated fields between the various attribute flow tables; S403: Based on the similarity of field value distribution and field semantic similarity, calculate the field-level connectability score of the candidate pairs of related fields; S404: Based on the distribution characteristics of the field in the corresponding attribute flow table, the field-level connectability score is constrained and corrected; S405: Based on the field-level connectability score after constraint correction, determine the set of candidate multi-table association relationships between the attribute flow tables.

4. The method for processing multi-table joins in SQL stream data according to claim 1, characterized in that, The SQL query includes: target field, data source identifier, filter conditions, projection field, grouping information, aggregation information, and join semantics.

5. The method for processing multi-table joins in SQL stream data according to claim 1, characterized in that, S6 specifically includes: S601: Select multiple attribute flow tables related to the associated requirement set, and combine the selected multiple attribute flow tables to obtain an attribute flow table set; S602: Initialize the attribute flow table set; S603: Based on the coverage of the associated requirement set by the initialized attribute flow table set, calculate the marginal coverage gain of the candidate attribute flow table for the uncovered associated requirements; S604: Calculate the association connectivity gain between the candidate attribute flow table and the initialized attribute flow table set; S605: Construct a comprehensive utility function based on the marginal coverage gain and the associated connectivity gain; S606: Select the candidate attribute flow table that maximizes the comprehensive utility function value as the attribute flow table to be added; S607: Add the attribute flow table to be added to the initialized attribute flow table set to determine the set of participating attribute flow tables for association processing.

6. The method for processing multi-table joins in SQL stream data according to claim 1, characterized in that, Specifically, S7 includes: S701: Determine the association field pairs used for association processing between each of the attribute flow tables from the multi-table association candidate relationship set; S702: Based on the participating attribute flow table set and the associated field pair, construct an attribute flow table association graph, wherein the nodes in the attribute flow table association graph are the attribute flow tables, and the edges are the association relationships between the attribute flow tables; S703: Determine the execution order of multi-table associations between the attribute flow tables based on the attribute flow table association diagram; S704: Based on the execution order of the multi-table association, generate a corresponding streaming multi-table association logic execution plan, wherein the streaming multi-table association logic execution plan includes the association order and association conditions of the attribute flow tables; S705: Perform semantic fusion on the execution plan of the streaming multi-table join logic based on the filtering conditions and projection fields in the SQL query; S706: Convert the semantically fused streaming multi-table join logic execution plan into a streaming multi-table join execution plan that is semantically equivalent to the SQL query.

7. The method for processing multi-table joins in SQL stream data according to claim 1, characterized in that, S8 specifically includes: S801: Based on the entity-attribute mapping index table, determine the subset of attribute flow tables that participate in the association processing; S802: Using the entity identifier field as the driving force, generate an entity identifier stream from the entity-attribute mapping index table; S803: Based on the streaming multi-table association execution plan, driven by the entity identifier stream, sequentially perform streaming association processing on each attribute stream table in the attribute stream table subset with the entity identifier stream to generate an initial streaming multi-table association result stream; S804: Based on the filtering conditions in the SQL query, perform a streaming filtering operation on the initial streaming multi-table association result stream to obtain a set of candidate entities that meet the filtering conditions; S805: When the SQL query contains multiple filtering conditions, perform set operations on the candidate entity set according to the logical relationship between the filtering conditions to determine the final target entity set that satisfies the filtering conditions; S806: Extract the corresponding associated records from the initial streaming multi-table association result stream; S807: Based on the target entity set, and combining the projection field and aggregation information in the SQL query, perform projection and aggregation operations on the associated records to generate an intermediate result stream; S808: Output the intermediate result stream as the real-time processing result of the SQL query to complete the execution of the streaming multi-table join execution plan.

8. The method for processing multi-table joins in SQL stream data according to claim 1, characterized in that, Following S8, the following is also included: S9: Update the set of candidate relationships for multi-table associations based on the latest data distribution characteristics obtained during the execution process; S10: Based on the update result, repeat steps S5-S8 until the SQL query ends.

9. The method for processing multi-table joins in SQL stream data according to claim 8, characterized in that, Specifically, S9 is: During the execution of the streaming multi-table join execution plan, the matching frequency, join success rate, and field value distribution of each join field pair in the actual join operation are statistically analyzed, and the multi-table join candidate relationship set is updated based on the statistical results.

10. A multi-table join processing system for SQL stream data, characterized in that, include: Processor and memory; The memory stores programs or instructions that can run on the processor, which, when executed by the processor, implement the steps of the multi-table join processing method for SQL stream data as described in any one of claims 1 to 9.