Method, device and computer storage medium for optimizing intra connections
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- RADAR NEW ENERGY AUTOMOBILE (ZHEJIANG) CO LTD
- Filing Date
- 2023-06-02
- Publication Date
- 2026-06-26
Smart Images

Figure CN116842035B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of interconnect technology, and in particular to an optimization method, apparatus, device, and computer storage medium for interconnect. Background Technology
[0002] With the rapid development of SQL (Structured Query Language), users have increasingly higher requirements for inner joins. They want to meet the basic requirements of a normal inner join while ensuring the accuracy of the driver table selection. This places higher demands on the selection of the inner join driver table.
[0003] Traditional methods for selecting the driving table in inner joins involve random selection or using predicate statistical histograms, essentially estimating the driving table statistically. This approach has a significant drawback: random selection or using predicate statistical histograms may fail to identify the optimal driving table. In other words, this method results in low accuracy in driving table selection because random selection or using predicate statistical histograms may not yield the best choice. Summary of the Invention
[0004] The main objective of this application is to provide an optimization method, apparatus, device, and storage medium for inner joins, aiming to address the technical problem of improving the accuracy of driving table selection during inner joins.
[0005] To achieve the above objectives, this application provides a method for optimizing inner joins, the steps of which include:
[0006] In response to the received current query statement, a binary classification result is determined based on the current query statement and a preset target training model; wherein, the target training model is obtained by training a semantic model based on historical query statements.
[0007] Determine the target driving table corresponding to the binary classification result, and optimize the inner join based on the target driving table.
[0008] Optionally, the step of training the semantic model based on historical query statements to obtain the target training model includes:
[0009] Obtain multiple input historical query statements, parse the historical semantic feature information in each historical query statement in turn, and determine the historical word order information corresponding to the semantic feature information;
[0010] Historical feature vectors are determined based on the historical word order information and the historical semantic feature information. The historical feature vectors corresponding to each historical query statement are then aggregated to obtain a training sample dataset. A semantic model is then trained based on the training sample dataset to obtain the target training model.
[0011] Optionally, the step of obtaining historical query statements from multiple inputs includes:
[0012] In response to the input demand information, an initial historical query statement is obtained, the demand quantity of the demand information is determined, and it is detected whether the number of the initial historical query statements is greater than the demand quantity.
[0013] If the number of initial historical query statements is greater than the required number, then the initial historical query statements are sampled based on the preset sampling rules and the required number to obtain historical query statements;
[0014] If the number of initial historical query statements is less than or equal to the required number, then the initial historical query statements will be used as historical query statements.
[0015] Optionally, the step of determining the historical feature vector based on the historical word order information and the historical semantic feature information includes:
[0016] All historical features in the historical semantic feature information are determined, and each historical feature is sequentially input into a preset pre-trained language model based on the historical word order information for vector extraction to obtain an initial historical feature vector;
[0017] Determine the length of the vector in the initial historical feature vector, and detect whether the vector length matches a preset length threshold;
[0018] If the vector length does not match the preset length threshold, the initial historical feature vector is padded to obtain a historical feature vector; wherein, the padded process includes padding with zeros.
[0019] Optionally, the step of training the model based on the training sample dataset to obtain the target training model includes:
[0020] The training conditions of the pre-trained language model are determined, and the model is trained based on the training conditions and the training sample dataset to obtain the target training model; wherein, the pre-trained language model includes the benchmark BERT model with access to the binary classification network layer.
[0021] Optionally, the step of determining the binary classification result based on the current query statement and the target training model includes:
[0022] Parse the current semantic feature information in the current query statement and determine the current word order information corresponding to the current semantic feature information;
[0023] All current features in the current semantic feature information are determined, and each current feature is sequentially input into the target training model based on the current word order information to extract vectors, thereby obtaining the current feature vector;
[0024] The binary classification result is determined based on the current feature vector and the target training model.
[0025] Optionally, the step of determining the binary classification result based on the current feature vector and the target training model includes:
[0026] The current feature vector is used as the input information of the binary classification layer of the target training model, and the output information of the target training model under the input information is determined.
[0027] Determine the integer value corresponding to the output information, and use the integer value as the binary classification result.
[0028] Furthermore, to achieve the above objectives, the present invention also provides an internal connectivity optimization device, the internal connectivity optimization device comprising:
[0029] An online inference module is used to respond to the received current query statement and determine the binary classification result based on the current query statement and a preset target training model; wherein, the target training model is obtained by training a semantic model based on historical query statements.
[0030] An optimization determination module is used to determine the target driving table corresponding to the binary classification result, so as to optimize the inner join based on the target driving table.
[0031] This application also provides an optimization device for in-connection, the optimization device for in-connection comprising: a memory, a processor, and a program of the optimization method for in-connection stored in the memory and executable on the processor, wherein when the program of the optimization method for in-connection is executed by the processor, it can implement the steps of the optimization method for in-connection as described above.
[0032] This application also provides a computer storage medium storing a program for implementing an optimization method of internal linking, the program for implementing the optimization method of internal linking being executed by a processor to implement the steps of the optimization method of internal linking as described above.
[0033] The technical solution of this application responds to the received current query statement and determines the binary classification result based on the current query statement and a preset target training model. Specifically, the target training model is obtained by training a semantic model based on historical query statements. The target driving table corresponding to the binary classification result is determined, and inner join optimization is performed based on the target driving table. By obtaining the target training model based on semantic training, and determining the binary classification result based on the target training model and the current query statement, inner join optimization is finally performed based on the target driving table determined by the binary classification result. This ensures that the selection of the entire driving table is based on the binary classification result determined by the target training model obtained through semantic training. This avoids the phenomenon that random selection or using predicate statistical histograms to select the driving table fails to select the optimal driving table. The inner join optimization method of this application can determine the binary classification result through the target training model and the current query statement, and finally determine the target driving table based on the binary classification result for inner join, thereby improving the accuracy of driving table selection during inner join. Attached Figure Description
[0034] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.
[0035] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0036] Figure 1 This is a schematic diagram of the optimized device structure for the internal connections of the hardware operating environment involved in the embodiments of the present invention;
[0037] Figure 2 This is a flowchart illustrating the first embodiment of the optimization method for internal linking in this application;
[0038] Figure 3 This is a schematic diagram of the optimization device module connected within this application;
[0039] Figure 4 This is a schematic diagram of the offline training process in the optimization of internal connections in this application;
[0040] Figure 5 This is a flowchart illustrating the online inference process in the optimization of internal connections in this application;
[0041] Figure 6 A schematic diagram of the Bert model;
[0042] Figure 7 This is a schematic diagram of the adjusted Bert model.
[0043] The realization of the objective, functional features and advantages of the present invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0044] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
[0045] Reference Figure 1 , Figure 1 This is a schematic diagram of the optimized device structure for the internal connections of the hardware operating environment involved in the embodiments of the present invention.
[0046] like Figure 1 As shown, the optimized device for this internal connection may include: a processor 0003, such as a central processing unit (CPU), a communication bus 0001, an acquisition interface 0002, a processing interface 0004, and a memory 0005. The communication bus 0001 is used to implement communication between these components. The acquisition interface 0002 may include an information acquisition device or acquisition unit, such as a computer; optionally, the acquisition interface 0002 may also include a standard wired interface or a wireless interface. The processing interface 0004 may optionally include a standard wired interface or a wireless interface. The memory 0005 may be a high-speed random access memory (RAM) or a stable non-volatile memory (NVM), such as a disk storage device. Optionally, the memory 0005 may also be a storage device independent of the aforementioned processor 0003.
[0047] Those skilled in the art will understand that Figure 1 The structure shown does not constitute a limitation on the optimized device for internal connections, and may include more or fewer components than shown, or combine certain components, or have different component arrangements.
[0048] like Figure 1 As shown, the memory 0005, which serves as a storage medium, may include an operating device, an acquisition interface module, an execution interface module, and an internally linked optimization program.
[0049] exist Figure 1In the internal connection optimization device shown, the communication bus 0001 is mainly used to realize the connection communication between components; the acquisition interface 0002 is mainly used to connect to the backend server and communicate data with the backend server; the processing interface 0004 is mainly used to connect to the deployment end (user end) and communicate data with the deployment end; the processor 0003 and the memory 0005 in the internal connection optimization device of the present invention can be set in the internal connection optimization device. The internal connection optimization device calls the internal connection optimization program stored in the memory 0005 through the processor 0003 and executes the internal connection optimization method provided in the embodiment of the present invention.
[0050] To ensure clarity and conciseness in the description of the following embodiments, a brief introduction to the implementation of an optimization method for inner joins is first given:
[0051] SQL is a widely used language in programming and data analysis, and inner join is a common join method. During the calculation of an inner join, the SQL engine needs to select a driving table from the two tables to be joined. Choosing the smaller table reduces computational load. However, inner joins often involve filtering conditions, and the size of the table with filtering conditions is often difficult to predict, only becoming known after execution. The engine struggles to obtain this information during the planning phase. Therefore, it is crucial to utilize more information to select the smaller table as the driving table during the planning phase. Common inner join driving table selection focuses on "selectivity" prediction, employing machine learning regression models. Features include table name, predicates, operators, and row count, used for selectivity prediction in mergejoin (a join method). Traditional SQL engine CBO (Cost-Based Optimization) methods primarily rely on statistics and sampling to derive histograms. RBO (Rule-Based Optimization) methods use predefined rules such as predicate pushdown. These methods of selecting the driving table do not utilize the semantic information already present in the database tables. Based on these issues, this application proposes using a language model to extract semantic and sequential information from the statements, performing binary classification on the two tables in an inner join. Features include table names, predicates, operators, and their comment information, without requiring row counts. It places greater emphasis on semantic understanding of the statements themselves, using the BERT language model. Furthermore, the driving table is determined based on semantics and sequential information within the statements, ensuring high accuracy in determining the driving table.
[0052] This application proposes an inner join optimization method. It responds to the received current query statement and determines a binary classification result based on the current query statement and a preset target training model. Specifically, the target training model is obtained by training a semantic model based on historical query statements. A target driving table corresponding to the binary classification result is determined, and inner join optimization is performed based on this target driving table. By obtaining the target training model based on semantic training, determining the binary classification result based on the target training model and the current query statement, and finally optimizing the inner join based on the target driving table determined by the binary classification result, the selection of the entire driving table is based on the binary classification result determined by the target training model obtained through semantic training. This avoids the phenomenon that random selection or using predicate statistical histograms to select the driving table fails to select the optimal driving table. The inner join optimization method of this application can determine the binary classification result through the target training model and the current query statement, and finally determine the target driving table based on the binary classification result for inner join, thereby improving the accuracy of driving table selection during inner join.
[0053] Based on the above hardware structure, an embodiment of the optimization method for internal connections of the present invention is proposed.
[0054] This invention provides an optimization method for inner connections, referring to... Figure 2 , Figure 2 This is a flowchart illustrating the first embodiment of an optimization method for inner connections according to the present invention.
[0055] In this embodiment, the optimization method for inner connections includes:
[0056] Step S10: In response to the received current query statement, determine the binary classification result based on the current query statement and the preset target training model; wherein, the target training model is obtained by training a semantic model based on historical query statements.
[0057] In this embodiment, the implementation process is divided into two parts: offline training and online inference. The required model is obtained through offline training, and then used to predict the driving table during online inference. During offline training, multiple historical query statements are acquired, and a training sample dataset is determined based on these statements. A semantic model is then trained on this dataset to obtain the target training model. The historical query statements refer to the program statements used in the inner join, and the number can be determined according to user needs, as the number of historical query statements directly affects the overall offline training duration. Users can select the appropriate number of historical query statements based on their actual situation or acceptable training time. The training sample dataset is the dataset determined based on multiple historical query statements, containing feature vectors corresponding to these statements. The target training model is the model trained using these vectors. After obtaining the required training model offline, when the current query is received, a binary classification result is determined based on the current query and the target training model. The current query statement refers to the program statements used in the inner join during online training, and the binary classification result refers to the two numerical values output by the model after processing the current query statement. This allows us to directly output binary classification results for the table that meets the requirements (the shorter of the two tables), thus ensuring the intelligence of the selection.
[0058] Step S20: Determine the target driving table corresponding to the binary classification result, and optimize the inner join based on the target driving table.
[0059] In this embodiment, a target determination table corresponding to the binary classification result is determined, and inner join optimization is implemented based on the target-driven table. The target-driven table refers to the shorter table selected during inner join based on the above method. The model is trained offline, and the model is derived from extracting SQL and annotations from the table. During online inference, the offline-trained model is used for prediction (the model input features are derived from extracting SQL and annotations from the table), ultimately obtaining the target-driven table. This ensures the accuracy of determining the target-driven table. Furthermore, accurately selecting a smaller table as the driving table ensures the accuracy of the entire inner join and the efficiency of the processing. The inner join optimization method is specifically for inner joins; other join types cannot be covered in this application. Existing attribute information in the database, such as "table name, table name comments, table field names, field name comments, and filter condition expressions," is used as feature input and fed into a pre-trained language model. Word vectors are extracted to express relevant semantics. A supervised dataset is constructed through historical logs or sampling. That is, the driving table is predicted using the offline-trained model during online inference. The pre-trained language model uses an adjusted BERT model (details are in the appendix). Figure 7The core of this application lies in utilizing the semantics existing in the database, eliminating the need for full sampling and demonstrating a certain degree of transfer and generalization capability. It solves the binary classification problem of choosing the left or right table as the driving table in the planning phase of inner join queries in database engines. This avoids the shortcomings of existing technologies that rely heavily on general machine learning models to determine the driving table through non-semantic engineering and statistical features, and whose optimization methods often fail to extract the semantics contained in the SQL and comments. In other words, this embodiment uses semantic extraction as an aid to driving table prediction, exhibiting novelty in feature selection. Besides engineering features, it uses the order of words in the SQL statement as a feature and enriches the semantic features with information such as comments. By using a BERT pre-trained language model and introducing semantic information, it specifically optimizes and adapts for inner joins. This ensures targeted optimization when implementing inner joins.
[0060] It is worth noting that this embodiment also provides a schematic diagram of the Bert model, see below. Figure 6 When input features are input, they are combined with positional encoding to obtain input feature vectors, i.e., each feature vector in each query statement is obtained. In the BERT model, semantic vectors are output through multi-attention stacking and a semantic output layer. Multi-attention stacking includes at least multi-head attention processing, addition and normalization, and Hull addition and normalization in a feedforward network layer. The semantic output layer includes at least a hyperbolic tangent activation function and a linear layer. The connection between the input features and positional encoding represents joint or additive processing. The entire model here follows the same processing flow as the commonly used BERT model and is not limited here. Based on the above BERT model, this application proposes an adjusted BERT model. In this embodiment, refer to... Figure 7 , Figure 7 This is a schematic diagram of the adjusted BERT model. The difference between the adjusted BERT model and the original BERT model is that the intermediate processing part of the BERT model is frozen to ensure that it is used as a training condition for semantic and word order training. After obtaining the output semantic vector, concatenation and zero-padding operations are performed. Then, a trainable classifier is used to perform binary classification based on sigmoid (the non-linear action function of neurons) and linear layers to obtain the output classification, thereby determining the selection result of the driving table. In other words, the driving table is obtained by binary classification of the concatenated and zero-padding semantic vector by a trainable classifier. This ensures that the selection of the driving table can be based on semantics and word order, providing a selection basis for the existing selection and improving the accuracy of inner connections by accurately selecting the driving table.
[0061] Furthermore, this embodiment also provides a flowchart illustrating offline training in the optimization of inner connections, referring to... Figure 5 In this embodiment, offline training mainly serves to determine the training model M2 for online inference (refer to...). Figure 7 The adjusted BERT model (as described in the text) obtains historical or sampled query statements, then converts the query statements into feature vectors. For each feature vector as a sample, each sample is processed by M1 (see [reference needed]). Figure 6 The process involves using the BERT model (as described in the example) and concatenating and padding with zeros to convert each feature in the query statement into a vector. This vector is then concatenated, and missing elements are padded with zeros. Other operations are also possible and are not limited here. When acquiring query data, all historical query data or a sample of historical query data can be selected based on actual needs. The conversion process checks whether all query statements have been completely converted into feature vectors. If not, the conversion continues; otherwise, a training sample dataset is constructed based on all converted feature vectors. This dataset is then used to train the model M2, which is saved for online inference, ensuring the accuracy of model training based on semantics and word order. (See reference...) Figure 4 , Figure 4 This is a flowchart illustrating the online inference process in inner join optimization. After obtaining the trained model M2, upon receiving the current query statement, the query statement is converted into features F by the SQL parser. It then determines whether model M2 has been loaded. If not, model M2 is loaded; otherwise, features F are input into model M2 to predict the category Y, and the category Y is output. In other words, prediction is performed based on the trained model to determine the category (the fitting target mentioned earlier), and the driving table selection can be determined based on the category. By training the model based on semantics and word order, the accuracy of the driving table selection can be guaranteed.
[0062] This embodiment responds to the received current query statement and determines the binary classification result based on the current query statement and a preset target training model. The target training model is obtained by training a semantic model based on historical query statements. A target driving table corresponding to the binary classification result is determined, and inner join optimization is performed based on this target driving table. By obtaining the target training model based on semantic training, determining the binary classification result based on the target training model and the current query statement, and finally optimizing the inner join based on the target driving table determined by the binary classification result, the selection of the entire driving table is based on the binary classification result determined by the target training model obtained through semantic training. This avoids the phenomenon that random selection or using predicate statistical histograms to select the driving table fails to select the optimal driving table. The inner join optimization method of this application can determine the binary classification result through the target training model and the current query statement, and finally determine the target driving table based on the binary classification result for inner join, thereby improving the accuracy of driving table selection during inner join.
[0063] Furthermore, based on the first embodiment of the inner join optimization method of the present invention, a second embodiment of the inner join optimization method of the present invention is proposed, the step of obtaining multiple input historical query statements includes:
[0064] Step B10: In response to the input demand information, obtain the initial historical query statement, determine the demand quantity of the demand information, and detect whether the number of the initial historical query statements is greater than the demand quantity;
[0065] Step B20: If the number of initial historical query statements is greater than the required number, then the initial historical query statements are sampled based on the preset sampling rules and the required number to obtain historical query statements.
[0066] Step B30: If the number of initial historical query statements is less than or equal to the required number, then the initial historical query statements are used as historical query statements.
[0067] In this embodiment, during the optimization process of the entire inner connection, the model is trained based on historical query statements during offline training. Therefore, when demand information is input, the initial historical query statements are obtained in response to the input demand information. By determining the number of demands corresponding to the demand information, it is checked whether the number of initial historical query statements is greater than the demand quantity. If it is greater, the initial historical query statements are sampled based on preset sampling rules and the demand quantity to obtain historical query statements; otherwise, the initial historical query statements are used as historical query statements. In other words, because the number of query statements is related to the training time and accuracy of the model, the number of query statements to be used can be determined according to the user's needs. If no input demand information is received, the default demand quantity corresponding to the default demand information is determined for subsequent operations. Here, the default demand information refers to the information on the default number of demand query data, the demand information refers to the information on the number of data to be queried by the actual user input, the initial historical query statement refers to the query statement that was not initially determined as a historical query statement, the demand quantity refers to the value required by the query statement contained in the demand information, and the sampling rule refers to the rule for determining the initial historical query statement as a historical query statement, such as selecting the initial historical query statement for a fixed time period or selecting the initial historical query statement of a fixed length. The entire offline training process is based on the triggering process during offline training. By selecting historical query data during offline training, the accuracy and efficiency of the entire offline model training can be guaranteed, and the model basis can also be provided for subsequent online inference.
[0068] Furthermore, based on the first and second embodiments of the optimization method for inner joins of the present invention, a third embodiment of the optimization method for inner joins of the present invention is proposed, the optimization method for inner joins including:
[0069] Furthermore, the step of determining the training sample dataset corresponding to the historical query statements includes:
[0070] Step a: Obtain multiple input historical query statements, and sequentially parse the historical semantic feature information in each historical query statement, and determine the historical word order information corresponding to the semantic feature information;
[0071] Step b: Determine the historical feature vector based on the historical word order information and the historical semantic feature information, summarize the historical feature vectors corresponding to each historical query statement to obtain the training sample dataset, and train the semantic model based on the training sample dataset to obtain the target training model.
[0072] In this embodiment, by acquiring multiple input historical query statements and training a semantic model, the main difference from common techniques in determining training sample data based on historical query statements is that historical semantic feature information is determined sequentially for multiple historical query statements, and the corresponding historical word order information is determined. Historical feature vectors are determined through historical word order information and historical semantic feature information. Finally, the historical feature vectors corresponding to multiple historical query statements are summarized to obtain the training sample dataset. Ultimately, the semantic model is trained based on the training sample dataset to obtain the target training model. That is, by determining the historical feature vector for each historical query statement, the model is trained using the summarized training sample dataset. Here, historical semantic feature information refers to semantic information such as table names, table name comments, table field names, field name comments, and filter condition expressions in the historical query statements. Historical word order information refers to the word order corresponding to the semantic information; for example, table attributes refer to the word order of attributes, and attribute tables refer to the word order of tables. The historical feature vector is a vector based on the features determined above. The following example illustrates the training sample dataset, assuming there are two tables of data to be selected as the driving table:
[0073] The information of one of the item_category tables is shown in Table 1 and Table 2 below. Item_category refers to the table name. Table 1 is the semantic information of the example table, and Table 2 refers to the data in the table. If the semantic information of Table 1 is not sampled for processing, it will be randomly selected or a predicate statistical histogram will be used.
[0074]
[0075] The data in Table 1 (item_category) is shown in Table 2:
[0076] id category 1 a 2 a 3 b 4 c 5 c
[0077] Table 2
[0078] The information for another item_size table is shown in Tables 3 and 4 below. Item_size refers to the table name, Table 3 is the semantic information of the example table, and Table 4 refers to the data in the table.
[0079]
[0080] The data in Table 3 (item_size) is shown in Table 4.
[0081] id size 1 10 2 20 3 10 4 20
[0082] Table 4
[0083] Suppose that the historical query statement contains a query statement Q1: SELECT A.id, A.category, B.size(SELECT id, categoryFORM item_categoryWHERE category=“a”)AINNERJOIN(SELECT id, sizeFORM item_sizeWHERE size>0)BONA.id=B.id; that is, the inner join method of the above two tables is the join method of id and category in item_category, and category=“a”, and id and size in item_size, and size>0. Based on the above Tables 1-4, the SQL syntax parser can obtain the table name, field name, related comments, and filter expressions contained in Q1 as features. The order of each feature is taken according to the natural order of each feature in the statement. The comments are inserted after the name, such as "table name comment" is inserted after "table name". The feature values F1-F10 (F1, F2...F10 are historical semantic feature information, F1-F10 are the corresponding historical word order information) are shown in Table 5 below:
[0084]
[0085]
[0086] Table 5
[0087] The values of features F1 to F10 are sequentially fed into the pre-trained language model M1 (a Bert model, or an adjusted Bert model) for vector extraction. The transformation logic is V = M1(F), resulting in feature vectors V1 to V10 as shown in Table 6.
[0088] feature Feature values Feature vector Values of the characteristic vector F1 item_category V1 [0.1,0.2] F2 Warehouse Item Category List V2 [0.2,0.3] F3 Item number, primary key index V3 [0.3,0.4] F4 Category name of the item V4 [0.5,0.6] F5 category="a" V5 [0.6,0.7] F6 item_size V6 [0.7,0.8] F7 Warehouse Item Dimensions V7 [0.8,0.9] F8 Item number, primary key index V8 [0.9,1.0] F9 Item size V9 [1.0,1.1] F10 size>0 V10 [1.1,1.2]
[0089] Table 6
[0090] Finally, the vectors V1 to V10 corresponding to the features in the historical query statements are concatenated end to end to form a feature vector V_CONCAT. The concatenation order uses the original order of the words extracted from the statement, i.e., the historical word order information. Combined with the position encoding in the pre-trained language model (BERT model, or an adjusted BERT model), we get V_CONCAT = [0.1, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.5, 0.5, 0.6, 0.6, 0.7, 0.7, 0.8, 0.8, 0.9, 0.9, 1.0, 1.0, 1.1, 1.1, 1.2]. This gives us the historical feature vector corresponding to a single historical query data. Based on the above method, we determine the historical feature vectors corresponding to multiple historical feature query data and summarize them as the training sample dataset. In this example, the vector corresponding to the feature in a single historical query statement is a vector with a length of 0.1-1.1 and containing only two elements. In reality, the vector corresponding to a single feature may contain more different elements. This is just an example and does not limit the vector or the number of elements in the vector.
[0091] In this embodiment, multiple input historical query statements are acquired, and the historical semantic feature information in each historical query statement is parsed sequentially. The historical word order information corresponding to the semantic feature information is determined, and historical feature vectors are determined based on the historical word order information and the historical semantic feature information. The historical feature vectors corresponding to each historical query statement are then aggregated to obtain a training sample dataset. The feature vectors of the historical query statements are determined based on the historical semantic feature information and the historical word order information as the training sample dataset. The semantic model is trained based on the training sample dataset to obtain the target training model. This ensures the accuracy of model training, and using historical semantic feature information and historical word order information as feature criteria ensures the accuracy of subsequent driving table selection, further guaranteeing the accuracy of inner joins.
[0092] Furthermore, the step of determining the historical feature vector based on the historical word order information and the historical semantic feature information includes:
[0093] Step c: Determine all historical features in the historical semantic feature information, and based on the historical word order information, input each historical feature into a preset pre-trained language model for vector extraction to obtain an initial historical feature vector;
[0094] Step d: Determine the vector length in the initial historical feature vector and detect whether the vector length matches a preset length threshold;
[0095] Step e: If the vector length does not match the preset length threshold, the initial historical feature vector is padded to obtain a historical feature vector; wherein, the padded process includes padding with zeros.
[0096] In this embodiment, when determining the historical feature vector, all historical features in the historical semantic feature information are identified, and then the historical features are sequentially input into a preset pre-trained language model for vector extraction and sequential concatenation to obtain the initial historical feature vector. The preset pre-trained language model includes an adjusted BERT model. The length of the vector in the initial historical feature vector is determined, and then it is checked whether the vector length matches a preset length threshold. If they do not match, the initial historical feature vector is padded with zeros to obtain the final historical feature vector. Matching here refers to cases where the length is less than or equal to the preset length threshold. If the length exceeds the preset length threshold, the feature vector will not be included in the training sample dataset. Historical features refer to various semantic features in the historical semantic feature information, such as F1-F10 in Table 6. The initial historical feature vector is the vector obtained by concatenating only the vectors corresponding to the historical features. The vector length refers to the number of elements within the initial historical feature vector. The preset length threshold refers to the number of elements within the user-defined feature vector. Padding with zeros ensures that the initial historical feature vector reaches the required length. Since the input length of conventional pre-trained models is generally quite long, for example, in this embodiment, a length of 1024 (context length 512 multiplied by embedding dimension 2) is used. Referring to the above-mentioned initial historical feature vector V_CONCAT, which has a length of 20, less than 1024 bits, 1004 zeros are added to the end of V_CONCAT to make the length reach 1024 bits. The resulting 1024-bit historical feature vector is as follows:
[0097] V=[0.1,0.2,0.2,0.3,0.3,0.4,0.4,0.5,0.5,0.6,0.6,0.7,0.7,0.8,0.8,0.9,0.9,1.0,1.0,1.1,1.1,1.2,0,...,0]
[0098] It is worth noting that the length of the historical query statement Q1 cannot be infinite, and the features extracted by Q1 cannot exceed 512 (since there are two tables, the length of a single table cannot exceed 512), in order to match the length of the model context. Other operations can be used to pad the length or set a shorter number of bits; the examples here are not limited to the scheme of this application.
[0099] In this embodiment, by determining all historical features in the historical semantic feature information, and based on the historical word order information, each historical feature is sequentially input into a preset pre-trained language model for vector extraction to obtain an initial historical feature vector. The length of the initial historical feature vector is determined, and it is detected whether the vector length matches a preset length threshold. If the vector length does not match the preset length threshold, the initial historical feature vector is padded to obtain a historical feature vector. The padding process includes padding with zeros. By padding each feature vector, the uniqueness of the feature vector can be guaranteed, ensuring that each feature vector is used as the training basis for the training sample dataset at a fixed length, thus ensuring the accuracy of subsequent training.
[0100] Furthermore, the step of training the model based on the training sample dataset to obtain the target training model includes:
[0101] Step f: Determine the training conditions of the pre-trained language model, and train the model based on the training conditions and the training sample dataset to obtain the target training model; wherein, the pre-trained language model includes the baseline BERT model with access to the binary classification network layer.
[0102] In this embodiment, the sample construction steps (generating corresponding feature vectors) are repeatedly executed on multiple historical query statements (or sampled historical query statements) to ultimately form a training sample dataset. The adjusted BERT model is then trained on this training sample dataset to obtain the target training model. The adjusted BERT model includes a BERT model with a binary classification network layer; that is, the adjusted BERT model is the original BERT model with a binary classification network layer added. Training conditions refer to the control conditions that govern the BERT model. For example, training conditions could involve freezing the parameters of the adjusted BERT model while not freezing the weights of the binary classification network layer, and setting them to be trainable. By inputting the training sample dataset into the model using these training conditions, the target training model is obtained. The target training model is the adjusted BERT model after training. For example, assuming Q1 is a historical query statement, and the historical records have already executed this query Q1, it is found that the left table is smaller and more suitable as the execution table. Using the merged V as the input feature, the fitted target Y is a binary classification of "0, 1", where 0 represents whether the left table is the driving table and 1 represents whether the right table is the driving table. Following the example above, a sample binary pair (V, left table is the driving table) will be constructed, as shown in the following formula.
[0103] V = [0.1, 0.2, 0.2, 0.3, 0.3, 0.4, 0.4, 0.5, 0.5, 0.6, 0.6, 0.7, 0.7, 0.8, 0.8, 0.9, 0.9, 1.0, 1.0, 1.1, 1.1, 1.2, 0, ..., 0]. When the model generates the above feature vectors during subsequent processing, the left table will be directly selected as the target driving table.
[0104] In this embodiment, by determining the training conditions corresponding to the adjusted BERT model, the target training model is obtained by training the model based on the training conditions and the training sample dataset. The adjusted BERT model includes a BERT model connected to a binary classification network layer. By training the model based on the adjusted BERT model and the training sample dataset, the accuracy of determining the target training model can be guaranteed. Furthermore, the feature vector obtained based on historical word order information and historical semantic feature information can guarantee the accuracy in the specific environment of inner connections.
[0105] Furthermore, based on the first, second, and third embodiments of the optimization method for inner joins of the present invention, a fourth embodiment of the optimization method for inner joins of the present invention is proposed, the optimization method for inner joins including:
[0106] Furthermore, the step of determining the binary classification result based on the current query statement and the target training model includes:
[0107] Step g: parse the current semantic feature information in the current query statement and determine the current word order information corresponding to the current semantic feature information;
[0108] Step h: Determine all current features in the current semantic feature information, and input each current feature into the target training model sequentially based on the current word order information to extract vectors, thereby obtaining the current feature vector;
[0109] Step i: Determine the binary classification result based on the current feature vector and the target training model.
[0110] In this embodiment, after determining the target training model, the prediction of the driving table is performed in response to the current query statement. By parsing the current semantic feature information in the current query statement and determining the current word order information corresponding to the current semantic feature information, all current features in the current semantic feature information are finally determined. These current features are then sequentially input into the target training model based on the current word order information for vector extraction, resulting in the current feature vector. Since the target training model is a BERT model connected to a binary classification network layer, the binary classification result is determined based on the current feature vector and the target training model. Here, the current semantic feature information refers to the semantic information in the current query statement, such as the table name, table name comments, table field names, field name comments, and filter condition expressions. The current word order information refers to the word order corresponding to the semantic information. The current feature vector is the vector obtained by concatenating and padding zeros with the vectors corresponding to all current features of the current semantic feature information according to word order. The current feature refers to the semantically related information in the current semantic feature information, and the binary classification result refers to the classification result of selecting two tables to determine the driving table. By processing the current query statement in the target training model, a binary classification result corresponding to the current query statement is obtained. Then, the driving table can be determined based on semantics and word order, ensuring the accuracy of the driving table determination. For example, assuming Q1 is the SQL to be executed, the driving table for Q1 needs to be estimated. This is done by transforming the current query statement Q1 into V within the target training model M2 (M1 refers to the commonly used BERT model) and inputting it into the binary classification layer, resulting in Y = M2(V). Y is set to 1 if greater than 0.5 and 0 if less than or equal to 0.5. Based on the value of Y, the left or right table is selected as the driving table. Therefore, determining the driving table based on novel feature semantics and word order ensures the accuracy of the driving table determination.
[0111] In this embodiment, by parsing the current semantic feature information in the current query statement and determining the current word order information corresponding to the current semantic feature information, all current features in the current semantic feature information are determined. Based on the current word order information, each current feature is sequentially input into the target training model for vector extraction to obtain the current feature vector. The binary classification result is determined based on the current feature vector and the target training model. By determining the binary classification result through the current semantic feature information and the current word order information in the current query statement, the accuracy of selecting the driving table in the executed SQL can be guaranteed, and the accuracy of the inner join can be further guaranteed.
[0112] Furthermore, the step of determining the binary classification result based on the current feature vector and the target training model includes:
[0113] Step j: Use the current feature vector as the input information of the binary classification layer of the target training model, and determine the output information of the target training model under the input information;
[0114] Step k: Determine the integer value corresponding to the output information, and use the integer value as the binary classification result.
[0115] In this embodiment, since online inference uses the target training model obtained offline, the current feature vector is used as the input information for the binary classification layer of the target training model. The output information of the target training model under the input information is then determined. The input information refers to the current feature vector obtained earlier, and the output information refers to the information output by the model based on the input information. The integer value corresponding to the obtained output information is taken as the binary classification result. The integer value can be 0 or 1, or other integers, which can be customized according to user needs. The model can be defined so that the output integer is 0 when the left table is selected and 1 when the right table is selected. Using 0 or 1 as the binary classification result determines the driving table to be selected. Adding a binary classification network to the commonly used BERT model processes the feature vector to obtain the classification result of selecting one of the two tables. The driving table is then selected based on the classification result, ensuring that the driving table is selected based on the semantics and word order of the tables, thus guaranteeing the accuracy of the selection.
[0116] In this embodiment, by using the current feature vector as input information for the binary classification layer of the target training model, and determining the output information of the target training model under the input information, determining the integer value corresponding to the output information, and using the integer value as the binary classification result, the driving table is selected by determining the binary classification result, thereby ensuring the accuracy of the driving table selection in the entire inner connection scenario and further improving the accuracy of the inner connection.
[0117] The present invention also provides an optimization device for internal connections, with reference to Figure 3 The optimization device for the internal connection includes:
[0118] The online inference module A01 is used to respond to the received current query statement and determine the binary classification result based on the current query statement and the preset target training model; wherein, the target training model is obtained by training a semantic model based on historical query statements.
[0119] The optimization determination module A02 is used to determine the target driving table corresponding to the binary classification result, so as to optimize the inner join based on the target driving table.
[0120] Optionally, the offline training unit in the online inference module A01 is further used for:
[0121] Obtain multiple input historical query statements, parse the historical semantic feature information in each historical query statement in turn, and determine the historical word order information corresponding to the semantic feature information;
[0122] Historical feature vectors are determined based on the historical word order information and the historical semantic feature information. The historical feature vectors corresponding to each historical query statement are then aggregated to obtain a training sample dataset. A semantic model is then trained based on the training sample dataset to obtain the target training model.
[0123] Optionally, the offline training unit in the online inference module A01 is further used for:
[0124] In response to the input demand information, an initial historical query statement is obtained, the demand quantity of the demand information is determined, and it is detected whether the number of the initial historical query statements is greater than the demand quantity.
[0125] If the number of initial historical query statements is greater than the required number, then the initial historical query statements are sampled based on the preset sampling rules and the required number to obtain historical query statements;
[0126] If the number of initial historical query statements is less than or equal to the required number, then the initial historical query statements will be used as historical query statements.
[0127] Optionally, the offline training unit in the online inference module A01 is further used for:
[0128] All historical features in the historical semantic feature information are determined, and each historical feature is sequentially input into a preset pre-trained language model based on the historical word order information for vector extraction to obtain an initial historical feature vector;
[0129] Determine the length of the vector in the initial historical feature vector, and detect whether the vector length matches a preset length threshold;
[0130] If the vector length does not match the preset length threshold, the initial historical feature vector is padded to obtain a historical feature vector; wherein, the padded process includes padding with zeros.
[0131] Optionally, the offline training unit in the online inference module A01 is further used for:
[0132] The training conditions of the pre-trained language model are determined, and the model is trained based on the training conditions and the training sample dataset to obtain the target training model; wherein, the pre-trained language model includes the benchmark BERT model with access to the binary classification network layer.
[0133] Optionally, the online inference module A01 is further configured to:
[0134] Parse the current semantic feature information in the current query statement and determine the current word order information corresponding to the current semantic feature information;
[0135] All current features in the current semantic feature information are determined, and each current feature is sequentially input into the target training model based on the current word order information to extract vectors, thereby obtaining the current feature vector;
[0136] The binary classification result is determined based on the current feature vector and the target training model.
[0137] Optionally, the online inference module A01 is further configured to:
[0138] The current feature vector is used as the input information of the binary classification layer of the target training model, and the output information of the target training model under the input information is determined.
[0139] Determine the integer value corresponding to the output information, and use the integer value as the binary classification result.
[0140] The methods executed by the above-mentioned program modules can be referred to in the various embodiments of the optimization method of internal linkage of the present invention, and will not be repeated here.
[0141] The present invention also provides an optimized device for internal connections.
[0142] The device of the present invention includes: a memory, a processor, and an in-link optimization program stored in the memory and executable on the processor, wherein the in-link optimization program, when executed by the processor, implements the steps of the in-link optimization method as described above.
[0143] The present invention also provides a storage medium.
[0144] The present invention stores an optimization program for internal linking on a storage medium, which, when executed by a processor, implements the steps of the internal linking optimization method as described above.
[0145] The method implemented when the internal linking optimization program running on the processor is executed can be referred to in various embodiments of the internal linking optimization method of the present invention, and will not be repeated here.
[0146] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.
[0147] The sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0148] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) as described above, and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.
[0149] The above are merely preferred embodiments of the present invention and do not limit the scope of the patent. Any equivalent structural or procedural transformations made based on the description and drawings of the present invention, or direct or indirect applications in other related technical fields, are similarly included within the scope of patent protection of the present invention.
Claims
1. A method for optimizing inner connections, characterized in that, The optimization method for inner connections includes: In response to the received current query statement, a binary classification result is determined based on the current query statement and a preset target training model; wherein, the target training model is obtained by training a semantic model based on historical query statements, and the step of obtaining the target training model by training a semantic model based on historical query statements includes: Multiple input historical query statements are obtained, and the historical semantic feature information in each historical query statement is parsed sequentially to determine the historical word order information corresponding to the semantic feature information. Historical feature vectors are determined based on the historical word order information and the historical semantic feature information, and the historical feature vectors corresponding to each historical query statement are aggregated to obtain a training sample dataset. Pre-set training conditions for a pre-trained language model are determined, and the model is trained based on the training conditions and the training sample dataset to obtain a target training model. The pre-trained language model includes a baseline BERT model with a binary classification network layer. The training conditions include freezing the parameters of the baseline BERT model, but not freezing the weights of the binary classification network layer, and setting them to be trainable. The feature order in the historical feature vectors uses the original order extracted from the historical query statements. The target-driven table corresponding to the binary classification result is determined to optimize the inner join based on the target-driven table. The step of determining the binary classification result based on the current query statement and the target training model includes: parsing the current semantic feature information in the current query statement and determining the current word order information corresponding to the current semantic feature information; determining all current features in the current semantic feature information; inputting each current feature sequentially into the target training model for vector extraction based on the current word order information to obtain a current feature vector; and determining the binary classification result based on the current feature vector and the target training model, wherein the binary classification result includes both semantic and word order results.
2. The optimization method for inner connections as described in claim 1, characterized in that, The step of obtaining historical query statements from multiple inputs includes: In response to the input demand information, an initial historical query statement is obtained, the demand quantity of the demand information is determined, and it is detected whether the number of the initial historical query statements is greater than the demand quantity. If the number of initial historical query statements is greater than the required number, then the initial historical query statements are sampled based on the preset sampling rules and the required number to obtain historical query statements; If the number of initial historical query statements is less than or equal to the required number, then the initial historical query statements will be used as historical query statements.
3. The optimization method for inner connections as described in claim 1, characterized in that, The step of determining the historical feature vector based on the historical word order information and the historical semantic feature information includes: All historical features in the historical semantic feature information are determined, and each historical feature is sequentially input into a preset pre-trained language model based on the historical word order information for vector extraction to obtain an initial historical feature vector; Determine the length of the vector in the initial historical feature vector, and detect whether the vector length matches a preset length threshold; If the vector length does not match the preset length threshold, the initial historical feature vector is padded to obtain a historical feature vector; wherein, the padded process includes padding with zeros.
4. The optimization method for inner connections as described in claim 1, characterized in that, The step of determining the binary classification result based on the current feature vector and the target training model includes: The current feature vector is used as the input information of the binary classification layer of the target training model, and the output information of the target training model under the input information is determined. Determine the integer value corresponding to the output information, and use the integer value as the binary classification result.
5. An optimization device for internal connections, characterized in that, The optimization device for the internal connection includes: An online inference module is used to respond to a received current query statement and determine a binary classification result based on the current query statement and a preset target training model; wherein, the target training model is obtained by training a semantic model based on historical query statements, and the step of obtaining the target training model by training a semantic model based on historical query statements includes: Multiple input historical query statements are obtained, and the historical semantic feature information in each historical query statement is parsed sequentially to determine the historical word order information corresponding to the semantic feature information. Historical feature vectors are determined based on the historical word order information and the historical semantic feature information, and the historical feature vectors corresponding to each historical query statement are aggregated to obtain a training sample dataset. Pre-set training conditions for a pre-trained language model are determined, and the model is trained based on the training conditions and the training sample dataset to obtain a target training model. The pre-trained language model includes a baseline BERT model with a binary classification network layer. The training conditions include freezing the parameters of the baseline BERT model, but not freezing the weights of the binary classification network layer, and setting them to be trainable. The feature order in the historical feature vectors uses the original order extracted from the historical query statements. An optimization determination module is used to determine the target driving table corresponding to the binary classification result, so as to optimize the inner join based on the target driving table. The step of determining the binary classification result based on the current query statement and the target training model includes: parsing the current semantic feature information in the current query statement and determining the current word order information corresponding to the current semantic feature information; determining all current features in the current semantic feature information; inputting each current feature sequentially into the target training model for vector extraction based on the current word order information to obtain a current feature vector; and determining the binary classification result based on the current feature vector and the target training model, wherein the binary classification result includes both semantic and word order results.
6. An optimized device for internal connections, characterized in that, The in-link optimization device includes: a memory, a processor, and an in-link optimization program stored in the memory and running on the processor, wherein the in-link optimization program, when executed by the processor, implements the steps of the in-link optimization method as described in any one of claims 1 to 4.
7. A computer storage medium, characterized in that, The computer storage medium stores a program for implementing an optimization method of inner join, which is executed by a processor to implement the steps of the optimization method of inner join as described in any one of claims 1 to 4.