Machine Learning-Based Script Syntax Parsing Method and System
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAMEN ZHENGHANG SOFTWARE TECH CO LTD
- Filing Date
- 2026-03-06
- Publication Date
- 2026-06-30
AI Technical Summary
The existing script syntax parsing process is prone to interruption of complex or error-prone script parsing due to localization and rigid rule-driven processes. The generated target script structure is bloated and lengthy, with a lot of repetitive calculations in the execution chain, resulting in low stability, a large amount of subsequent maintenance work, and a significant slowdown in execution performance.
A machine learning-based script syntax parsing method is adopted. By collecting historical data of script execution ports, a multi-task deep neural network model is trained to construct an initial script semantic graph. The machine learning model is used to perform encoding and editing operations to generate a complete abstract syntax tree. A structural equivalence subgraph index is constructed on the abstract syntax tree, and redundant subtrees are filtered and rewritten to form an optimized abstract syntax tree.
It improves adaptability to diverse error modes, reduces script execution costs, enhances the overall efficiency of the self-developed parsing and execution system, reduces the burden of rule maintenance, and improves stability and availability in complex script scenarios.
Smart Images

Figure CN121807314B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer system technology with specific computational models, specifically to a script syntax parsing method and system based on machine learning. Background Technology
[0002] Script syntax parsing refers to analyzing a piece of script code according to pre-agreed syntax rules, breaking down the code, which is originally just a string, into structured elements such as keywords, variables, operators, and statement blocks, and further organizing them into internal representations such as abstract syntax trees, so that the system can understand "what logic this script is expressing", providing a foundation for subsequent execution, optimization, verification, or translation.
[0003] Existing syntax parsing processes generally include: first, lexical analysis, which divides the script source code into meaningful lexical units (such as identifiers, keywords, constants, operators, etc.); then, syntax analysis is performed according to preset grammar rules (such as BNF / EBNF), checking whether the lexical unit strings conform to the syntax, and constructing a syntax tree or abstract syntax tree; based on this, some implementations will further perform simple semantic checks (such as whether variables are defined and whether types match), and finally, the parsing results are handed over to the interpreter or execution engine.
[0004] For example, Chinese invention patent CN109670601B discloses a method, apparatus, electronic device, and storage medium for generating machine learning features. The method includes: configuring feature processing information in a configuration file; parsing the feature processing information configured in the configuration file and generating a script file based on the parsed feature processing information; and executing the script file to obtain machine learning features.
[0005] For example, Chinese invention patent application CN119312882A discloses a method and related apparatus for constructing a TVM quantization model based on syntax layer transformation. The method includes: obtaining a PyTorch model file and a quantization parameter file of the model to be quantized; importing the PyTorch model file into the PyTorch framework to obtain the Python inference script for model inference; parsing the Python inference script line by line according to syntax rules, and combining it with the model information in the PyTorch model file, converting the parsed Python inference script into a TVM relay inference script according to syntax transformation rules; parsing the converted TVM relay inference script in the TVM framework and saving it as a TVM model file; and calculating and generating a quantized TVM model parameter file based on the quantization parameter file.
[0006] Most existing script syntax parsing processes are based on line-by-line matching and preset syntax templates, performing local rule parsing and direct conversion on a line-by-line basis. However, the actual script syntax is intertwined with the overall structure and local statements during the parsing process. It includes cross-line control flow and scope relationships, as well as incomplete writing, temporary modifications, and even minor syntax errors. At the same time, repeated subexpressions and redundant calculation fragments often appear. This localized and rigid rule-driven processing method makes it easy for existing technical solutions to lose sight of some aspects in understanding the global semantic relationships of the script, handling abnormal scripts, and redundancy identification and structural optimization of the parsing results. It has technical shortcomings such as easy to interrupt the parsing when encountering slightly complex scripts or scripts with minor errors, the generated target script structure being bloated and lengthy, and a lot of repetitive calculations in the execution chain. This results in low stability of the script parsing process, a large amount of subsequent maintenance workload, and a significant slowdown in the actual execution performance of model inference or feature generation. Summary of the Invention
[0007] To address the shortcomings of existing technologies, this invention provides a script syntax parsing method and system based on machine learning, which can effectively solve the problems mentioned in the background technology.
[0008] To achieve the above objectives, the present invention provides the following technical solution: The first aspect of the present invention provides a script syntax parsing method based on machine learning, comprising: collecting a historical script data set of the script execution port, extracting historical script parsing samples from it and inputting them into a machine learning model as training samples to train the machine learning model; performing lexical analysis on the script to be parsed, using a parser to perform preliminary syntax parsing on the lexical analysis results, and constructing an initial script semantic graph based on the preliminary syntax parsing results; encoding the initial script semantic graph using the machine learning model, obtaining predicted editing operations based on the encoding results, modifying the initial script semantic graph according to the predicted editing operations, and generating a complete abstract syntax tree; constructing a structural equivalence subgraph index on the complete abstract syntax tree to filter redundant subtrees, and using the machine learning model to generate a rewriting strategy for the redundant subtrees and applying it to the complete abstract syntax tree to form an optimized abstract syntax tree, thus completing the syntax parsing process of the script to be parsed.
[0009] The second aspect of this invention provides a machine learning-based script syntax parsing system, comprising: a machine learning model training module, used to collect a historical script data set of the script execution port, extract historical script parsing samples from it and input them into the machine learning model as training samples to train the machine learning model; a script semantic graph construction module, used to perform lexical analysis on the script to be parsed, perform preliminary syntax parsing on the lexical analysis results through a parser, and construct an initial script semantic graph based on the preliminary syntax parsing results; an abstract syntax tree modification module, used to encode the initial script semantic graph through the machine learning model, obtain predicted editing operations based on the encoding results, modify the initial script semantic graph according to the predicted editing operations, and generate a complete abstract syntax tree; and an abstract syntax tree optimization module, used to construct a structural equivalence subgraph index on the complete abstract syntax tree to filter redundant subtrees, and generate a rewriting strategy for the redundant subtrees through the machine learning model and apply it to the complete abstract syntax tree to form an optimized abstract syntax tree, thus completing the syntax parsing process of the script to be parsed.
[0010] Compared with the prior art, the embodiments of the present invention have at least the following advantages or beneficial effects:
[0011] (1) This invention provides a script syntax parsing method and system based on machine learning. First, it collects a historical script data set from the script execution port, extracts grammatically correct, grammatically incorrect, and optimized script parsing samples from it, and uses them to train a machine learning model. This allows the model to learn standard grammatical structures, error correction patterns, and structural rewriting patterns simultaneously in the same feature space, improving the consistency and generalization ability of online parsing and optimization. Subsequently, lexical analysis is performed on the script to be parsed, and preliminary syntax parsing is performed through a parser. The parsing results are used to construct an initial script semantic graph, which explicitly represents the parsed structures, grammatical error locations, and unparsed segments in the same graph structure, making it easy to retain complete contextual information even when there are missing or non-standard writing styles. Based on this... The system encodes the initial script semantic graph using a machine learning model. Based on the encoding results, it outputs predictive editing operations for errors or incomplete structures, and modifies the semantic graph accordingly to generate a complete abstract syntax tree. This transforms syntax correction from hard-coded rules to a data-driven, automated process, improving adaptability to diverse error patterns and reducing the burden of rule maintenance. Finally, it constructs a structural equivalence subgraph index on the complete abstract syntax tree to filter redundant subtrees. A machine learning model is then used to generate and apply rewriting strategies to these redundant subtrees, resulting in an optimized abstract syntax tree. This allows redundancy detection and structural rewriting to be performed from a unified structural equivalence perspective and based on learnable decisions, reducing script execution costs and improving the overall efficiency of the self-developed parsing and execution system while ensuring semantic equivalence.
[0012] (2) This invention establishes a fast retrieval relationship of “syntactic structure equivalence class → subexpression node set” in a data structure such as a hash table or mapping table in advance by using the equivalence class identifier as the index key and the subexpression node set within the equivalence class as the index value. This allows the machine model to locate all candidate redundant subtrees with the same syntactic structure on the complete abstract syntax tree with near constant time complexity. The operation that originally required multiple traversals of the AST to compare the structure is converged into a single search of the equivalence class index and subsequent local processing. This indexed structural equivalence management method not only significantly reduces the time overhead of redundancy detection in large scripts or complex rule sets, but also provides a natural candidate set filtering mechanism for the rewrite strategy module. This allows the optimization process to prioritize subexpression regions with “high repetition and clear structural patterns”, thereby improving optimization benefits and execution efficiency while ensuring rewrite security.
[0013] (3) In this invention, the same set of semantic encoding results can be used by the syntax correction decoder to locate and repair erroneous nodes, and by the rewrite strategy module to evaluate the rewriting value of subtrees. It can also participate in the cost estimation and equivalence class partitioning process, so that multiple variables such as node-level context vectors, script-level global vectors, and structural equivalence class identifiers can be reused across modules in the "error correction-redundancy detection-structural rewriting" link. This shared representation design avoids building and maintaining multiple independent feature extraction logics for different subtasks, reduces the complexity of feature engineering and system integration, and ensures that each submodule makes decisions in the same semantic space, which helps to improve the consistency and stability of the overall parsing process in complex script scenarios.
[0014] (4) Compared with the prior art, the embodiments of the present invention no longer rely solely on the one-time syntax checking and fixed rule base of traditional parsers. Instead, they achieve an end-to-end syntax parsing and optimization closed loop from sample construction, error correction to structure rewriting through a combination of "historical script data-driven machine learning model, initial script semantic graph, and structural equivalence subgraph index". On the one hand, by collecting a large number of historical script samples from the real running environment from the script execution port, the model can automatically learn the actual usage patterns of the enterprise's internal SQL dialect and business rule scripts, which is closer to the actual business. On the other hand, by executing data-driven redundancy detection and rewriting strategies on the complete abstract syntax tree, the present invention can not only reduce redundant calculations and reduce execution costs while ensuring semantic equivalence, but also provide stronger adaptive correction capabilities when facing syntax errors, incomplete input, or mixed language scripts. Overall, it improves the availability, scalability, and long-term maintenance efficiency of the company's self-developed syntax parser in complex production environments. Attached Figure Description
[0015] The present invention will be further described with reference to the accompanying drawings, but the embodiments in the drawings do not constitute any limitation on the present invention. For those skilled in the art, other drawings can be obtained based on the following drawings without creative effort.
[0016] Figure 1 This is a schematic diagram of the method steps of the present invention.
[0017] Figure 2 This is a schematic diagram of the system module connections of the present invention.
[0018] Figure 3 This is a script semantic representation diagram of the present invention.
[0019] Figure 4 This is a flowchart of the hybrid coding structure of the present invention. Detailed Implementation
[0020] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
[0021] To make the technical problems, technical solutions and advantages of the present invention clearer, a detailed description will be given below in conjunction with the accompanying drawings and specific embodiments.
[0022] Example 1: Refer to Figure 1 As shown, the first aspect of the present invention provides a script syntax parsing method based on machine learning, comprising: collecting a historical script data set of the script execution port, extracting historical script parsing samples from it and inputting them into a machine learning model as training samples, and training the machine learning model.
[0023] The historical script data set of the script execution port includes the script source code dataset of the script execution port and the execution auxiliary information set of the script execution port.
[0024] The script source code dataset is used to collect the source code of various scripts that have actually run or have run on this execution port, including business rule scripts, SQL scripts, and predefined domain-specific scripts. The execution auxiliary information set is used to collect execution-side auxiliary data such as compilation or interpretation logs, execution plans, execution result statistics, and error stack information corresponding to the above script source code, providing contextual support for the construction of training samples and the evaluation of parsing performance of the subsequent script syntax parsing model.
[0025] The historical script parsing samples include the first type of parsing training samples, the second type of parsing training samples, and the third type of parsing training samples. The machine learning model is trained using the historical script parsing samples.
[0026] In this embodiment of the invention, the machine learning model is specifically a multi-task deep neural network model based on a combination of graph neural networks (GNN) and Transformers. Its training process includes: first, automatically dividing and constructing three categories of historical script parsing samples from the script source code dataset of the script execution port, combined with execution auxiliary information such as execution logs, error stacks, and execution plans.
[0027] First, for grammatically correct scripts that can be fully parsed by existing parsers, a corresponding standard abstract syntax tree is generated. The "script → standard AST (Abstract Syntax Tree)" is used as the first type of parsing training sample to constrain the semantic encoding units in the model, which are based on GNN and Transformer, to learn the standard grammatical structure.
[0028] Secondly, for scripts with grammatical errors or incomplete writing in the past that have been manually corrected or successfully executed, pairwise samples of "error script → corrected script / correct AST" are constructed and converted into the corresponding initial script semantic graph input encoding unit. The editing operation sequence corresponding to the target AST after grammatical correction is used as the supervision signal to train the grammatical correction decoder submodule in the model, forming the second type of parsing training samples.
[0029] Third, for redundant or inefficient scripts that show significant performance differences in the execution auxiliary information and their artificial / system optimized versions, construct sample pairs of "script before optimization / AST → script after optimization / AST", convert the script before optimization into an initial script semantic graph, obtain the contextual semantic representation of the candidate subtree through semantic encoding units, and use the optimized target structure or target rewrite editing sequence as labels to train the rewrite policy component in the model, forming the third type of parsing training samples.
[0030] Then, the three types of samples are uniformly converted into an initial script semantic graph or syntax tree representation that conforms to the process of this invention. The node-level context semantic representation vector and the script-level global semantic representation vector are obtained through the GNN+Transformer semantic encoding unit and respectively fed into the multi-task output head: on the one hand, it is used to predict the standard AST structure and the sequence of syntax correction and editing operations, and on the other hand, it is used to predict the rewriting strategy and rewriting result of the candidate redundant subtree. On this basis, a joint loss function is constructed, which includes indicators such as parsing accuracy, editing operation sequence matching degree, and the execution cost after rewriting being better than the original structure. The parameters of the entire GNN-Transformer encoding unit, syntax correction decoder, and rewriting strategy component are uniformly backpropagated and iteratively optimized until convergence on the validation set, thereby obtaining a multi-task deep learning model that can simultaneously support semantic encoding, syntax correction, and structural rewriting optimization after the initial script semantic graph is constructed.
[0031] Lexical analysis is performed on the script to be parsed. The parser performs preliminary syntactic analysis on the lexical analysis results and constructs an initial script semantic graph based on the preliminary syntactic analysis results.
[0032] Lexical analysis is performed on the script to be parsed to obtain the lexical units of the script. The lexical units are arranged in the order of their appearance in the source code of the script to be parsed to form the lexical analysis result, which is the sequence of lexical units of the script to be parsed.
[0033] Lexical analysis refers to the process of scanning script source code character by character, dividing the continuous stream of characters into individual, meaningful lexical units, and labeling each unit with type and position information. Specifically, lexical analysis identifies keywords, identifiers, numeric constants, string constants, operators, delimiters, etc., while filtering out meaningless content such as whitespace and comments, thus transforming "plain text code" into an ordered sequence of lexical units, which serves as the foundation for subsequent syntax parsing.
[0034] When performing lexical analysis on the script to be parsed, the script source code is first scanned in character order. Based on preset lexical rules, the continuous character stream is segmented and identified into multiple lexical units with independent meanings, including keywords, identifiers, constants, operators, delimiters, etc. At the same time, whitespace characters and comment information are filtered out to obtain a set of lexical units corresponding to the script to be parsed. Subsequently, based on the order of appearance of each lexical unit in the original script source code, the above lexical units are arranged in sequence to construct a lexical unit sequence of the script to be parsed, which is used as the ordered input for subsequent syntax parsing and semantic modeling.
[0035] Based on the pre-defined simplified grammar constraints, the LL (Left-to-right, Leftmost derivation) / LR (Left-to-right, Rightmost derivation) parser is used to perform preliminary grammatical parsing on the lexical unit sequence of the script to be parsed, and the preliminary grammatical parsing results are obtained. The preliminary grammatical parsing results include the successfully parsed abstract syntax tree fragments, the locations of syntax errors, and the lexical unit fragments that have not yet been parsed.
[0036] Based on the successfully parsed abstract syntax tree fragments, syntax error locations, and unparsed lexical unit fragments, the machine learning model constructs the corresponding initial script semantic graph at the granularity of the script to be parsed. The initial script semantic graph represents the connection relationship between the successfully parsed structure, error context, and unparsed fragments simultaneously in the same script semantic graph.
[0037] Syntax parsing refers to the process of examining the sequence of lexical units obtained from lexical analysis, according to predefined grammatical rules, to check whether the arrangement of these lexical units satisfies grammatical structure constraints, and assembling them into a hierarchical abstract syntax tree or other syntax tree representation. During syntax parsing, higher-level grammatical components such as expressions, statements, statement blocks, function calls, conditions, and loops are identified, establishing parent-child, nesting, and sequential relationships between lexical units. This yields an internal representation that reflects the script's logical structure, providing a structural basis for subsequent semantic analysis, execution plan generation, or script optimization.
[0038] Based on pre-defined simplified grammar constraints, the LL / LR parser reduces and derives the lexical unit sequence of the script to be parsed sequentially from front to back. Lexical unit sequences that successfully match the simplified grammar production rules are reduced to corresponding non-terminal symbol nodes, thus gradually constructing several successfully parsed abstract syntax tree fragments. During parsing, when encountering lexical unit arrangements that violate grammar production rule constraints or missing necessary symbols, such as parentheses for enclosing expressions or statement blocks, semicolons for ending statements, commas for separating parameters, operators and keywords for forming comparisons or logical conditions, the current stall position and related lexical units are marked as syntax error positions, and further reduction of that local branch is stopped. Remaining lexical units that have not yet participated in any effective reduction due to syntax conflicts, incomplete statements, or limited grammar coverage are retained as unparsed lexical unit fragments. Through the above process, a preliminary grammar parsing result is formed, which includes successfully parsed abstract syntax tree fragments, grammatical error locations, and unparsed lexical unit fragments. This provides a structural basis and error location information for subsequent semantic encoding and grammar correction.
[0039] The aforementioned initial script semantic graph refers to a graph-like intermediate representation used to represent the semantic structure of a script. It explicitly expresses various semantic components and their structural relationships within the script using graph nodes and edges. Nodes typically correspond to syntactic elements such as expression nodes, statement nodes, control structure nodes, operator nodes, and identifier nodes, while edges depict parent-child hierarchical relationships, sibling order relationships, control flow relationships, and data dependencies. Compared to a simple abstract syntax tree, a syntax graph can supplement the tree-like hierarchical structure with additional sequence edges, cross-statement dependency edges, or error association edges, allowing the script's structural features at the syntactic level to be presented in a more complete and computable graph structure, facilitating subsequent graph neural network encoding, structural optimization, and rewriting analysis.
[0040] After obtaining the successfully parsed abstract syntax tree fragments, syntax error locations, and unparsed lexical unit fragments, the machine learning model constructs a corresponding semantic graph at the granularity of the script to be parsed. First, each syntax node (such as expression nodes, condition nodes, statement nodes, etc.) in the abstract syntax tree fragment is extracted as a syntax node in the semantic graph, and each lexical unit in the unparsed lexical unit fragment is extracted as a lexical node in the semantic graph. A dedicated error node is generated for each syntax error location. Subsequently, based on the parent-child relationship and sibling order relationship in the abstract syntax tree, hierarchical edges and sequential edges are established between syntax nodes in the semantic graph. Sequence edges are established between lexical nodes based on the order of lexical units in the source code. Error association edges are established between error nodes and their respective statement blocks or adjacent syntax nodes. Thus, the connection relationship between "parsed structure", "error context" and "unparsed fragments" is simultaneously depicted in the same semantic graph, providing a unified structured representation basis for subsequent semantic encoding and syntax correction.
[0041] In one specific embodiment:
[0042] Taking Example Script 1: SELECT id FROM user_table WHERE id=100 (missing a semicolon at the end) as an example, construct the initial script semantic graph of Example Script 1, as follows: Figure 3 As shown, Figure 3 The semantic representation diagram of the script in this invention is as follows: A "statement node (SELECT statement)" is set in the semantic diagram to represent the top-level syntactic unit of the entire query statement, and is connected to the lexical nodes representing keywords such as SELECT and FROM, and the expression nodes representing the WHERE condition clause through hierarchical edges; below the "expression node (WHERE condition)", various lexical nodes representing WHERE, id, =, and 100 are further attached to describe the syntactic composition relationship inside the condition expression; at the same time, an "error node (missing semicolon)" is introduced below the statement node, and the error node is connected to the corresponding statement node through error association edges to explicitly mark the syntactic defects that exist at the end of the script statement.
[0043] It's important to explain that the relationship between semantic graphs and scripts can be understood as follows: the script is the source, and the semantic graph is a structured projection. The script itself exists as a stream of characters or a sequence of lexical units; the interpreter directly sees the text code. The semantic graph, however, is an internal representation of the script's implicit syntactic hierarchy, statement boundaries, control structures, and dependencies, extracted and organized into a graph structure after lexical and syntactic analysis. Each script corresponds to a semantic graph, and the nodes and edges in the semantic graph can be mapped back to specific code segments in the script through their positional information. During the parsing and optimization process, modifications, merging, or rewriting of the semantic graph ultimately reflect adjustments to the original script structure or execution semantics. Therefore, the semantic graph both originates from the script and serves as the carrier for subsequent analysis and transformation of the script.
[0044] The initial script semantic graph is encoded using a machine learning model. Based on the encoding results, predicted editing operations are obtained. The initial script semantic graph is then modified according to the predicted editing operations to generate a complete abstract syntax tree.
[0045] Each node in the initial script semantic graph is mapped to an initial semantic representation vector. Through a set hybrid encoding structure, message passing and encoding are performed on the initial semantic representation vectors of each node in the entire initial script semantic graph.
[0046] In the initial script semantic graph, mapping each syntax node to an initial semantic representation vector means assigning a low-dimensional vector of real numbers to each syntax node, lexical node, and error node. This vector compactly encodes the node's syntax type, lexical content, and role information within the script. The "semantic representation vector" can be understood as a numerical identity label that can be computed and compared in subsequent machine learning modules, replacing the symbolic form that would otherwise be difficult to directly participate in model computation.
[0047] The aforementioned allocation of a low-dimensional vector composed of real numbers essentially involves calculating / retrieving a small array of numbers for each node, serving as its "identity card" within the model. For example, embedding tables can be pre-established for different node types (syntax nodes, lexical nodes, and error nodes) and different lexical contents (keywords SELECT, FROM, WHERE, operators =, >, identifier lexical units, etc.). When a vector needs to be assigned to a node, a vector is first retrieved from the type embedding table based on the node type, and then a vector is retrieved from the content embedding table based on the lexical contents carried by the node. These two vectors are then concatenated or summed and mapped to a fixed dimension, such as 64 or 128 dimensions, through a linear transformation to obtain the initial semantic representation vector of the node.
[0048] For example, for the syntax node (SELECT statement) in Example Script 1, we can take the vector v_type=[0.4, -0.1, 0.2] representing the statement node from the type embedding table, and the vector v_content=[0.3, 0.5, -0.2] representing the SELECT statement pattern from the content embedding table. After concatenating the two, we can map them through a linear layer to a low-dimensional vector of length 4 [0.25, -0.10, 0.03, 0.72]. This vector is the initial semantic representation of the statement node in the initial script semantic graph.
[0049] Similarly, the lexical node "WHERE" will receive another set of vectors with different values but the same dimensions, and the error node (missing a semicolon) will also receive a special error type vector, so that the subsequent graph neural network can perform calculations and updates on different types of nodes in a unified numerical space.
[0050] The mapping process typically includes: First, finding or generating the corresponding embedding vector based on the node type (such as syntax nodes, lexical nodes, and error nodes) and the text content carried by the node (such as syntax nodes carrying the main syntax symbol or syntax structure represented by the node, such as clause keywords, function names, operator categories, etc.; lexical nodes carrying the original lexical unit text identified during the lexical analysis stage, such as the specific keyword "SELECT", identifier name, or constant literal; error nodes carrying predefined error type identifiers or error codes, such as "missing semicolon" or "mismatched brackets"), and then combining the node's position information in the script, the statement block number it belongs to, and other additional features, to obtain the initial semantic representation vector of the node through vector concatenation and linear transformation, thereby unifying the discrete set of nodes in the semantic graph into a set of numerical representations that can be encoded and updated by subsequent graph neural networks or other deep models.
[0051] In this embodiment, the hybrid coding structure is based on a combination of graph neural networks (GNN) and Transformers. The message passing and encoding process described above is as follows: Figure 4 As shown, Figure 4The flowchart of the hybrid encoding structure of this invention is as follows: First, based on the pre-constructed hierarchical edges, sequential edges, dependent edges, and erroneous association edges in the semantic graph, the initial semantic representation vector of each node (node embedding layer and edge embedding layer) is used as the input to the graph neural network. Multiple rounds of message aggregation and state updates (GNN message passing layer) are performed between parent-child nodes, sibling nodes, and adjacent nodes in the control flow, outputting a set of first-round node representation vectors that have incorporated local graph structure information. After several rounds of graph-level message passing, the node vector sequence arranged in the order of appearance in the script is input into the Transformer encoding block. The Transformer encoding block includes a self-attention layer and multiple layers for generating query / key / value (Q / K / V) vectors. The system consists of a linear transformation layer, a feedforward neural network layer, and a concatenation layer. It models long-distance dependencies and global semantic patterns between nodes at different locations using a multi-head self-attention mechanism. The node vectors are then updated and reorganized by the feedforward network and the concatenation layer, outputting a final set of node-level context semantic representation vectors, as well as a global semantic representation vector representing the entire script, obtained by pooling the node-level context semantic representation vector set or by selecting the root node vector. Thus, the GNN submodule explicitly outputs intermediate node representations that integrate local structural information, while the Transformer submodule outputs node-level representations and script-level global representations that simultaneously incorporate local graph structure and global script context information.
[0052] It should be explained that the hybrid encoding structure combining graph neural networks and Transformers plays a role in the above process in two ways. On the one hand, it uses graph neural networks to aggregate explicit topological information such as parent-child hierarchical relationships, sibling order relationships, and error association relationships in the semantic graph, making the node representation closer to the real syntactic skeleton and control structure of the script. On the other hand, it uses Transformers to perform global self-attention modeling on the node vector sequence unfolded in the script order, enabling the model to perceive long-distance dependencies and semantic patterns across lines, sentences, and even sentence blocks.
[0053] Compared to sequence-based encoding methods, focusing solely on linear order often fails to fully utilize the extracted graph structure constraints. Conversely, relying solely on graph neural networks may not adequately depict long-distance positional relationships and overall context when processing scripts with numerous nodes and long sequence spans. By combining these two approaches, both "graph structure" and "sequence dependencies" are collaboratively modeled simultaneously during the same encoding process. This allows subsequent syntax correction and rewriting strategy models to operate on a more complete and refined structural semantic representation. This integrated hybrid encoding structure provides a unique structural-semantic joint modeling capability in complex script syntax parsing scenarios.
[0054] The encoding results include the set of context semantic representation vectors for each node and the global semantic representation vector of the script to be parsed.
[0055] The machine learning model's syntax correction decoder receives the encoding result as the initial decoding state or condition vector. The decoder generates a series of structured editing instructions step by step through autoregressive or step-by-step decoding. When the decoder determines that the current editing sequence is sufficient to correct the script from an erroneous state to a grammatically correct state, it outputs a termination marker, resulting in a set of predicted editing operation sequences, which are represented as predicted editing operations.
[0056] The syntax correction decoder is a decoding strategy that introduces grammatical constraints during the decoding process.
[0057] A syntax correction decoder can be understood as a component in a machine learning model specifically responsible for "syntax correction." Its input consists of a set of node-level contextual semantic representation vectors output by a hybrid encoding structure and a script-level global semantic representation vector. Internally, through attention mechanisms, gating units, or other non-linear transformation structures, it performs one or more steps of decoding and reasoning on these encoded results, thereby learning the mapping pattern of "error script - correct script / correct grammatical structure" in a high-dimensional vector space. In other words, the syntax correction decoder does not directly face the original characters or lexical units, but rather, based on an intermediate representation that integrates structural and contextual semantic information, it internally performs the functions of "identifying error locations, selecting correction modes, and planning correction content," making it a key unit for realizing the syntax correction capability of machine learning models.
[0058] During the output of predicted editing operations, the syntax correction decoder first receives the context semantic representation vectors of each node in the semantic graph corresponding to the current script, as well as the global semantic representation vector representing the entire script, and uses these as the initial state or condition vector for decoding. Subsequently, the decoder gradually generates several structured editing operation instructions through autoregressive or step-by-step decoding. Each editing operation includes at least the editing type, target node identifier, and editing content parameters, which are used to specify "near which node to modify", "what editing type to use", and "what syntax unit or subtree to insert, delete, or replace". When the decoder internally determines that the currently accumulated editing operation sequence is sufficient to correct the script from an erroneous state to a syntactically correct state, it outputs a preset termination marker to indicate the end of the editing sequence. This results in the final output of this decoding consisting of an ordered sequence of predicted editing operation instructions and its termination marker, which drives subsequent structural updates to the semantic graph or abstract syntax tree.
[0059] Editing operations refer to atomic-level structural modification instructions applied to the semantic graph or abstract syntax tree representing the script. These instructions are used to adjust local structures by inserting, deleting, replacing, or reconnecting elements while maintaining overall semantic consistency as much as possible.
[0060] For example, in Example Script 1: SELECT id FROM user_table WHERE id=100, if a semicolon is missing at the end, the syntax correction decoder might output the following editing operation: "Insert a new lexical node after the syntax node indicating the end of the statement, with a semicolon (;) as its lexical content, and attach this node under the current statement node." This operation is structurally equivalent to adding a semicolon child node to the statement in the syntax tree. Similarly, when the conditional expression is mistakenly written as WHERE=id 100, the editing operation could include "deleting the = node at the current position" and "inserting a new comparison operator node between id and 100," etc. Through a series of such fine-grained structural edits, the script structure, which originally had missing or incorrect ordering, is adjusted to a grammatically correct form.
[0061] The aforementioned decoding strategy for grammatical constraints refers to explicitly introducing predefined grammatical production rules, syntax rules, and node type constraints into the decoding process during the generation of editing operations or correction plans by the grammar correction decoder. This ensures that the decoder searches and outputs only from the set of legal candidates that conform to the current grammatical context when selecting the editing type, target node, and inserting / replacing content at each step. For example, if the grammar requires that the following symbol be an operator or a right parenthesis at a certain statement position, the decoder will not generate identifiers or keywords that do not conform to the grammar at that position. In this way, the model's free generation behavior is constrained within the feasible space of "grammatically correct" results, which reduces the occurrence of obvious grammatical error correction results and improves the legality and stability of the overall structure of the editing operation sequence after it is applied to the abstract syntax tree.
[0062] A structural equivalence subgraph index is constructed on the complete abstract syntax tree to filter redundant subtrees. A rewriting strategy for the redundant subtrees is generated and applied to the complete abstract syntax tree through a machine learning model to form an optimized abstract syntax tree, thus completing the syntax parsing process of the script to be parsed.
[0063] The hash operation is performed on each sub-expression of the complete abstract syntax tree to obtain the hash value of each sub-expression. When two sub-expressions have the same hash value, it means that their syntax structure is consistent and they can be regarded as sub-expressions with the same syntax structure.
[0064] Based on the same syntactic structure, subexpressions with the same hash value are grouped into the same group, and a common equivalence class identifier is assigned to this group, thereby marking all subexpressions with the same syntactic structure as the same equivalence class.
[0065] Hash operations are performed on each sub-expression's corresponding syntax tree subtree. A structural signature is constructed based on its node type, operators, child node order, and other syntactic structural elements. A pre-defined hash function is then applied to this signature to obtain a hash value that compactly represents the structural characteristics of the sub-expression. Assuming the hash function is designed to provide a consistent mapping for the same syntactic structure, when two sub-expressions have the same hash value, it means their syntactic structures are consistent in terms of node type combinations, operator configurations, and child node topology. These can be considered sub-expressions with identical syntactic structures. Based on this, sub-expressions with the same hash value can be grouped together and assigned a common equivalence class identifier. This marks all sub-expressions with identical syntactic structures as belonging to the same equivalence class, facilitating their subsequent unified management and optimization as a reusable or rewriteable structural unit.
[0066] Based on example script 2:
[0067] The query `SELECT id, amount FROM order_table WHERE status='PAID' AND status='PAID' AND amount>100` has a subexpression `status='PAID'` and a larger subexpression `status='PAID' AND amount>100`.
[0068] Structured hashing can be done like this:
[0069] First, construct a structural signature for each subexpression (focusing only on the structure, not the source code format). For `status='PAID'`: parse it into a small tree: the root node is `=`, the left child node is the field `status`, and the right child node is the constant `'PAID'`; construct the signature string, for example:
[0070]
[0071] For status='PAID' AND amount>100:
[0072] The root node is AND, the left subtree is status='PAID', and the right subtree is amount>100; the right subtree is constructed similarly. ;
[0073] Reconstruct the signature as a whole:
[0074] ;
[0075] Perform a hash operation on the structure signature:
[0076] Treat Sig1 and Sig2 as ordinary strings and input them into a hash function (such as hash()): H1=Hash(Sig1), H2=Hash(Sig2).
[0077] If two identical subtrees with status='PAID' structure appear in the script, their signatures will both be... Therefore, the hash values obtained after the hash operation are the same, for example, both are H1=0xA37F…, so these two subexpressions can be classified into the same equivalence class.
[0078] After completing the equivalence class partitioning of the subexpressions, for each equivalence class, the equivalence class identifier is obtained as the index key, and the system searches the preset data structure to see if an entry corresponding to that index key already exists.
[0079] If it does not exist, a write operation is performed, creating a new record in the data structure and writing the list of subexpression node identifiers within the equivalence class into the index value field of the record.
[0080] If it exists, perform an update operation, appending the identifier of the newly added subexpression node in the equivalence class to the node set corresponding to the index key.
[0081] By sequentially performing write or update operations on all equivalence classes, a key-value mapping set from equivalence class identifiers to subexpression node sets is finally formed in the preset data structure, thus completing the construction of the structural equivalence subgraph index.
[0082] The aforementioned preset data structure can be a hash table, a mapping table, or a key-value storage structure, used to store the mapping relationship between equivalence class identifiers and subexpression node sets.
[0083] This index allows the system to quickly retrieve all sets of sub-expression nodes with the same or similar syntax based on equivalence class identifiers, providing efficient location and management methods for subsequent candidate redundant subtree selection, common sub-expression extraction, and structure rewriting optimization.
[0084] Taking example script 2 with WHERE status='PAID' AND status = 'PAID' AND amount>100 as an example:
[0085] Suppose that after calculating the structure hash of each conditional subexpression, we get: the hash value of the first status='PAID' is H11, the hash value of the second status='PAID' is also H11, and the hash value of amount>100 is H22. In the abstract syntax tree, the root nodes of these three subexpressions are numbered Node_1, Node_2, and Node_3 respectively.
[0086] When constructing the structural equivalence subgraph index, the equivalence class H1 is processed first. It is found that there is no record with key H11 in the data structure. Therefore, a new entry Key=H11→Value={Node_1, Node_2} is created, indicating that the equivalence class H11 contains two subexpressions with the same structure, status='PAID'. Then, the equivalence class H22 is processed. Similarly, a new entry Key=H22→Value={Node_3} is created, indicating that the equivalence class H22 currently contains only one subexpression, amount>100.
[0087] Through this key-value writing process, a set of mappings of "equivalence class identifier → subexpression node set" is finally obtained in the hash table / mapping table. This is used to quickly locate all duplicate status='PAID' subtrees directly through H11 and locate the amount>100 subtree through H22, thus providing an efficient retrieval foundation for redundancy detection and rewrite optimization.
[0088] Candidate redundant equivalence classes are selected from the structural equivalence subgraph index. Each subtree node in each candidate redundant equivalence class is considered a candidate redundant subtree.
[0089] When selecting candidate redundant equivalence classes from the structural equivalence subgraph index, the machine learning model first performs statistical analysis on the set of subexpression nodes corresponding to each equivalence class in the index. Equivalence classes that contain only a single subexpression node or whose subexpression size is lower than a preset size threshold are eliminated. Only equivalence classes that appear more frequently than a frequency threshold and whose structural indicators such as the number of subexpression nodes and syntactic depth are within a set range are retained. These equivalence classes that meet the condition of "high repetition and structural size with optimization value" are marked as candidate redundant equivalence classes. On this basis, the root node of the subtree in each candidate redundant equivalence class is added to the candidate set according to its position reference in the abstract syntax tree, and each subtree node in this set is regarded as a candidate redundant subtree for subsequent rewriting strategy evaluation and structural optimization decision-making.
[0090] For each candidate redundant subtree, predict whether it is worth rewriting. If it is worth rewriting, generate a rewriting strategy for the redundant subtree using a machine learning model and apply it to the complete abstract syntax tree. If it is not worth rewriting, maintain the original structure of the redundant subtree in the abstract syntax tree.
[0091] In this embodiment, the prediction of whether a candidate redundant subtree is worth rewriting is specifically achieved by inputting the structural features of the candidate redundant subtree and its corresponding contextual semantic representation into the rewriting strategy component of the machine learning model. Within this component, a prediction score representing the rewriting benefit of the subtree is calculated through forward inference. This score is then compared with a preset first prediction threshold. When the prediction score is not lower than the first prediction threshold, the candidate redundant subtree is determined to have rewriting value.
[0092] If a subtree is deemed worthy of rewriting, a machine learning model is invoked to generate one or more candidate rewriting schemes, such as different methods for extracting common subexpressions, merging logical conditions, or shifting computation forward. A heuristic cost estimate is calculated for each candidate rewriting scheme. This cost estimate can be obtained by weighting and summing indicators such as the expected number of executions, data scan volume, and computational complexity based on a preset cost function. Simultaneously, a reference cost value is calculated for the original unrewritten subtree. The cost estimates of each candidate rewriting scheme are compared with the reference cost value. Candidate schemes with cost estimates no higher than the reference cost value minus a preset cost reduction threshold are selected as acceptable schemes. Among the acceptable schemes, the one with the smallest cost estimate is chosen as the target rewriting scheme. This target rewriting scheme is then applied to the complete abstract syntax tree, replacing or reconstructing the corresponding subtree structure. This achieves local structural optimization and a substantial reduction in overall execution cost while maintaining the script's logical semantics.
[0093] In this embodiment of the invention, for the complete abstract syntax tree after syntax correction and structural rewriting optimization, a unified downstream generation step can be further performed: On the one hand, based on the optimized abstract syntax tree, an internally common intermediate representation (IR) is generated on demand, abstracting the script's syntax structure and execution semantics into an instruction sequence or operator graph structure independent of the specific scripting language, which is used as input for subsequent rule engines, execution engines, or code generation modules, so that the optimization results generated in the parsing stage can be reused among different execution carriers; On the other hand, when it is necessary to deploy to a specific operating environment or business system, the internal representation can be translated into the specific implementation form of the target scripting language on demand based on the same optimized abstract syntax tree, such as generating corresponding SQL statements, TVM Relay scripts, or internal rule scripts, thereby realizing a complete technical closed loop from "script syntax parsing - error correction and structural optimization - unified intermediate representation - multi-target script generation", so that the syntax parsing and optimization capabilities of this invention can be smoothly connected to various execution platforms and business scenarios.
[0094] It should be explained that, in a preferred embodiment, program syntax parsing and execution can be integrated into the integrated process of historical script sample construction and semantic encoding-correction-rewriting of the present invention. Specifically, when collecting historical data at the script execution port, database syntax scripts and program scripts (such as internal business scripts, Python subsets, rule engine scripts, etc.) can be uniformly included in the same script source code dataset. For program scripts, standard abstract syntax trees, error script and corrected script pairs, and before and after optimization script pairs are generated through the corresponding syntax analyzer. Script type identifiers and syntax family identifiers are added to the samples as additional features input to the multi-task deep neural network model based on GNN and Transformer.
[0095] During online parsing, the script to be parsed is first analyzed by selecting the appropriate lexical analyzer and grammar rules based on the script type. The program script is then parsed into abstract syntax tree fragments and mapped to the database script in the same way to form an initial script semantic graph. In the subsequent semantic graph hybrid encoding, syntax correction decoding, and structural rewriting strategy generation processes, the database script and the program script are no longer distinguished. Instead, they are processed uniformly by the same set of semantic encoding units and editing operation prediction mechanisms. Only in the final stage, the optimized abstract syntax tree is handed over to the database execution engine or the program execution engine for interpretation or compilation and execution, depending on the script type. This achieves deep integration of database syntax script parsing and generation with program syntax parsing and execution at the sample construction layer, semantic graph representation layer, and model inference layer.
[0096] As a specific example of Example 1:
[0097] A company developed its own script, whose execution port parses and executes SQL and internal business rule scripts.
[0098] S101 Training Sample Preparation:
[0099] Three types of samples were compiled from the historical scripts of the execution port:
[0100] Syntactically correct script → Standard AST (training semantic encoding);
[0101] Error / incomplete script → Manually corrected script / AST (training syntax correction);
[0102] Redundant or inefficient scripts → optimize the script / AST pair before and after (training rewrite strategy).
[0103] S102 Lexical analysis, preliminary grammatical parsing, semantic graph construction:
[0104] Lexical analysis is performed on Example Script 2: SELECT id, amount FROM order_table WHERE status='PAID' AND status='PAID' AND amount>100 to obtain the lexical unit sequence: SELECT, id, , amount, FROM, order_table, WHERE, status, =, 'PAID', AND, status, =, 'PAID', AND, amount, >, 100. Under the simplified SQL grammar, the LL / LR parser generates the AST fragment of the statement / condition expression, and marks the end with a syntax error "missing semicolon". Then, the statement nodes, condition nodes, lexical nodes, and error nodes and their relationships are assembled into an initial script semantic graph.
[0105] S103 Semantic Graph Hybrid Encoding:
[0106] Each node in the semantic graph is mapped to an initial vector. First, a graph neural network is used to perform multiple rounds of message passing on parent-child, sibling, and error association edges. Then, the node vector sequence expanded in the order of the script is fed into the Transformer to obtain the contextual semantic representation of each node and the global semantic representation of the entire SQL.
[0107] S104 Syntax Correction Decoding:
[0108] Based on the above encoding results, the syntax correction decoder infers the error node "missing semicolon" at the end and its context, and outputs the edit operation "insert semicolon node at the end of the current SELECT statement". This operation is applied to the AST to obtain a syntactically complete and executable SQL syntax tree.
[0109] S105 Redundancy Detection and Structure Rewriting:
[0110] On the complete AST, the subexpression (including two instances of status='PAID') is hashed / encoded and an equivalence class index is built. The candidate redundant subtrees that appear repeatedly are screened out, and their structural features and semantic representations are input into the rewriting strategy component to determine whether they are worth rewriting. If the rewriting value reaches the threshold, candidate rewriting schemes such as "removing duplicate conditions" are generated. The cost of each scheme is calculated based on the expected number of executions, the amount of data scanned, etc. The scheme with a significantly lower cost than the original structure is selected and applied to the AST. Finally, the WHERE clause is rewritten as WHERE status='PAID' AND amount>100.
[0111] Example 2:
[0112] Without changing the logic of other steps in Example 1, the semantic graph hybrid encoding in step S103 can be replaced by a hybrid encoding method based on the syntax tree structure: considering the characteristic that the script itself is mainly a tree structure, firstly, based on the simplified grammar, a syntax tree skeleton as complete as possible is constructed for the example script, and the SELECT statement, WHERE condition, and various fields and constants are organized into hierarchical syntax nodes; for the error positions marked in the preliminary parsing stage (such as missing semicolons) and possible unparsed segments, they are attached to the corresponding parent node as child nodes with special type marks, so that the whole tree can simultaneously carry "normal syntax components" and "abnormal / to-be-corrected components" in structure.
[0113] Building upon this foundation, tree-structure-specific networks such as Tree-LSTM or Tree-Transformer are introduced to recursively aggregate and encode the node vectors of each subtree of the syntax tree from the bottom up. This ensures that each intermediate and leaf node obtains a node-level semantic representation vector that integrates information from its child nodes and the local context structure. Simultaneously, the aggregated representation of the root node is output as a global semantic vector representing the entire script, which is used for subsequent syntax correction decoding and structural rewriting decisions. In this way, the original implementation method of "hybrid encoding of semantic graph, GNN combined with Transformer" is replaced by "annotated syntax tree and tree structure network".
[0114] Example 3:
[0115] Without changing the logic of other steps in Embodiment 1 or Embodiment 2, the method for judging structural rewriting can also adopt an alternative implementation based on a simple heuristic rule. That is, each candidate redundant subtree is quickly screened only from two perspectives: "whether the size is worth changing" and "whether the frequency of occurrence is sufficient". For each candidate subtree, first count the number of nodes in the abstract syntax tree (as the subtree size) and the number of times the subtree appears in the current script. Then, three threshold parameters are preset: minimum node number N_min, maximum node number N_max, and minimum number of occurrences in the script F_min. When the number of nodes of a candidate subtree is less than N_min, it is considered that the structure is too simple and the potential optimization benefits are limited, so it is skipped directly for rewriting. When the number of nodes is greater than N_max, it is considered that the structure is too complex and the risk of rewriting and analysis cost are high, so it is not included in the current round of rewriting. When the number of occurrences of the subtree in the script is less than F_min, it is considered that even if the subtree is optimized, its impact on the overall execution cost is relatively limited, so it can be excluded first. By employing the exhaustive judgment logic of "not rewriting if too small, not rewriting if too complex, and not rewriting if too rare," a rapid pruning of the candidate redundant subtree set is performed without relying on a complex strategy model. Only subtrees of moderate size that appear repeatedly in the script are retained as the target objects for subsequent structural rewriting, thereby obtaining a more focused rewriting candidate set with lower overhead.
[0116] Reference Figure 2 As shown, the second aspect of the present invention provides a script syntax parsing system based on machine learning, including: a machine learning model training module, a script semantic graph construction module, an abstract syntax tree modification module, and an abstract syntax tree optimization module.
[0117] The machine learning model training module is connected to the script semantic graph construction module, the script semantic graph construction module is connected to the abstract syntax tree modification module, and the abstract syntax tree modification module is connected to the abstract syntax tree optimization module.
[0118] The machine learning model training module is used to collect historical script data sets from the script execution port, extract historical script parsing samples from them, and input them into the machine learning model as training samples to train the machine learning model.
[0119] The script semantic graph construction module is used to perform lexical analysis on the script to be parsed. The parser performs preliminary syntactic analysis on the lexical analysis results and constructs an initial script semantic graph based on the preliminary syntactic analysis results.
[0120] The Abstract Syntax Tree Modification Module is used to encode the initial script semantic graph using a machine learning model, obtain predicted editing operations based on the encoding results, modify the initial script semantic graph according to the predicted editing operations, and generate a complete abstract syntax tree.
[0121] The Abstract Syntax Tree (ABS) optimization module is used to construct a structurally equivalent subgraph index on the complete ABS, filter redundant subtrees, and generate a rewriting strategy for the redundant subtrees through a machine learning model. This strategy is then applied to the complete ABS to form an optimized ABS, thus completing the syntax parsing process of the script to be parsed.
[0122] The above description is merely an example and illustration of the structure of the present invention. Those skilled in the art can make various modifications or additions to the specific embodiments described, or use similar methods to replace them, as long as they do not deviate from the structure of the invention or exceed the scope defined by the present invention, they should all fall within the protection scope of the present invention.
Claims
1. A method for parsing a script grammar based on machine learning, characterized by, include: Collect historical script data sets from the script execution port, extract historical script parsing samples from them, and input them into the machine learning model as training samples to train the machine learning model; Lexical analysis is performed on the script to be parsed. The parser performs preliminary syntactic analysis on the lexical analysis results and constructs an initial script semantic graph based on the preliminary syntactic analysis results. The initial script semantic graph is encoded using the machine learning model, and the predicted editing operation is obtained based on the encoding result. The initial script semantic graph is then modified according to the predicted editing operation to generate a complete abstract syntax tree. A structural equivalence subgraph index is constructed on the complete abstract syntax tree. Redundant subtrees are filtered through the structural equivalence subgraph index. The machine learning model is used to generate a rewriting strategy for the redundant subtrees, which is then applied to the complete abstract syntax tree to form an optimized abstract syntax tree, thus completing the syntax parsing process of the script to be parsed. Perform hash operations on each sub-expression of the complete abstract syntax tree to obtain the hash value of each sub-expression; Subexpressions with the same hash value are marked as the same equivalence class. By performing write or update operations on all equivalence classes in sequence, a key-value mapping set from equivalence class identifiers to subexpression node sets is finally formed in the preset data structure, which completes the construction of the structural equivalence subgraph index. The index filtering of redundant subtrees is achieved by filtering candidate redundant equivalence classes from the structural equivalence subgraph index. Each subtree node in each candidate redundant equivalence class is regarded as a candidate redundant subtree. The rewriting strategy for generating redundant subtrees is applied to the complete abstract syntax tree. It predicts whether each candidate redundant subtree is worth rewriting. If it is to be rewritten, the rewriting strategy for generating redundant subtrees is applied to the complete abstract syntax tree through the machine learning model. If it is not to be rewritten, the original structure of the redundant subtree in the abstract syntax tree is maintained. To predict whether a candidate redundant subtree is worth rewriting, the structural features of the candidate redundant subtree and its corresponding contextual semantic representation are input into the rewriting strategy component of the machine learning model. Within this component, a predicted score representing the rewriting benefit of the subtree is calculated through forward inference. This score is then compared with a preset first prediction threshold. If the predicted score is not lower than the first prediction threshold, the candidate redundant subtree is determined to have rewriting value.
2. The script syntax parsing method based on machine learning according to claim 1, characterized in that: The historical script data set of the script execution port includes the script source code dataset of the script execution port and the execution auxiliary information set of the script execution port; The historical script parsing samples include a first type of parsing training samples, a second type of parsing training samples, and a third type of parsing training samples. The machine learning model is trained using the historical script parsing samples. For grammatically correct scripts that can be fully parsed by existing parsers, generate corresponding standard abstract syntax trees and convert the scripts into standard ASTs as training samples for the first type of parsing. For scripts with grammatical errors, incomplete writing, and that have been manually corrected or successfully executed in the past, pairwise samples of erroneous scripts are constructed and converted into corrected scripts or correct ASTs. These samples are then converted into the corresponding initial script semantic graph input encoding units. The editing operation sequence corresponding to the target AST after grammatical correction is used as a supervision signal to train the grammatical correction decoder submodule in the model, forming the second type of parsing training samples. For redundant or inefficient scripts that exhibit performance differences in the execution auxiliary information and their artificial or system-optimized versions, construct sample pairs of pre-optimized scripts or ASTs and convert them into post-optimized scripts or ASTs. Convert the pre-optimized scripts into initial script semantic graphs, obtain the contextual semantic representation of candidate subtrees through semantic encoding units, and use the optimized target structure or target rewrite editing sequence as labels to train the rewrite policy components in the model, forming the third type of parsing training samples.
3. The script syntax parsing method based on machine learning according to claim 1, characterized in that: The lexical analysis result specifically refers to performing lexical analysis on the script to be parsed to obtain the lexical units of the script to be parsed, and arranging the lexical units in the order of their appearance in the source code of the script to be parsed to form the lexical analysis result.
4. The machine learning-based script syntax parsing method according to claim 3, characterized in that, Includes the following steps: The preliminary syntax parsing is based on preset simplified grammar constraints. An LL parser or an LR parser is used to perform preliminary syntax parsing on the lexical analysis results to obtain preliminary syntax parsing results. The preliminary syntax parsing results include successfully parsed abstract syntax tree fragments, syntax error locations, and unparsed lexical unit fragments. The construction of the initial script semantic graph is based on the successfully parsed abstract syntax tree fragments, syntax error locations, and unparsed lexical unit fragments. The machine learning model constructs the corresponding initial script semantic graph at the script level. The initial script semantic graph represents the connection relationship between the successfully parsed structure, error context, and unparsed fragments simultaneously in the same script semantic graph.
5. The script syntax parsing method based on machine learning according to claim 1, characterized in that, Includes the following steps: The machine learning model encodes the initial script semantic graph by mapping each node in the initial script semantic graph to an initial semantic representation vector. Through a set hybrid encoding structure, message passing and encoding are performed on the initial semantic representation vectors of each node in the entire initial script semantic graph.
6. The script syntax parsing method based on machine learning according to claim 5, characterized in that: The encoding result includes the set of context semantic representation vectors for each node and the global semantic representation vector of the script to be parsed.
7. The script syntax parsing method based on machine learning according to claim 6, characterized in that, Includes the following steps: The predicted editing operation based on the encoding result is obtained by the machine learning model's syntax correction decoder receiving the encoding result as the initial decoding state or condition vector. The decoder gradually generates a series of structured editing instructions through autoregressive or step-by-step decoding. When the decoder determines that the current editing sequence is sufficient to correct the script from an error state to a grammatically correct state, it outputs a termination flag, resulting in a set of predicted editing operation sequences, which are represented as predicted editing operations. The syntax correction decoder introduces a decoding strategy that incorporates grammatical constraints during the decoding process.
8. The script syntax parsing method based on machine learning according to claim 1, characterized in that, Includes the following steps: The construction of a structural equivalence subgraph index on the complete abstract syntax tree is specifically achieved through: The hash operation is performed on each sub-expression of the complete abstract syntax tree to obtain the hash value of each sub-expression. When two sub-expressions have the same hash value, it means that their syntax structure is consistent and they can be regarded as sub-expressions with the same syntax structure. Based on the same syntactic structure, subexpressions with the same hash value are grouped into the same group, and a common equivalence class identifier is assigned to the group, thereby marking all subexpressions with the same syntactic structure as the same equivalence class; After completing the equivalence class partitioning of the subexpressions, for each equivalence class, the equivalence class identifier is obtained as the index key, and the system searches the preset data structure to see if an entry corresponding to that index key already exists. If it does not exist, a write operation is performed, a new record is created in the data structure, and the list of subexpression node identifiers in the equivalence class is written to the index value field of the record; If it exists, perform an update operation to append the identifier of the newly added subexpression node in the equivalence class to the node set corresponding to the index key; By sequentially performing write or update operations on all equivalence classes, a key-value mapping set from equivalence class identifiers to subexpression node sets is finally formed in the preset data structure, thus completing the construction of the structural equivalence subgraph index.
9. A machine learning-based script syntax parsing system, employing the machine learning-based script syntax parsing method as described in any one of claims 1-8, characterized in that: include: The machine learning model training module is used to collect historical script data sets from the script execution port, extract historical script parsing samples from them, and input them into the machine learning model as training samples to train the machine learning model. The script semantic graph construction module is used to perform lexical analysis on the script to be parsed. The parser performs preliminary syntactic analysis on the lexical analysis results and constructs an initial script semantic graph based on the preliminary syntactic analysis results. The abstract syntax tree modification module is used to encode the initial script semantic graph through the machine learning model, obtain the predicted editing operation based on the encoding result, modify the initial script semantic graph according to the predicted editing operation, and generate a complete abstract syntax tree. The abstract syntax tree optimization module is used to construct a structurally equivalent subgraph index on the complete abstract syntax tree to filter redundant subtrees, and apply the rewriting strategy generated by the machine learning model to the complete abstract syntax tree to form an optimized abstract syntax tree, thus completing the syntax parsing process of the script to be parsed.