A database table topic classification method and system fusing field standardization results

By employing hierarchical and progressive field standardization and multi-model fusion decision-making, the problems of non-standard naming and semantic ambiguity in database table subject classification are resolved, achieving higher accuracy and stability, and supporting automated data asset governance in complex data environments.

CN122240732APending Publication Date: 2026-06-19CHONGQING UNIV OF POSTS & TELECOMM

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHONGQING UNIV OF POSTS & TELECOMM
Filing Date
2026-03-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies lack a hierarchical and progressive field semantic standardization process in database table subject classification, resulting in poor robustness to dirty data and a single decision-making mechanism that makes it difficult to handle issues such as non-standard naming and semantic ambiguity, thus affecting classification accuracy and stability.

Method used

A hierarchical, progressive field standardization module is adopted, which combines supervised machine learning models and generative large language models to perform deep semantic understanding and standardization of field information. Through multi-model fusion and intelligent decision-making, the final table topic classification results are generated.

Benefits of technology

It improves the accuracy and reliability of database table subject classification, effectively handles naming irregularities and semantic ambiguity in complex data environments, and enhances the automation level of data asset governance and intelligent data catalog construction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122240732A_ABST
    Figure CN122240732A_ABST
Patent Text Reader

Abstract

This invention relates to a database table topic classification method and system that integrates field standardization results, belonging to the field of data governance technology. It includes: a data extraction and parsing module, which parses the input source to extract structured information such as table names, table comments, field names, and field comments; a hierarchical progressive field standardization module, which performs semantic standardization on fields sequentially through dictionary matching, semantic retrieval, and generative reasoning; a topic classification feature construction module, which integrates standardized field and table information to construct classification feature representations; a multi-model fusion topic classification module, which executes supervised classification and generative reasoning in parallel, outputting classification results and classification and potential usage information generated by a large language model; an intelligent decision-making and fusion module, which performs fusion adjudication on the multi-model outputs according to preset rules to determine the final table topic; and a result processing and archiving module, which persists the final classification results as JSON and CSV format files and generates performance statistics reports.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data governance technology, and in particular to a database table topic classification method and system that integrates field standardization results. Background Technology

[0002] In the fields of data governance and data analysis, automatically and accurately identifying the business themes of database tables is fundamental for data asset inventory, data model understanding, and data service construction. Currently, automated table theme classification mainly relies on rule-based keyword matching or classification methods based on traditional machine learning models. Rule-based methods depend on pre-set dictionaries, resulting in poor flexibility and high maintenance costs. Machine learning-based methods extract statistical features from metadata such as table names and field names for training and prediction, but their performance is limited by feature quality and the standardization of training data. Therefore, developing a robust classification method that can adapt to complex real-world data environments is of significant value.

[0003] Current mainstream methods typically extract features and perform model inference directly from the raw table structure information, failing to effectively address common issues in source data such as non-standard naming and semantic ambiguity. In real-world databases in fields like science and technology innovation, table and field names often contain numerous abbreviations, mixed Chinese and English names, and legacy naming conventions, with business annotations frequently lacking. This noise in the raw data directly leads to low-quality extracted features, making it difficult for the model to capture true business semantics, thus severely impacting the accuracy and generalization ability of classification methods in cross-system and cross-domain scenarios. Therefore, conducting deep semantic understanding and standardization of the raw field information before classification is a crucial prerequisite for improving the effectiveness of subsequent topic classification.

[0004] When researching automatic classification methods for database table topics, the inventors discovered two main shortcomings in existing technologies: First, existing methods lack a hierarchical, progressive field semantic standardization process, failing to systematically address the complete standardization problem from exact matching and semantic retrieval to generative completion, resulting in poor robustness to "dirty data." Second, the decision-making mechanisms of existing solutions are typically singular, relying on only one classification model. When faced with "difficult cases" with ambiguous features or at the classification boundary, they are prone to producing low-confidence or erroneous results. They lack a mechanism to integrate the advantages of multiple models and make intelligent decisions, leading to unstable and unreliable final output results. Summary of the Invention

[0005] To address the above problems, this invention proposes a database table topic classification method and system that integrates field standardization results, comprising:

[0006] The data extraction and parsing module is used to perform syntactic and structural analysis on the input content in order to extract structured metadata, including table names, table comments, field names, and field comments.

[0007] The hierarchical progressive field standardization module is used to call the hierarchical progressive workflow to standardize the field information in the structured information to generate standardized field names and definitions.

[0008] The topic classification feature construction module is used to integrate the standardized field information with the macro-structured information of the table to construct classification features that can comprehensively characterize the semantic connotation of the data table;

[0009] The multi-model fusion topic classification module is used to combine supervised machine learning models and generative large language models to analyze the classification features and generate a table of topic classification results.

[0010] The intelligent decision-making and fusion module is used to dynamically adjudicate and fuse the outputs of multiple models according to preset rules and model confidence, and determine the final table topic classification result.

[0011] The results processing and archiving module is used to persist the final results of table topic classification and generate performance statistics reports.

[0012] This invention discloses a database table topic classification method and system that integrates field standardization results. The data extraction and parsing module is used to parse SQL text to extract structured metadata. Specifically, it includes: receiving and analyzing diverse initial data; performing syntax compliance checks on the SQL language text; traversing its syntax structure using an ANTLR4-based parser and syntax tree listener; and extracting the basic metadata of the target data table, including the table name (denoted as...). Table Notes and the set of all fields in the table. ,in , Total number of fields Indicates the first Each field contains the field name. and field comments and output structured data objects. ,in .

[0013] The present invention discloses a database table topic classification method and system that integrates field standardization results. The hierarchical progressive field standardization module is used to perform hierarchical standardization processing on field information, specifically including:

[0014] (1) Dictionary matching and repair unit, used to match and repair field sets based on a preset dictionary. This unit performs matching and repair, filtering out fields with names exactly matching those in the standard field library through exact matching, and generating an exact match set. For unmatched field sets Then, word segmentation repair is performed, splitting the field names into word sequences according to preset delimiters, and using the forward maximum matching algorithm in the standard word library. Search and reorganize in order to generate standardized field names and interpretation If a word is completely matched, then that field is classified. If no complete match is found, the output will be added to the unprocessed set. The set of precise matching results can be represented by the following formula:

[0015] In the formula Represents the standard field library, This symbol represents a field in the standard field library that is completely equivalent to a field in the field set in terms of string representation. The symbol indicates that something "belongs to" a certain set. To indicate "existence", the symbol is... It means "meeting the following conditions";

[0016] (2) Text vectorization unit, used for deep semantic understanding and standardization of fields that cannot be matched by dictionary. This unit uses a pre-trained BERT model to encode field names and annotations, and generates high-dimensional mathematical vectors that can represent their semantics through average aggregation. At the same time, the standard field library All entries are processed in the same way to generate a standard semantic vector set. ,in This represents the total number of entries in the standard field library.

[0017] (3) Semantic retrieval and similarity calculation unit: calculates the similarity between the query field vector and each standard vector using the cosine similarity formula, sorts them according to the calculated similarity scores, recalls the candidate standardized fields with the most similar semantics, and outputs them to Among them; the cosine similarity Calculated using the following formula:

[0018] In the formula Indicates the field to be queried The corresponding high-dimensional mathematical vector , Represents the first in the standard field library The standard semantic vector of each field, symbol This represents the vector dot product operation. and Representing vectors respectively and The Euclidean norm;

[0019] (4) Generative reasoning semantic completion unit, used to reason and generate standardized results for difficult fields that have not been standardized by the dictionary matching and repair unit and the semantic retrieval unit. For each difficult field Construct a structure containing its field names. Field comments and its table name The structured context information is processed and formatted into input prompt text acceptable to the generative language model. Then, using the pre-trained generative language model Qwen-15B, the input prompt text is used as a condition for inference, and the generated standardized field names are output. and corresponding Chinese definitions The final standardized result of this problematic field is input into... among.

[0020] The present invention discloses a database table topic classification method and system that integrates field standardization results. The topic classification feature construction module is used to construct classification features, specifically including:

[0021] (1) Structured information aggregation unit, used to aggregate macro information of data tables and all standardized field results output by the hierarchical progressive field standardization module. To form a structured information set The structured information set can be represented by the following definition:

[0022] , symbols in the formula This represents the union operator in mathematics.

[0023] (2) A comprehensive descriptive text generation unit, used to generate text through structured splicing functions. The collected structured information set Convert into a comprehensive descriptive text The text is defined as ;

[0024] (3) A multi-dimensional feature vector mapping unit, used to map the comprehensive descriptive text into high-dimensional feature vectors using multiple pre-trained language models, specifically including:

[0025] A TF-IDF vectorizer is used to extract word importance feature vectors, and a pre-trained word vector model is used to extract semantic feature vectors. These two types of feature vectors are then concatenated to form the final multi-dimensional feature vector. It can be expressed by the following formula:

[0026] In the formula The model function representing the TF-IDF vectorizer. Represents the pre-trained word vector model function, symbol This indicates the vector combination operator.

[0027] The present invention discloses a database table topic classification method and system that integrates field standardization results. The multi-model fusion topic classification module is used to perform multi-model analysis on classification features, specifically including:

[0028] (1) Supervised classification unit, used to classify the high-dimensional feature vector using a pre-trained random forest classifier. The analysis generates preliminary classification candidate results with confidence scores. The result can be represented by the following process:

[0029] In the formula This represents the prediction function of the random forest classifier. Indicates the first Each category This indicates the confidence score for that category. The number of categories output by the classifier;

[0030] (2) Generative reasoning unit, used to input the comprehensive descriptive text into the generative large language model Qwen-15B, and use it to perform deep semantic understanding and reasoning to generate results including topic classification, confidence level and potential uses. The result can be represented by the following process:

[0031] In the formula Represents the inference function of a large language model. This represents the table topic classification result inferred by the model. This indicates the confidence level of the classification result. This indicates the potential uses of the data table.

[0032] The present invention discloses a database table topic classification method and system that integrates field standardization results. The intelligent decision-making and fusion module is used to optimize and determine the final classification result. Specifically, it includes: setting decision logic; when the confidence level output by the generative inference unit is lower than a preset threshold, and the supervised classification unit has a classification result, adopting the output result of the supervised classification unit as the final classification result, and adding a potential use description. Otherwise, the output of the generative reasoning classification unit is adopted as the final classification result; wherein, the potential use information is always taken from the output of the generative reasoning classification unit.

[0033] The present invention discloses a database table topic classification method and system that integrates field standardization results. The result processing and archiving module performs formatted storage and performance analysis on the final classification results, specifically including:

[0034] (1) Save the final topic classification results, decision paths and key features used in each table as a structured JSON file;

[0035] (2) Extract the core information from the JSON file and convert it into a CSV format file;

[0036] (3) Generate a performance statistics report, which includes the accuracy of the topic classification and the processing performance data of each unit in the hierarchical progressive field standardization module.

[0037] The database table topic classification method and system proposed in this invention integrates field standardization results. It systematically solves the problems of non-standard field naming and semantic ambiguity through a hierarchical and progressive field standardization process, from dictionary matching and semantic retrieval to generative reasoning. At the same time, by integrating the analytical capabilities of supervised machine learning models and generative large language models, and making intelligent decisions based on confidence, it ensures the robustness and reliability of classification decisions. This provides solid technical support for improving the accuracy and reliability of automatic classification of database table topics in complex data environments, and further for efficient and automated data asset governance and intelligent data catalog construction. Attached Figure Description

[0038] Figure 1 This is a flowchart of the algorithm of the present invention;

[0039] Figure 2 This is a system framework diagram of the present invention;

[0040] Figure 3 This is a screenshot of the front-end page result of the present invention. Detailed Implementation

[0041] To make the objectives, technical solutions, and advantages of this invention clearer, the following detailed description, in conjunction with the accompanying drawings, provides a method and system for classifying database table topics by integrating field standardization results. It should be understood that the specific implementation methods described herein are merely illustrative of the invention and are not intended to limit the invention. Any changes, modifications, additions, alterations, or substitutions made by those skilled in the art within the scope of this invention should be covered by the claims of this invention.

[0042] Figure 1 This is a flowchart illustrating the database table topic classification method based on the fusion of field standardization results according to the present invention. From Figure 1 As can be seen, the table topic classification method proposed in this invention includes receiving and parsing the input DDL text or file, extracting structured information such as table name, field name, and field comments; standardizing the extracted raw fields: attempting to match based on a preset dictionary, and directly obtaining the standardized result if a complete match is found; firstly calculating the semantic similarity between the field name and the standard field library through BERT semantic retrieval, and adopting the semantically closest result if the similarity is higher than a set threshold, and if it is lower than the threshold, then calling a large language model to perform inference and generation based on input prompts; constructing table topic features based on all standardized field results and table information; then performing supervised classification and generative large language model classification in parallel, and stripping the potential usage information provided by the LLM method; aggregating all classification results and information, and outputting the final JSON file through intelligent decision-making.

[0043] Figure 2 This is a framework diagram of the database table topic classification system that integrates field standardization results according to the present invention. Figure 2As can be seen, the table topic classification system proposed in this invention includes an input parsing and preprocessing module, a hierarchical progressive field standardization module, a topic classification feature construction module, a multi-model fusion topic classification module, an intelligent decision-making and fusion module, and a result output and archiving module. The input parsing and preprocessing module receives and parses user-input DDL text or files, performs syntax validation, and extracts structured information such as table names, field names, and field comments. The hierarchical progressive field standardization module, based on the information extracted by the input parsing and preprocessing module, sequentially performs dictionary matching, semantic retrieval, and generative inference operations on each field to generate standardized field semantic descriptions. The topic classification feature construction module integrates the standardized results output by the hierarchical progressive field standardization module and table-level information to construct a comprehensive feature representation for topic classification. The multi-model fusion topic classification module includes supervised classification units and generative inference units, used to analyze the feature representations generated by the topic classification feature construction module in parallel, outputting classification results based on machine learning and classification results and potential usage information based on a large language model, respectively. The intelligent decision-making and fusion module, according to preset rules, fuses and intelligently adjudicates the two results output by the multi-model fusion topic classification module to determine the final table topic classification. The result output and archiving module formats the final classification results, decision paths, and related information determined by the intelligent decision-making and fusion module and outputs them as structured archive files.

[0044] Furthermore, the following example illustrates this further:

[0045] Suppose the system receives an SQL data definition language script related to database management and authorization as input. This script contains several fields that need to be standardized:

[0046] CREATE TABLE `auth_hi_log`(

[0047] `hi_log_id` bigint NOT NULL AUTO_INCREMENT COMMENT 'primary key id',

[0048] `org_id` bigint NULL COMMENT 'Organization ID',

[0049] `org_name` varchar(200) NULL COMMENT 'Name of Organization / Institution',

[0050] `person_id` varchar(20) NULL COMMENT 'Personal ID',

[0051] `manage_name` varchar(200) NULL COMMENT 'Name of the unit's management personnel',

[0052] `manage_mp` varchar(40) NULL COMMENT 'Mobile phone number of unit management personnel',

[0053] `manage_id_no` varchar(100) NULL COMMENT 'Unit management personnel ID number'

[0054] ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ciCOMMENT='History Table';

[0055] First, the data is preprocessed and extracted using the input parsing and preprocessing module to obtain the variables in the hypothesis. The specific implementation steps of the proposed database field standardization method and system are as follows:

[0056] The complete implementation steps of the system from receiving the final output are as follows:

[0057] S1: The data extraction and parsing module extracts structured data objects including table names, table comments, field names, and field comments. .

[0058] S2: The hierarchical progressive field standardization module standardizes fields to obtain standard field names and definitions;

[0059] First, dictionary-based precise matching and repair are performed to conduct preliminary matching and repair of the fields. The dictionary is the foundation for field identification and standardization; the system determines the target name and semantics for standardization by comparing the input field name with entries in the dictionary. The dictionary matching and repair module loads a preset standard field library. The standard library contains standard field names such as user_id and their definitions;

[0060] The person_id field is in If an item is found to be exactly the same, it is directly matched, and the result is added to the set. Its expression is as follows:

[0061] In the formula Represents the standard field library, This symbol represents a field in the standard field library that is completely equivalent to a field in the field set in terms of string representation. The symbol indicates that something "belongs to" a certain set. To indicate "existence", the symbol is... It means "meeting the following conditions".

[0062] Secondly, word segmentation and matching were performed. Fields such as manage_id_no failed to match and were added to the unmatched set. The built-in word segmentation repair unit will process the data. The fields in the code are split according to preset delimiters and based on a standard lexicon. The matching is performed sequentially, and the vocabulary includes id, name, org, etc.

[0063] The field `hi_log_id` was split into [hi, log, id], where "hi" was identified as a common abbreviation of "history". Combined with the table comment "history table", the entire field was repaired and standardized to `history_log_id`. Input... The fields org_id and org_name are divided into [org, id, name]. The word "org" is recognized as a standard abbreviation of "organization," and therefore standardized to organization_id and organization_name respectively. Input .

[0064] The field `manage_mp` was split into `[manage,mp]` and reassembled in order into `mp_manage`, but this field failed to match, so it was added to the pending set. The fields manage_id_no and manage_name also failed to be standardized by this module and were input into the unprocessed set. Then proceed to the next step.

[0065] Third, semantic retrieval matching is performed, using a pre-trained BERT model to handle fields that cannot be resolved by dictionary matching; the unmatched field names are concatenated with annotations and transformed into semantic vectors. For example, for the field manage_id_no, the text "manage_id_no company manager's ID number" is constructed and the corresponding word embedding vector is generated. Then, this vector is aggregated to generate a high-dimensional mathematical vector. The standard library entries have also been pre-vectorized, and their aggregation process is as follows:

[0066] In the formula The length of the word sequence. For vector dimensions; and for the standard field library. All entries are subjected to the same vectorization process to generate a standard semantic vector set. ,in The total number of entries in the standard field library, each Each represents the semantic meaning of a standard field entry.

[0067] Calculate the semantic vector of the field manage_id_no to be processed. With all vector sets in the standard field library The cosine similarity between them is used. Candidate standardized fields with the highest similarity (exceeding a threshold) and semantic similarity ("unit_manage_person_identification_number") and the field name "unit_manage_person_identification_number") are selected as input. The cosine similarity Calculated using the following formula:

[0068] In the formula Indicates the field to be queried The corresponding high-dimensional mathematical vector , Represents the first in the standard field library The standard semantic vector of each field, symbol This represents the vector dot product operation. and Representing vectors respectively and The Euclidean norm.

[0069] The vector of the `manage_name` field has a similarity exceeding a threshold with the vector of `manager_name` in the standard library, and is therefore directly recalled and categorized. If the similarity between the vector in the `manage_mp` field and the vector in the standard library is below a threshold, it will be transferred to the next module for further processing.

[0070] Finally, based on context-based generative reasoning, fields that failed the semantic retrieval module are processed. The field name `manage_mp`, the annotation "unit manager's mobile phone number", the table name `auth_hi_log`, and the table annotation "history table" are jointly constructed to form contextual information. This information is then formatted as input prompts for the larger model and fed into the pre-trained generative language model `Qwen-15B` for reasoning and generation. The resulting field name `unit_manage_person_mobile_phone_number` and the annotation "unit manager's mobile phone number" are then input into... At this point, all fields have been processed.

[0071] S3: Topic classification feature construction module. The system collects table-level information and standardization results to construct features for topic classification.

[0072] System aggregate table names Table Notes and the complete set of standardized fields output by the hierarchical progressive field standardization module. To form a structured information set The set can be represented as follows:

[0073] , symbols in the formula This represents the union operator in mathematics.

[0074] By using structured splicing functions The collected structured information set Convert into a comprehensive descriptive text The method utilizes a TF-IDF vectorizer to extract word importance feature vectors, while simultaneously employing a pre-trained word vector model to extract semantic feature vectors. Finally, these two vectors are concatenated to obtain a multi-dimensional feature vector. The process can be represented by the following:

[0075] In the formula The model function representing the TF-IDF vectorizer. Represents the pre-trained word vector model function, symbol This indicates the vector combination operator.

[0076] S4: Multi-model fusion topic classification module, used for multi-model analysis of classification features.

[0077] The feature vector obtained in the previous step Input a pre-trained random forest classifier The classifier generates preliminary classification candidate results with confidence scores. The result can be represented by the following process:

[0078] In the formula This represents the prediction function of the random forest classifier. Indicates the first Each category This indicates the confidence score for that category. The number of categories output by the classifier.

[0079] The generated comprehensive description text The generative large language model Qwen-15B is directly input, and carefully designed prompts guide the model's reasoning. The model outputs its judgment of the topic category, confidence level, and result regarding the potential use of the table. The result can be represented by the following process:

[0080] In the formula Represents the inference function of a large language model. This represents the table topic classification result inferred by the model. This indicates the confidence level of the classification result. This indicates the potential uses of the data table.

[0081] S5: The intelligent decision-making and fusion module processes data according to preset rules. and To make a ruling, when If the confidence level is higher than the threshold, then the method with priority is adopted. Otherwise when If results are available, the result with the highest confidence level is used. Because... The confidence level is higher than the threshold, so we use And information on its potential uses.

[0082] S6: The results processing and archiving module formats all classification results and sends them to the front-end page for user viewing, outputting them as JSON files and ultimately generating performance reports for each module. See the appendix for specific front-end page results. Figure 3 The detailed JSON file information is as follows.

[0083] {

[0084] "func_flag": 2,

[0085] "auth_hi_log": {

[0086] "table_name": "auth_hi_log",

[0087] "classification": "Institutional Platform",

[0088] "potential_use": Records information about the organization's management personnel and related resources.

[0089] }

[0090] }

Claims

1. A method and system for classifying database table topics by integrating field standardization results, characterized in that, include: The data extraction and parsing module is used to perform syntactic and structural analysis on the input content in order to extract structured metadata, including table names, table comments, field names and field comments. The hierarchical progressive field standardization module is used to call the hierarchical progressive workflow to standardize the field information in the structured information to generate standardized field names and definitions. The topic classification feature construction module is used to integrate the standardized field information with the macro-structured information of the table to construct classification features that can comprehensively characterize the semantic connotation of the data table; The multi-model fusion topic classification module is used to combine supervised machine learning models and generative large language models to analyze the classification features and generate a table of topic classification results. The intelligent decision-making and fusion module is used to dynamically adjudicate and fuse the outputs of multiple models according to preset rules and model confidence, and determine the final table topic classification result. The results processing and archiving module is used to persist the final results of table topic classification and generate performance statistics reports.

2. The database table topic classification method and system based on the fusion of field standardization results as described in claim 1, characterized in that, The data extraction and parsing module is used to parse SQL text to extract structured metadata. Specifically, it includes: receiving and analyzing diverse initial data; performing syntax compliance checks on the SQL text; traversing its syntax structure using an ANTLR4-based parser and syntax tree listener; and extracting the basic metadata of the target data table, including the table name (denoted as...). Table Notes and the set of all fields in the table. ,in , Total number of fields Indicates the first Each field contains the field name. and field comments and output structured data objects. ,in .

3. The database table topic classification method and system based on the fusion of field standardization results as described in claim 1, characterized in that, The hierarchical progressive field standardization module is used to perform hierarchical standardization processing on field information, specifically including: (1) Dictionary matching and repair unit, used to match and repair field sets based on a preset dictionary. This unit performs matching and repair, filtering out fields with names exactly matching those in the standard field library through exact matching, and generating an exact match set. For unmatched field sets Then, word segmentation repair is performed, splitting the field names into word sequences according to preset delimiters, and using the forward maximum matching algorithm in the standard word library. Search and reorganize in order to generate standardized field names and interpretation If a word is completely matched, then that field is classified. If no complete match is found, the output will be added to the unprocessed set. The set of precise matching results can be represented by the following formula: In the formula Represents the standard field library, This symbol represents a field in the standard field library that is completely equivalent to a field in the field set in terms of string representation. The symbol indicates that something "belongs to" a certain set. To indicate "existence", the symbol is... It means "that satisfies the following conditions"; (2) Text vectorization unit, used for deep semantic understanding and standardization of fields that cannot be matched by dictionary. This unit uses a pre-trained BERT model to encode field names and annotations, and generates high-dimensional mathematical vectors that can represent their semantics through average aggregation. At the same time, the standard field library All entries are processed in the same way to generate a standard semantic vector set. ,in This represents the total number of entries in the standard field library. (3) Semantic retrieval and similarity calculation unit: calculates the similarity between the query field vector and each standard vector using the cosine similarity formula, sorts them according to the calculated similarity scores, recalls the candidate standardized fields with the most similar semantics, and outputs them to Among them; the cosine similarity Calculated using the following formula: In the formula Indicates the field to be queried The corresponding high-dimensional mathematical vector , Represents the first in the standard field library The standard semantic vector of each field, symbol This represents the vector dot product operation. and Representing vectors respectively and The Euclidean norm; (4) Generative reasoning semantic completion unit, used to reason and generate standardized results for difficult fields that have not been standardized by the dictionary matching and repair unit and the semantic retrieval unit. For each difficult field Construct a structure containing its field names. Field comments and its table name The structured context information is processed and formatted into input prompt text acceptable to the generative language model. Then, using the pre-trained generative language model Qwen-15B, the input prompt text is used as a condition for inference, and the generated standardized field names are output. and corresponding Chinese definitions The final standardized result of this problematic field is input into... among.

4. The database table topic classification method and system based on the fusion of field standardization results as described in claim 1, characterized in that, The topic classification feature construction module is used to construct classification features, specifically including: (1) Structured information aggregation unit, used to aggregate macro information of data tables and all standardized field results output by the hierarchical progressive field standardization module. To form a structured information set The structured information set can be represented by the following definition: , symbols in the formula This represents the union operator in mathematics. (2) A comprehensive descriptive text generation unit, used to generate text through structured splicing functions. The collected structured information set Convert into a comprehensive descriptive text The text is defined as ; (3) A multi-dimensional feature vector mapping unit, used to map the comprehensive descriptive text into high-dimensional feature vectors using multiple pre-trained language models, specifically including: A TF-IDF vectorizer is used to extract word importance feature vectors, and a pre-trained word vector model is used to extract semantic feature vectors. These two types of feature vectors are then concatenated to form the final multi-dimensional feature vector. It can be expressed by the following formula: In the formula The model function representing the TF-IDF vectorizer. Represents the pre-trained word vector model function, symbol This indicates the vector combination operator.

5. The database table topic classification method and system based on the fusion of field standardization results as described in claim 1, characterized in that, The multi-model fusion topic classification module is used to perform multi-model analysis on classification features, specifically including: (1) Supervised classification unit, used to classify the high-dimensional feature vector using a pre-trained random forest classifier. The analysis generates preliminary classification candidate results with confidence scores. The result can be represented by the following process: In the formula This represents the prediction function of the random forest classifier. Indicates the first Each category This indicates the confidence score for that category. The number of categories output by the classifier; (2) Generative reasoning unit, used to input the comprehensive descriptive text into the generative large language model Qwen-15B, and use it to perform deep semantic understanding and reasoning to generate results including topic classification, confidence level and potential uses. The result can be represented by the following process: In the formula Represents the inference function of a large language model. This represents the table topic classification result inferred by the model. This indicates the confidence level of the classification result. This indicates the potential uses of the data table.

6. The database table topic classification method and system based on the fusion of field standardization results as described in claim 1, characterized in that, The intelligent decision-making and fusion module is used to optimize and determine the final classification result. Specifically, it includes: setting decision logic; when the confidence level output by the generative inference unit is lower than a preset threshold, and the supervised classification unit has a classification result, adopting the output result of the supervised classification unit as the final classification result, and adding a potential usage description. Otherwise, the output of the generative reasoning classification unit is adopted as the final classification result; wherein, the potential use information is always taken from the output of the generative reasoning classification unit.

7. The database table topic classification method and system based on the fusion of field standardization results as described in claim 1, characterized in that, The result processing and archiving module performs formatted storage and performance analysis on the final classification results, specifically including: (1) Save the final topic classification results, decision paths and key features used in each table as a structured JSON file; (2) Extract the core information from the JSON file and convert it into a CSV format file; (3) Generate a performance statistics report, which includes the accuracy of the topic classification and the processing performance data of each unit in the hierarchical progressive field standardization module.