Data standard determination methods, apparatus, equipment, media and computer program products
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INDUSTRIAL AND COMMERCIAL BANK OF CHINA
- Filing Date
- 2022-11-28
- Publication Date
- 2026-06-30
Smart Images

Figure CN116304851B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of big data technology, and in particular to a method, apparatus, computer equipment, storage medium and computer program product for determining data standards. Background Technology
[0002] With the development of information technology, when adding or modifying table structures during project initiation or system development, it is necessary to determine whether there is a corresponding data standard for each data (data item, field). For data that needs to be associated with a data standard, it is necessary to first determine the most suitable target data standard from the existing data standards.
[0003] Traditional techniques typically involve professional technicians selecting target data standards for each dataset. However, due to the involvement of matching thousands of data standards and a massive number of fields, this requires a significant amount of manual work, resulting in low efficiency in determining the target data standards for the data. Summary of the Invention
[0004] Therefore, it is necessary to provide a data standard determination method, apparatus, computer equipment, computer-readable storage medium, and computer program product to address the aforementioned technical problems.
[0005] Firstly, this application provides a method for determining data standards. The method includes:
[0006] Acquire the data to be processed; the data to be processed is data for which the data standard is yet to be determined.
[0007] Identify the feature data corresponding to the data to be processed;
[0008] The feature data is input into a pre-trained data standard recognition model. The data standard recognition model determines the candidate data standards for the data to be processed and the matching degree between the data to be processed and the candidate data standards from the pre-stored data standards.
[0009] Based on the matching degree, the target data standard for the data to be processed is determined from the candidate data standards.
[0010] In one embodiment, the target data standard for the data to be processed is determined from the candidate data standards based on the matching degree, including:
[0011] Receive a determination instruction for candidate data criteria; the determination instruction is triggered based on the matching degree.
[0012] Based on the determined instructions, the target data standard for the data to be processed is determined from the candidate data standards.
[0013] In one embodiment, the target data standard for the data to be processed is determined from the candidate data standards based on the matching degree, including:
[0014] From the candidate data standards, select the candidate data standards whose matching degree meets the preset matching degree conditions, and use them as the candidate data standards to be selected;
[0015] Based on the matching degree of the candidate data standards to be selected, the target data standard for the data to be processed is determined from the candidate data standards to be selected.
[0016] In one embodiment, determining the target data standard for the data to be processed from the candidate data standards includes:
[0017] If at least one matching degree meets the preset matching degree condition, the target data standard for the data to be processed is determined from the candidate data standards.
[0018] The method also includes:
[0019] If the matching degree does not meet the preset matching degree conditions, it is determined that the data to be processed does not have a corresponding target data standard.
[0020] In one embodiment, the pre-trained data standard recognition model is trained in the following manner:
[0021] Acquire sample data and the true data standard for sample data;
[0022] The sample data and the real data standard of the sample data are divided to obtain the training sample set and the validation sample set;
[0023] The training sample set is used to train the standard recognition model of the data to be trained, and the trained standard recognition model of the data is obtained.
[0024] The trained data standard recognition model was validated using a validation sample set to obtain validation results;
[0025] If the verification result is satisfactory, the trained data standard recognition model will be determined as the pre-trained data standard recognition model.
[0026] In one embodiment, acquiring the data to be processed includes:
[0027] Obtain the data to be preprocessed;
[0028] Data cleaning is performed on the data to be preprocessed to obtain cleaned data to be preprocessed.
[0029] The pre-processed data is transformed to obtain the data to be processed.
[0030] Secondly, this application also provides a data standard determination apparatus. The apparatus includes:
[0031] The data acquisition module is used to acquire data to be processed; the data to be processed is data for which the data standard is yet to be determined.
[0032] The data determination module is used to determine the feature data corresponding to the data to be processed;
[0033] The matching degree determination module is used to input the feature data into a pre-trained data standard recognition model, and determine the candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from the pre-stored data standard through the data standard recognition model;
[0034] The standard determination module is used to determine the target data standard of the data to be processed from the candidate data standards based on the matching degree.
[0035] Thirdly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to perform the following steps:
[0036] Acquire the data to be processed; the data to be processed is the data whose data standard is yet to be determined; determine the feature data corresponding to the data to be processed; input the feature data into a pre-trained data standard recognition model, and use the data standard recognition model to determine the candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from the pre-stored data standards; based on the matching degree, determine the target data standard of the data to be processed from the candidate data standards.
[0037] Fourthly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, performs the following steps:
[0038] Acquire the data to be processed; the data to be processed is the data whose data standard is yet to be determined; determine the feature data corresponding to the data to be processed; input the feature data into a pre-trained data standard recognition model, and use the data standard recognition model to determine the candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from the pre-stored data standards; based on the matching degree, determine the target data standard of the data to be processed from the candidate data standards.
[0039] Fifthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, performs the following steps:
[0040] Acquire the data to be processed; the data to be processed is the data whose data standard is yet to be determined; determine the feature data corresponding to the data to be processed; input the feature data into a pre-trained data standard recognition model, and use the data standard recognition model to determine the candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from the pre-stored data standards; based on the matching degree, determine the target data standard of the data to be processed from the candidate data standards.
[0041] The aforementioned data standard determination method, apparatus, computer equipment, storage medium, and computer program product acquire data to be processed (data whose standard needs to be determined), identify feature data corresponding to the data to be processed, input the feature data into a pre-trained data standard recognition model, and use the data standard recognition model to determine candidate data standards for the data to be processed and the matching degree between the data to be processed and the candidate data standards from a pre-stored pool of data standards. Based on the matching degree, the target data standard for the data to be processed is determined from the candidate data standards. This scheme acquires data to be processed, identifies the corresponding feature data based on the data to be processed, inputs the feature data into a data standard recognition model, and uses the data standard recognition model to determine candidate data standards for the data to be processed and the matching degree corresponding to each candidate data standard from a pre-stored pool of massive data standards. Based on the matching degree, the target data standard for the data to be processed is determined from the candidate data standards, thereby improving the efficiency and accuracy of determining the target data standard for the data. Attached Figure Description
[0042] Figure 1 This is a flowchart illustrating a data standard determination method in one embodiment;
[0043] Figure 2 This is a flowchart illustrating the data standard determination method in another embodiment;
[0044] Figure 3 This is a schematic diagram of the internal devices of a terminal in one embodiment;
[0045] Figure 4 This is a schematic diagram of the work units included in an intelligent standardization device in one embodiment;
[0046] Figure 5 This is a structural block diagram of a data standard determination device in one embodiment;
[0047] Figure 6 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0048] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0049] In one embodiment, such as Figure 1 As shown, a method for determining data standards is provided. This embodiment illustrates the application of this method to a terminal, and includes the following steps:
[0050] Step S101: Obtain the data to be processed.
[0051] In this step, the data to be processed is the data whose data standard needs to be determined, such as data items (fields).
[0052] Specifically, the terminal acquires the data to be processed.
[0053] Step S102: Determine the feature data corresponding to the data to be processed.
[0054] In this step, the characteristic data can be the table name, Chinese field name, English field name, field data type, field length, field precision, and Chinese name of the data standard of the data to be processed.
[0055] Specifically, the terminal determines the feature data corresponding to the data to be processed based on the data to be processed.
[0056] Step S103: Input the feature data into the pre-trained data standard recognition model, and use the data standard recognition model to determine the candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from the pre-stored data standards.
[0057] In this step, the data standard identification model can be the BERT (Bidirectional Encoder Representations from Transformers) language model; the data standard can be the data standard that each data item (field) needs to be associated with when adding or modifying the table structure during project initiation or system development; the matching degree can be expressed as a percentage.
[0058] Specifically, the terminal inputs feature data into a pre-trained data standard recognition model. The data standard recognition model determines candidate data standards for the data to be processed from the pre-stored data standards, as well as the matching degree between the candidate data standards and the data to be processed, so that the terminal can obtain the candidate data standards and the matching degree corresponding to the candidate data standards through the data standard recognition model.
[0059] Step S104: Based on the matching degree, determine the target data standard for the data to be processed from the candidate data standards.
[0060] In this step, the target data standard can be the data standard that best matches the data to be processed.
[0061] Specifically, the terminal determines the target data standard for the data to be processed from the candidate data standards based on the matching degree corresponding to the candidate data standards.
[0062] In the aforementioned data standard determination method, the data to be processed is acquired, which is the data whose standard needs to be determined. Feature data corresponding to the data to be processed is determined, and this feature data is input into a pre-trained data standard recognition model. The data standard recognition model then determines candidate data standards for the data to be processed and the matching degree between the data to be processed and the candidate data standards from a pre-stored pool of data standards. Based on the matching degree, the target data standard for the data to be processed is determined from the candidate data standards. This scheme acquires the data to be processed, determines the corresponding feature data based on the data to be processed, inputs the feature data into the data standard recognition model, and uses the data standard recognition model to determine candidate data standards for the data to be processed and the matching degree corresponding to each candidate data standard from a pre-stored pool of massive data standards. Based on the matching degree, the target data standard for the data to be processed is determined from the candidate data standards, thereby improving the efficiency and accuracy of determining the target data standard.
[0063] In one embodiment, the step S104 above, which determines the target data standard of the data to be processed from the candidate data standards based on the matching degree, specifically includes: receiving a determination instruction for the candidate data standards; and determining the target data standard of the data to be processed from the candidate data standards based on the determination instruction.
[0064] In this embodiment, the determination instruction is an instruction triggered based on the matching degree, such as an instruction triggered by the user to select candidate data criteria based on the matching degree.
[0065] Specifically, after the terminal determines the candidate data standards for the data to be processed and the matching degree of each candidate data standard, it displays the candidate data standards and the corresponding matching degree for the user's reference. The user can trigger a confirmation command to select the candidate data standards based on the candidate data standards and the corresponding matching degree. The terminal receives and responds to the confirmation command and determines the target data standard for the data to be processed from the candidate data standards according to the confirmation command.
[0066] The technical solution of this embodiment determines the target data standard of the data to be processed from the candidate data standards according to the determined instructions, thereby improving the accuracy of the determined target data standard.
[0067] In one embodiment, step S104, which determines the target data standard of the data to be processed from the candidate data standards based on the matching degree, specifically includes: determining candidate data standards whose matching degree meets the preset matching degree condition from the candidate data standards, and using them as candidate data standards to be selected; and determining the target data standard of the data to be processed from the candidate data standards to be selected based on the matching degree of the candidate data standards to be selected.
[0068] In this embodiment, the preset matching degree condition can be a pre-set condition that the matching degree is greater than the matching degree threshold, such as a matching degree greater than 80%.
[0069] Specifically, after the terminal determines the candidate data standards for the data to be processed and the matching degree corresponding to each candidate data standard, it selects the candidate data standards whose matching degree meets the preset matching degree conditions from the candidate data standards and uses them as candidate data standards to be selected. The candidate data standards to be selected can be displayed. Based on the matching degree of the candidate data standards to be selected, the candidate data standard with the highest matching degree can be selected from the candidate data standards to be selected and used as the target data standard for the data to be processed.
[0070] The technical solution of this embodiment first determines the candidate data standards to be selected, and then determines the target data standard of the data to be processed from the candidate data standards, thereby improving the accuracy of the target data standard of the data.
[0071] In one embodiment, step S104 of determining the target data standard of the data to be processed from the candidate data standards specifically includes: determining the target data standard of the data to be processed from the candidate data standards when at least one matching degree meets the preset matching degree condition; the above method can also determine that the data to be processed does not have a corresponding target data standard by the following steps, specifically including: determining that the data to be processed does not have a corresponding target data standard when the matching degree does not meet the preset matching degree condition.
[0072] Specifically, after the terminal determines the candidate data standards for the data to be processed and the matching degree corresponding to each candidate data standard, it determines whether the matching degree corresponding to each candidate data standard meets the preset matching degree condition. If at least one matching degree meets the preset matching degree condition, the target data standard for the data to be processed is determined from the candidate data standards. If none of the matching degrees meet the preset matching degree condition, it is determined that the data to be processed does not have a corresponding target data standard.
[0073] In this embodiment, since not all data to be processed has a corresponding target data standard, some data to be processed does not have a corresponding target data standard. Therefore, to identify the data to be processed that does not have a corresponding target data standard, the terminal determines that the data to be processed does not have a corresponding target data standard when the matching degree does not meet the preset matching degree condition. This helps to improve the accuracy and efficiency of determining whether the data has a corresponding target data standard.
[0074] In one embodiment, the pre-trained data standard recognition model is trained in the following manner, specifically including: acquiring sample data and the real data standard of the sample data; dividing the sample data and the real data standard of the sample data to obtain a training sample set and a validation sample set; using the training sample set to train the data standard recognition model to be trained to obtain the trained data standard recognition model; using the validation sample set to validate the trained data standard recognition model to obtain a validation result; and if the validation result is qualified, determining the trained data standard recognition model as the pre-trained data standard recognition model.
[0075] In this embodiment, the sample data can be the sample data to be processed for training; the real data standard of the sample data can be the real target data standard corresponding to the sample data; the training sample set can be a portion of the sample data and the real data standard of that portion of the sample data; the verification sample set can be another portion of the sample data and the real data standard of that portion of the sample data; the verification result can be the accuracy, precision and / or recall of the data standard recognition model after training, and can be expressed as a percentage; the verification is qualified when the verification result is greater than a preset qualified threshold.
[0076] Specifically, the terminal acquires sample data and the real data standard of the sample data. The sample data and the real data standard of the sample data are randomly sampled in a ratio of 7:3 to obtain a training sample set and a validation sample set. The training sample set is used to train the data standard recognition model to be trained, and the trained data standard recognition model is obtained. The validation sample set is used to validate the trained data standard recognition model and obtain the validation result. If the validation result is qualified, the trained data standard recognition model is determined as the pre-trained data standard recognition model. If the validation result is unqualified, the training sample set is used to train the data standard recognition model to be trained in turn until the validation result is qualified.
[0077] The technical solution of this embodiment trains the data standard recognition model using a training sample set and verifies the trained data standard recognition model using a verification sample set. This helps to obtain a more accurate pre-trained data standard recognition model, which in turn helps to improve the accuracy of the target data standard determined in the subsequent process.
[0078] In one embodiment, the step S101 of obtaining the data to be processed specifically includes: obtaining the data to be preprocessed; performing data cleaning on the data to be preprocessed to obtain the data to be preprocessed after data cleaning; and performing data transformation on the data to be preprocessed after data cleaning to obtain the data to be processed.
[0079] In this embodiment, the data to be preprocessed is the data to be preprocessed. Data cleaning can refer to the process of solving data inconsistency problems by supplementing missing values, smoothing noisy data, and deleting outliers. For example, data cleaning should follow these rules: the data standard mapping relationship needs to be the latest updated, such as avoiding outdated or non-existent standards; the text content cannot contain special symbols, such as punctuation marks; data to be processed with missing important information needs to be deleted or supplemented, such as deleting fields with missing data types that cannot be supplemented, and supplementing fields with missing Chinese names based on their English names; cleaning rules should be added according to business meaning and requirements, such as not using spare fields as data to be processed. According to reports, data transformation processing can be carried out by transforming data from one form to another to make it more suitable for data mining and modeling. This step requires inputting a cleaned dataset and outputting a transformed dataset. Data transformation processing methods include, but are not limited to, the following three categories: Normalization: scaling attribute data proportionally to make it fall into a specific small interval, methods include max-min normalization, Z-value normalization, decimal scaling normalization, etc.; Discretization: replacing continuous / numerical data with interval labels or concept labels, methods include binning discretization, clustering discretization, etc., where binning discretization is further divided into equal-interval binning, equal-frequency binning, optimal binning, etc.; Attribute construction: constructing new attributes from given attributes and adding them to the dataset.
[0080] Specifically, the terminal acquires the data to be preprocessed, performs data cleaning on the data to be preprocessed, obtains the cleaned data to be preprocessed, and performs data transformation on the cleaned data to obtain the data to be processed.
[0081] The technical solution of this embodiment, by performing data cleaning and data transformation processing, helps to obtain data to be processed that meets the format requirements, thereby improving the accuracy of the target data standard determined in the subsequent process.
[0082] The following example illustrates the data standard determination method provided in this application. This example demonstrates the application of this method to a terminal, and the main steps include:
[0083] The first step is for the terminal to acquire sample data and the actual data standard of the sample data.
[0084] The second step involves the terminal dividing the sample data and the real data standard of the sample data to obtain the training sample set and the validation sample set.
[0085] The third step involves the terminal using the training sample set to train the data standard recognition model to obtain the trained data standard recognition model.
[0086] The fourth step involves the terminal using the validation sample set to validate the trained data standard recognition model and obtain the validation results.
[0087] Fifth, if the verification result is qualified, the terminal will determine the trained data standard recognition model as the pre-trained data standard recognition model.
[0088] Step 6: The terminal acquires the data to be preprocessed.
[0089] Step 7: The terminal performs data cleaning on the data to be preprocessed to obtain the cleaned data to be preprocessed.
[0090] Step 8: The terminal performs data transformation on the pre-processed data after data cleaning to obtain the data to be processed.
[0091] The ninth step is for the terminal to determine the feature data corresponding to the data to be processed.
[0092] The tenth step involves the terminal inputting the feature data into a pre-trained data standard recognition model. The data standard recognition model then determines the candidate data standards for the data to be processed and the matching degree between the data to be processed and the candidate data standards from the pre-stored data standards.
[0093] Step 11: The terminal receives a determination instruction for the candidate data standard. Based on the determination instruction, the terminal determines the target data standard for the data to be processed from the candidate data standards. Alternatively, the terminal determines candidate data standards from the candidate data standards whose matching degree meets the preset matching degree condition as candidate data standards to be selected. Based on the matching degree of the candidate data standards to be selected, if at least one matching degree meets the preset matching degree condition, the terminal determines the target data standard for the data to be processed from the candidate data standards. If none of the matching degrees meet the preset matching degree condition, the terminal determines that the data to be processed does not have a corresponding target data standard.
[0094] Among them, the data to be processed is the data standard to be determined; the determination instruction is the instruction triggered based on the matching degree.
[0095] The technical solution of this embodiment obtains the data to be processed, determines the feature data corresponding to the data to be processed based on the data to be processed, inputs the feature data into the data standard recognition model, and determines the candidate data standard of the data to be processed and the matching degree of each candidate data standard from the pre-stored massive data standards through the data standard recognition model. Based on the matching degree, the target data standard of the data to be processed is determined from the candidate data standards, thereby improving the efficiency and accuracy of determining the target data standard of the data.
[0096] The following application example illustrates the data standard determination method provided in this application. This application example demonstrates the application of this method to a terminal. Figure 2 As shown, the main steps include:
[0097] The first step is to determine the key features of the intelligent data standard implementation model. The terminal identifies "table name", "Chinese field name", "English field name", "field data type", "field length" and "field precision" as six categories of influencing factors. These can be adjusted and supplemented as needed. The main function of each influencing factor is to select the most suitable data standard from more than 3,000 data standards for implementation. Therefore, the Chinese name of the data standard is chosen as the output feature of the model.
[0098] The second step involves processing the existing data of the data standard implementation in the terminal to form a sample set of characteristic indicators: The terminal uses the metadata management system to record the existing data of the data standard implementation, and combines relevant information such as "table name", "Chinese field name", and "English field name" to process the values of the corresponding characteristic indicators for each existing data standard implementation record. The processed indicator record set is then used as a sample set for subsequent steps.
[0099] The third step is for the terminal to process and divide the samples (training set and sample set): The terminal randomly selects records from the indicator sample set formed in step two at a ratio of 7:3 to serve as the training sample set and the validation sample set, respectively, for use in the subsequent construction of the intelligent model.
[0100] The fourth step involves determining the model algorithm on the terminal: All intelligent data standardization models built on the terminal analyze and predict the selection of data standards for subsequent standardization based on historical standardization records and related information. Essentially, this is a supervised classification problem. Since it involves text content, it is a short text multi-classification problem. Therefore, the modeling approach is to establish an automated standard mapping model based on deep learning natural language processing technology. Through a suitable amount of manual mapping (ensuring the accuracy of business operations), the mapped table fields and standards are used as the sample training set to train the automated classification mapping model. For this scenario, based on past practical experience and theoretical basis, the BERT algorithm can be considered suitable for this scenario. The BERT (Bidirectional Encoder Representations from Transformers) model is a two-stage NLP (Neuro Linguistic Programming) model. First, a language model is trained through a large amount of external corpus (Pre-training). Then, the trained language model is used for transfer learning to complete specific downstream NLP tasks (Fine-tuning). In this scenario, the mapping of data standards is a specific downstream task.
[0101] The fifth step involves model training, validation, and optimization on the terminal: Using the training and validation sample sets generated in step three, the terminal trains, validates, and optimizes the model based on the BERT algorithm determined in step four. After multiple iterations, a data standard intelligent implementation model that meets the requirements of actual use is determined. For optimization, the learning effect of the network is mainly optimized by adjusting parameters such as the learning rate and penalty weights in the model network. The model is iteratively trained to gradually achieve the expected effect. The model is tested using the validation sample set, and then the accuracy, precision, and recall of the model output values are observed to determine whether the three evaluation indicators are met to assess whether further optimization of the model is needed.
[0102] Step 6: The terminal deploys the model and provides services: The terminal deploys and releases the intelligent standardization model of the data standard completed in step 5, and provides real-time calling services based on the online interface.
[0103] Step 7: The terminal performs data standardization operations based on the model's recommendations. The terminal obtains relevant information such as table name, Chinese field name, English field name, field data type, field length, and field precision entered during table structure maintenance. During data standardization, the terminal uses the model to calculate the recommended Chinese names of the data standards based on the associated information. Finally, after receiving the user's instruction to select the data standards, the terminal uses the selection results as a stock sample set for continuous optimization and improvement of the intelligent data standardization model.
[0104] Among them, such as Figure 3 As shown, the terminal may include a data standard management system, a data standard knowledge base management system, an intelligent standardization model building system, and an intelligent standardization device. Data standard management relies on a data standard management system, such as a bank's, allowing users to perform operations like data standard retrieval, querying, and management. This system clarifies the list of data standards that need to be mapped, determining the range of standards that can be mapped when the final model is applied. The data standard knowledge base management system is a metadata and software asset management platform. This platform provides centralized registration and management functions for various data resources and software assets for business system development. It supports all business systems adding or modifying table structures by using historical data associated with data standards and fields as training samples for model building, ensuring the accuracy of the historical data. The intelligent standardization model building system is based on the BERT algorithm to build a multi-classification machine learning model based on natural language processing. After this intelligent standardization model is trained and validated, it is acquired and loaded by the intelligent standardization device. The intelligent standardization device includes operational units such as... Figure 4 As shown, specifically, it can include a modeling initiation unit, a data processing unit, a model training unit, and a model evaluation unit. The modeling initiation unit includes two parts: indicator selection and dataset construction. Indicator selection (feature selection): The mapping of data standards depends on the relevant information of the data items (fields). Based on expert experience, the indicators that can be used for modeling are (as shown in Table 1 below):
[0105] Table 1
[0106]
[0107] Table name, field Chinese name, field English name, field data type, field length, field precision, and data standard Chinese name; Dataset construction: The modeling dataset includes a training set and a test set. Each record in the dataset contains at least seven variables: the data standard Chinese name as the target variable (i.e., "Y variable"), and the table name, field Chinese name, field English name, field data type, field length, and field precision as independent variables ("X variable"). The data structure is shown in Table 1 below. The following requirements should be considered when constructing the dataset: Dataset sample diversity: The dataset must contain all data standards that need to be mapped. Samples should include multiple topics and business domains to improve the generalization performance of the model. Furthermore, it should be noted that some fields do not have corresponding data standards. Therefore, the dataset needs to include some samples without corresponding standards. The target variable for such samples can be assigned a specific number to indicate that there is no corresponding standard. For such fields, subsequent intelligent standardization recommendations will recommend that no standardization is needed; Dataset sample size and class balance: Theoretically, an appropriate number of samples corresponding to each standard with similar magnitudes is an ideal condition for training the model, but in practice... In practice, the sample size corresponding to each standard is likely to exhibit a long-tail distribution, leading to class imbalance (some standards may only correspond to a few fields in reality). Therefore, if this situation exists, the following two rules should be followed when preparing the dataset samples: For samples with standard mappings, automatically count the sample size of each standard, and set weight rule parameters according to the sample size distribution to obtain the penalty weight of each standard. Automatically adjust the sample distribution according to the penalty weight to ensure class balance. For samples without standard mappings, their size can be prepared according to the overall real ratio. For example, if the ratio of fields with standards to fields without standards is approximately 3:7, then the recommended ratio of samples with and without standards in the dataset is 3:7, and so on. The data processing unit will clean and transform the dataset data to improve the quality of the modeling dataset and ensure the effectiveness of the intelligent standardization model. The model training unit is used to train the BERT language model. Based on the trained language model, specific NLP downstream tasks can be learned and trained, and the learning effect of the network can be optimized by adjusting the learning rate and penalty weight parameters in the model network.The model evaluation unit is used to evaluate the model. For model performance, the ROC curve (Receptor Operational Characteristic curve) and the following metrics can be examined: Accuracy = (Number of correct judgments / Total number of test sets), Precision = (Number of correct judgments in table fields with corresponding standards / Total number of table fields with corresponding data standards), Recall = (Number of correct judgments in table fields with corresponding standards / Total number of table fields with actual corresponding data standards). A good model performance will exhibit the following characteristics: finding as many fields as possible that can be mapped to data standards (high recall), and mapping to as many standards as possible accurately (high precision). Typically, it is difficult to achieve perfect results for all these metrics simultaneously. Thresholds and model parameters can be adjusted based on the metrics prioritized by the business stakeholders to optimize model results. Intelligent standardization implementation. The device connects the intelligent standardization model to the data standard management system. The model's input consists of information related to the data items to be standardized (see Table 1). The model's output consists of several possible standards: for the Top 1 (highest) standard score (matching degree) below the threshold, a null value is output, indicating no corresponding matching standard; for the Top 3 (highest three) standards with a matching degree above the threshold, the Top 1 standard is typically chosen as the final matching standard. (Note: The threshold can be understood as a confidence / matching degree threshold. For example, if the threshold is set to 80%, and a field has a 95% confidence level associated with standard A and a 75% confidence level associated with standard B, the model recommends associating the field with standard A. If the confidence level of all associated standards is below the threshold, the field is considered not to be associated with any data standard.)
[0108] The technical solution in this application example ensures the accuracy of data standard implementation results while improving the efficiency of data standard implementation. By establishing an intelligent mapping model, standards are implemented in systems, tables, and fields, reducing the waste of human resources and improving the efficiency and accuracy of determining the target data standards.
[0109] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0110] Based on the same inventive concept, this application also provides a data standard determination apparatus for implementing the data standard determination method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations in one or more data standard determination apparatus embodiments provided below can be found in the limitations of the data standard determination method described above, and will not be repeated here.
[0111] In one embodiment, such as Figure 5 As shown, a data standard determination device is provided, the device 500 may include:
[0112] Data acquisition module 501 is used to acquire data to be processed; the data to be processed is data for which the data standard is yet to be determined.
[0113] Data determination module 502 is used to determine feature data corresponding to the data to be processed;
[0114] The matching degree determination module 503 is used to input the feature data into a pre-trained data standard recognition model, and determine the candidate data standard of the data to be processed and the matching degree between the data to be processed and the candidate data standard from the pre-stored data standard through the data standard recognition model;
[0115] The standard determination module 504 is used to determine the target data standard of the data to be processed from the candidate data standards based on the matching degree.
[0116] In one embodiment, the standard determination module 504 is further configured to receive a determination instruction for the candidate data standard; the determination instruction is an instruction triggered based on the matching degree; and, according to the determination instruction, determine the target data standard of the data to be processed from the candidate data standards.
[0117] In one embodiment, the standard determination module 504 is further configured to determine, from the candidate data standards, a candidate data standard whose matching degree meets a preset matching degree condition, as a candidate data standard to be selected; and determine, from the candidate data standards to be selected, a target data standard for the data to be processed based on the matching degree of the candidate data standard to be selected.
[0118] In one embodiment, the standard determination module 504 is further configured to determine the target data standard of the data to be processed from the candidate data standards when at least one of the matching degrees meets the preset matching degree condition; the device 500 further includes: a condition not met module, configured to determine that the data to be processed does not have a corresponding target data standard when none of the matching degrees meet the preset matching degree condition.
[0119] In one embodiment, the device 500 further includes: a model training module, configured to acquire sample data and the real data standard of the sample data; divide the sample data and the real data standard of the sample data to obtain a training sample set and a validation sample set; train the data standard recognition model to be trained using the training sample set to obtain a trained data standard recognition model; validate the trained data standard recognition model using the validation sample set to obtain a validation result; and if the validation result is qualified, determine the trained data standard recognition model as the pre-trained data standard recognition model.
[0120] In one embodiment, the data acquisition module 501 is further configured to acquire data to be preprocessed; perform data cleaning processing on the data to be preprocessed to obtain data to be preprocessed after data cleaning; and perform data transformation processing on the data to be preprocessed after data cleaning to obtain the data to be processed.
[0121] The modules in the aforementioned data standard determination device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.
[0122] It should be noted that the method and apparatus for determining data standards provided in this application can be used in the financial field involving the determination of data standards, and can also be used in the processing of data standards in any field other than the financial field. The application field of the method and apparatus for determining data standards provided in this application is not limited.
[0123] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 6As shown, the computer device includes a processor, memory, input / output interfaces, a communication interface, a display unit, and an input device. The processor, memory, and input / output interfaces are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interfaces. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The input / output interfaces are used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When the computer program is executed by the processor, it implements a data standard determination method. The display unit is used to form a visually visible image and can be a display screen, a projection device, or a virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device of the computer device can be a touch layer covering the display screen, or buttons, trackballs, or touchpads set on the casing of the computer device, or external keyboards, touchpads, or mice, etc.
[0124] Those skilled in the art will understand that Figure 6 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0125] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.
[0126] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.
[0127] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0128] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data shall comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0129] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.
[0130] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0131] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A method for determining data standards, characterized in that, The method includes: Acquire the data to be processed; the data to be processed is data for which the data standard is yet to be determined. Determine the feature data corresponding to the data to be processed; wherein, the feature data includes at least one of the following: the table name, Chinese field name, English field name, field data type, field length, and field precision of the data to be processed; The feature data is input into a pre-trained data standard recognition model, wherein the data standard recognition model is a BERT language model; the data standard recognition model determines candidate data standards for the data to be processed and the matching degree between the data to be processed and the candidate data standards from the pre-stored data standards; Based on the matching degree, the target data standard for the data to be processed is determined from the candidate data standards.
2. The method according to claim 1, characterized in that, The step of determining the target data standard for the data to be processed from the candidate data standards based on the matching degree includes: Receive a determination instruction for the candidate data criteria; the determination instruction is triggered based on the matching degree. According to the determination instruction, the target data standard for the data to be processed is determined from the candidate data standards.
3. The method according to claim 1, characterized in that, The step of determining the target data standard for the data to be processed from the candidate data standards based on the matching degree includes: From the candidate data standards, candidate data standards that meet the preset matching degree conditions are determined as candidate data standards to be selected; Based on the matching degree of the candidate data standards to be selected, the target data standard for the data to be processed is determined from the candidate data standards to be selected.
4. The method according to claim 1, characterized in that, The step of determining the target data standard for the data to be processed from the candidate data standards includes: If at least one of the matching degrees meets the preset matching degree condition, the target data standard of the data to be processed is determined from the candidate data standards; The method further includes: If none of the matching degrees meet the preset matching degree conditions, it is determined that the data to be processed does not have a corresponding target data standard.
5. The method according to claim 1, characterized in that, The pre-trained data standard recognition model is trained in the following manner: Obtain sample data and the true data standard for the sample data; The sample data and the real data standard of the sample data are divided to obtain a training sample set and a validation sample set; The training sample set is used to train the data standard recognition model to be trained, and the trained data standard recognition model is obtained. The trained data standard recognition model is validated using the validation sample set to obtain validation results; If the verification result is satisfactory, the trained data standard recognition model is determined as the pre-trained data standard recognition model.
6. The method according to claim 1, characterized in that, The acquisition of data to be processed includes: Obtain the data to be preprocessed; The data to be preprocessed is cleaned to obtain cleaned data to be preprocessed. The preprocessed data after data cleaning is subjected to data transformation to obtain the data to be processed.
7. A data standard determination device, characterized in that, The device includes: The data acquisition module is used to acquire data to be processed; the data to be processed is data for which the data standard is yet to be determined. A data determination module is used to determine the feature data corresponding to the data to be processed; wherein, the feature data includes at least one of the following: the table name, Chinese field name, English field name, field data type, field length, and field precision of the data to be processed; The matching degree determination module is used to input the feature data into a pre-trained data standard recognition model, wherein the data standard recognition model is a BERT language model; and to determine, through the data standard recognition model, candidate data standards for the data to be processed and the matching degree between the data to be processed and the candidate data standards from the pre-stored data standards. The standard determination module is used to determine the target data standard of the data to be processed from the candidate data standards based on the matching degree.
8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.