Method and system for recommending data constraint conditions in data standard
By automatically matching the frequency distribution of target data and comparison data in the data standard and using a similarity evaluation strategy, the inefficiency caused by relying on manual setting of metadata attributes in existing technologies is solved, and efficient recommendation under data constraints is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING ESENSOFT CO LTD
- Filing Date
- 2022-08-16
- Publication Date
- 2026-06-26
Smart Images

Figure CN115344755B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a method and system for recommending data constraints in data standards. Background Technology
[0002] As more and more enterprises embark on digital transformation, digital enterprises are introducing data governance concepts into their daily information management, using data to empower business and improve management efficiency.
[0003] A crucial task in data governance is generating corresponding data standards for each data metric, and then using these standards to constrain and standardize those metrics. As the number of enterprise information systems increases and the volume of data continues to grow, the burden of developing data standards for data metrics becomes increasingly heavy for enterprises during the data governance process.
[0004] In the process of developing the existing technology, the inventors discovered that:
[0005] The existing process for generating data standards for data indicators is as follows: establish a unified data standard library and create data standards; generate metadata for data indicators; compare metadata attributes with data standard attributes, select data standards with higher matching degrees as candidate data standards for the corresponding data indicators, and obtain the data standards for the data indicators after manual confirmation.
[0006] A crucial step in traditional methods is generating metadata attributes for data metrics, currently relying primarily on manual setting of these attributes. Without metadata for the data metrics, existing methods will not function correctly. When dealing with a massive number of data metrics, this metadata attribute generation method is time-consuming and labor-intensive, limiting the efficiency of generating data standards to some extent.
[0007] Therefore, a new data-constrained recommendation scheme is needed to address the technical problem of low efficiency in data-constrained recommendation. Summary of the Invention
[0008] This application provides a new data constraint recommendation scheme to solve the technical problem of low efficiency in data constraint recommendation.
[0009] Specifically, a data constraint recommendation method in a data standard includes the following steps:
[0010] Retrieve target data from the first database and determine the attribute types and frequency distribution of the target data.
[0011] In a second database, which is different from the first database and stores the mapping relationship between the comparison data and data constraints, the attribute types and frequency distribution of the comparison data are determined.
[0012] When the attribute type of the target data and the attribute type of the comparison data are both the first attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the first similarity evaluation strategy, and a similarity evaluation result is generated.
[0013] When the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the second similarity evaluation strategy, which is different from the first similarity evaluation strategy, and a similarity evaluation result is generated.
[0014] The comparison data whose similarity assessment results are greater than a first preset threshold are identified as similar comparison data;
[0015] Based on the mapping relationship between comparative data and data constraints, the data constraints of similar comparative data are determined and used as recommended data constraints.
[0016] Furthermore, when the attribute type of the target data and the attribute type of the comparison data are both of the first attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the first similarity evaluation strategy, and a similarity evaluation result is generated, specifically including:
[0017] When the attribute types of the target data and the comparison data are both categorical data types, the Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result.
[0018] Furthermore, when the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, a second similarity evaluation strategy, different from the first similarity evaluation strategy, is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data, generating a similarity evaluation result, specifically including:
[0019] When the attribute types of the target data and the comparison data are both non-integer numerical data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result.
[0020] Furthermore, when the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, a second similarity evaluation strategy, different from the first similarity evaluation strategy, is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data, generating a similarity evaluation result, specifically including:
[0021] When the attribute type of the target data and the attribute type of the comparison data are both integer data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the first test result is generated.
[0022] The Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and a second test result is generated.
[0023] The result with the highest value between the first and second test results is taken as the similarity assessment result.
[0024] Furthermore, the method is applied to recommendation data standards;
[0025] The target data refers to data indicators.
[0026] This application also provides a data constraint recommendation system in a data standard.
[0027] Specifically, a data-constrained recommendation system based on a data standard includes:
[0028] The acquisition module is used to acquire target data from a first database, determine the attribute type and frequency distribution of the target data; it is also used to acquire comparison data from a second database, which is different from the first database and stores the mapping relationship between comparison data and data constraints, and determine the attribute type and frequency distribution of the comparison data.
[0029] The evaluation module is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data according to a first similarity evaluation strategy when the attribute type of the target data and the attribute type of the comparison data are both of the first attribute type, and generate a similarity evaluation result; it is also used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data according to a second similarity evaluation strategy different from the first similarity evaluation strategy when the attribute type of the target data and the attribute type of the comparison data are both of the second attribute type, and generate a similarity evaluation result.
[0030] The recommendation module is used to identify comparison data whose similarity assessment results are greater than a first preset threshold as similar comparison data; it is also used to determine the data constraints of the similar comparison data based on the mapping relationship between the comparison data and the data constraints, which are then used as the recommendation data constraints.
[0031] Furthermore, the evaluation module is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data when the attribute type of the target data and the attribute type of the comparison data are both of the first attribute type, according to the first similarity evaluation strategy, and generate a similarity evaluation result. Specifically, it is used for:
[0032] When the attribute types of the target data and the comparison data are both categorical data types, the Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result.
[0033] Furthermore, the evaluation module is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data when the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, according to a second similarity evaluation strategy different from the first similarity evaluation strategy, and to generate a similarity evaluation result. Specifically, it is used for:
[0034] When the attribute types of the target data and the comparison data are both non-integer numerical data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result.
[0035] Furthermore, the evaluation module is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data when the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, according to a second similarity evaluation strategy different from the first similarity evaluation strategy, and to generate a similarity evaluation result. Specifically, it is used for:
[0036] When the attribute type of the target data and the attribute type of the comparison data are both integer data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the first test result is generated.
[0037] The Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and a second test result is generated.
[0038] The result with the highest value between the first and second test results is taken as the similarity assessment result.
[0039] Furthermore, the system is applied to recommending data standards;
[0040] The target data refers to data indicators.
[0041] The technical solution provided in this application has at least the following beneficial effects:
[0042] By comparing the attribute types of the target data with those of the comparison data, and by comparing the frequency distribution of the target data with that of the comparison data, comparison data similar to the target data is found, thereby determining the recommended data constraints for the target data. This makes the technical solution provided in this application no longer dependent on matching data constraints based on the data's metadata, thus improving the automation level and efficiency of data constraints. Attached Figure Description
[0043] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:
[0044] Figure 1 This is a flowchart illustrating a data constraint recommendation method in a data standard, as provided in an embodiment of this application.
[0045] Figure 2 This is a schematic diagram of the structure of a data constraint recommendation system in a data standard provided in an embodiment of this application.
[0046] 100 Data Standards: Recommendation Systems with Data Constraints
[0047] 11 Acquisition Module
[0048] 12 Evaluation Modules
[0049] 13 Recommended Modules Detailed Implementation
[0050] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0051] It needs to be emphasized again that a crucial step in existing technologies for generating data standards for data indicators is generating metadata attributes for those indicators, which currently relies heavily on manual setting of these attributes. If data indicators lack metadata, existing methods will not function correctly. Those skilled in the art should understand that the data describing the data must be manually set. Therefore, when dealing with a large number of data indicators, this metadata attribute generation method is time-consuming and labor-intensive, limiting the efficiency of data standard generation to some extent.
[0052] Please refer to Figure 1 To address the technical problem of low efficiency in recommending data constraints, this application provides a method for recommending data constraints in data standards, comprising the following steps:
[0053] Retrieve target data from the first database and determine the attribute types and frequency distribution of the target data.
[0054] In a second database, which is different from the first database and stores the mapping relationship between the comparison data and data constraints, the attribute types and frequency distribution of the comparison data are determined.
[0055] When the attribute type of the target data and the attribute type of the comparison data are both the first attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the first similarity evaluation strategy, and a similarity evaluation result is generated.
[0056] When the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the second similarity evaluation strategy, which is different from the first similarity evaluation strategy, and a similarity evaluation result is generated.
[0057] The comparison data whose similarity assessment results are greater than a first preset threshold are identified as similar comparison data;
[0058] Based on the mapping relationship between comparative data and data constraints, the data constraints of similar comparative data are determined and used as recommended data constraints.
[0059] As is readily apparent, the technical solution provided in this application no longer relies on matching data constraints based on the data's metadata. Instead, it compares the attribute types of the target data with those of the comparison data, and compares the frequency distributions of the target data with those of the comparison data, to find comparison data similar to the target data, thereby determining the recommended data constraints for the target data. This improves the automation level and efficiency of data constraints.
[0060] The specific implementation process of this application is described in detail below:
[0061] S110: Obtain the target data from the first database and determine the attribute type and frequency distribution of the target data.
[0062] It is understood that the first database is used to store data to be constrained, and this data can be considered as target data. In specific application scenarios, the first database acts as a data indicator library, and the data to be constrained represents data indicators that require the generation of data standards. Data indicators can be understood as numerical values that characterize the company's operational attributes such as size, degree, proportion, and structure.
[0063] Furthermore, the data indicator library consists of data indicators that require the generation of data standards. Each data indicator is represented by a single field in the data indicator library. Therefore, the data indicator library can be viewed as a collection of data indicators. Fields from other databases can be automatically extracted to form data indicator collections, which in turn constitute the data indicator library.
[0064] Furthermore, when extracting fields from other databases to build a data indicator set, the data types of the fields and the sampled data records are also extracted. Here, the data types of the fields correspond to the attribute types of the data to be constrained, and the sampled data records correspond to the frequency distribution of the data to be constrained. Therefore, the first database stores the data to be constrained, the attribute types of the data to be constrained, and the frequency distribution of the data to be constrained.
[0065] In the application scenarios provided in this application, field data types have various forms of representation, such as categorical data types, enumerable types, non-enumerable types, non-integer numeric data types, and integer numeric data types.
[0066] Based on the data to be constrained, the attribute types of the data to be constrained, and the frequency distribution of the data to be constrained stored in the first database, the target data can be obtained from the first database, and the attribute types and frequency distribution of the target data can be determined.
[0067] It should also be noted that the data metrics in the aforementioned data metric library are fundamentally different from typical metadata. These data metrics do not contain metadata attributes with business meaning and can be viewed as simplified copies of the original data fields. As the enterprise information system grows, the data metric library will also grow. For example, if a new field is added to a database within the enterprise system, a corresponding data metric will be added to the data metric library.
[0068] S120: Obtain comparison data from a second database that is different from the first database and stores the mapping relationship between comparison data and data constraints, and determine the attribute type and frequency distribution of the comparison data.
[0069] It is understandable that the second database is used to store constrained data and the data constraints on that data. In specific application scenarios, the second database acts as a data standard library, and the data constraints act as data standards. Data standards can be understood as consistent conventions for the expression, format, and definition of data, including unified definitions of data business attributes, technical attributes, and management attributes, used to meet internal analysis and management needs or external regulatory requirements. Data standards describe the standardization requirements of indicator data through at least one of the following: basic attributes, business attributes, technical attributes, and management attributes. For example, the standard name is defined in the basic attributes, and the indicator classification is clarified; the business meaning, business scope, and indicator dimensions of the indicator are clarified in the business attributes; and the data range, data acquisition method, indicator conditions, indicator data type, length, and precision of the indicator are clarified in the technical attributes. Therefore, data standards are essentially data constraints.
[0070] Furthermore, the data standards library not only stores data standards, but also indicator data for which data standards have been defined, as well as the relationships between these indicators and their corresponding data standards. The indicator data for which data standards have been defined can be considered constrained data. Similarly, the data standards for constrained data can be considered the data constraints on the constrained data.
[0071] In addition, the data standard library also stores the attribute types of constrained data and the frequency distribution of constrained data.
[0072] In the application scenario provided in this application, constrained data can be regarded as comparative data. Therefore, based on the second database storing constrained data, the attribute types of constrained data, the frequency distribution of constrained data, the data constraints of constrained data, and the relationship between constrained data and corresponding data constraints, comparative data can be obtained from the second database to determine the attribute types and frequency distribution of the comparative data.
[0073] It should also be emphasized that the comparative data described in this application is used to compare with the target data, with the aim of finding comparative data similar to the target data, so as to use the data constraints of the comparative data as the recommended data constraints of the target data.
[0074] This application finds similar comparative data by comparing the attribute types of the target data with those of the comparative data, and by comparing the frequency distribution of the target data with that of the comparative data, thereby determining the recommended data constraints for the target data.
[0075] This is equivalent to transforming the matching degree calculation scheme of target data and numerous data constraints into a similarity calculation scheme of target data and comparison data. Furthermore, it transforms the similarity calculation scheme of target data and comparison data into a frequency distribution similarity calculation scheme for target data and comparison data of the same attribute type.
[0076] The reason for transforming the above technical solution is that the inventors considered that data with the same business meaning may have different data constraints or be represented by different data fields. Therefore, even if the business meaning is the same, the similarity between the data to be constrained and the constrained data may be low. However, data with the same business meaning, even if they are stored in different databases, data tables, or represented by different data fields, will have statistically similar sampled data records. Therefore, this application finds similar comparative data to the target data by calculating the frequency distribution similarity of target data and comparative data of the same attribute type, thereby determining the recommended data constraints for the target data in a more accurate and reasonable way.
[0077] If the target data and the comparison data have different attribute types, it means that their frequency distributions are not meaningful for comparison. Therefore, the technical solution provided in this application will not further explore the similarity of the frequency distributions of target data and comparison data with different attribute types.
[0078] S130: When the attribute type of the target data and the attribute type of the comparison data are both the first attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the first similarity evaluation strategy, and a similarity evaluation result is generated.
[0079] S140: When the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the second similarity evaluation strategy, which is different from the first similarity evaluation strategy, and a similarity evaluation result is generated.
[0080] It should be noted that the target data or comparative data has multiple attribute types. For the sake of brevity, this description only uses the first attribute type and the second attribute type of the target data or comparative data. This should not be construed as a limitation on the scope of the invention patent.
[0081] In the specific embodiments provided in this application, the data constraint recommendation method in the data standard is applied to recommend the data standard, and the target data is a data indicator. The attribute type of the target data includes at least one of the following: categorical data type, enumerable type, non-enumerable type, non-integer numeric data type, and integer numeric data type.
[0082] Taking a categorical data type as the first attribute type as an example, when the attribute types of the target data and the comparison data are both categorical data types, the Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distributions of the target data and the comparison data. The resulting test result is used as the similarity assessment result. The use of the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distributions of the target data and the comparison data constitutes the first similarity assessment strategy.
[0083] The second attribute type is usually different from the first attribute type. Therefore, based on the first attribute type being a categorical data type, any of the following can be the second attribute type: enumerable type, non-enumerable type, non-integer numeric data type, or integer numeric data type.
[0084] Taking a non-integer numeric data type as an example, when the attribute type of the target data and the attribute type of the comparison data are both non-integer numeric data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data. The generated test result is used as the similarity assessment result. The use of the Kolmogorov-Smirnov test algorithm to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data constitutes the second similarity assessment strategy, which differs from the first similarity assessment strategy.
[0085] Taking the second attribute type as an integer numeric data type as an example, when the attribute type of the target data and the attribute type of the comparison data are both integer numeric data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the first test result is generated.
[0086] The Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and a second test result is generated.
[0087] The highest numerical value between the first and second test results is taken as the similarity assessment result. The first test result is generated by using the Kolmogorov-Smirnov test algorithm to perform an isodistribution test on the frequency distributions of the target data and the comparison data; the second test result is generated by using the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distributions of the target data and the comparison data; the second similarity assessment strategy, which differs from the first similarity assessment strategy, is determined by taking the highest numerical value between the first and second test results.
[0088] Of course, in the actual implementation process, there may be situations where the attribute types of the target data are complex. In this case, the data constraint recommendation method in the data standard provided in this application still finds similar comparative data by comparing the attribute types of the target data with those of the comparative data, and comparing the frequency distribution of the target data with that of the comparative data, thereby determining the recommended data constraints for the target data.
[0089] For example, when the attribute types of the target data and the comparison data are both categorical data types and both are enumerable types, the Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distributions of the target data and the comparison data, and the generated test result is used as the similarity assessment result. The categorical data type and enumerable type can be considered as the first attribute type in this application, and the use of the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distributions of the target data and the comparison data can be considered as the first similarity assessment strategy.
[0090] When the attribute types of the target data and the comparison data are both categorical data types and both are non-enumerable, special characters such as spaces or punctuation marks are first removed from the target data and comparison data to generate new target data and comparison data. Then, the Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the new target data and the frequency distribution of the new comparison data, and the generated test result is used as the similarity evaluation result. The categorical data type and non-enumerable type can be regarded as the second attribute type different from the first attribute type described in this application. The process of removing special characters such as spaces or punctuation marks from the target data and comparison data to generate new target data and comparison data, and then using the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distribution of the new target data and the frequency distribution of the new comparison data, can be regarded as the second similarity evaluation strategy.
[0091] It is understood that the enumerable data type does not contain special characters such as spaces or punctuation marks, while the non-enumerable data type does contain special characters such as spaces or punctuation marks.
[0092] This application also provides a method for determining whether data of a categorized data type is an enumerable type, including the following steps:
[0093] For data with a categorical attribute type, suppose there are m sample values in the frequency distribution of the data and n distinct sample values. Define the sample value repetition degree of the data as R = m / n, and define R1(n) as the second judgment threshold, which is a function of n.
[0094] Based on the R and n values of the enumerable data type, a prediction model for the function R1(n) is established. In the prediction model of the function R1(n), R1 is the dependent variable and n is the independent variable.
[0095] The training data for the model consists of the number of distinct sample values n and the sample value repetition rate R in the data sampling data.
[0096] Define the third judgment threshold R2 = R1(n) - 3*MSE, where MSE is the mean square error of model R1(n).
[0097] When R is greater than the third judgment threshold R2, the data indicator is determined to be an enumerable type;
[0098] When R is less than or equal to the third judgment threshold R2, the data indicator is determined to be of an unenumerable type.
[0099] S150: Determine the comparison data whose similarity evaluation result is greater than the first preset threshold, and use them as similar comparison data.
[0100] S160: Based on the mapping relationship between the comparative data and the data constraints, determine the data constraints of the similar comparative data as recommended data constraints.
[0101] It is understood that the first preset threshold is used to filter out comparative data that can be determined to be similar to the target data from several comparative data corresponding to the similarity evaluation results. Comparative data determined to be similar to the target data is defined as similar comparative data, and there can be several such similar comparative data.
[0102] Since the similar comparison data is similar to the target data, the constraints of the similar comparison data are naturally also suitable for the target data. Based on the mapping relationship between the comparison data and the data constraints, the data constraints of the similar comparison data are determined as recommended data constraints for the target data.
[0103] In summary, the data constraint recommendation method in the data standard provided in this application no longer relies on matching data constraints based on the data's metadata. Instead, it finds similar comparative data by comparing the attribute types of the target data with those of the comparison data, and by comparing the frequency distribution of the target data with that of the comparison data. This determines the recommended data constraints for the target data, improving the automation level and efficiency of data constraints.
[0104] Please refer to Figure 2 To support the data constraint recommendation method in the data standard provided in this application, this application also provides a data constraint recommendation system 100 in the data standard, comprising:
[0105] The acquisition module 11 is used to acquire target data from the first database, determine the attribute type and frequency distribution of the target data; it is also used to acquire comparison data from a second database that is different from the first database and stores the mapping relationship between comparison data and data constraints, and determine the attribute type and frequency distribution of the comparison data.
[0106] Evaluation module 12 is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data according to a first similarity evaluation strategy when the attribute type of the target data and the attribute type of the comparison data are both of the first attribute type, and generate a similarity evaluation result; it is also used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data according to a second similarity evaluation strategy different from the first similarity evaluation strategy when the attribute type of the target data and the attribute type of the comparison data are both of the second attribute type, and generate a similarity evaluation result.
[0107] The recommendation module 13 is used to determine the comparison data whose similarity evaluation result is greater than the first preset threshold as similar comparison data; it is also used to determine the data constraints of the similar comparison data according to the mapping relationship between the comparison data and the data constraints, and to determine the data constraints of the recommendation data.
[0108] Specifically, the acquisition module 11 acquires target data from the first database and determines the attribute type and frequency distribution of the target data.
[0109] It is understood that the first database is used to store data to be constrained, and this data can be considered as target data. In specific application scenarios, the first database acts as a data indicator library, and the data to be constrained represents data indicators that require the generation of data standards. Data indicators can be understood as numerical values that characterize the company's operational attributes such as size, degree, proportion, and structure.
[0110] Furthermore, the data indicator library consists of data indicators that require the generation of data standards, with each data indicator represented by a single field within the library. Therefore, the data indicator library can be viewed as a collection of data indicators. The acquisition module 11 can automatically extract fields from other databases to form a data indicator collection, which in turn constitutes the data indicator library.
[0111] Furthermore, when the acquisition module 11 extracts fields from other databases to build a data indicator set, it also extracts the field data types and data record sampling samples. Here, the field data types correspond to the attribute types of the data to be constrained, and the data record sampling samples correspond to the frequency distribution of the data to be constrained. Therefore, the first database stores the data to be constrained, the attribute types of the data to be constrained, and the frequency distribution of the data to be constrained.
[0112] In the application scenarios provided in this application, field data types have various forms of representation, such as categorical data types, enumerable types, non-enumerable types, non-integer numeric data types, and integer numeric data types.
[0113] Based on the data to be constrained, the attribute types of the data to be constrained, and the frequency distribution of the data to be constrained stored in the first database, the acquisition module 11 can acquire the target data from the first database and determine the attribute types and frequency distribution of the target data.
[0114] It should also be noted that the data metrics in the aforementioned data metric library are fundamentally different from typical metadata. These data metrics do not contain metadata attributes with business meaning and can be viewed as simplified copies of the original data fields. As the enterprise information system grows, the data metric library will also grow. For example, if a new field is added to a database within the enterprise system, a corresponding data metric will be added to the data metric library.
[0115] In addition, the acquisition module 11 also acquires comparison data from a second database, which is different from the first database and stores the mapping relationship between comparison data and data constraints, and determines the attribute type and frequency distribution of the comparison data.
[0116] It is understandable that the second database is used to store constrained data and the data constraints on that data. In specific application scenarios, the second database acts as a data standard library, and the data constraints act as data standards. Data standards can be understood as consistent conventions for the expression, format, and definition of data, including unified definitions of data business attributes, technical attributes, and management attributes, used to meet internal analysis and management needs or external regulatory requirements. Data standards describe the standardization requirements of indicator data through at least one of the following: basic attributes, business attributes, technical attributes, and management attributes. For example, the standard name is defined in the basic attributes, and the indicator classification is clarified; the business meaning, business scope, and indicator dimensions of the indicator are clarified in the business attributes; and the data range, data acquisition method, indicator conditions, indicator data type, length, and precision of the indicator are clarified in the technical attributes. Therefore, data standards are essentially data constraints.
[0117] Furthermore, the data standards library not only stores data standards, but also indicator data for which data standards have been defined, as well as the relationships between these indicators and their corresponding data standards. The indicator data for which data standards have been defined can be considered constrained data. Similarly, the data standards for constrained data can be considered the data constraints on the constrained data.
[0118] In addition, the data standard library also stores the attribute types of constrained data and the frequency distribution of constrained data.
[0119] In the application scenario provided in this application, the acquisition module 11 can use constrained data as comparison data. Therefore, based on the second database storing constrained data, the attribute types of constrained data, the frequency distribution of constrained data, the data constraints of constrained data, and the relationship between constrained data and corresponding data constraints, the acquisition module 11 can acquire comparison data from the second database and determine the attribute types and frequency distribution of the comparison data.
[0120] It should also be emphasized that the comparative data described in this application is used to compare with the target data, with the aim of finding comparative data similar to the target data, so as to use the data constraints of the comparative data as the recommended data constraints of the target data.
[0121] The evaluation module 12 finds similar comparative data by comparing the attribute types of the target data with those of the comparative data, and by comparing the frequency distribution of the target data with that of the comparative data, thereby determining the recommended data constraints for the target data.
[0122] This is equivalent to transforming the matching degree calculation scheme of target data and numerous data constraints into a similarity calculation scheme of target data and comparison data. Furthermore, it transforms the similarity calculation scheme of target data and comparison data into a frequency distribution similarity calculation scheme for target data and comparison data of the same attribute type.
[0123] The reason for transforming the above technical solution is that the inventors considered that data with the same business meaning may have different data constraints or be represented by different data fields. Therefore, even if the business meaning is the same, the similarity between the data to be constrained and the constrained data may be low. However, data with the same business meaning, even if they are stored in different databases, data tables, or represented by different data fields, will have statistically similar sampled data records. Therefore, this application finds similar comparative data to the target data by calculating the frequency distribution similarity of target data and comparative data of the same attribute type, thereby determining the recommended data constraints for the target data in a more accurate and reasonable way.
[0124] If the target data and the comparison data have different attribute types, it means that their frequency distributions are not meaningful for comparison. Therefore, the technical solution provided in this application will not further explore the similarity of the frequency distributions of target data and comparison data with different attribute types.
[0125] Specifically, when the attribute type of the target data and the attribute type of the comparison data are both the first attribute type, the evaluation module 12 evaluates the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data according to the first similarity evaluation strategy, and generates a similarity evaluation result.
[0126] When the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, the evaluation module 12 evaluates the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data according to the second similarity evaluation strategy, which is different from the first similarity evaluation strategy, and generates a similarity evaluation result.
[0127] It should be noted that the target data or comparative data has multiple attribute types. For the sake of brevity, this description only uses the first attribute type and the second attribute type of the target data or comparative data. This should not be construed as a limitation on the scope of the invention patent.
[0128] In the specific embodiments provided in this application, the data constraint recommendation system 100 in the data standard is applied to recommend the data standard, and the target data is a data indicator. The attribute type of the target data includes at least one of the following: categorical data type, enumerable type, non-enumerable type, non-integer numeric data type, and integer numeric data type.
[0129] Taking the first attribute type as a categorical data type as an example, when the attribute type of the target data and the attribute type of the comparison data are both categorical data types, the evaluation module 12 uses the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result. The use of the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data is the first similarity evaluation strategy.
[0130] The second attribute type is usually different from the first attribute type. Therefore, based on the first attribute type being a categorical data type, any of the following can be the second attribute type: enumerable type, non-enumerable type, non-integer numeric data type, or integer numeric data type.
[0131] Taking a non-integer numeric data type as an example, when the attribute type of the target data and the attribute type of the comparison data are both non-integer numeric data types, the evaluation module 12 uses the Kolmogorov-Smirnov test algorithm to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result. The use of the Kolmogorov-Smirnov test algorithm to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data is a second similarity evaluation strategy, different from the first similarity evaluation strategy.
[0132] Taking the second attribute type as an integer numeric data type as an example, when the attribute type of the target data and the attribute type of the comparison data are both integer numeric data types, the evaluation module 12 uses the Kolmogorov-Smirnov test algorithm to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and generates the first test result;
[0133] The Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and a second test result is generated.
[0134] The highest numerical value between the first and second test results is taken as the similarity assessment result. The first test result is generated by using the Kolmogorov-Smirnov test algorithm to perform an isodistribution test on the frequency distributions of the target data and the comparison data; the second test result is generated by using the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distributions of the target data and the comparison data; the second similarity assessment strategy, which differs from the first similarity assessment strategy, is determined by taking the highest numerical value between the first and second test results.
[0135] Of course, in the actual implementation process, there may be situations where the attribute types of the target data are complex. In this case, the evaluation module 12 still finds similar comparative data by comparing the attribute types of the target data with those of the comparative data, and comparing the frequency distribution of the target data with that of the comparative data, thereby determining the recommended data constraints for the target data.
[0136] For example, when the attribute types of the target data and the comparison data are both categorical data types and both are enumerable types, the evaluation module 12 uses the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result. The categorical data type and enumerable type can be regarded as the first attribute type of this application, and the use of the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data can be regarded as the first similarity evaluation strategy.
[0137] When the attribute types of the target data and the comparison data are both categorical data types and both are non-enumerable, the evaluation module 12 first removes special characters such as spaces or punctuation marks from the target data and the comparison data, generating new target data and comparison data. Then, it uses the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distribution of the new target data and the frequency distribution of the new comparison data, and the generated test result is used as the similarity evaluation result. The categorical data type and non-enumerable type can be regarded as the second attribute type different from the first attribute type described in this application. The process of removing special characters such as spaces or punctuation marks from the target data and the comparison data to generate new target data and comparison data, and then using the Pearson chi-square test algorithm to perform an isodistribution test on the frequency distribution of the new target data and the frequency distribution of the new comparison data, can be regarded as the second similarity evaluation strategy.
[0138] It is understood that the enumerable data type does not contain special characters such as spaces or punctuation marks, while the non-enumerable data type does contain special characters such as spaces or punctuation marks.
[0139] Evaluation module 12 determines whether the data of the categorical data type is an enumerable type, specifically including:
[0140] For data with a categorical attribute type, suppose there are m sample values in the frequency distribution of the data and n distinct sample values. Define the sample value repetition degree of the data as R = m / n, and define R1(n) as the second judgment threshold, which is a function of n.
[0141] Based on the R and n values of the enumerable data type, a prediction model for the function R1(n) is established. In the prediction model of the function R1(n), R1 is the dependent variable and n is the independent variable.
[0142] The training data for the model consists of the number of distinct sample values n and the sample value repetition rate R in the data sampling data.
[0143] Define the third judgment threshold R2 = R1(n) - 3*MSE, where MSE is the mean square error of model R1(n).
[0144] When R is greater than the third judgment threshold R2, the data indicator is determined to be an enumerable type;
[0145] When R is less than or equal to the third judgment threshold R2, the data indicator is determined to be of an unenumerable type.
[0146] The recommendation module 13 determines the comparison data whose similarity evaluation results are greater than the first preset threshold as similar comparison data.
[0147] The recommendation module 13 determines the data constraints of similar comparison data based on the mapping relationship between comparison data and data constraints, and uses these constraints as the recommendation data constraints.
[0148] Understandably, the first preset threshold is used to filter out comparison data that can be determined to be similar to the target data from several comparison data corresponding to the similarity evaluation results. The recommendation module 13 defines the comparison data determined to be similar to the target data as similar comparison data, and there can be several such similar comparison data.
[0149] Since the similar comparison data is similar to the target data, the constraints of the similar comparison data are naturally also suitable for the target data. Recommendation module 13 determines the data constraints of the similar comparison data based on the mapping relationship between the comparison data and the data constraints, and uses these constraints as the recommended data constraints for the target data.
[0150] In summary, the data constraint recommendation system 100 in the data standard provided in this application no longer relies on matching data constraints based on the data's metadata. Instead, it finds similar comparative data by comparing the attribute types of the target data with those of the comparison data, and by comparing the frequency distribution of the target data with that of the comparison data. This determines the recommended data constraints for the target data, improving the automation level and efficiency of data constraints.
[0151] It should be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0152] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0153] The above description is merely an embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principle of this application should be included within the scope of the claims of this application.
Claims
1. A method for recommending data constraints in a data standard, characterized in that, Includes the following steps: Retrieve target data from the first database and determine the attribute types and frequency distribution of the target data. In a second database, which is different from the first database and stores the mapping relationship between comparison data and data constraints, the attribute types and frequency distribution of the comparison data are determined. When the attribute type of the target data and the attribute type of the comparison data are both of the first attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the first similarity evaluation strategy, and a similarity evaluation result is generated. When the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the second similarity evaluation strategy, which is different from the first similarity evaluation strategy, and a similarity evaluation result is generated. The comparison data whose similarity assessment results are greater than a first preset threshold are identified as similar comparison data; Based on the mapping relationship between comparative data and data constraints, the data constraints of similar comparative data are determined and used as recommended data constraints for the target data. When the attribute type of the target data and the attribute type of the comparison data are both of the first attribute type, the frequency distribution of the target data and the frequency distribution of the comparison data are evaluated for similarity according to the first similarity evaluation strategy, and a similarity evaluation result is generated, which specifically includes: When the attribute types of the target data and the comparison data are both categorical data types, the Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result. When the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, a second similarity evaluation strategy, different from the first similarity evaluation strategy, is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data, generating a similarity evaluation result, specifically including: When the attribute types of the target data and the comparison data are both non-integer numeric data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distributions of the target data and the comparison data, and the generated test result is used as the similarity assessment result; or When the attribute types of the target data and the comparison data are both integer numerical data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, generating a first test result; the Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, generating a second test result; the test result with the highest value between the first test result and the second test result is determined as the similarity evaluation result.
2. The method as described in claim 1, characterized in that, The method is applied to recommendation data standards; The target data refers to data indicators.
3. A data constraint recommendation system in a data standard, characterized in that, include: The acquisition module is used to acquire target data from a first database, determine the attribute type and frequency distribution of the target data; it is also used to acquire comparison data from a second database, which is different from the first database and stores the mapping relationship between comparison data and data constraints, and determine the attribute type and frequency distribution of the comparison data. The evaluation module is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data according to the first similarity evaluation strategy when the attribute type of the target data and the attribute type of the comparison data are both the first attribute type, and generate a similarity evaluation result. It is also used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data when the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, according to a second similarity evaluation strategy different from the first similarity evaluation strategy, and generate a similarity evaluation result; The recommendation module is used to identify comparison data whose similarity assessment results are greater than a first preset threshold, and use them as similar comparison data. It is also used to determine the data constraints of similar comparison data based on the mapping relationship between comparison data and data constraints, and to serve as the recommended data constraints. The evaluation module is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data according to a first similarity evaluation strategy when the attribute type of the target data and the attribute type of the comparison data are both of the first attribute type, and to generate a similarity evaluation result. Specifically, it is used for: When the attribute types of the target data and the comparison data are both categorical data types, the Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result. The evaluation module is used to evaluate the similarity between the frequency distribution of the target data and the frequency distribution of the comparison data when the attribute type of the target data and the attribute type of the comparison data are both the second attribute type, according to a second similarity evaluation strategy different from the first similarity evaluation strategy, and to generate a similarity evaluation result. Specifically, it is used for: When the attribute types of the target data and the comparison data are both non-integer numeric data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the generated test result is used as the similarity evaluation result. or When the attribute type of the target data and the attribute type of the comparison data are both integer data types, the Kolmogorov-Smirnov test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and the first test result is generated. The Pearson chi-square test algorithm is used to perform an isodistribution test on the frequency distribution of the target data and the frequency distribution of the comparison data, and a second test result is generated. The result with the highest value between the first and second test results is taken as the similarity assessment result.
4. The system as described in claim 3, characterized in that, The system is applied to recommending data standards; The target data refers to data indicators.