Data governance method and apparatus, computer device and storage medium
By parsing business data tables, mining relationships, and partitioning and merging processes, the problem of insufficient accuracy in existing data governance methods has been solved, enabling accurate and standardized governance of complex business data. In particular, it has significantly improved data usability in the field of medical data.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 腾讯医疗健康(深圳)有限公司
- Filing Date
- 2022-12-07
- Publication Date
- 2026-06-30
AI Technical Summary
Existing data governance methods are difficult to guarantee accuracy and may mistakenly treat seemingly abnormal values as abnormal, leading to inaccurate business data.
By parsing the table creation statements of the business data tables to be governed, data table information is obtained. Based on the data table information, database table relationships are mined, and after partitioning, business themes are merged. Data standardization governance is then performed on the business data tables under each business theme to ensure the accuracy of data governance.
It enables accurate identification and standardized processing of relationships in complex business data tables, improving the accuracy and effectiveness of data governance, and significantly enhancing data availability, especially in medical data governance.
Smart Images

Figure CN115858513B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer technology, and in particular to a data governance method, apparatus, computer equipment, storage medium, and computer program product. Background Technology
[0002] With the development of computer technology, big data technology has emerged. Big data, also known as massive data, refers to data volumes so large that mainstream software tools cannot capture, manage, process, and organize it into information to aid business decision-making within a reasonable timeframe. In the field of big data, a core business database can aggregate thousands of data tables, covering multiple different business modules. The relationships between these tables are intricate, and the daily increment of a single core table can reach millions. Therefore, the challenge lies in how to capture the data relationships between these tables and perform data governance and standardization.
[0003] Current data governance methods generally involve identifying anomalous data. This means detecting anomalous data items and comparing them with their corresponding normal data items to correct the anomalous data and achieve the goal of data governance. However, the accuracy of this method is difficult to guarantee. Sometimes, depending on business needs, some seemingly abnormal values may not be truly anomalous, and removing them would lead to inaccurate business data. Summary of the Invention
[0004] Therefore, it is necessary to provide a data governance method, apparatus, computer equipment, computer-readable storage medium, and computer program product that can improve the accuracy of data governance in response to the above-mentioned technical problems.
[0005] Firstly, this application provides a data governance method. The method includes:
[0006] The database statement for creating the table of data to be managed is parsed to obtain the table information.
[0007] The business data table to be governed is taken as an entity, and the database table relationship mining process is performed on the business data table to be governed based on the data table information to obtain the data table relationship;
[0008] Based on the association between the partitioned business data tables obtained by partitioning the business data tables to be governed and the data tables, the business theme merging process is performed on each of the partitioned business data tables to obtain the business data tables under each business theme.
[0009] Data standardization and governance are performed on the business data tables under each business theme to obtain data governance results.
[0010] Secondly, this application also provides a data governance apparatus. The apparatus includes:
[0011] The statement parsing module is used to parse and process the table creation statements of the business data tables to be managed, and obtain the data table information.
[0012] The relationship mining module is used to treat the business data table to be governed as an entity, and perform database table relationship mining processing on the business data table to be governed based on the data table information to obtain the data table relationship;
[0013] The business theme merging module is used to perform business theme merging processing on each of the partitioned business data tables based on the association relationship between the partitioned business data tables obtained by partitioning the business data tables to be managed and the data tables, so as to obtain business data tables under each business theme.
[0014] The data governance module is used to perform data standardization and governance on business data tables under various business themes to obtain data governance results.
[0015] Thirdly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to perform the following steps:
[0016] The database statement for creating the table of data to be managed is parsed to obtain the table information.
[0017] The business data table to be governed is taken as an entity, and the database table relationship mining process is performed on the business data table to be governed based on the data table information to obtain the data table relationship;
[0018] Based on the association between the partitioned business data tables obtained by partitioning the business data tables to be governed and the data tables, the business theme merging process is performed on each of the partitioned business data tables to obtain the business data tables under each business theme.
[0019] Data standardization and governance are performed on the business data tables under each business theme to obtain data governance results.
[0020] Fourthly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, performs the following steps:
[0021] The database statement for creating the table of data to be managed is parsed to obtain the table information.
[0022] The business data table to be governed is taken as an entity, and the database table relationship mining process is performed on the business data table to be governed based on the data table information to obtain the data table relationship;
[0023] Based on the association between the partitioned business data tables obtained by partitioning the business data tables to be governed and the data tables, the business theme merging process is performed on each of the partitioned business data tables to obtain the business data tables under each business theme.
[0024] Data standardization and governance are performed on the business data tables under each business theme to obtain data governance results.
[0025] Fifthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, performs the following steps:
[0026] The database statement for creating the table of data to be managed is parsed to obtain the table information.
[0027] The business data table to be governed is taken as an entity, and the database table relationship mining process is performed on the business data table to be governed based on the data table information to obtain the data table relationship;
[0028] Based on the association between the partitioned business data tables obtained by partitioning the business data tables to be governed and the data tables, the business theme merging process is performed on each of the partitioned business data tables to obtain the business data tables under each business theme.
[0029] Data standardization and governance are performed on the business data tables under each business theme to obtain data governance results.
[0030] The aforementioned data governance methods, devices, computer equipment, storage media, and computer program products first perform database statement parsing on the table creation statements of the business data table to be governed, thereby obtaining information on various data tables contained in the business data table. Then, treating the business data table to be governed as an entity, and based on the data table information, performing database-table relationship mining to obtain the data table relationships, the relationships between various data tables are identified from the obtained entity-relationship graph. Next, based on the partitioned business data tables obtained from partitioning the business data table to be governed and the data table relationships, business theme merging processing is performed on each partitioned business data table to obtain business data tables under each business theme. Business data relationships provide data support for business theme merging. Finally, data normalization governance is performed on the business data tables under each business theme to obtain the data governance results, thereby achieving metadata governance and normalization processing between different business modules. This application utilizes automated statement parsing, data table relationship mining, and data table relationship analysis to quickly parse core data assets and related business data tables in the original business during the data governance and standardization process. Based on the parsed data relationships, it provides data support for business theme merging and business theme standardization, completing the data governance operation and ensuring the accuracy of data governance. Attached Figure Description
[0031] Figure 1 This is a diagram illustrating the application environment of a data governance method in one embodiment;
[0032] Figure 2 This is a flowchart illustrating a data governance method in one embodiment;
[0033] Figure 3 This is a schematic diagram illustrating data relationship redundancy caused by tables sharing the same primary key in one embodiment;
[0034] Figure 4 This is a schematic diagram of the cluster mining results in one embodiment;
[0035] Figure 5 This is a schematic diagram illustrating the minimum data connection relationships in one embodiment;
[0036] Figure 6 This is a schematic diagram illustrating the merging of business topics based on data relationship mining results in one embodiment;
[0037] Figure 7 This is a schematic diagram illustrating the merging of business tables under a business theme in one embodiment;
[0038] Figure 8 This is a schematic diagram of the topic normalization flowchart in one embodiment;
[0039] Figure 9 This is a schematic diagram of the overall architecture and system structure for data governance and standardization in one embodiment;
[0040] Figure 10 This is a flowchart illustrating the data relationship mining process in one embodiment;
[0041] Figure 11 This is a structural block diagram of a data governance device in one embodiment;
[0042] Figure 12 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0043] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0044] This application relates to the field of cloud technology, which refers to a hosting technology that unifies hardware, software, and network resources within a wide area network (WAN) or local area network (LAN) to achieve data computation, storage, processing, and sharing. Cloud technology is a general term encompassing network technology, information technology, integration technology, management platform technology, and application technology applied to cloud computing business models. It can form resource pools, providing flexible and convenient on-demand access. Cloud computing technology will become a crucial support. Backend services of technical network systems require substantial computing and storage resources, such as video websites, image websites, and many portal websites. With the rapid development and application of the internet industry, every item may have its own identification mark in the future, requiring transmission to backend systems for logical processing. Data at different levels will be processed separately, and various industry data will require robust system support, which can only be achieved through cloud computing.
[0045] This application specifically relates to Big Data technology within cloud computing. Big Data refers to data sets that cannot be captured, managed, and processed within a certain timeframe using conventional software tools. It represents massive, rapidly growing, and diverse information assets that require new processing models to achieve stronger decision-making, insightful discovery, and process optimization capabilities. With the advent of the cloud era, Big Data has attracted increasing attention. Big Data requires specialized technologies to effectively process large amounts of data within a tolerable timeframe. Technologies suitable for Big Data include massively parallel processing databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the Internet, and scalable storage systems.
[0046] In this article, it is important to understand the following terms:
[0047] Primary key and primary key association information (primary-primary-edge, PPE): also known as primary key relationship, is the relationship between data tables that have the same primary key, connected by the primary key.
[0048] Primary-foreign-edge (PFE): also known as primary-foreign-key relationship. If the primary key of table A is the foreign key of table B, then the connection relationship between tables A and B in the Entity Relationship Diagram (ER Diagram) is a primary-foreign-key relationship.
[0049] Community algorithms: also known as community discovery algorithms, can discover useful information from graph networks. For example, they can be used to find the largest clique in a graph network, and can also be used to calculate the centrality of connecting edges to obtain the importance of the edges.
[0050] The data governance method provided in this application embodiment can be applied to, for example... Figure 1 In the application environment shown, terminal 102 communicates with server 104 via a network. A data storage system stores the data that server 104 needs to process. The data storage system is integrated on server 104 and is specifically a data warehouse. When a user on terminal 102 needs to perform data governance on specified business data, the business governance process can be achieved through server 104. First, terminal 102 specifies the business data to be processed. Server 104 retrieves the corresponding business data table to be governed, then synchronizes it to the data warehouse on server 104, and performs database statement parsing on the table creation statements of the business data table to be governed to obtain data table information. The business data table to be governed is treated as an entity, and based on the data table information, database table relationship mining is performed on the business data table to be governed to obtain data table relationships. Based on the partitioned business data tables and data table relationships obtained from partitioning the business data table to be governed, business theme merging is performed on each partitioned business data table to obtain business data tables under each business theme. Data normalization governance is performed on the business data tables under each business theme to obtain the data governance result. The server 104 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The terminal 102 can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, etc., but is not limited to these. The terminal and server can be directly or indirectly connected via wired or wireless communication, and this application does not impose any restrictions.
[0051] In one embodiment, such as Figure 2As shown, a data governance method is provided, which can be applied to... Figure 1 Taking server 104 as an example, the following steps are included:
[0052] Step 201: Perform database statement parsing on the table creation statement of the business data table to be managed to obtain the data table information.
[0053] The business data table to be governed refers to an external business data table, which is a temporary grid virtual table used to record detailed business-related data. The purpose of this application is to perform data governance on the data in the business data table to be governed, so as to improve the availability of this data. In one embodiment, the business data table to be governed is specifically a medical-related business data table. Due to the complex relationships between the various tables in medical business data and the huge daily increase in the core single table, how to extract the data relationships between the various tables from the complex medical business modules, and to jointly govern and standardize the data of the related tables to provide a good data foundation for data application, is an urgent problem to be solved. This application can be used to perform overall governance of this medical data, while improving the accuracy of data governance and ensuring the availability of medical data. The table creation statement refers to the database statement used to create the business data table to be governed, which specifically includes the structure of the data table, field names, field explanations, primary key information, etc. The table creation statements are originally used to create business data tables on the business database side, and are therefore stored there. When it is necessary to manage these business data tables in the business database, these tables need to be synchronized from the business database to the data warehouse of server 104 as the business data tables to be managed. At this time, server 104 will simultaneously retrieve these table creation statements from the business database for analysis of the relationships between the business tables. Database statement parsing refers to the process of obtaining relevant information about the business data tables to be managed by parsing the table creation statements. The obtained table information specifically includes key information such as table name, field information, primary and foreign keys. The method of database statement parsing can be determined according to the complexity of the table creation statement. For more complex table creation statements, lexical analysis and syntax analysis can be used for parsing, while for simpler table creation statements, regular expressions can be used directly for parsing.
[0054] Specifically, during the data governance process, users can submit corresponding data governance requests to server 104 through terminal 102. After receiving the data governance request, server 104 determines the business data tables to be governed and then synchronizes these business data tables to the data warehouse and performs partitioning processing. The solution of this application mainly analyzes the relationships in the business data tables to be governed and then realizes data governance based on the relationships between the business data tables. Therefore, when performing data governance, it is first necessary to use automated database statement parsing to parse the table creation statements to obtain the corresponding data table information. Then, based on the data table information, data table relationship analysis can be performed for the data governance process. In one embodiment, the solution of this application is specifically applicable to data governance related to medical data. In this case, the business data tables to be governed can be retrieved from the medical data and information system, and the corresponding table creation statements can be obtained. Then, regular expressions are used to parse the table creation statements to obtain the data table information.
[0055] Step 203: Treat the business data table to be governed as an entity, and perform database table relationship mining based on the data table information to obtain the data table relationship.
[0056] In this context, "entity" refers to objectively existing and distinguishable things. Specifically, this application's solution uses each business data table to be governed as an entity for correlation analysis, thereby determining the relationships between them and achieving effective data governance. The analysis of these relationships can be implemented using the Entity Relationship Diagram (ER Diagram), which provides a method for representing entity types, attributes, and relationships to describe conceptual models of the real world. It is an effective method for describing conceptual models of real-world relationships. It is a way to represent conceptual relationship models. Entity types are represented by rectangles, with the entity name written inside. Entity attributes are represented by elliptical or rounded rectangles, connected to their corresponding entity types by solid lines. The causes of relationships between entity types are represented by diamonds, with the relationship name written inside and connected to the relevant entity types by solid lines, with the relationship type (1:1, 1:n, or m:n) labeled next to the solid lines. In this application's solution, the business data table to be governed is primarily treated as an entity in the entity-relationship diagram (IRD), and the data content of the business data table is treated as the data in the IRD, such as primary key relationships and primary key-foreign key relationships. The relationships between these data tables are then considered as connections between entities. The database-table relationship mining process mainly utilizes primary key-to-primary key relationship information and primary key-to-foreign key relationship information.
[0057] Specifically, in the data governance process, after obtaining the data table information, the relationships between the data tables can be analyzed based on this information. This analysis can be implemented using an entity-relationship diagram (ERP). Specifically, the table names of each business data table to be governed, as well as the primary key and foreign key relationships between the tables, can be extracted from the data table information. Then, entities are constructed from the table names of the business data tables to be governed to form the entities in the ERP. The attributes of the ERP are obtained based on the corresponding data table information. Furthermore, based on the primary key and foreign key relationships in the data table information, the relationships between the entities are determined, thus constructing the ERP. Simultaneously, the relationships between the various data tables are obtained through the ERP.
[0058] Step 205: Based on the partitioned business data tables and data table relationships obtained from the partitioning of the business data tables to be governed, perform business theme merging processing on each partitioned business data table to obtain the business data tables under each business theme.
[0059] Partitioning, or data partitioning, refers to dividing logically unified data into smaller, independently manageable physical units, also known as data segmentation or data sharding. In this application's solution, the business data table to be governed is partitioned into various partition business data tables using the partitioning field carried in the table. Business theme merging refers to merging different partition business data tables according to the dimension of business theme for processing. Specifically, a business theme refers to different types of data divided based on the business logic corresponding to the business data.
[0060] Specifically, the solution in this application also includes a process of synchronizing and partitioning the business data tables to be governed. After determining the business data tables to be governed that need to be processed, server 104 can synchronize the business data tables to be governed in the business system to the data warehouse through a full synchronization to begin data governance. The synchronization process mainly occurs at the source (stg) layer of the data warehouse. The business data tables to be governed in the business system can be synchronized to the data warehouse through full or incremental synchronization. Then, from the source layer to the first data operation (ods1) layer, the synchronized business data tables are partitioned according to the partitioning field to obtain partitioned business data tables. Then, based on the base partitioned business data tables and the data table association relationships, the business data tables of each partitioned business table are merged according to business themes to obtain business data tables under each business theme. The specific business theme can be divided according to the actual business logic, but since manual division is labor-intensive and prone to data table errors, this application uses the results of data association mining to perform theme division and merge the data tables under the business theme to obtain the business data tables under the business theme.
[0061] Step 207: Perform data standardization and governance on the business data tables under each business theme to obtain the data governance results.
[0062] Data normalization governance refers to processing business data tables through data normalization to obtain standardized business data tables. Data normalization is useful because data is relatively complex in application. Specifically, it's a data transformation method in data mining, converting or unifying data into a form suitable for data mining.
[0063] Specifically, after the themes are merged, the table structure of each business data table changes, and the tables incorporate a large amount of content from various business themes. While this meets the needs of business applications, it may not meet the database table normalization requirements. Therefore, data normalization governance is used to ensure that the business data tables after the business themes are merged meet the database table normalization requirements. In a specific embodiment, the process of data normalization governance for business data tables includes three data governance processes: primary key normalization governance, data field normalization governance, and inter-table relationship normalization governance. After data normalization processing, the corresponding business data tables are obtained. These business data tables are structured and stored in a data warehouse for subsequent big data analysis. In one embodiment, the solution of this application is specifically applicable to data governance related to medical data. In this case, data normalization governance can be performed on the business data tables under each business theme. After obtaining the data governance results, data mining and data analysis can be performed based on the data governance results according to the needs of medical data analysis, to ensure the effective execution of the medical data analysis process and improve the accuracy of medical data analysis.
[0064] The aforementioned data governance method first parses the table creation statements of the business data table to be governed to obtain information about the various data tables contained within it. Then, treating the business data table to be governed as an entity, it performs database-table relationship mining based on the data table information to obtain the table relationships. Next, based on the partitioned business data tables obtained from partitioning the business data table to be governed and their relationships, it merges the business data tables of each partition according to business themes, resulting in business data tables under each business theme. These business data relationships provide data support for the merging of business themes. Finally, it performs data normalization governance on the business data tables under each business theme to obtain the data governance results, thereby achieving metadata governance and normalization between different business modules. This application, through automated statement parsing, data table relationship mining, and data table relationship analysis, rapidly parses the core data assets and related business data tables in the original business during the data governance and standardization process. Then, based on the parsed data relationships, it provides data support for business theme merging and business theme normalization, completing the data governance operation and ensuring the accuracy of data governance.
[0065] In one embodiment, the method further includes: performing full synchronization processing on the historical data in the business data table to be governed to obtain a historical business data table; partitioning the historical business data table according to the partitioning field of the historical business data table, and performing partitioning field transformation processing and business data deduplication processing on the partitioned historical business data table to obtain a partitioned business data table.
[0066] Full synchronization is one type of data synchronization method. The process of extracting data from the business database and transferring it to the data warehouse is where data discrepancies arise. Full synchronization involves synchronizing all data from the business database to the data warehouse, which is the simplest way to ensure data synchronization between the two sides. In contrast, incremental synchronization only synchronizes newly added and changed data from the business database to the data warehouse. In a specific implementation, the data synchronization cycle can be daily. Historical data in the business data tables to be governed specifically refers to all business data tables to be governed. Historical business data tables are those stored in the data warehouse after full synchronization, and the data stored in historical business data tables is consistent with the data stored in the business data tables to be governed. Partition fields refer to the data table fields used for partitioning the business data tables to be governed. After data synchronization, the historical business data tables can be partitioned based on the original partition fields. Partition field transformation and business data deduplication are used to transform the partition fields of the partitioned business tables into fields with business meaning and attributes. Partition field conversion and business data deduplication specifically include three processes: partition splitting, partition extraction, and partition deduplication merging.
[0067] Specifically, during data governance, the business data tables to be governed in the business database can be synchronized and saved to the data warehouse first. The data warehouse then facilitates data governance. Specifically, the data warehouse can obtain the business data tables to be governed stored in the business system database through a full synchronization. Historical data within these tables is then fully synchronized. This full synchronization process involves creating an empty table in the data warehouse based on the table creation statements of the business data tables to be governed, and then synchronizing the historical data from the business data tables to the empty table, thus obtaining the historical business data table. After the full synchronization, the data warehouse generates historical business data tables corresponding to the business data tables to be governed. These historical business data tables can then be partitioned using their original partition fields. The partitioned historical business data tables then undergo partition field transformation and business data deduplication. This involves three processes: partition splitting, partition extraction, and partition deduplication and merging, to process the partitioned historical business data tables, resulting in partitioned business data tables. At this point, the partition fields of the partitioned business data tables have been transformed into fields with business meaning and attributes. In one embodiment, data synchronization and data governance are performed on a daily basis. Each day, a full synchronization of historical data in the business data table to be governed is performed first. Following this, subsequent periodic incremental synchronization is used to synchronize real-time updated business data, ensuring that data governance covers all business data. In this embodiment, through data synchronization and data partitioning, a more accurate partitioned business data table can be obtained, ensuring the rationality and effectiveness of data governance.
[0068] In one embodiment, step 201 includes: performing lexical analysis on the table creation statement of the business data table to be governed to obtain lexical tags; performing syntax analysis on the lexical tags to obtain an abstract syntax tree; and obtaining data table information based on the abstract syntax tree.
[0069] Database statement parsing is specifically used to extract table names, database names, and values of related fields from database statements. Database statement parsing consists of two parts: lexical analysis and syntax analysis. Lexical analysis transforms the input into lexical tags, which include keywords and non-keywords. Syntax analysis, on the other hand, is the process of generating an abstract syntax tree (AST). The AST transforms the table creation statements of the business data tables to be managed into structured data for analysis, thereby accurately extracting various types of data table information.
[0070] Specifically, the solution in this application can parse database table creation statements using a combination of lexical and syntactic analysis. The parsing process first involves lexical analysis of the table creation statement to be processed, obtaining corresponding lexical tags. Then, syntactic analysis is used to transform the table creation statement into an abstract syntax tree based on the obtained lexical tags. Finally, based on the abstract syntax tree, the table structure, field names, field explanations, primary key information, and other table information can be easily analyzed. In another embodiment, where the table creation statement is relatively simple, database statement parsing can be directly achieved using regular expressions. Regular expressions, also known as rule expressions, are text patterns that include ordinary characters (e.g., letters from a to z) and special characters (called "metacharacters"), and are a concept in computer science. Regular expressions use a single string to describe and match a series of strings that match a certain syntax rule, and are typically used to retrieve and replace text that conforms to a certain pattern (rule). For simple table creation statements, regular expressions can be used for direct parsing. In this embodiment, by constructing an abstract syntax tree, the table creation statements of the business data table to be governed can be effectively parsed, ensuring the accuracy of data table information extraction.
[0071] In one embodiment, the database table information is processed by the entity relationship graph method to obtain the data table relationship, which includes: extracting primary key-to-primary key relationship information and primary key-to-foreign key relationship information from the data table information; constructing a first entity relationship graph based on the primary key-to-primary key relationship information and primary key-to-foreign key relationship information; performing edge filtering on the first entity relationship graph through a community discovery algorithm to obtain a second entity relationship graph; and obtaining the data table relationship based on the second entity relationship graph.
[0072] In this context, primary key-to-primary key relationships refer to the connections between tables that share the same primary key. Primary key-to-foreign key relationships refer to the primary-to-foreign key relationships. For example, if the primary key of table A is a foreign key of table B, then the connection between tables A and B in the Entity Relationship Graph (ER graph) is a primary-to-foreign key relationship. Community detection algorithms can discover useful information from graph networks. For instance, they can be used to find the largest cliques in a graph network and to calculate the centrality of connecting edges to determine their importance. The first Entity Relationship Graph refers to the initially selected Entity Relationship Graph, while the second Entity Relationship Graph is the minimum connectivity graph obtained after further filtering based on the first Entity Relationship Graph.
[0073] Specifically, the entity-relationship graph method of this application is used for database table relationship mining. During relationship mining, firstly, primary key-to-primary key relationship information and primary key-to-foreign key relationship information are extracted from the data table information. Then, based on these two types of information, the relationships between various partition business data tables are determined. Next, based on these relationships, connections are established between entities (partition business data tables), constructing a first entity-relationship graph containing all relationships. Then, a community discovery algorithm is used to filter the edges of the first entity-relationship graph, removing less important connection edges to obtain a second entity-relationship graph. The second entity-relationship graph provides reliable data table relationships. In this embodiment, by constructing the entity-relationship graph corresponding to the business data tables using the two relationship attributes between business data tables—primary key-to-primary key relationship information and primary key-to-foreign key relationship information—and then using a community discovery algorithm for edge filtering, the relationship mining of data tables can be effectively performed, ensuring the accuracy of the data table relationship mining.
[0074] In one embodiment, constructing a first entity relationship graph based on primary key-to-primary key association information and primary key-to-foreign key association information includes: performing cluster mining on primary key-to-primary key association information to obtain primary key-to-primary key association information clusters; determining the node degree information of nodes in the primary key-to-foreign key association information; filtering edges in the primary key-to-primary key association information clusters based on the node degree information to obtain first filtered edges; filtering edges corresponding to the primary key-to-foreign key association information based on the node degree information to obtain second filtered edges; and constructing the first entity relationship graph based on the first filtered edges and the second filtered edges.
[0075] Community mining refers to the process of performing community mining on graph data to obtain primary key and primary key association information clusters (PPE clusters). A node is an entity in the entity-relationship graph, i.e., a partitioned business data table. Node degree information is a basic concept in graph structure, referring to the number of edges associated with that node, also known as the degree of association.
[0076] Specifically, in actual business database tables, it often happens that multiple business data tables share the same primary key. These tables have certain relationships, but if all information is retained, problems such as... Figure 3 The scenario shown illustrates tables with the same primary key (AE) that are interconnected. Dashed lines represent primary key-to-primary key relationships (PPE). Simultaneously, these tables are also fully joined with all other tables (FH) that have primary key-to-foreign key relationships (FFE), represented by solid lines. This results in a chaotic table relationship structure. Figure 3 It can also be seen that tables with the same primary key form a primary key association network structure after being linked together. In order to identify the primary key association network structure in all database tables, this application first uses the clique mining method in the community detection algorithm to clique all primary key associations. For example Figure 4 As shown, after inputting the PPE connection information, the maximum clique mining algorithm outputs the node information within each clique after clique splitting. Nodes within the same clique have the same primary key. The name of each clique and the node members within it are stored in dictionary form: {clique1: [A, B, C, D, E], clique2: [P, Q, R], clique3: [W, X, Y, Z]} represent the three cliques respectively. Then, the corresponding node degree can be calculated. For all primary-foreign key associations, the node degree is calculated, and each primary-foreign key association edge is represented as a triple (u, v, e), where u and v represent the two endpoints of edge e, respectively. The calculation method for each node degree D is as follows:
[0077] for e in all_PFE_edges:
[0078] D(u)+1;
[0079] D(v)+1;
[0080] `all_PFE_edges` represents all connection edges in the primary key and foreign key association information. This application only includes PFE connection information when calculating node degree, avoiding interference from redundant PPE information. Then, based on the node degree information, all edges in the primary key and primary key association information cluster can be filtered, removing some connection edges representing association relationships; the remaining edges are the first filtered edges. Similarly, based on the node degree information, all edges in the primary key and foreign key association information can be filtered, removing some connection edges representing association relationships; the remaining edges are the second filtered edges. The entity relationship graph containing only the association relationships represented by the first and second filtered edges is then used as the required first entity relationship graph. In this embodiment, cluster mining is performed using primary key association information to determine primary key and primary key association information clusters, and then edge filtering is performed twice using node degree. This effectively extracts important data table association relationships to construct the first entity relationship graph, ensuring the effectiveness of the entity relationship graph.
[0081] In one embodiment, filtering edges in the primary key and primary key association information group based on node degree information to obtain the first filtered edge includes: determining the key node with the largest node degree in the primary key and primary key association information group based on node degree information; filtering edges in the primary key and primary key association information group that contain the key node as filtered and retained edges to obtain the first filtered edge; filtering edges corresponding to the primary key and foreign key association information based on node degree information to obtain the second filtered edge includes: filtering edges in the primary key and foreign key association information that contain the key node as filtered and retained edges to obtain the second filtered edge.
[0082] The key node is both the node with the highest degree in the primary key and its associated information cluster, and represents the core table in a business theme. The filtered edges are the edges that are retained after filtering; edges representing other related information are removed.
[0083] Specifically, in the scheme of this application, a complete relationship information graph can be constructed first based on the primary key-to-primary key relationship information and the primary key-to-foreign key relationship information. Then, the primary key and primary key-to-foreign key information is filtered using node degree information. First, the node degree information is used to determine the key node with the highest node degree in the primary key-to-primary key relationship information group; this node also represents the core table under a business theme. Then, the relationship information between primary keys is filtered. This process only needs to retain the edges containing the key node, i.e., save the primary key-to-primary key relationship information corresponding to the core table, while the edges representing primary key-to-primary key information between other nodes can be directly removed, resulting in the first filtering edge. Similarly, for the primary key-to-foreign key relationship information, only the edges containing the key node need to be retained, i.e., save the primary key-to-foreign key relationship information corresponding to the core table, while the primary key-to-foreign key relationships of other nodes can be removed, resulting in the second filtering edge. In this way, the relationship graph containing only the first and second filtering edges is the first entity-relationship graph. In one embodiment, the process of constructing the first entity-relationship graph can be referred to... Figure 5 As shown before and after the first arrow, nodes A, B, C, D, and E are connected via primary key (PPE) relationships, and simultaneously connected to nodes F, G, and H via primary foreign key (PFE) relationships. After filtering node information by node degree, node A is identified as the key node. At this point, the corresponding associations can be deleted. In the PPE cluster, all nodes except key node A are connected to key node A, and the connection relationship is marked as PPE. Nodes within the PPE cluster are no longer connected to each other. Furthermore, key node A in the PPE cluster establishes PFE connections with external nodes, while other nodes within the cluster no longer establish PFE connections with external nodes. The result is as follows. Figure 5 As shown in the upper right corner. The solution in this application achieves edge filtering by identifying key nodes, which can effectively clear the relationships between nodes, reduce redundant edges, and ensure the effectiveness of the first entity relationship graph.
[0084] In one embodiment, the process of filtering edges in the first entity relationship graph using a community discovery algorithm to obtain a second entity relationship graph includes: calculating the centrality of the edges of the primary key and foreign key association information in the first entity relationship graph using a community discovery algorithm to obtain the centrality calculation result; and filtering edges in the first entity relationship graph based on the centrality calculation result to obtain the second entity relationship graph.
[0085] Centrality is used to quantify the importance of a vertex in a graph. Similarly, centrality can also be used to quantify the importance of a node or an edge in a graph. In this application, the centrality of an edge is mainly used to identify and determine the importance of an edge.
[0086] Specifically, after filtering the primary key association information and primary foreign key association information using node degree information, a community discovery algorithm can be used to calculate the centrality of each edge in the first entity relationship graph. Then, based on the edge centrality, the centrality of the edges representing primary key and foreign key association information (PFE information) in the first entity relationship graph can be calculated. The edge centrality is related to the node information in the entity relationship graph, specifically, it is precisely related to the nodes of the two segments of the edge. The centrality of an edge is related to the ratio of the number of shortest paths between two nodes to the number of shortest paths between two nodes that pass through this edge. The relationship is shown in the following formula:
[0087]
[0088] In the above formula, s,t represent all nodes in the first entity-relationship graph, δ(s,t) is the number of shortest paths from node s to node t, and δ(s,t|e) is the number of edges e traversed in all shortest paths from node s to node t. After calculating the centrality of all edges using the above formula, the first entity-relationship graph is further filtered based on the centrality calculation results to obtain the second entity-relationship graph. In this embodiment, the importance of edges is calculated using centrality to further filter the entity-relationship graph, remove redundant edges, and ensure the effectiveness of the second entity-relationship graph.
[0089] In one embodiment, the process of filtering edges in the first entity relationship graph based on the centrality calculation results to obtain the second entity relationship graph includes: removing the connecting edges in the first entity relationship graph; and re-adding connecting edges in the first entity relationship graph in descending order of centrality to obtain the second entity relationship graph, wherein the two ends of the re-added connecting edges contain isolated nodes, and isolated nodes are nodes without connection information.
[0090] Specifically, the solution in this application can achieve edge filtering of the first entity relationship graph through a centrality sorting method. The edge filtering targets the primary key and foreign key association information (PFE) connection edges. Specifically, the primary key and foreign key association information connection edges in the first entity relationship graph can be cleared first; then, the primary key and foreign key association information connection edges are re-added in the first entity relationship graph in descending order of centrality. For example, the centrality of all primary key and foreign key association information connection edges in the first entity relationship graph can be calculated first and stored as a centrality list in descending order, where each element in the list is represented as a triple (u, v, e). Then, the primary key and foreign key association information connection edges in the current first entity relationship graph are deleted. Next, edge e is taken out sequentially from the centrality list. If at least one of the two endpoints of edge e is an isolated node (i.e., a node without any connection information), then the edge is connected in the graph of the first entity relationship graph; otherwise, the edge is not connected. This process continues until all edges in the centrality list have been evaluated. The resulting graph is the second entity relationship graph representing the minimum connectivity. In one embodiment, the process of constructing the second entity relationship graph can be referred to... Figure 5 The two entity relationship graphs in the lower middle are analyzed first. The centrality of the first entity relationship graph in the upper right corner is calculated to determine the centrality of the connecting edges corresponding to the primary and foreign key associations of key node A, as well as the centrality of the primary and foreign key associations between various external nodes. The centrality of AF is 50, AG is 45, AH is 43, FG is 22, and HG is 15. The resulting centrality list is {50, 45, 43, 22, 15}. Then, connecting edges are extracted sequentially. First, the connecting edge of AF is extracted. Since F is an isolated node, node A is connected to node F. Similarly, node A is connected to node G and node H. When extracting the connecting edge of FG, since both node F and node G are connected to node A and are not isolated nodes, FG does not need to be connected. Similarly, HG does not need to be connected. Finally, the second entity relationship graph in the lower left corner is obtained. This application's solution achieves edge filtering by identifying the centrality of connecting edges, effectively clearing the associations between nodes, reducing redundant edges, and ensuring the effectiveness of the second entity relationship graph.
[0091] In one embodiment, the process of merging business themes in partitioned business data tables based on data table relationships to determine the business themes of the partitioned business data tables and obtain business data tables under each business theme includes: identifying the core data tables of each business theme in the partitioned business data tables; determining the associated data tables of each business theme in the partitioned business data tables based on the data table relationships of the core data tables; and merging the data information of the associated data tables in the core data tables to obtain business data tables under each business theme.
[0092] Among them, the core data table refers to the most core table under the business theme, and it is also the business data table that serves as a key node in the entity relationship diagram. By identifying the relationships between the core data table and other data tables, we can determine the related data tables in the partitioned business data table that are related to the current business theme.
[0093] Specifically, when classifying business data tables, the core data tables for each business theme in the partitioned business data tables can be identified through the constructed entity-relationship diagram. Each core data table corresponds to a business theme, while other data tables with primary key-to-primary key and primary key-to-foreign key relationships with the core data tables can serve as supplementary explanations for branch businesses. Therefore, the theme merging process essentially uses the core table as the center, integrating the information from related tables to obtain the business data tables under each business theme. In one specific embodiment, a schematic diagram illustrating the theme division and data table merging performed by data relationship mining in this application can be found here. Figure 6 As shown, the tables within the boxes are the core tables under a business theme, while the tables within the dashed boxes represent business themes centered around the core tables and incorporating information from related tables. The business theme fusion process also includes the fusion of theme fields, and the field processing methods include field information replacement, field information expansion, and field information deletion. The field information fusion process can be referenced... Figure 7 As shown, taking the project filing theme in a certain business database as an example, the core table is the project filing summary information table, and the theme will be integrated around this core table. In its associated tables, the personnel basic information table is a dimension table. Although the "Personnel Type" field is also in the dimension table, it is retained because it can be used as a dimension field for statistical analysis. Considering that this field may have missing or inaccurate information, it is replaced based on the personnel basic information table. Project filing, transfer filing, and off-site filing information all belong to branch businesses within the filing business, recording detailed data for three different filing businesses. Therefore, the serial number is used as an associated field to supplement information in the core table, i.e., a field expansion operation. In this embodiment, by first determining the core data table of the business theme, and then determining the branch business information within the business theme based on the data table relationships of the core data table, the accuracy of business theme division can be effectively guaranteed, thereby improving the governance effect of data governance.
[0094] In one embodiment, obtaining data governance results based on business data tables under each business theme includes: determining the business primary key in the business data table; performing primary key normalization governance on the business data table based on the business primary key to obtain a primary key normalized data table; and performing data field normalization governance and inter-table relationship normalization governance on the primary key normalized data table to obtain the data governance result.
[0095] Regarding primary key normalization, since source tables in business databases typically use non-business primary keys as unique identifiers, subject tables need to determine business primary keys based on actual business usage, and then use these business primary keys for data deduplication and normalization. As for non-standardized data fields, since business data contains null values, values outside the specified range, and other non-standardized data, normalization is required. Non-standardized inter-table joins occur because data applications sometimes require joining tables from different subjects, but non-standardized source data can cause some joins to fail, thus requiring normalization.
[0096] Specifically, after determining the core data tables, data normalization governance can be performed based on these tables to improve data availability and ensure the effectiveness of data governance. First, primary key normalization governance is performed on the business data tables based on the business primary key, resulting in a primary key normalized data table. Primary key normalization effectively deduplicates business data, preventing subsequent data processing from repeatedly processing duplicate data and ensuring the efficiency of data normalization governance. Then, data field normalization governance and inter-table relationship normalization governance can be performed to obtain the data governance result. In one embodiment, such as... Figure 8 As shown, after determining the business primary key, the latest data can be retained based on the update time (update_time), thereby achieving the purpose of primary key deduplication in subject table A. In the actual governance process, after determining the data quality control standards, each field of the subject table automatically matches the corresponding quality control rules for quality verification, filtering out non-compliant dirty data for data governance by the business side. In this embodiment, data normalization governance can effectively achieve the governance of business data tables under each business subject, thereby obtaining accurate data governance results.
[0097] In one embodiment, the method further includes: determining the entity relationship diagram corresponding to the data table association; providing the entity relationship diagram and obtaining the association adjustment information corresponding to the entity relationship diagram; adjusting the entity relationship diagram based on the association adjustment information to obtain the target entity relationship diagram; and constructing a data warehouse model based on the target entity relationship diagram, wherein the data warehouse model is used to structurally store the data governance results.
[0098] Among them, the data warehouse model is a model obtained from data warehouse modeling, which can realize functions such as data storage, data aggregation, and data analysis.
[0099] Specifically, the data governance approach of this application can be applied to the field of data warehouse modeling. First, the entity-relationship diagram (ERP) involved in the data table relationship analysis is determined. Then, the ERP is fed back to the modeling staff to obtain feedback on relationship adjustments. Based on this adjustment information, the ERP is further adjusted to obtain the target ERP. Finally, a data warehouse model is constructed based on the target ERP. During data warehouse construction, to deepen understanding of the business, data insights and data table relationship analysis are typically performed. The fastest way to gain data insights is to draw and display the ERP diagrams of the database tables involved in the data warehouse. In one embodiment, the user inputs the table creation statements for the relevant database tables, and this application can automatically parse the statements and optimize the entity connection relationships to obtain an optimized ERP diagram. This optimized ERP diagram is then displayed on the web using a graph database (neo4j). Based on this ERP diagram, the user can make simple adjustments to quickly obtain the final ERP diagram, which can then be input into a database or data asset platform for storage and corresponding data warehouse modeling. In another embodiment, the solution of this application can also be applied to a data standard platform. On the data standard platform, users need to sort out and determine the relationships between all tables and fields in the database, including the association relationships between table fields, field value ranges, etc. During the sorting process, there are often cases where the same field appears in different database tables, and the same field in different database tables may have the same name but may actually store different data or have different association constraints (such as strong consistency, weak consistency, etc.). Through this application, before performing field-level sorting, the entity relationship diagram can be quickly parsed to locate different association constraints and field-table relationships, and different types of association management can be performed in the system. Afterwards, linkage-type changes can be made during the sorting process, thereby ensuring the internal consistency of the standard results and saving repetitive work. When the data standard is completed and converted into the corresponding data quality control rules, the entity relationship diagram can also be quickly converted into association quality control and other rule content. In one embodiment, the business data table to be governed includes the original medical data table, and the method further includes: performing data governance on the original medical data table based on the data warehouse model to obtain the data governance result corresponding to the original medical data table; and structurally saving the data governance result corresponding to the original medical data table to the data warehouse model. In this embodiment, data warehouse modeling can effectively construct a corresponding data warehouse model, thereby enabling subsequent work such as saving and analyzing data governance results related to business data, thus ensuring the effectiveness of data processing.
[0100] This application also provides an application scenario in which the above-described data governance method is applied. Specifically, the data governance method is applied in this scenario as follows:
[0101] When users need to perform big data-related data analysis, if the business systems involved in the data are complex and diverse, they can directly use the data governance method described in this application to extract the data relationships between tables in the business systems, and jointly govern and standardize the data in related tables, providing a solid data foundation for data applications. The overall architecture and system structure diagram for data governance and standardization of business data within the business data source can be found in [reference needed]. Figure 9As shown in the figure, the steps within the shaded box are the detailed execution processes of the steps within the octagonal box. The manual and automatic tasks in the figure are opposites, triggered manually by the user or executed automatically by the system, respectively. The step tasks are detailed descriptions of a specific step node, and therefore all are executed within the shaded box. First, in the system's data source layer, the business data tables within the business data source are updated in real time. The server implementing the data governance of this application synchronizes the business data tables from the business data source to the server online or offline through methods such as initial synchronization (full synchronization) and configured scheduled synchronization (including normal synchronization and supplementary synchronization). As shown in the figure, the synchronized data table is named the X_history table. Simultaneously, during data synchronization, an SQL parser is used to parse the table creation statements in the business data source, obtaining data table information including table name, primary key, foreign key, and field information. Then, database table relationship mining is used to explore the relationships between the various data tables synchronized from the business data source, further determining the main table and related tables under each theme. In the source layer (STG) of the data warehouse, the synchronized data tables are initially roughly partitioned. At this stage, partitioning is mainly based on the partition fields in the data packets, and null partitions are filtered out. The roughly partitioned business data tables then undergo partition field transformation and business data deduplication in the first data operations layer (ODS1) to obtain partitioned business data tables. This process includes partition splitting, partition extraction, and partition deduplication, as shown in the diagram. The partition transformation and deduplication processes performed on business data tables synchronized to the data warehouse at different times vary. The first synchronized business data table is split into two parts for partition transformation and deduplication. Similarly, the routine execution corresponding to normal synchronization and the post-deployment execution corresponding to supplementary data synchronization also involve separate partition transformation and deduplication processes. At this point, the business data tables of each partition are still in a relatively scattered state. From the first data operation layer to the second data operation layer (ods2), themes will be merged according to the theme dimension (determined by mining the relationship between database tables). Depending on the event being processed, it is also divided into three processes: initial execution, routine execution, and post-deployment execution, which correspond to the initial synchronization, normal synchronization, and data supplementation synchronization processes, respectively. The theme merging process specifically includes the process of merging business tables. The merging of business tables depends on the theme corresponding to the business table. The database tables and field ranges contained in each theme will depend on the theme database table relationship mined by the source layer. This process is mainly achieved by looking up the business partitions corresponding to each table.Further processing from the second data operations layer to the data detail layer (DWD) requires subject normalization of the subject tables. This results in business data tables that are partitioned by subject, have a standardized subject structure, and have undergone data cleaning. This process involves data cleaning steps such as fact normalization and primary key normalization. Finally, through subject table structure standardization, business data suitable for data analysis is obtained, completing the entire business governance process. The process of mining relationships between subject database tables can be referenced. Figure 10As shown, the process includes: Step 1001, parsing the SQL statements of the table creation statements for the business data tables to be governed to obtain table information. When mining the relationships between master database tables, the SQL statement parsing method can be used to directly analyze the table creation statements and identify key table information, including table names, field information, primary and foreign keys. Step 1003, extracting primary key and primary key association information from the table information and mining primary key and primary key association information clusters. Then, primary key and primary key association information is identified from the primary and foreign key information in the table information, and community mining algorithms are used to mine primary key and primary key association information clusters. Step 1005, extracting primary key and foreign key association information, determining the node degree information of nodes in the primary key and foreign key association information, and also identifying primary key and foreign key association information from the primary and foreign key information in the table information, calculating the node degree based on the primary key and foreign key association information, and determining the node degree of each node in the graph structure. Step 1007: Filter the edges in the primary key and primary key association information group based on node degree information to obtain the first filtered edges; and filter the edges corresponding to the primary key and foreign key association information based on node degree information to obtain the second filtered edges. After obtaining the node degree information, the edges of the association relationships between data tables can be filtered using the node degree information to remove redundant information in the association information. Step 1009: Construct a preliminary entity relationship graph based on the first and second filtered edges. That is, construct a preliminary entity relationship graph based on the association relationships of data tables after removing redundant information. Step 1011: Determine the centrality of each edge in the preliminary entity relationship graph, and sort the edges of the preliminary entity relationship graph based on the centrality. This process is mainly used to further filter the data table associations in the preliminary entity relationship graph. Step 1013: Perform edge filtering processing on the preliminary entity relationship graph based on the edge sorting results to construct a minimum connection graph. That is, construct a minimum connection graph based on the further filtered association relationships. The minimum connection graph retains the strong association relationships between data tables, so subsequent data analysis can be performed based on the minimum connection graph. This refers to step 1015, the process of topic / table / field matching analysis, etc. Specifically, parsing the database table creation statements to obtain table information can include: performing lexical analysis on the table creation statements of the business data tables to be managed, obtaining lexical tags; performing syntax analysis on the lexical tags, obtaining an abstract syntax tree; and obtaining the data table information based on the abstract syntax tree. Constructing a preliminary entity relationship graph based on primary key-to-primary key association information and primary key-to-foreign key association information includes: performing cluster mining on primary key-to-primary key association information to obtain primary key-to-primary key association information clusters; determining the node degree information of nodes in the primary key-to-foreign key association information; filtering the edges in the primary key-to-primary key association information clusters based on the node degree information to obtain first-filtered edges; filtering the edges corresponding to the primary key-to-foreign key association information based on the node degree information to obtain second-filtered edges; and constructing the preliminary entity relationship graph based on the first and second-filtered edges.The process of constructing a minimum connectivity graph includes: calculating the centrality of edges in the initial entity relationship graph based on primary key and foreign key associations using a community discovery algorithm; filtering edges in the initial entity relationship graph based on the centrality results to obtain a minimum connectivity graph. Then, the core data tables for each business theme in the partitioned business data tables can be further identified; based on the data table relationships in the core data tables, the associated data tables for each business theme in the partitioned business data tables can be determined; the data information from the associated data tables can be integrated into the core data tables to obtain the business data tables under each business theme. The business primary key in the business data tables is then determined; primary key normalization is performed on the business data tables based on the business primary key to obtain a primary key normalized data table; data field normalization and inter-table relationship normalization are then performed on the primary key normalized data table to obtain the data governance results. After the data governance results are structured and saved to the data warehouse, subsequent data analysis and other processes can be performed.
[0102] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0103] Based on the same inventive concept, this application also provides a data governance apparatus for implementing the data governance method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations in one or more data governance apparatus embodiments provided below can be found in the limitations of the data governance method described above, and will not be repeated here.
[0104] In one embodiment, such as Figure 11 As shown, a data governance device is provided, comprising:
[0105] The statement parsing module 1102 is used to perform database statement parsing on the table creation statements of the business data tables to be managed, and to obtain the data table information.
[0106] The relationship mining module 1104 is used to treat the business data table to be governed as an entity, and to perform database table relationship mining processing based on the data table information to obtain the data table relationship.
[0107] The business theme merging module 1106 is used to perform business theme merging processing on the business data tables of each partition based on the partitioned business data tables and the data table association relationships obtained from the partitioning of the business data tables to be managed, so as to obtain the business data tables under each business theme.
[0108] The data governance module 1108 is used to perform data standardization and governance on business data tables under various business themes to obtain data governance results.
[0109] In one embodiment, the system further includes a data synchronization module, configured to: perform full synchronization of historical data in the business data table to be managed to obtain a historical business data table; partition the historical business data table according to the partition field of the historical business data table; and perform partition field transformation and business data deduplication on the partitioned historical business data table to obtain a partitioned business data table.
[0110] In one embodiment, the statement parsing module 1102 is specifically used to: perform lexical parsing on the table creation statement of the business data table to be governed to obtain lexical tags; perform syntax parsing on the lexical tags to obtain an abstract syntax tree; and obtain data table information based on the abstract syntax tree.
[0111] In one embodiment, the relationship mining module 1104 is specifically used to: extract primary key-to-primary key relationship information and primary key-to-foreign key relationship information from the data table information; construct a first entity relationship graph based on the primary key-to-primary key relationship information and primary key-to-foreign key relationship information; perform edge filtering processing on the first entity relationship graph through a community discovery algorithm to obtain a second entity relationship graph; and obtain the data table relationship based on the second entity relationship graph.
[0112] In one embodiment, the association mining module 1104 is specifically used for: performing cluster mining on primary key and primary key association information to obtain primary key and primary key association information clusters; determining the node degree information of nodes in the primary key and foreign key association information; filtering the edges in the primary key and primary key association information clusters based on the node degree information to obtain first filtered edges; filtering the edges corresponding to the primary key and foreign key association information based on the node degree information to obtain second filtered edges; and constructing a first entity relationship graph based on the first filtered edges and the second filtered edges.
[0113] In one embodiment, the association mining module 1104 is specifically used to: determine the key node with the largest node degree in the primary key-to-primary key association information group based on node degree information; filter the edges in the primary key-to-primary key association information group that contain the key node as filtering and retention edges to obtain a first filtering edge; and filter the edges in the primary key-to-foreign key association information group that contain the key node as filtering and retention edges to obtain a second filtering edge.
[0114] In one embodiment, the association mining module 1104 is specifically used to: calculate the centrality of the edges of the primary key and foreign key association information in the first entity relationship graph using a community discovery algorithm to obtain the centrality calculation result; and filter the edges of the first entity relationship graph based on the centrality calculation result to obtain the second entity relationship graph.
[0115] In one embodiment, the association mining module 1104 is specifically used to: clear the connection edges of primary key and foreign key association information in the first entity relationship graph; and re-add the connection edges of primary key and foreign key association information in the first entity relationship graph in descending order of centrality to obtain the second entity relationship graph. The nodes at both ends of the re-added connection edges are isolated nodes, and the isolated nodes are nodes without connection information.
[0116] In one embodiment, the business theme merging module 1106 is specifically used to: identify the core data tables of each business theme in the partitioned business data table; determine the associated data tables of each business theme in the partitioned business data table based on the data table association relationship of the core data tables; and merge the data information of the associated data tables in the core data tables to obtain the business data tables under each business theme.
[0117] In one embodiment, the data governance module 1108 is specifically used to: determine the business primary key in the business data table; perform primary key normalization governance on the business data table based on the business primary key to obtain a primary key normalized data table; and perform data field normalization governance and inter-table relationship normalization governance on the primary key normalized data table to obtain the data governance result.
[0118] In one embodiment, the method further includes a data warehouse modeling module, used to: determine the entity relationship diagram corresponding to the data table associations; provide feedback on the entity relationship diagram and obtain the association adjustment information corresponding to the entity relationship diagram; adjust the entity relationship diagram based on the association adjustment information to obtain a target entity relationship diagram; and construct a data warehouse model based on the target entity relationship diagram, the data warehouse model being used to structure and store data governance results.
[0119] In one embodiment, the system further includes a data storage module, configured to: perform data governance on the original medical data table based on the data warehouse model to obtain the data governance result corresponding to the original medical data table; and structure and save the data governance result corresponding to the original medical data table to the data warehouse model.
[0120] Each module in the aforementioned data governance device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the operations corresponding to each module.
[0121] In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as follows: Figure 12 As shown, this computer device includes a processor, memory, input / output interfaces (I / O), and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores data related to data governance. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When the computer program is executed by the processor, it implements a data governance method.
[0122] Those skilled in the art will understand that Figure 12 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0123] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.
[0124] In one embodiment, a computer-readable storage medium is provided storing a computer program that, when executed by a processor, implements the steps in the above method embodiments.
[0125] In one embodiment, a computer program product or computer program is provided, the computer program product or computer program including computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium, and executes the computer instructions, causing the computer device to perform the steps in the above method embodiments.
[0126] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data shall comply with the relevant laws, regulations and standards of the relevant countries and regions.
[0127] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.
[0128] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0129] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A data governance method, characterized in that, The method includes: The database statement for creating the table of data to be managed is parsed to obtain the table information. Extract primary key-to-primary key association information and primary key-to-foreign key association information from the data table information; construct a first entity-relationship graph based on the primary key-to-primary key association information and the primary key-to-foreign key association information; calculate the centrality of the edges of the primary key-to-foreign key association information in the first entity-relationship graph using a community discovery algorithm to obtain the centrality calculation result; remove the connecting edges of the primary key-to-foreign key association information in the first entity-relationship graph; re-add the connecting edges of the primary key-to-foreign key association information in the first entity-relationship graph in descending order of centrality to obtain a second entity-relationship graph. The two ends of the newly added connecting edges have isolated nodes, which are nodes without connection information. The second entity-relationship graph is a minimum connection graph obtained by further filtering based on the first entity-relationship graph; obtain the data table association relationships based on the second entity-relationship graph. Based on the association between the partitioned business data tables obtained by partitioning the business data tables to be governed and the data tables, the business theme merging process is performed on each of the partitioned business data tables to obtain the business data tables under each business theme. Data standardization and governance are performed on the business data tables under each business theme to obtain data governance results.
2. The method according to claim 1, characterized in that, The method further includes: The historical data in the governance business data table is fully synchronized to obtain the historical business data table. The historical business data table is partitioned according to the partition field, and the partitioned historical business data table is then subjected to partition field transformation and business data deduplication to obtain a partitioned business data table.
3. The method according to claim 1, characterized in that, The database statement parsing process performed on the table creation statement of the business data table to be managed yields the following data table information: Lexical analysis is performed on the table creation statements of the business data table to be governed to obtain lexical tags; The lexical tags are parsed to obtain an abstract syntax tree; Data table information is obtained based on the abstract syntax tree.
4. The method according to claim 1, characterized in that, The construction of the first entity relationship graph based on the primary key and its association information, and the primary key and its association information, includes: The primary key and its association information are subjected to cluster mining to obtain primary key and primary key association information clusters. Determine the node degree information of the nodes in the primary key and foreign key association information; Based on the node degree information, the edges in the primary key and primary key association information group are filtered to obtain the first filtered edges; Based on the node degree information, the edges corresponding to the primary key and foreign key association information are filtered to obtain the second filtered edges; Based on the first filtering edge and the second filtering edge, a first entity relationship graph is constructed.
5. The method according to claim 4, characterized in that, The first filtered edge, obtained by filtering the edges in the primary key and primary key association information group based on the node degree information, includes: Based on the node degree information, determine the key node with the highest node degree in the primary key and primary key association information group; The edges containing the key node in the primary key and primary key association information group are used as filtering and retention edges. The edges in the primary key and primary key association information group are filtered to obtain the first filtering edge. The step of filtering the edges corresponding to the primary key and foreign key association information based on the node degree information to obtain the second filtered edges includes: The key node contained in the primary key and foreign key association information is used as the filtering edge, and the edges in the primary key and foreign key association information are filtered to obtain the second filtering edge.
6. The method according to claim 1, characterized in that, The step of merging the business themes of the partitioned business data tables based on the data table relationships, determining the business themes of the partitioned business data tables, and obtaining business data tables under each business theme includes: Identify the core data tables for each business theme in the partitioned business data tables; Based on the data table relationships of the core data table, determine the associated data tables for each business theme in the partitioned business data table; By integrating the data information from the related data tables into the core data table, business data tables under each business theme are obtained.
7. The method according to claim 6, characterized in that, The data governance results obtained based on the business data tables under each business theme include: Determine the business primary key in the business data table; Based on the business primary key, the business data table is normalized to obtain a primary key normalized data table; The data field normalization and inter-table relationship normalization are performed on the primary key normalized data table to obtain the data governance results.
8. The method according to any one of claims 1 to 7, characterized in that, The method further includes: Determine the entity relationship diagram corresponding to the data table associations; Feedback the entity relationship diagram and obtain the corresponding association adjustment information; The entity relationship diagram is adjusted based on the aforementioned association adjustment information to obtain the target entity relationship diagram; Based on the target entity relationship diagram, a data warehouse model is constructed, which is used to structurally store the data governance results.
9. The method according to claim 8, characterized in that, The business data table to be treated includes the original medical data table, and the method further includes: Data governance is performed on the original medical data table based on the data warehouse model to obtain the data governance result corresponding to the original medical data table; The data governance results corresponding to the original medical data table are structured and saved to the data warehouse model.
10. A data governance device, characterized in that, The device includes: The statement parsing module is used to parse and process the table creation statements of the business data tables to be managed, and obtain the data table information. The association mining module is used to extract primary key-to-primary key association information and primary key-to-foreign key association information from the data table information; construct a first entity-relationship graph based on the primary key-to-primary key association information and the primary key-to-foreign key association information; calculate the centrality of the edges of the primary key-to-foreign key association information in the first entity-relationship graph using a community discovery algorithm to obtain the centrality calculation result; remove the connecting edges of the primary key-to-foreign key association information in the first entity-relationship graph; re-add the connecting edges of the primary key-to-foreign key association information in the first entity-relationship graph in descending order of centrality to obtain a second entity-relationship graph, wherein the two ends of the re-added connecting edges have isolated nodes, and the isolated nodes are nodes without connection information. The second entity-relationship graph is a minimum connection graph obtained by further filtering based on the first entity-relationship graph; and obtain the data table associations based on the second entity-relationship graph. The business theme merging module is used to perform business theme merging processing on each of the partitioned business data tables based on the association relationship between the partitioned business data tables obtained by partitioning the business data tables to be managed and the data tables, so as to obtain business data tables under each business theme. The data governance module is used to perform data standardization and governance on business data tables under various business themes to obtain data governance results.
11. The apparatus according to claim 10, characterized in that, It also includes a data synchronization module, used for: performing full synchronization of historical data in the business data table to be managed to obtain a historical business data table; partitioning the historical business data table according to the partitioning field of the historical business data table, and performing partitioning field transformation and business data deduplication on the partitioned historical business data table to obtain a partitioned business data table.
12. The apparatus according to claim 10, characterized in that, The statement parsing module 1102 is specifically used for: performing lexical parsing on the table creation statement of the business data table to be governed to obtain lexical tags; and performing syntax parsing on the lexical tags to obtain an abstract syntax tree; Data table information is obtained based on the abstract syntax tree.
13. The apparatus according to claim 10, characterized in that, The association mining module is specifically used for: performing cluster mining on the primary key and primary key association information to obtain primary key and primary key association information clusters; and determining the node degree information of nodes in the primary key and foreign key association information. Based on the node degree information, the edges in the primary key and primary key association information group are filtered to obtain the first filtered edges; based on the node degree information, the edges corresponding to the primary key and foreign key association information are filtered to obtain the second filtered edges; based on the first filtered edges and the second filtered edges, a first entity relationship graph is constructed.
14. The apparatus according to claim 13, characterized in that, The association mining module is specifically used for: determining the key node with the largest node degree in the primary key and primary key association information group based on the node degree information; filtering the edges in the primary key and primary key association information group that contain the key node as filtering and retention edges to obtain a first filtering edge; and filtering the edges in the primary key and foreign key association information group that contain the key node as filtering and retention edges to obtain a second filtering edge.
15. The apparatus according to claim 10, characterized in that, The business theme merging module is specifically used for: identifying the core data tables of each business theme in the partitioned business data table; determining the associated data tables of each business theme in the partitioned business data table based on the data table association relationships of the core data tables; and merging the data information of the associated data tables in the core data tables to obtain the business data tables under each business theme.
16. The apparatus according to claim 15, characterized in that, The business topic merging module is specifically used for: determining the business primary key in the business data table; performing primary key normalization on the business data table based on the business primary key to obtain a primary key normalized data table; and performing data field normalization and inter-table relationship normalization on the primary key normalized data table to obtain the data governance result.
17. The apparatus according to any one of claims 10 to 16, characterized in that, It also includes a data warehouse modeling module, used to: determine the entity relationship diagram corresponding to the data table association; provide back the entity relationship diagram; and obtain the association adjustment information corresponding to the entity relationship diagram. The entity relationship graph is adjusted based on the association adjustment information to obtain the target entity relationship graph; a data warehouse model is constructed based on the target entity relationship graph, and the data warehouse model is used to structurally store the data governance results.
18. The apparatus according to claim 17, characterized in that, It also includes a data storage module, used for: performing data governance on the original medical data table based on the data warehouse model to obtain the data governance result corresponding to the original medical data table; and structurally saving the data governance result corresponding to the original medical data table to the data warehouse model.
19. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 9.
20. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 9.
21. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 9.