Big data quality control method, system, supercomputer and storage medium
By constructing a knowledge graph in the biomedical field and training a multimodal end-to-end model, the problem of strong domain-specificity in existing data quality control methods is solved, enabling cross-domain data quality control that is applicable to the quality evaluation of various data types.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 青岛国实科技集团有限公司
- Filing Date
- 2023-04-27
- Publication Date
- 2026-06-26
Smart Images

Figure CN116580774B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of big data technology, and in particular to big data quality control methods, systems, supercomputers, and computer-readable storage media. Background Technology
[0002] Data quality control is a crucial step in ensuring the overall quality of big data. However, data quality control methods are highly domain-specific, lacking universality and versatility. Existing data quality control methods based on deep learning technology are often applicable to specific fields, such as deep learning-based methods for sea surface temperature observation data quality control, medical radiological imaging quality control, and environmental monitoring data quality control.
[0003] The above methods are mainly used for quality control of serialized data and image data. Moreover, the evaluation dimensions of quantity and quality depend on the data relevance or rationality within the field. They are only applicable to specific fields and cannot be extended to other fields (such as the biomedical field), thus failing to meet the growing demand for big data quality control.
[0004] Currently, no effective solution has been proposed for data quality control that is commonly used in related technologies. Summary of the Invention
[0005] This application provides a big data quality control method, system, supercomputer, and computer-readable storage medium to achieve big data quality control based on graph data and improve the domain scalability of the quality control method.
[0006] In a first aspect, embodiments of this application provide a big data quality control method, including:
[0007] The source data acquisition steps involve obtaining source data from multiple preset data sources. Specifically, the data sources include 30 biomedical databases such as DrugBank, KEGG (KeggDrug), RCSB PDB, PubMed, Uniprot, Pubchem, Chemspider, Wikipedia, and Patent. The source data includes eight categories: diseases, genes, cell lines, proteins, targets, compounds, drugs, and pathways. The source data covers rule bases, algorithm bases, model bases, literature bases, patent bases, ontology bases, etc. Among them, the protein PRO, cell line CLO, disease DOID, compound CHEBI, pathway PW, and gene GO data are sourced from The OBO Foundry (Biomedical Ontology Resource Service and Application).
[0008] The knowledge graph construction steps involve knowledge extraction, data alignment, data storage, incremental evolution, and visualization based on the source data. The knowledge graph includes entity data, relation data, triplet data, and subgraph data. Based on the source data, the corresponding entities in the knowledge graph include eight categories: disease, gene, cell line, protein, target, compound, drug, and pathway.
[0009] The quality control model construction steps include configuring quality control parameters and constructing a quality evaluation dataset based on the quality control parameters and knowledge graph; constructing a multimodal end-to-end big data quality control model; training the quality control model using the quality evaluation dataset; and using the trained quality control model to calculate and output data confidence values based on input entity data, subgraph data, and / or triple data. The data in the quality evaluation dataset is used as parameters in supervised learning tasks. The quality control parameters include: data comprehensiveness parameters, data timeliness parameters, data authenticity parameters, data relevance parameters, and data accuracy parameters.
[0010] In some embodiments, the quality assessment dataset is in the form of serialized data. The quality assessment dataset includes entity data, relation data, triple data, subgraph data, and data timeliness parameters and data authenticity parameters in the quality control parameters. The triple includes a head entity, a relation, and a tail entity, which can be represented as (h, r, t), where h is the head entity, t is the tail entity, and r is the relation between the head entity and the tail entity. The proportions of the training set, validation set, and test set in the dataset are 6:2:2, respectively.
[0011] In some embodiments, the knowledge graph construction step further includes:
[0012] The knowledge extraction step involves extracting multiple types of ontology from the source data and parsing the ontology. Entity attributes are extracted from the ontology, and entity relationships are extracted based on predefined inter-ontology relationships and intra-ontology entity relationships. Entity attributes include, but are not limited to: entity identifier (ID), entity function description (IAO_0000115), entity namespace, entity label, entity external link (hasDbXref), and entity synonym. The entity function description (IAO_0000115) describes the physiological processes or functions in which the entity participates in the ontology. Inter-ontology relationships include, but are not limited to: database links, paper links, webpage links, and literature links. Intra-ontology entity relationships include, but are not limited to: synonym associations, label associations, parent-child class relationships, and namespace associations.
[0013] The data alignment step involves aligning entities based on their entity IDs or entity identifier IDs. Specifically, data from The OBO Foundry can be aligned based on entity identifier IDs, while attributes from other data sources can be aligned directly based on entity IDs.
[0014] In some embodiments, the quality control model construction step further includes:
[0015] The data acquisition steps involve acquiring subgraph data, triplet data, and entity data from the input knowledge graph, and calculating the corresponding quality control parameters; specifically, calculating the data timeliness parameter and data authenticity parameter among the quality control parameters.
[0016] The pre-trained model loading steps include loading the graph pre-trained model to train the subgraph data, and loading the text pre-trained model to train the triplet data and entity data, in order to improve the performance of the quality control model.
[0017] In the data embedding step, features are extracted and embedded from the subgraph data, triplet data, and entity data respectively in conjunction with the quality control parameters. The resulting subgraph vector, triplet vector, and entity vector are then fused to obtain a feature vector. The specific models corresponding to the three types of data are as follows: the Graph2vec model is used for the subgraph data, the transformer model is used for the triplet data, and the Node2vec model is used for the entity data.
[0018] The model training step utilizes a multi-head attention mechanism to train the network on the feature vectors. The trained network includes convolutional layers, pooling layers, fully connected layers, and a classification network (softmax). The classification network (softmax) outputs data confidence values. This step captures effective information from the feature vectors through the multi-head attention mechanism. The network activation function is ReLU, the learning rate is configured to 0.01, the dropout is configured to 0.5, and the batch size for one training iteration is configured to 256. The data confidence values output by the classification network (softmax) are distributed between [0,1]. The closer the confidence value is to 1, the higher the data quality; conversely, the lower the confidence value is, the lower the data quality.
[0019] In some embodiments, the source data acquisition step further includes:
[0020] The data content update step involves updating the data from the data source. The update methods include full update, incremental update through program parsing, and / or incremental update through incremental crawling.
[0021] Secondly, embodiments of this application provide a big data quality control system, including:
[0022] The source data acquisition module is used to acquire source data from multiple preset data sources;
[0023] The knowledge graph construction module is used to construct a knowledge graph based on the source data through knowledge extraction, data alignment, data storage, incremental evolution, and visualization. The knowledge graph includes entity data, relation data, triple data, and subgraph data.
[0024] A quality control model construction module is used to configure quality control parameters and construct a quality evaluation dataset based on the quality control parameters and a knowledge graph. This module then constructs a multimodal end-to-end big data quality control model, trains the model using the quality evaluation dataset, and calculates and outputs data confidence values based on input entity data, subgraph data, and / or triplet data. The data in the quality evaluation dataset is used as parameters in a supervised learning task. The quality control parameters include: data comprehensiveness parameters, data timeliness parameters, data authenticity parameters, data relevance parameters, and data accuracy parameters.
[0025] In some embodiments, the knowledge graph construction module further includes:
[0026] The knowledge extraction module is used to extract multiple types of ontology from the source data and parse the ontology, extract entity attributes from the ontology, and extract entity relationships based on predefined inter-ontology relationships and inter-entity relationships within the ontology. The entity attributes include, but are not limited to: entity identifier (ID), entity function description (IAO_0000115), entity namespace, entity label, entity external link (hasDbXref), and entity synonym. The entity function description (IAO_0000115) describes the physiological process or function in which the entity participates in the ontology. The inter-ontology relationships include, but are not limited to: database links, paper links, webpage links, and literature links. The inter-entity relationships within the ontology include, but are not limited to: synonym associations, label associations, parent-child class relationships, and namespace associations.
[0027] The data alignment module is used to perform entity alignment based on the entity ID or entity identifier ID of the entity. Specifically, data from The OBO Foundry can be aligned based on the entity identifier ID, while attributes from other data sources can be aligned directly based on the entity ID.
[0028] In some embodiments, the quality control model building module further includes:
[0029] The data acquisition module is used to acquire subgraph data, triple data, and entity data of the input knowledge graph, and calculate the corresponding quality control parameters; specifically, it calculates the data timeliness parameter and data authenticity parameter in the quality control parameters.
[0030] The pre-trained model loading module is used to load subgraph data for training graph pre-trained models, and to load triplet data and entity data for training text pre-trained models, in order to improve the performance of the quality control model.
[0031] The data embedding module is used to extract and embed features from the subgraph data, triplet data, and entity data respectively in combination with the quality control parameters. The resulting subgraph vector, triplet vector, and entity vector are then fused to obtain a feature vector. The specific models corresponding to the three types of data are as follows: the Graph2vec model is used for the subgraph data, the transformer model is used for the triplet data, and the Node2vec model is used for the entity data.
[0032] The model training module is used to train the network on the feature vector using a multi-head attention mechanism. The trained network includes convolutional layers, pooling layers, fully connected layers, and a classification network softmax. The classification network softmax outputs data confidence values. This step captures effective information in the feature vector through the multi-head attention mechanism. The network activation function is ReLU, the learning rate is configured to 0.01, the random dropout is configured to 0.5, and the batch size of data samples captured in one training session is configured to 256. The data confidence values output by the classification network softmax are distributed between [0,1]. The closer to 1, the higher the data quality; conversely, the lower the confidence value, the lower the data quality.
[0033] In some embodiments, the source data acquisition module further includes:
[0034] The data content update module is used to update the data of the data source. The update methods include full update, incremental update through program parsing, and / or incremental update through incremental crawling.
[0035] Thirdly, embodiments of this application provide a supercomputer including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the big data quality control method as described in the first aspect above.
[0036] Fourthly, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the big data quality control method as described in the first aspect above.
[0037] Compared to related technologies, the big data quality control method, system, supercomputer, and computer-readable storage medium provided in this application evaluate quality control parameters based on entities, relationships, and other elements of knowledge graphs to achieve graph-based big data quality control. This method is not strongly related to any particular domain and can be extended to other domains, such as social network data, transportation network data, and financial transaction network data.
[0038] Details of one or more embodiments of this application are set forth in the following drawings and description to make other features, objects and advantages of this application more readily apparent. Attached Figure Description
[0039] The accompanying drawings, which are included to provide a further understanding of this application and form part of this application, illustrate exemplary embodiments and are used to explain this application, but do not constitute an undue limitation of this application. In the drawings:
[0040] Figure 1 This is a flowchart of a big data quality control method according to an embodiment of this application;
[0041] Figure 2 This is a step-by-step flowchart of a big data quality control method according to an embodiment of this application;
[0042] Figure 3 This is another step-by-step flowchart of the big data quality control method according to an embodiment of this application;
[0043] Figure 4 This is a schematic diagram of a quality control model according to an embodiment of this application;
[0044] Figure 5 This is a schematic diagram of the principle architecture of a big data quality control method according to a preferred embodiment of this application;
[0045] Figure 6 This is a schematic diagram of the knowledge extraction structure according to a preferred embodiment of this application;
[0046] Figure 7 This is a schematic diagram of the attribute extraction result map according to a preferred embodiment of this application;
[0047] Figure 8 This is an example diagram of the relationship between entities according to a preferred embodiment of this application;
[0048] Figure 9 This is an example diagram illustrating the relationships between entities within the body according to a preferred embodiment of this application;
[0049] Figure 10 This is a schematic diagram illustrating the data update status of each data source according to a preferred embodiment of this application;
[0050] Figure 11 This is a structural block diagram of the big data quality control system according to an embodiment of this application;
[0051] Figure 12 This is a schematic diagram of the visual interface of the big data quality control system according to an embodiment of this application;
[0052] Figure 13 This is a schematic diagram illustrating the deployment principle of the big data quality control method according to an embodiment of this application.
[0053] In the picture:
[0054] 1. Source data acquisition module; 2. Knowledge graph construction module; 3. Quality control model construction module;
[0055] 101. Data Content Update Module; 201. Knowledge Extraction Module; 202. Data Alignment Module;
[0056] 301. Data Acquisition Module; 302. Pre-trained Model Loading Module; 303. Data Embedding Module;
[0057] 304. Model Training Module. Detailed Implementation
[0058] To make the objectives, technical solutions, and advantages of this application clearer, the application is described and illustrated below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application. All other embodiments obtained by those skilled in the art based on the embodiments provided in this application without inventive effort are within the scope of protection of this application.
[0059] Obviously, the accompanying drawings described below are merely some examples or embodiments of this application. Those skilled in the art can apply this application to other similar scenarios based on these drawings without any inventive effort. Furthermore, it is understood that although the efforts made in this development process may be complex and lengthy, for those skilled in the art related to the content disclosed in this application, any changes to design, manufacturing, or production based on the technical content disclosed in this application are merely conventional technical means and should not be construed as insufficient disclosure of the content of this application.
[0060] In this application, the reference to "embodiment" means that a specific feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places in the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment that is mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described in this application may be combined with other embodiments without conflict.
[0061] Unless otherwise defined, the technical or scientific terms used in this application shall have the ordinary meaning understood by one of ordinary skill in the art to which this application pertains. The terms “a,” “an,” “an,” “the,” and similar words used in this application do not indicate quantity limitation and may indicate singular or plural. The terms “comprising,” “including,” “having,” and any variations thereof used in this application are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or device that includes a series of steps or modules (units) is not limited to the listed steps or units, but may also include steps or units not listed, or may include other steps or units inherent to these processes, methods, products, or devices. The terms “connected,” “linked,” “coupled,” and similar words used in this application are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect. “Multiple” used in this application refers to two or more. “And / or” describes the relationship between related objects, indicating that three relationships may exist; for example, “A and / or B” can represent: A alone, A and B simultaneously, and B alone. The character " / " generally indicates that the preceding and following objects are in an "or" relationship. The terms "first," "second," and "third" used in this application are merely to distinguish similar objects and do not represent a specific ordering of the objects.
[0062] An ontology is a description of the form in which an entity exists. It is often represented as a set of conceptual definitions and hierarchical relationships between concepts. The ontology framework takes the form of a tree structure and is commonly used to define the schema of a knowledge graph. For example, people, companies, and cars can all be referred to as ontologies in specific scenarios.
[0063] With the development of big data technology, graph data has become an important data model in the field of big data technology, and knowledge graphs have become an important big data carrier. In order to solve the problem that data quality control methods in other fields cannot be transferred to the field of biomedical technology, this application proposes a quality control method based on deep learning that is highly scalable and widely applicable to graph data.
[0064] This embodiment provides a big data quality control method. Figures 1-3 This is a flowchart of a big data quality control method according to an embodiment of this application, such as... Figures 1-3 As shown, the process includes the following steps:
[0065] Source data acquisition step S1: Obtain source data from multiple preset data sources;
[0066] The knowledge graph construction step S2 involves constructing a knowledge graph based on the source data through knowledge extraction, data alignment, data storage, incremental evolution, and visualization. The knowledge graph includes entity data, relation data, triplet data, and subgraph data.
[0067] Step S3 of the quality control model construction involves configuring quality control parameters and constructing a quality evaluation dataset based on the quality control parameters and the knowledge graph. A multimodal end-to-end big data quality control model is then constructed. The quality control model is trained using the quality evaluation dataset. The trained quality control model is used to calculate and output data confidence values based on the input entity data, subgraph data, and / or triple data. The data from the quality evaluation dataset is used as parameters in the supervised learning task.
[0068] Considering that the data in the graph encompasses various data types, including text, video, images, and graph data, and that the data quality varies considerably, this application embodiment, in order to measure the data quality of biomedical big data, determines and configures quality control parameters based on eight data indicators: data accuracy, data timeliness, data immediacy, data authenticity, data precision, data completeness, data comprehensiveness, and data relevance. These quality control parameters include: data comprehensiveness parameters, data timeliness parameters, data authenticity parameters, data relevance parameters, and data precision parameters. The specific calculation models for these parameters are shown below:
[0069] Data comprehensiveness parameter Q overall It can be calculated based on the following computational model:
[0070]
[0071] Where S1 represents the number of data types, indicating how many data types are present. If the specified data contains only text data, then S1 takes the value 1; i represents the i-th data type, and the initial value of i is 1; n i Let n represent the number of data items of the i-th data type. If i represents a text data type, then n... i The number of text data entries represents the total number of data entries across all data types; N represents the total number of data entries across all data types. If there are two data types in total, text and audio, with a total text data volume of n1 and a total audio data volume of n2, then N = n1 + n2. Based on this, the richness of the data can be quantitatively measured using the data comprehensiveness parameter.
[0072] Data timeliness parameter Q timeIt can be calculated based on the following computational model:
[0073]
[0074] Where S2 represents the number of data updates, with a default value of 1. For example, S2 = 2 means that the data has been updated twice since publication (data publication is considered the first update); k represents the k-th data update, with a default value of 1; t k This represents: "the time of the k-th data update", in months; P represents the time when the data was published, in months; and T represents the current time, in months. Based on this, the publication time, update time, and update frequency of data can be quantitatively measured using data timeliness parameters, thereby measuring the timeliness of the data.
[0075] Data authenticity parameter Q truth It can be calculated based on the following computational model:
[0076]
[0077] Here, Cite represents the number of references to the data, with a default value of 0; Total represents the number of associated data items, with a default value of 1; format indicates the number of data types, which is an example rather than a limitation. If the data includes both text and voice data types, its value is 2. Data types can also include video, images, etc. Based on this, the authenticity of the data can be measured through parameters such as associated data, referenced data, and data types.
[0078] Data correlation parameter Q rele It can be calculated based on the following computational model:
[0079]
[0080] Where Max is the total number of data nodes in the selected area; m represents the m-th data node; Degree m `m` represents the degree of the m-th data node; `Total` represents the sum of the degrees of all data nodes within the specified region. Based on this, the degree of association between specified data and other data can be measured using parameters such as the degree, total degree, and number of nodes. Degree includes out-degree and in-degree. In a directed graph, arrows have direction, pointing from one vertex to another. The number of arrows pointing to each vertex is its in-degree; the number of arrows pointing out of that vertex is its out-degree.
[0081] Data accuracy parameter Q acc It can be calculated based on the following computational model:
[0082]
[0083] in, Q j Let Q1 represent "the j-th data quality control parameter", and Q1 = Q overall Q2 = Q time Q3 = Q truth Q4 = Q rele ; is the sample mean for Q1-Q4; num is the number of parameters, with a value of 4. Based on this, a quantitative measurement of the accuracy parameter is achieved using the above parameters.
[0084] In some embodiments, the quality assessment dataset is in the form of serialized data. The quality assessment dataset includes entity data, relation data, triple data, subgraph data, and quality control parameters. The triple includes a head entity, a relation, and a tail entity, which can be represented as (h, r, t), where h is the head entity, t is the tail entity, and r is the relation between the head entity and the tail entity. The ratio of the training set, validation set, and test set in the dataset is 6:2:2, respectively.
[0085] In some embodiments, the knowledge graph construction step S2 further includes:
[0086] In the knowledge extraction step S201, multiple types of ontology are extracted from the source data and parsed. Entity attributes are extracted from the ontology, and entity relationships are extracted based on predefined inter-ontology relationships and inter-entity relationships within the ontology. The entity attributes include, but are not limited to: entity identifier (ID), entity function description (IAO_0000115), entity namespace (NameSpace), entity label (Label), entity external link (hasDbXref), and entity synonym (Synonym). The entity function description (IAO_0000115) describes the physiological process or function in which the entity participates in the ontology. The inter-ontology relationships include, but are not limited to: database links, paper links, webpage links, and literature links. The relationships between entities within the entity include, but are not limited to: synonym association, tag association, parent-child class relationship, and namespace association. Synonym association serves as a weak mutual association between entities within the entity and is used to define aliases or synonyms for entities based on the entity synonym. Tag association serves as a weak mutual association between entities within the entity and is used to specify the functional category to which an entity belongs based on the entity label. Parent-child class relationship includes parent class SuperClass and child class SubClass, which serve as a strong mutual association between entities within the entity. Namespace association serves as a weak mutual association between entities within the entity and is used to specify the functional category to which an entity belongs based on the entity namespace.
[0087] Data alignment step S201: Entity alignment is performed based on the entity ID or entity identifier ID of the entity.
[0088] In some embodiments, in conjunction with reference Figure 4 As shown, step S3 of constructing the quality control model further includes:
[0089] In the data acquisition step S301, the subgraph data, triplet data, and entity data of the input knowledge graph are acquired, and the corresponding quality control parameters are calculated. Specifically, the data timeliness parameter and data authenticity parameter are calculated in the quality control parameters. It should be noted that although this embodiment uses two parameters in the quality control parameters, considering that the data comprehensiveness parameter of the subgraph data can also be calculated, and the correlation parameter and data accuracy parameter of the entity data and triplet data can also be calculated, the quality control parameters calculated in this step can also be flexibly configured based on the quality control parameters that can be calculated from the subgraph data, triplet data, and entity data. For example, the data comprehensiveness parameter, data timeliness parameter, and data authenticity parameter of the subgraph data can be calculated, and all quality control parameters of the triplet data can be calculated. It is not limited to calculating the data timeliness parameter and data authenticity parameter. This embodiment configures the calculation of these two parameters for the sake of calculation convenience.
[0090] In the pre-trained model loading step S302, the training subgraph data of the pre-trained graph model is loaded, and the training triplet data and entity data of the text pre-trained model BERT-based are loaded to improve the performance of the quality control model.
[0091] In data embedding step S303, the subgraph data, triplet data, and entity data are subjected to feature extraction using GraphEncoder, TextEncoder, and NodeEncoder, respectively, and embedding using GraphEmbedding, TextEmbedding, and NodeEmbedding, in conjunction with the quality control parameters. The resulting subgraph vector, triplet vector, and entity vector are then fused to obtain the feature vector Fusion Features. The specific models corresponding to the three types of data are as follows: the subgraph data uses the Graph2vec model, the triplet data uses the transformer model, and the entity data uses the Node2vec model.
[0092] In model training step S304, the feature vector is trained using a multi-head attention mechanism. The trained network includes convolutional layers, pooling layers, fully connected layers, and a classification network softmax. The classification network softmax outputs a data confidence value. The data confidence value output by the classification network softmax is distributed between [0,1]. The closer the confidence value is to 1, the higher the data quality; conversely, the lower the confidence value is, the lower the data quality.
[0093] Based on the above steps, this application provides a method for big data quality control based on knowledge graph graph data, which solves the problem that existing data quality control methods are highly specific to certain fields and cannot be transferred to the biomedical field. Since the quality control model is calculated based on data units such as entities, subgraphs, and triples of graph data, and the calculation models of several types of quality control parameters are not strongly dependent on the field to which the data belongs, the instruction control method of this application can be widely extended to various fields and has strong versatility.
[0094] In some embodiments, reference Figure 1 As shown, the source data acquisition step S1 further includes:
[0095] The data content update step S101 involves updating the data from the data source. The update methods include full update, incremental update through program parsing, and / or incremental update through incremental crawling.
[0096] The embodiments of this application are described and illustrated below through preferred embodiments, with reference to... Figures 1-5 As shown in the embodiment, the big data quality control method of this application is applied to the biomedical field to realize big data quality control, and is constructed as including a data layer, a knowledge layer and an application layer.
[0097] The data sources obtained in step S1 include 30 biomedical databases such as DrugBank, KEGG (KeggDrug), RCSB PDB, PubMed, Uniprot, Pubchem, Chemspider, Wikipedia, and patents. The source data includes eight types of ontology: diseases, genes, cell lines, proteins, targets, compounds, drugs, and pathways. The source data covers rule bases, algorithm bases, model bases, literature bases, patent bases, and ontology bases. Among them, the protein PRO, cell line CLO, disease DOID, compound CHEBI, pathway PW, and gene GO data come from The OBO Foundry (Biomedical Ontology Resource Service and Application).
[0098] Based on the above source data, the knowledge graph constructed in step S2 includes eight categories of entities: diseases, genes, cell lines, proteins, targets, compounds, drugs, and pathways. The number of entities reaches 528,243, and the data size consisting of entities and their relationships with each other, as well as relationships between entities within an entity, reaches 353,612, as shown in Table 1 below. The data source is rich, the number of entities is huge, and the relationships between entities and relationships between entities within an entity are complex.
[0099] Table 1
[0100]
[0101] Taking the compounds in Table 1 as an example, the specific data sources are shown in Table 2 below. The compound data sources in this embodiment are abundant, with 31,225 compound records from MC3D, 9,091 from PubChem, and 99,413 from CHEBI. Moreover, each data source in the knowledge graph is directly related to other ontologies (e.g., patents, drugs, etc.). For detailed associations, please refer to the "External Associations" field in Table 2.
[0102] Table 2
[0103]
[0104]
[0105] Taking targets, drugs, literature, patents, and Wikipedia entries as examples, the specific data statistics are shown in Table 3 below.
[0106] Table 3
[0107]
[0108] Based on the biomedical source data shown above, step S2 of knowledge graph construction is performed, as follows: Figure 6 As shown. First, the knowledge extraction step S201 is performed, which involves knowledge point identification and acquisition, and attribute-value acquisition, for eight types of ontology, including diseases, genes, cell lines, proteins, targets, compounds, drugs, and pathways. This is specifically divided into three parts: entity extraction, relation extraction, and entity attribute extraction. In this embodiment, Jena is used to parse the ontology in OBO Foundry and extract entity attributes. Examples of the obtained entity attributes are shown in Table 4 below, and will not be elaborated further here. Jena is a Java API used to support semantic web applications. For example, and not as a limitation, the extracted ontology relationships include target-document, target-drug, compound-drug, compound-target, and compound pathway relationships. The extracted intra-ontology relationships include: parent class (SuperClass), child class (SubClass), entity label (Label), entity namespace (NameSpace), and entity synonym (Synonym). The biomedical knowledge graph uses label association, namespace association, and synonym association to assist in constructing intra-ontology entity relationships.
[0109] Table 4
[0110]
[0111]
[0112] like Figure 7As shown in the example, this embodiment uses DTO_02100034 as an example to illustrate its attribute extraction results. The name of the entity DTO_02100034 is Beta-2 adrenergic receptor, which is a β-2 adrenergic receptor encoded in the human genome. Its ID in the PRO ontology is WCB, its ID in the UniProtKB data source is P07550, and its ID in the Uniprot data source is P07550. Its parent class is DTO_02300094, and the corresponding gene is ADRB2. Synonyms for the gene ADRB2 include ADRB2R and B2AR. Diseases associated with the Beta-2 adrenergic receptor include DOID_114, DOID_3083, DOID_10763, DOID_9970, DOID_3393, and DOID_2841. Compounds that interact with the Beta-2 adrenergic receptor include CHEBI_29105, CHEBI_33569, and CHEBI_28918.
[0113] Examples of relationships between entities Figure 8 As shown, the compound ontology is directly associated with literature, patents, and drug ontology, and indirectly associated with other ontologies through these three types of ontologies. The relationships between ontologies are mainly extracted from the external link data of the entity's hasDbXref field. The direct association relationships between compounds and other ontologies are shown in Table 5.
[0114] Table 5
[0115] direct relationship Related nodes Compounds - Drugs Inchi、Drug_Central、KEGG、Smiles Compounds - Literature PMID, CAS-related compounds and related literature Compound - Patent hasDbXref - Document link, webpage link
[0116] In the table, compounds are directly associated with drugs through unique molecular structure identifiers such as Inchi and Smiles, and database identifiers such as KEGG and Drug_Central; they are also directly associated with literature through literature identifiers such as PMID and CAS; and directly associated with patents through literature chaining and patent chaining. Further, taking the Beta-2 adrenergic receptor as an example, it has a targeting effect on norepinephrine, and the ontological relationship between the two is as follows: Figure 9 As shown.
[0117] This embodiment categorizes the biomedical knowledge graph data sources primarily into The OBO Foundry data source and other data sources. The OBO Foundry data source includes CHEBI, DOID, PR, GO, PW, and CLO. Other data sources include Uniprot, DrugBank, CHEBI, KeggDrug, PubMed, Patent, Wikipedia, RCSB PDB, and PubChem. Therefore, when performing data alignment step S201, data from The OBO Foundry can be aligned based on entity identifier IDs, while attributes from other data sources can be directly aligned based on entity IDs. Specifically, the entity identifier IDs for CHEBI, DOID, PR, GO, PW, and CLO are configured as: CHEBI_ID, DOID_ID, PR_ID, GO_ID, PW_ID, and CLO_ID, respectively.
[0118] Then proceed to step S3, which involves building the quality control model.
[0119] The data in the biomedical knowledge graph can be refined into four basic units: entities, relations, triples, and subgraphs. This application first assesses the computability of the dataset by calculating five quality control parameters for these four basic units. Calculations show that the data relevance and accuracy parameters for subgraph data units, as well as the data comprehensiveness, relevance, and accuracy parameters for relation data units, cannot be calculated. Therefore, in constructing the dataset, this embodiment uses the data timeliness and data authenticity parameters from the quality control parameters as parameters, and uses entity data and triple data as basic data units. In data acquisition step S301, 881,855 data entries were obtained from the biomedical big data, including 353,612 triple data entries, 528,243 entity data entries, and 16,523 subgraph data entries. The five quality control parameters for the above data were calculated, and the original data was compiled into a dataset according to the dataset structure.
[0120] The final calculation result is obtained through the pre-trained model loading step S302, the data embedding step S303, and the model training step S304. In the model training step S304, effective information in the feature vector is captured through a multi-head attention mechanism. The activation function of the network is ReLU, the learning rate is configured to be 0.01, the random dropout is configured to be 0.5, and the batch size of data samples captured in one training session is configured to be 256.
[0121] Based on the above steps, users can input entity data or triplet data from the biomedical field into the quality control model. The model directly outputs the confidence value of the entity data or triplet data, which can be directly used as a reference value for quality control. This solves the problem that data quality control in the biomedical field cannot be directly transferred from control methods in fields such as marine, medical, and environmental fields. This embodiment is not limited to the biomedical field and can also be extended to multiple fields such as social network data, transportation network data, and financial transaction data. As the scale of the dataset increases, the performance of the model also increases positively.
[0122] It should be noted that relational databases are not suitable for constructing compound knowledge graphs. Taking compounds as an example, in relational databases, compounds are associated with other entities. To ensure data uniqueness, composite indexes need to be designed, but this design has poor performance in actual storage, querying, and application. Furthermore, relational databases are not suitable for subsequent knowledge discovery and entity attribute prediction research based on compound knowledge graphs. Therefore, this embodiment uses Neo4j for data storage. Neo4j is a high-performance NoSQL graph database that stores structured data on the network instead of in tables. It is an embedded, disk-based Java persistence engine with full transaction capabilities.
[0123] Combination Figure 1 As shown, considering that this preferred embodiment is applied to the biomedical field, and the amount of information and knowledge in this field can be flexibly selected from the full update or incremental update in the data content update step S101 based on factors such as data volume, update frequency, and update method. For example... Figure 10 The diagram illustrates the data volume, update frequency, and update methods of several major biomedical data sources. Considering that Uniprot and DrugBank data sources are updated annually, they are configured for full updates. The OBO Foundry data source, which requires periodic parsing, is configured for incremental updates using programmatic parsing. PubChem, ChemSpider, RCSB PDB, and KEGG data sources are configured for updates via incremental crawling. To improve efficiency, full updates are performed annually with the latest version of the data source, updating older versions stored in the Neo4j database. Incremental updates using programmatic parsing are performed semi-annually, batch-by-batch, updating data in the OBO Foundry data source. Data sources using incremental crawling can be updated periodically.
[0124] Additionally, it should be noted that the steps shown in the above process or in the flowcharts of the accompanying figures can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowcharts, in some cases the steps shown or described may be executed in a different order than that shown here.
[0125] This application also provides a big data quality control system for implementing the above embodiments and preferred embodiments; details already described will not be repeated. As used below, the terms "module," "unit," "subunit," etc., can refer to a combination of software and / or hardware that performs a predetermined function. Although the apparatus described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.
[0126] Figure 11 This is a structural block diagram of the big data quality control system according to an embodiment of this application, such as... Figure 11 As shown, the system includes:
[0127] Source data acquisition module 1 is used to acquire source data from multiple preset data sources;
[0128] The knowledge graph construction module 2 is used to construct a knowledge graph based on the source data through knowledge extraction, data alignment, data storage, incremental evolution, and visualization. The knowledge graph includes entity data, relation data, triple data, and subgraph data. The knowledge graph construction module 2 further includes a knowledge extraction module 201 and a data alignment module 202. The knowledge extraction module 201 is used to extract multiple types of ontology from the source data and parse the ontology, extract entity attributes from the ontology, and extract entity relationships based on predefined inter-ontology relationships and inter-entity relationships within the ontology. The entity attributes include, but are not limited to, entity identifier (ID), entity function description (IAO_0000115), entity namespace (NameSpace), entity label (Label), entity external link (hasDbXref), and entity synonym (Synonym). The entity function description (IAO_0000115) describes the physiological process or function in which the entity participates in the ontology. The inter-ontology relationships include, but are not limited to, database links, paper links, webpage links, and literature links. The relationships between entities within the entity include, but are not limited to, synonym associations, tag associations, parent-child class relationships, and namespace associations. The data alignment module 202 is used to perform entity alignment based on the entity ID or entity identifier ID of the entity. Specifically, data from The OBO Foundry can be aligned based on the entity identifier ID, while attributes from other data sources can be directly aligned based on the entity ID.
[0129] The quality control model construction module 3 is used to configure quality control parameters and construct a quality evaluation dataset based on the quality control parameters and knowledge graph, build a multimodal end-to-end big data quality control model, train the quality control model using the quality evaluation dataset, and calculate and output data confidence values based on input entity data, subgraph data, and / or triplet data. The data in the quality evaluation dataset is used as parameters in supervised learning tasks. The quality control parameters include: data comprehensiveness parameters, data timeliness parameters, data authenticity parameters, data correlation parameters, and data accuracy parameters. The quality control model construction module 3 further includes:
[0130] The data acquisition module 301 is used to acquire the subgraph data, triple data and entity data of the input knowledge graph, and calculate the corresponding quality control parameters; specifically, it calculates the data timeliness parameter and data authenticity parameter in the quality control parameters.
[0131] The pre-trained model loading module 302 is used to load the training subgraph data of the pre-trained graph model and the training triplet data and entity data of the BERT-based text pre-trained model to improve the performance of the quality control model.
[0132] The data embedding module 303 is used to perform feature extraction (GraphEncoder, TextEncoder, NodeEncoder) and embedding (GraphEmbedding, TextEmbedding, NodeEmbedding) on the subgraph data, triplet data, and entity data respectively, in conjunction with the quality control parameters. The resulting subgraph vector, triplet vector, and entity vector are then fused to obtain the feature vector Fusion Features. The specific models corresponding to the three types of data are as follows: the Graph2vec model is used for the subgraph data, the transformer model is used for the triplet data, and the Node2vec model is used for the entity data.
[0133] Model training module 304 is used to train the network on the feature vector using a multi-head attention mechanism. The trained network includes convolutional layers, pooling layers, fully connected layers, and a classification network softmax. The classification network softmax outputs a data confidence value. This step captures effective information in the feature vector through the multi-head attention mechanism. The network activation function is ReLU, the learning rate is configured to 0.01, the random dropout is configured to 0.5, and the batch size of data samples captured in one training session is configured to 256. The data confidence value output by the classification network softmax is distributed between [0,1]. The closer to 1, the higher the data quality; conversely, the lower the confidence value, the lower the data quality.
[0134] like Figure 11 As shown, the system includes all the modules described above. Furthermore, the source data acquisition module further includes:
[0135] The data content update module 101 is used to update the data of the data source. The update methods include full update, incremental update through program parsing, and / or incremental update through incremental crawling.
[0136] The system in this application embodiment can be built based on CiteSpace, displaying spectral information such as partial compounds, targets, and literature through a visual interface, such as... Figure 12 As shown.
[0137] It should be noted that the above modules can be functional modules or program modules, and can be implemented through software or hardware. For modules implemented through hardware, the above modules can reside in the same processor; or the above modules can be located in different processors in any combination.
[0138] In addition, combined Figure 1 The big data quality control method described in this application is implemented using a supercomputer, as illustrated in the deployment diagram below. Figure 13 As shown, this is to achieve parallel processing of big data to accelerate computation and improve efficiency. A supercomputer may include a processor and a memory storing computer program instructions. The memory can be used to store or cache various data files that need to be processed and / or used for communication, as well as possible computer program instructions executed by the processor. The processor reads and executes the computer program instructions stored in the memory to implement any of the big data quality control methods in the above embodiments. The supercomputer can execute the big data quality control methods in the embodiments of this application based on the acquired source data, thereby achieving a combination of... Figure 1 The big data quality control method is described.
[0139] Furthermore, in conjunction with the big data quality control methods in the above embodiments, this application embodiment can provide a computer-readable storage medium for implementation. This computer-readable storage medium stores computer program instructions; when these computer program instructions are executed by a processor, they implement any of the big data quality control methods in the above embodiments.
[0140] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0141] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.
Claims
1. A big data quality control method, characterized in that, include: The source data acquisition steps involve obtaining source data from multiple preset data sources. The knowledge graph construction steps involve constructing a knowledge graph based on the source data through knowledge extraction and data alignment. The knowledge graph includes entity data, relation data, triple data, and subgraph data. The quality control model construction steps are as follows: configure quality control parameters and construct a quality evaluation dataset based on the quality control parameters and knowledge graph; construct a multimodal end-to-end big data quality control model; train the quality control model using the quality evaluation dataset; and use the trained quality control model to calculate and output data confidence values based on input entity data, subgraph data, and / or triple data. The quality control parameters include: data comprehensiveness parameters, data timeliness parameters, data authenticity parameters, data correlation parameters, and data accuracy parameters. The quality control model construction steps further include: The data acquisition steps involve acquiring the subgraph data, triplet data, and entity data of the input knowledge graph, and calculating the corresponding quality control parameters. The pre-trained model loading steps include loading the graph pre-trained model training subgraph data, and loading the text pre-trained model training triplet data and entity data. In the data embedding step, the subgraph data, triplet data and entity data are extracted and embedded respectively in combination with the quality control parameters. The obtained subgraph vector, triplet vector and entity vector are then fused to obtain the feature vector. The model training steps involve using a multi-head attention mechanism to train the network on the feature vector. The trained network includes convolutional layers, pooling layers, fully connected layers, and a classification network softmax, which outputs data confidence values.
2. The big data quality control method according to claim 1, characterized in that, The knowledge graph construction steps further include: The knowledge extraction step involves extracting multiple types of ontology from the source data and parsing the ontology, extracting entity attributes from the ontology, and extracting entity relationships based on predefined relationships between ontology and relationships between entities within the ontology. The data alignment step involves aligning entities based on their entity IDs or entity identifier IDs.
3. The big data quality control method according to claim 1, characterized in that, The source data acquisition step further includes: The data content update step involves updating the data from the data source. The update methods include full update, incremental update through program parsing, and / or incremental update through incremental crawling.
4. A big data quality control system, characterized in that, include: The source data acquisition module is used to acquire source data from multiple preset data sources; The knowledge graph construction module is used to construct a knowledge graph based on the source data through knowledge extraction and data alignment. The knowledge graph includes entity data, relation data, triple data, and subgraph data. The quality control model construction module is used to configure quality control parameters and construct a quality evaluation dataset based on the quality control parameters and knowledge graph, construct a multimodal end-to-end big data quality control model, train the quality control model using the quality evaluation dataset, and the trained quality control model is used to calculate and output data confidence values based on input entity data, subgraph data and / or triple data. The quality control parameters include: data comprehensiveness parameters, data timeliness parameters, data authenticity parameters, data correlation parameters, and data accuracy parameters. The quality control model construction module further includes: The data acquisition module is used to acquire subgraph data, triple data and entity data of the input knowledge graph, and calculate the corresponding quality control parameters; The pre-trained model loading module is used to load subgraph data for training graph pre-trained models, and to load triplet data and entity data for training text pre-trained models. The data embedding module is used to perform feature extraction and embedding on the subgraph data, triplet data and entity data respectively in combination with the quality control parameters. The obtained subgraph vector, triplet vector and entity vector are then fused to obtain the feature vector. The model training module is used to train the network on the feature vector using a multi-head attention mechanism. The trained network includes convolutional layers, pooling layers, fully connected layers, and a classification network softmax, which outputs data confidence values.
5. The big data quality control system according to claim 4, characterized in that, The knowledge graph construction module further includes: The knowledge extraction module is used to extract multiple types of ontology from the source data and parse the ontology, extract entity attributes from the ontology, and extract entity relationships based on predefined relationships between ontology and relationships between entities within the ontology. The data alignment module is used to perform entity alignment based on the entity ID or entity identifier ID of the entity.
6. The big data quality control system according to claim 5, characterized in that, The source data acquisition module further includes: The data content update module is used to update the data from the data source. The update methods include full update, incremental update through program parsing, and / or incremental update through incremental crawling.
7. A supercomputer, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the big data quality control method as described in any one of claims 1 to 3.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the big data quality control method as described in any one of claims 1 to 3.