A domain knowledge graph autonomous modeling method and system
By constructing an ontology model and an improved BERT model for knowledge mining, the accuracy and efficiency issues of autonomous knowledge graph modeling in the field of intelligent industrial manufacturing have been solved, enabling efficient knowledge graph construction and the reuse of enterprise knowledge.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SUN YAT SEN UNIV
- Filing Date
- 2024-03-27
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies lack effective methods for autonomous modeling of knowledge graphs in the field of intelligent manufacturing, making it impossible to accurately mine and reuse knowledge. Furthermore, existing methods are insufficient in terms of accuracy and efficiency.
A text corpus is constructed by acquiring unstructured text from the target domain, an ontology model is built and labeled, and a fine-tuned BERT model is used for training. An adaptive weighted dilatation gate convolutional gating unit and an average pooling layer are designed, and a domain knowledge graph is constructed by combining a two-pointer decoding method.
It enables the efficient and accurate construction of knowledge graphs in the field of intelligent manufacturing, improves the accuracy and efficiency of knowledge mining, and supports the reuse of enterprise knowledge and the realization of intelligent manufacturing.
Smart Images

Figure CN118194987B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a method and system for autonomous modeling of domain knowledge graphs. Background Technology
[0002] Industrial intelligent manufacturing is one of the most important foundational industries in my country's manufacturing sector and a crucial pillar of a manufacturing powerhouse. The integration of next-generation information technologies such as artificial intelligence and deep learning with traditional industrial intelligent manufacturing has given rise to intelligent manufacturing within the industrial intelligent manufacturing field. This field is characterized by numerous processes, long workflows, inter-process coupling, and a multitude of monitoring points, creating a dynamic, complex, and cross-domain enterprise production and operation environment. Years of production have accumulated a vast amount of multi-source, heterogeneous data, encompassing a wealth of reusable knowledge in the industrial intelligent manufacturing field. With the development of intelligent manufacturing, knowledge-driven methods are playing an increasingly important role in industrial intelligent manufacturing enterprises, and the efficient and accurate reuse of existing mechanistic and experiential knowledge has become a significant requirement for these companies.
[0003] To date, with the rapid development of deep learning and natural language processing, the construction of knowledge graphs in the field of automation has been extensively studied. However, the autonomous modeling methods applicable to knowledge graphs in the field of intelligent manufacturing have limitations in both the vertical depth of theory and the horizontal breadth of application. Furthermore, there is no universal and reliable method for knowledge mining of data features and text semantic features in the field of intelligent manufacturing.
[0004] Application content
[0005] The main objective of this application is to propose a method and system for autonomous modeling of domain knowledge graphs, in order to solve at least one problem in the prior art. This application is capable of accurately performing autonomous modeling of domain knowledge graphs.
[0006] To achieve the above objectives, one aspect of this application proposes an autonomous modeling method for domain knowledge graphs, the method comprising:
[0007] We acquire unstructured text from the target domain to construct a text corpus, and then obtain a training dataset from the text corpus.
[0008] An ontology model for the target domain is constructed based on a text corpus. The training dataset is then labeled based on the ontology model to obtain the model training dataset.
[0009] The pre-built BERT model is fine-tuned and trained based on the model training dataset to obtain a triplet knowledge mining model for the target domain.
[0010] Text sentences from a text corpus are loaded into a triplet knowledge mining model for knowledge mining, and a domain knowledge graph is constructed based on the results of the knowledge mining.
[0011] In some embodiments, an unstructured text corpus is constructed from unstructured text in the target domain, and a training dataset is obtained from the text corpus, including:
[0012] Collect unstructured text from industrial sources within the target domain; industrial text includes process texts, raw material and production information, and equipment ledgers for the target company's vertical domain;
[0013] Unstructured text is preprocessed into a collection of single sentences, which are then organized to obtain a text corpus.
[0014] A portion of text was obtained from a text corpus as the training dataset.
[0015] In some embodiments, the model training dataset includes a first training set labeled with existence relations and a second training set labeled with existence relations and entity location sequences; an ontology model for the target domain is constructed based on a text corpus, and the training dataset is labeled based on the ontology model to obtain the model training dataset, including:
[0016] Based on the target domain, define a conceptual schema layer;
[0017] Extract semantic features and abstract concepts from text corpora;
[0018] Based on semantic features and abstract concepts, the subject and object and their corresponding relationship are determined, and then the ontology model of the target domain is constructed by combining the schema layer.
[0019] Based on the definition results of the ontology model, the existence relation of a single sentence is labeled on the training text in the training dataset to obtain the first training set; then, the entity location sequence is labeled using the BIO annotation method to obtain the second training set.
[0020] In some embodiments, the model training dataset includes a first training set labeled with existence relations and a second training set labeled with existence relations and entity location sequences; the pre-built BERT model is fine-tuned based on the model training dataset to obtain a triplet knowledge mining model for the target domain, including:
[0021] A BERT model is pre-constructed based on semantic relation features extracted from a text corpus;
[0022] The BERT model was fine-tuned based on the first training set to obtain a text relationship classification model for the target domain.
[0023] The BERT model is fine-tuned a second time based on the second training set to obtain a sequence labeling model for the target domain.
[0024] By merging text relation classification models and sequence labeling models, a pipeline-based triplet knowledge mining model for the target domain of relation-entity pairs is constructed.
[0025] In some embodiments, a BERT model is pre-constructed based on semantic relation features extracted from a text corpus, including:
[0026] A bidirectional attention mechanism combined with residual connections is adopted, and an adaptive weighted dilatation gate convolutional gating unit and an average pooling layer are designed to pre-construct the BERT model.
[0027] In some embodiments, text sentences from a text corpus are loaded into a triplet knowledge mining model for knowledge mining, and a domain knowledge graph is constructed based on the results of the knowledge mining, including:
[0028] Text sentences from a text corpus are loaded into a triplet knowledge mining model for knowledge mining, and the triplet information contained in the unstructured text is output through a two-pointer decoding method.
[0029] Based on triple information, entity alignment is performed through the hierarchical structure and semantic relations of the ontology model to obtain a set of entity relation triples in the target domain, thus completing the construction of the domain knowledge graph.
[0030] In some embodiments, the method further includes:
[0031] The domain knowledge graph is stored in the neo4j graph database, and a retrieval and query mechanism is built based on Cypher statements.
[0032] To achieve the above objectives, another aspect of this application proposes an autonomous domain knowledge graph modeling system, the system comprising:
[0033] The first module is used to acquire unstructured text from the target domain to construct a text corpus, and to obtain a training dataset from the text corpus.
[0034] The second module is used to construct an ontology model for the target domain based on the text corpus, and to annotate the training dataset based on the ontology model to obtain the model training dataset.
[0035] The third module is used to fine-tune the pre-built BERT model based on the model training dataset to obtain a triplet knowledge mining model for the target domain.
[0036] The fourth module is used to load text statements from the text corpus into the triplet knowledge mining model for knowledge mining, and to construct a domain knowledge graph based on the results of knowledge mining.
[0037] In some embodiments, the system further includes:
[0038] The fifth module is used to store the domain knowledge graph into the neo4j graph database and build a retrieval and query mechanism based on Cypher statements.
[0039] To achieve the above objectives, another aspect of the embodiments of this application proposes an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the above-described method.
[0040] To achieve the above objectives, another aspect of the embodiments of this application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method.
[0041] The embodiments of this application include at least the following beneficial effects: This application provides a method and system for autonomous modeling of domain knowledge graphs. This scheme constructs a text corpus by acquiring unstructured text from the target domain and obtains a training dataset from the text corpus; it constructs an ontology model of the target domain based on the text corpus, and annotates the training dataset based on the ontology model to obtain a model training dataset; it fine-tunes a pre-built BERT model based on the model training dataset to obtain a triplet knowledge mining model for the target domain; it loads text statements from the text corpus into the triplet knowledge mining model for knowledge mining, and constructs a domain knowledge graph based on the results of knowledge mining. The embodiments of this application abstract multi-source heterogeneous data and text from the domain into a structured knowledge graph for knowledge modeling, providing reliable assistance for enterprises' knowledge-enabled intelligent manufacturing while improving the accuracy and efficiency of domain knowledge graph modeling. Attached Figure Description
[0042] Figure 1 This is a flowchart of the domain knowledge graph autonomous modeling method provided in the embodiments of this application;
[0043] Figure 2 This is a schematic diagram of the overall process of the autonomous modeling method for domain knowledge graphs provided in the embodiments of this application;
[0044] Figure 3 This is a schematic diagram of the architecture principle of autonomous modeling of domain knowledge graphs provided in the embodiments of this application;
[0045] Figure 4 This is a schematic diagram illustrating an example of semantic entity type design for an ontology model provided in this application embodiment;
[0046] Figure 5 This is a schematic diagram illustrating an example of the semantic relationship type design of the ontology model provided in this application embodiment;
[0047] Figure 6 This is a schematic diagram of the structure of the domain knowledge graph-improved BERT model provided in the embodiments of this application;
[0048] Figure 7 This is a schematic diagram of the "joint training-step prediction" knowledge graph autonomous modeling mechanism provided in the embodiments of this application;
[0049] Figure 8 This is a schematic diagram of the structure of the domain knowledge graph autonomous modeling system provided in the embodiments of this application;
[0050] Figure 9 This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0051] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit it. In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this application; they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this application as detailed in the appended claims.
[0052] It is understood that the terms “first,” “second,” etc., used in this application may be used herein to describe various concepts, but unless otherwise stated, these concepts are not limited by these terms. These terms are only used to distinguish one concept from another. For example, without departing from the scope of the embodiments of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the words “if,” “when,” or “in response to a determination” as used herein may be interpreted as “when…” or “when…” or “in response to a determination.”
[0053] As used in this application, the terms "at least one", "multiple", "each", "any", etc., "at least one" includes one, two or more, "multiple" includes two or more, "each" refers to each of the corresponding multiples, and "any" refers to any one of the multiples.
[0054] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.
[0055] The domain knowledge graph autonomous modeling method provided in this application relates to the field of data processing technology. This method can be applied to terminals, servers, or software running on either a terminal or a server. In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, or in-vehicle terminal, but is not limited to these. The server can be configured as an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The server can also be a node server in a blockchain network. The software can be an application implementing the domain knowledge graph autonomous modeling method, but is not limited to the above forms.
[0056] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0057] Figure 1 This is an optional flowchart of the domain knowledge graph autonomous modeling method provided in the embodiments of this application. Figure 1 The method may include, but is not limited to, steps S100 to S400.
[0058] S100. Obtain unstructured text from the target domain to construct a text corpus, and obtain the training dataset from the text corpus;
[0059] It should be noted that in some embodiments, step S100 may include: collecting unstructured text of industrial texts in the target domain; industrial texts include process texts, raw material and production information and equipment ledgers of the target enterprise's vertical domain; preprocessing the unstructured texts into a collection of single sentences, and then organizing them to obtain a text corpus; and obtaining a portion of the texts from the text corpus as a training dataset.
[0060] For example, in some specific implementations, the data collection and preprocessing steps are performed first: industrial texts from the vertical field of intelligent manufacturing enterprises are collected to form a text corpus, and a portion of the texts are selected as training datasets.
[0061] Specifically, preprocessing includes manually judging the accuracy and completeness of industrial text statements, logically sorting out and filtering out typos for each statement, and unifying the storage format of all statements into CSV format.
[0062] S200. Construct an ontology model for the target domain based on the text corpus, and annotate the training dataset based on the ontology model to obtain the model training dataset.
[0063] It should be noted that the model training dataset includes a first training set labeled with existence relations and a second training set labeled with existence relations and entity location sequences. In some embodiments, step S200 may include: defining a conceptual schema layer based on the target domain; extracting semantic features and abstract concepts from a text corpus; determining the subject and object and their corresponding relationships based on the semantic features and abstract concepts, and then constructing an ontology model for the target domain in conjunction with the schema layer; labeling the training text in the training dataset with single-sentence existence relations based on the definition results of the ontology model to obtain the first training set; and then labeling entity location sequences using the BIO annotation method to obtain the second training set.
[0064] For example, in some specific implementations, the ontology modeling and training data annotation steps are as follows: analyze the semantic features and abstract concepts of text in the text corpus, construct an ontology model in the field of intelligent industrial manufacturing, define a conceptualized domain schema layer, and, guided by the constructed domain ontology model, perform relation existence annotation and BIO entity sequence annotation on the training dataset to form the model training dataset.
[0065] Specifically, the ontology modeling process in the field of intelligent industrial manufacturing adopts a top-down approach and uses the Protégé modeling tool to define the domain ontology model in the OWL language, conceptually and formally defining the schema layer of the intelligent industrial manufacturing process.
[0066] S300. Based on the model training dataset, the pre-constructed BERT model is fine-tuned and trained to obtain a triplet knowledge mining model for the target domain.
[0067] It should be noted that the model training dataset includes a first training set labeled with existence relations and a second training set labeled with existence relations and entity location sequences; in some embodiments, step S300 may include: pre-constructing a BERT model based on semantic relation features extracted from a text corpus; performing a first fine-tuning training on the BERT model based on the first training set to obtain a text relation classification model for the target domain; performing a second fine-tuning training on the BERT model based on the second training set to obtain a sequence labeling model for the target domain; merging the text relation classification model and the sequence labeling model to build a pipeline-based triplet knowledge mining model for the target domain of relation-entity relationships.
[0068] In some embodiments, the BERT model is pre-constructed based on semantic relationship features extracted from a text corpus, including: using a bidirectional attention mechanism combined with residual connections, and designing adaptive weighted dilatation gate convolutional gating units and average pooling layers to pre-construct the BERT model.
[0069] For example, in some specific implementations, the training steps of the knowledge mining model in the field of intelligent manufacturing are as follows: analyze the semantic relationship features of industrial texts in the field of intelligent manufacturing, and improve the BERT model structure based on the semantic relationship features to make it more suitable for deep knowledge mining of domain texts.
[0070] Using text labeled with existence relationships as the training set, the improved BERT model was fine-tuned and trained to obtain a text relationship classification model in the field of intelligent industrial manufacturing.
[0071] Using entity location sequence text labeled with existence relations and BIO annotations as the training set, the improved BERT model was fine-tuned and trained to obtain a sequence labeling model in the field of intelligent manufacturing.
[0072] We combine the text relation classification model and the sequence labeling model in the field of intelligent manufacturing to build a pipeline-type surface relation-entity triplet knowledge mining model (R-EOTEM) for intelligent manufacturing.
[0073] Specifically, the base model used is the BERT model, whose core is the multi-head attention mechanism, and its mathematical expression is as follows:
[0074]
[0075] In the formula, It is a sequence input The results of the linear transformation represent the query, key, and value, respectively. For the dimension of attention head, For the number of attention heads, It is the first The weight parameters of each attention head.
[0076] The BERT model employs a bidirectional attention mechanism. After the attention mechanism, the representation at each position passes through a feedforward neural network, and residual connections are used on the output of each sub-layer to prevent the gradient vanishing problem. Its mathematical expression is as follows:
[0077]
[0078] In the formula, For the results of past self-attention, For the future self-attention outcome, These are variable weighting coefficients.
[0079] This study analyzes the semantic relationship features of industrial texts in the field of intelligent manufacturing, improves the basic BERT model structure, and designs adaptive weighted dilated gate convolutional gating units and average pooling layers on top of the basic BERT model to mine long entity information and their overlapping relationships in industrial texts. Its mathematical expression is as follows:
[0080]
[0081]
[0082]
[0083] In the formula, It is the first of the output sequences One element, Represents the input sequence Central The element at that location, For expansion rate, To enable learnable adaptive convolutional weights, the learning process for these weights is implemented using fully connected layers. This is the weight matrix of the fully connected layer. This is the bias vector of the fully connected layer. Use the Sigmoid activation function to ensure that the weights are within a suitable range. for The final output result is obtained after calculation by the average pooling layer.
[0084] Among them, the text relationship classification model and sequence labeling model in the field of industrial intelligent manufacturing have the following loss function set during the model training process:
[0085]
[0086] In the formula, The boundary indicator function is used, and the loss function is designed as a combination of the boundary indicator function and the original cross-entropy function.
[0087] S400. Load the text statements in the text corpus into the triplet knowledge mining model for knowledge mining, and construct a domain knowledge graph based on the results of knowledge mining.
[0088] It should be noted that in some embodiments, step S400 may include: loading text statements from the text corpus into a triplet knowledge mining model for knowledge mining, outputting triplet information contained in the unstructured text through a two-pointer decoding method; based on the triplet information, performing entity alignment through the hierarchical structure and semantic relations of the ontology model to obtain a set of entity relation triplets in the target domain, thereby completing the construction of the domain knowledge graph.
[0089] For example, in some specific embodiments, the steps for constructing a knowledge graph in the field of intelligent industrial manufacturing are as follows: loading the text statements in the obtained text corpus of the vertical domain of intelligent industrial manufacturing enterprises into the trained triplet knowledge mining model for prediction calculation.
[0090] The correct triplet (SPO) information contained in the industrial intelligent manufacturing text is output through a joint dual-pointer decoding method;
[0091] Based on ontology model information, entity alignment is performed through the hierarchical structure and semantic relationships of the ontology to obtain a set of entity relationship triples in the field of intelligent manufacturing, thus completing the construction of the knowledge graph.
[0092] Specifically, the pipeline-based relation-entity triplet knowledge mining model (R-EOTEM) for the industrial intelligent manufacturing domain designs a "parallel training-step prediction" modeling mechanism. The training process uses labeled training data to synchronously and in parallel train the text relation classification model and the sequence labeling model for the industrial intelligent manufacturing domain. The prediction process uses the pipeline-based input text relation classification model of unlabeled industrial text to predict the existence relations in the industrial text statements. Then, the predicted existence relations and the predicted statements are fused and input into the sequence labeling model to predict the triplet (SPO) knowledge information in the industrial text statements.
[0093] In the prediction decoding process, a joint dual-pointer decoding output is designed to predict five output probabilities for the triplet (SPO) information corresponding to each word segment. The output result labeling is based on the BIO annotation method.
[0094]
[0095] in, Word segmentation is the probability of starting with the subject entity word in a triplet. Word segmentation is the probability of the entity word body in the triplet subject. The probability that a word segment begins with a triplet object entity word. Word segmentation is the probability of the entity word body of the triple object. The probability that a word segment is not an entity word.
[0096] The formula for calculating the prediction probability of triplet (SPO) information uses the Sigmoid output, and the output probabilities are not mutually exclusive. Its mathematical expression is as follows:
[0097]
[0098] In the formula, For the input feature vector, To and Associated weight matrix, To and The associated bias vector.
[0099] In some embodiments, the method may further include: storing the domain knowledge graph in the neo4j graph database and constructing a retrieval and query mechanism based on Cypher statements.
[0100] For example, in some specific embodiments, the steps for storing and applying knowledge graphs in the field of intelligent industrial manufacturing are as follows: the constructed structured knowledge graph of intelligent industrial manufacturing is stored in the neo4j graph database, and a retrieval and query mechanism is created based on Cypher statements to realize in-depth retrieval and reuse of knowledge in the field of intelligent industrial manufacturing.
[0101] Specifically, the knowledge graph in the field of intelligent manufacturing is stored using the neo4j graph database, and the Cypher query statement is used to perform deep retrieval and knowledge reuse on the knowledge graph.
[0102] To explain in detail the principles of the technical solution of this application, the overall process of this application will be described below with reference to some specific embodiments. It is easy to understand that the following is an explanation of the technical principles of this application and should not be regarded as a limitation of this application.
[0103] To ensure that the objectives, technical solutions, and advantages of this invention are clearer and more explicit, the following will be based on the appendix. Figure 2The flowchart illustrating the autonomous modeling method for knowledge graphs in the field of intelligent industrial manufacturing presented in the embodiments clearly and completely describes the method architecture and technical solutions in the embodiments of the present invention. Obviously, the embodiments described in this invention are only a part of the applications of all embodiments, not all embodiments. Generally, according to the appendix to this invention… Figure 2 To be continued Figure 3 The process and architecture described and shown in the appendix can be arranged and designed with different configurations for different components. Figure 4 To be continued Figure 7 The description of the technical solution steps and their execution results can be adjusted according to different embodiments.
[0104] First, it's important to clarify that the autonomous modeling method for knowledge graphs in the field of industrial intelligent manufacturing utilizes the vast amounts of data accumulated by industrial intelligent manufacturing enterprises over many years as its data source. It abstracts multi-source heterogeneous data into structured knowledge information, performs knowledge modeling, and stores it in a knowledge base, effectively solving the problem of accurate and efficient reuse of existing multi-source heterogeneous knowledge information. Currently, most domain knowledge graph modeling methods rely on manual construction by experts. While this effectively ensures the quality of the knowledge graph, the increasing scale and volume of knowledge graphs leads to significant consumption of human and material resources. Furthermore, deep learning-based knowledge graph modeling methods suffer from insufficient accuracy in the non-ferrous metals smelting field. Therefore, researching an autonomous modeling method for knowledge graphs in the non-ferrous metals field can solve the problem of accurate modeling of domain knowledge graphs while improving modeling efficiency, thus possessing significant theoretical research value.
[0105] As attached Figure 2 To be continued Figure 3 As shown in the figure, this embodiment presents an autonomous modeling method for knowledge graphs in the field of intelligent industrial manufacturing, including the following steps:
[0106] S1. Data Acquisition and Preprocessing Steps: Collect multi-source heterogeneous data covering domain knowledge and perform preprocessing operations, specifically including the following steps:
[0107] S101: This embodiment collects multi-source heterogeneous industrial texts covering the vertical field of industrial intelligent manufacturing enterprises, including process specifications, raw material information and production information, equipment ledgers, product information, test results, etc. These texts contain a large amount of unstructured knowledge.
[0108] S102: In this embodiment, the data preprocessing operation includes manual judgment of the accuracy and completeness of the sentences to avoid erroneous descriptions and information omissions in the text data, and screening for typos to ensure data quality. At the same time, all files in different storage formats such as txt, excel, PDF, XML, etc. are converted to csv format for storage to establish a unified text corpus.
[0109] S103: In this embodiment, the interpolation method is used to select a set of 1500 single sentences from the text corpus as the training text for the model.
[0110] S2. Ontology Modeling and Training Data Labeling Steps: Construct an ontology model for the field of intelligent industrial manufacturing and label the training data, specifically including the following steps:
[0111] S201: Analyze the semantic features and abstract concepts of industrial texts in the text corpus, construct the schema layer of the industrial intelligent manufacturing field in a "top-down" manner, and define the relation P and its corresponding subject S and object O.
[0112] Ontology model, as attached Figure 4 The diagram shown is a semantic entity type design diagram for an ontology model used in the field of intelligent industrial manufacturing, constructed in this embodiment. Figure 5 The diagram shown is a semantic relation type design diagram of an ontology model for the field of intelligent industrial manufacturing constructed in this embodiment;
[0113] S202: In this embodiment, based on the ontology model definition, the training text selected in step S103 is labeled with single-sentence existence. Example of the labeling result: [CLS] Zinc sulfide concentrate reacts with sulfuric acid in the added waste electrolyte under a certain oxygen pressure, ...''\t'has_oxidation_reaction [SEP];
[0114] S203: In this embodiment, the entity location sequence in the training text is labeled using the BIO annotation method to form the model training dataset. Example of annotation results:
[0115] [CLS] Zinc sulfide concentrate reacts with sulfuric acid from the added waste electrolyte under a certain oxygen pressure… [SEP]
[0116] O B-SUB I-SUB I-SUB I-SUB I-SUB OOOO B-OBJ I-OBJ I-OBJ I-OBJ… …OO.
[0117] S3. Training steps for the triplet knowledge mining model: Based on the labeled training dataset, load the improved BERT model for fine-tuning and training to obtain the face-relationship-entity triplet knowledge mining model, which includes the following steps:
[0118] S301: In this embodiment, the improved BERT model architecture is as follows: Figure 6As shown, a one-dimensional adaptive dilated gate convolutional gating unit was added on the basis of the original BERT model. At the same time, the original knowledge mining model architecture was changed, and a multi-layer dual-channel model was built in parallel for knowledge mining in the field of industrial intelligent manufacturing.
[0119] S302 & S303: Load the labeled training dataset from step S2 into the improved BERT model. First, the labeled data is segmented into word segments and then vectorized to form word embeddings, segment embeddings, position embeddings, token label vectors, and predicate label vectors. After feature fusion, the data is loaded into the BERT model and a one-dimensional adaptive dilated gate convolutional gating unit for model training. The dual-channel training simultaneously yields a text relationship classification model and a sequence labeling model for the field of intelligent manufacturing.
[0120] S304: In this embodiment, the two trained models are merged to construct a pipeline-type face relation-entity triplet knowledge mining model, which is used for knowledge mining of unlabeled text in the text corpus.
[0121] S4. Steps for constructing a knowledge graph in the field of intelligent industrial manufacturing: Using a trained knowledge mining model, extract triplet knowledge information from unlabeled text to construct a knowledge graph for the field of intelligent industrial manufacturing. In this embodiment, the following steps are employed: Figure 7 The "parallel training-step prediction" modeling mechanism shown includes the following steps:
[0122] S401: In this embodiment, unlabeled text data is loaded into the trained triplet knowledge mining model. First, word segmentation is performed. Chinese word segmentation is completed using single-character token segmentation. The segmented word segmentation results are loaded into the BERT model for vectorization. Then, the vectorized results are loaded into the trained improved BERT model for feature extraction and prediction calculation. The calculation process is as follows:
[0123] S4011: Attention Calculation
[0124]
[0125] In the formula, This embodiment uses sequence input word segmentation vectorization. The results of the linear transformation represent the query, key, and value, respectively. For the dimension of attention head, For the number of attention heads, It is the first Weight parameters for each attention head;
[0126] S4012: Bidirectional Attention Calculation and Residual Connection:
[0127]
[0128] In the formula, This is the self-attention result during the encoding process of the BERT model in this embodiment. This is the self-attention result during the decoding process of the BERT model in this embodiment. These are variable weighting coefficients;
[0129] S4013: Adaptive Weighted Dilated Gate Convolution and Average Pooling:
[0130]
[0131]
[0132]
[0133] In the formula, It is the first of the output sequences One element, Represents the input sequence Central The element at that location, For expansion rate, To enable learnable adaptive convolutional weights, the learning process for these weights is implemented using fully connected layers. This is the weight matrix of the fully connected layer. This is the bias vector of the fully connected layer. Use the Sigmoid activation function to ensure that the weights are within a suitable range. for The final output result is obtained after calculation by the average pooling layer.
[0134] S402: In this embodiment, by designing a joint dual-pointer decoding output, five output probabilities of the triplet (SPO) information corresponding to each word segment are predicted, and the output result labeling also adopts the BIO annotation method:
[0135]
[0136] in, Word segmentation is the probability of starting with the subject entity word in a triplet. Word segmentation is the probability of the entity word body in the triplet subject. The probability that a word segment begins with a triplet object entity word. Word segmentation is the probability of the entity word body of the triple object. Given the probability that a word segment is not an entity word, select the result with the highest probability for each word segment as the entity corresponding result for that word segment, and finally output the correct triplet.
[0137] S403: In this embodiment, the obtained triplet information includes Subject-Predicate-Object, and its higher-level concepts can all correspond to the attached... Figure 4 To be continued Figure 5 By comparing the concepts, attributes, and relationships in the entity ontology representation with the defined semantic entity types and semantic relationship types, the similarity score between two entities is calculated to complete entity alignment and ultimately obtain a complete knowledge graph in the field of intelligent industrial manufacturing.
[0138] S5. Knowledge Graph Storage and Application Steps in the Industrial Intelligent Manufacturing Field: The constructed knowledge graph needs to be stored and deeply retrieved, specifically including the following steps:
[0139] S501: In this embodiment, the constructed domain knowledge graph is loaded into the neo4j graph database for storage;
[0140] S502: In this embodiment, the Cypher language can be used in the graph database to realize functions such as adding, deleting, modifying and querying knowledge graphs in the field of intelligent manufacturing, and can also realize deep retrieval and query.
[0141] In summary, compared with the prior art, this application has at least the following beneficial effects:
[0142] (1) This invention provides an autonomous modeling method for knowledge graphs in the field of intelligent manufacturing. First, a text corpus is constructed by collecting data information. Then, an ontology model is built and training data is labeled to train a knowledge mining model in the field of intelligent manufacturing. Next, the trained model is used to mine triplet knowledge. Finally, the autonomous construction of the knowledge graph in the field of intelligent manufacturing is completed. The ontology modeling and knowledge mining model proposed in this invention can directly predict structured triplet information from unlabeled text. It overcomes the defects of error accumulation and amplification in the traditional calculation process of "first identify entities, then identify relationships". It realizes knowledge mining of "input unlabeled original text - output triplet", reduces intermediate calculation process and improves the efficiency of knowledge graph modeling.
[0143] (2) This invention addresses the characteristics of long text entity words, high density of sentence entity words, and overlapping relationships in the field of intelligent manufacturing. It makes targeted improvements to the basic BERT model, designs a one-dimensional adaptive weighted dilatation gate convolutional gating unit and an average pooling layer, effectively solves the problem of long entity mining, and adopts a prediction mechanism of "relationship existence recognition → entity sequence recognition" to achieve accurate extraction of overlapping relationships under high-density entity words, effectively improving the accuracy of knowledge mining under specific texts in the field.
[0144] (3) This invention abstracts multi-source heterogeneous data and text in the field of industrial intelligent manufacturing into knowledge, forms a structured knowledge graph, and uses the neo4j graph database for storage. It designs a deep retrieval mechanism using the Cypher language, providing method support for the effective reuse of enterprise knowledge and helping industrial intelligent manufacturing enterprises realize knowledge empowerment for intelligent manufacturing.
[0145] Please see Figure 8 This application also provides a domain knowledge graph autonomous modeling system 900, which can implement the above-mentioned domain knowledge graph autonomous modeling method. The system includes:
[0146] The first module 910 is used to acquire unstructured text in the target domain to construct a text corpus and to obtain a training dataset from the text corpus.
[0147] The second module 920 is used to construct an ontology model for the target domain based on the text corpus, and to annotate the training dataset based on the ontology model to obtain the model training dataset.
[0148] The third module 930 is used to fine-tune the pre-built BERT model based on the model training dataset to obtain a triplet knowledge mining model for the target domain.
[0149] The fourth module, 940, is used to load text statements from the text corpus into the triplet knowledge mining model for knowledge mining, and to construct a domain knowledge graph based on the results of knowledge mining.
[0150] In some embodiments, the system may further include:
[0151] The fifth module is used to store the domain knowledge graph into the neo4j graph database and build a retrieval and query mechanism based on Cypher statements.
[0152] It is understood that the content of the above method embodiments is applicable to this system embodiment. The specific functions implemented in this system embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.
[0153] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the aforementioned autonomous modeling method for domain knowledge graphs. This electronic device can be any smart terminal, including tablet computers, in-vehicle computers, etc.
[0154] It is understood that the content of the above method embodiments is applicable to this device embodiment. The specific functions implemented by this device embodiment are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.
[0155] Please see Figure 9 , Figure 9 The hardware structure of an electronic device 1000 according to another embodiment is illustrated. The electronic device includes:
[0156] The processor 1001 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.
[0157] The memory 1002 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 1002 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 1002 and is called and executed by the processor 1001 using the domain knowledge graph autonomous modeling method of the embodiments of this application.
[0158] Input / output interface 1003 is used to implement information input and output;
[0159] The communication interface 1004 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0160] Bus 1005 transmits information between various components of the device (e.g., processor 1001, memory 1002, input / output interface 1003, and communication interface 1004);
[0161] The processor 1001, memory 1002, input / output interface 1003 and communication interface 1004 are connected to each other within the device via bus 1005.
[0162] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described autonomous modeling method for domain knowledge graphs.
[0163] It is understood that the content of the above method embodiments is applicable to this storage medium embodiment. The specific functions implemented in this storage medium embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.
[0164] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0165] The domain knowledge graph autonomous modeling method, system, electronic device, and storage medium provided in this application embodiment construct a text corpus by acquiring unstructured text from the target domain and obtaining a training dataset from the text corpus; construct an ontology model of the target domain based on the text corpus, and annotate the training dataset based on the ontology model to obtain a model training dataset; fine-tune and train a pre-built BERT model based on the model training dataset to obtain a triplet knowledge mining model for the target domain; load text statements from the text corpus into the triplet knowledge mining model for knowledge mining, and construct a domain knowledge graph based on the knowledge mining results. This application embodiment abstracts multi-source heterogeneous data and text from the domain into a structured knowledge graph for knowledge modeling, providing reliable support for enterprises' knowledge-enabled intelligent manufacturing while improving the accuracy and efficiency of domain knowledge graph modeling.
[0166] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0167] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0168] The system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0169] Those skilled in the art will understand that all or some of the steps, apparatuses, or functional modules / units in the methods disclosed above can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0170] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, apparatus, product, or device that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or devices.
[0171] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0172] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the system embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another device, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces, or indirect coupling or communication connection between devices or units, and may be electrical, mechanical, or other forms.
[0173] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0174] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0175] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0176] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.
Claims
1. A method for autonomous modeling of domain knowledge graphs, characterized in that, The method includes: Obtain unstructured text from the target domain to construct a text corpus, and obtain a training dataset from the text corpus; An ontology model for the target domain is constructed based on the text corpus, and the training dataset is labeled based on the ontology model to obtain the model training dataset. The model training dataset includes a first training set labeled with existence relations and a second training set labeled with existence relations and entity location sequences; the step of constructing an ontology model for the target domain based on the text corpus, and labeling the training dataset based on the ontology model to obtain the model training dataset includes: Based on the target domain, a conceptual schema layer is defined; Semantic features and abstract concepts are extracted from the text corpus; Based on the semantic features and the abstract concepts, the subject and object and their corresponding relationships are determined, and then the ontology model of the target domain is constructed by combining the schema layer. Based on the definition results of the ontology model, the existence relation of a single sentence is labeled on the training text in the training dataset to obtain the first training set; then, entity location sequence is labeled using the BIO annotation method to obtain the second training set. The pre-constructed BERT model is fine-tuned and trained based on the model training dataset to obtain the triplet knowledge mining model for the target domain. The model training dataset includes a first training set labeled with existence relations and a second training set labeled with existence relations and entity location sequences. The fine-tuning of the pre-built BERT model based on the model training dataset to obtain the triplet knowledge mining model for the target domain includes: The BERT model is preconstructed based on the semantic relationship features extracted from the text corpus. The BERT model is fine-tuned based on the first training set to obtain a text relationship classification model for the target domain. The BERT model is fine-tuned based on the second training set to obtain the sequence labeling model for the target domain. The text relation classification model and the sequence labeling model are merged to build a pipeline-based triplet knowledge mining model for the target domain, which is based on relation-entity relationships. The step of pre-constructing the BERT model based on the semantic relationship features extracted from the text corpus includes: The BERT model is pre-constructed by employing a bidirectional attention mechanism combined with residual connections, and by designing an adaptive weighted dilatation gate convolutional gating unit and an average pooling layer. In the BERT model, a bidirectional attention mechanism is used. After the attention mechanism, the representation at each position passes through a feedforward neural network and residual connections are used on the output of each sub-layer. The text statements in the text corpus are loaded into the triplet knowledge mining model for knowledge mining, and a domain knowledge graph is constructed based on the results of the knowledge mining.
2. The method according to claim 1, characterized in that, The process of acquiring unstructured text from the target domain to construct a text corpus, and obtaining a training dataset from the text corpus, includes: Collect unstructured text of industrial texts from the target domain; the industrial texts include process texts, raw material and production information, and equipment ledgers of the target enterprise's vertical domain; The unstructured text is preprocessed into a set of single sentences, which are then organized to obtain the text corpus. A portion of the text is obtained from the text corpus and used as the training dataset.
3. The method according to any one of claims 1 to 2, characterized in that, The process of loading text statements from the text corpus into the triplet knowledge mining model for knowledge mining, and constructing a domain knowledge graph based on the results of the knowledge mining, includes: The text sentences in the text corpus are loaded into the triplet knowledge mining model for knowledge mining, and the triplet information contained in the unstructured text is output through the two-pointer decoding method. Based on the triple information, entity alignment is performed through the hierarchical structure and semantic relationships of the ontology model to obtain the entity relationship triple set of the target domain, thus completing the construction of the domain knowledge graph.
4. The method according to any one of claims 1 to 2, characterized in that, The method further includes: The domain knowledge graph is stored in the neo4j graph database, and a retrieval and query mechanism is built based on Cypher statements.
5. A domain knowledge graph autonomous modeling system, characterized in that, The system, applied to the domain knowledge graph autonomous modeling method of claim 1, comprises: The first module is used to acquire unstructured text from the target domain to construct a text corpus, and to acquire a training dataset from the text corpus. The second module is used to construct an ontology model of the target domain based on the text corpus, and to annotate the training dataset based on the ontology model to obtain the model training dataset. The third module is used to fine-tune the pre-constructed BERT model based on the model training dataset to obtain the triplet knowledge mining model for the target domain. The fourth module is used to load text statements from the text corpus into the triplet knowledge mining model for knowledge mining, and to construct a domain knowledge graph based on the results of the knowledge mining.
6. An electronic device, characterized in that, The electronic device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the method according to any one of claims 1 to 4.
7. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method of any one of claims 1 to 4.