Visual token semantic structuring training method based on industry terminology atlas

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a semantic graph of hydropower terminology and jointly optimizing a visual discrete encoder, the generated visual tokens embody terminology hierarchy and association in the embedding space, solving the problem of no semantic constraints in the visual token embedding space and realizing high-precision intelligent tasks for hydropower engineering.

CN122244893APending Publication Date: 2026-06-19POWERCHINA BEIJING ENG CORP

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: POWERCHINA BEIJING ENG CORP
Filing Date: 2026-04-24
Publication Date: 2026-06-19

Smart Images

Figure CN122244893A_ABST

Patent Text Reader

Abstract

This invention, titled "A Visual Token Semantic Structure Training Method Based on Industry Terminology Graph," belongs to the field of intelligent processing technology for hydropower engineering drawings. The technical problem it addresses is the core deficiency of existing visual tokenization technologies in hydropower engineering drawing processing: the lack of semantic structure in terminology, the neglect of semantic hierarchy in term ID alignment, and the failure to model functional relationships between terms. The key technical solution involves constructing a hydropower terminology semantic graph based on the "Design Code for Pumped Storage Power Stations," generating semantic embedding vectors for each term; introducing a triplet contrastive loss based on this graph into the training of the discrete variational autoencoder, jointly optimizing the encoder and visual codebook parameters with the conventional reconstruction loss; and mapping the hydropower engineering drawings, after feature extraction via a visual transformer, into a discrete visual sequence with semantic structure during inference.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of intelligent processing technology for hydropower engineering drawings, specifically involving a visual token semantic structuring training method based on industry terminology graphs. Background Technology

[0002] In the field of hydropower construction, multimodal intelligent processing of engineering drawings (including but not limited to equipment layout diagrams, control panel diagrams, and equipment nameplate annotation diagrams) is the core foundation for achieving high-precision intelligent tasks such as equipment correlation analysis and fault diagnosis in hydropower projects. Existing visual tokenization (marking) technology suffers from a core deficiency in processing hydropower engineering drawings: the lack of semantic structure of terminology. Specifically: (1) Semantic hierarchy is ignored during term ID alignment. Conventional technical solutions will use OCR ( The terminology in the hydropower field (such as “governor”) identified by optical character recognition is mapped to a single visual token ID. It does not model the inclusion relationship between terms (such as the subclass-parent class relationship between “reversible pump turbine” and “pump turbine”) and the functional association relationship (such as the control relationship between “governor” and “guide vane”). It treats terms as isolated labels rather than nodes in a semantic network. (2) The visual token embedding space lacks structured semantic constraints. Currently, mainstream visual tokenization algorithms (such as BLIP) only optimize the L2 reconstruction loss during training and do not introduce professional semantic constraints in the field of hydropower. As a result, the generated visual tokens lack structured semantic expression in the embedding space and cannot support high-precision intelligent tasks in hydropower engineering.

[0003] In view of this, the present invention is hereby proposed. Summary of the Invention

[0004] To address the aforementioned technical problems in existing technologies, this invention provides a visual token semantic structuring training method based on industry terminology graphs, which solves the problem that existing visual tokens cannot reflect the hierarchy and relationships of hydropower terminology and cannot support high-precision multimodal intelligent tasks in hydropower engineering.

[0005] To achieve the above objectives, the technical solution of the present invention is as follows: A visual token semantic structuring training method based on industry terminology graphs includes: Constructing a semantic map of hydropower terminology based on standard texts in the field of hydropower; A pre-trained language model is used to generate semantic embedding vectors for each term in the terminology semantic graph; During the training process of the visual discrete encoder, a triplet contrastive loss function is constructed based on the terminology semantic graph; The triplet contrast loss function is jointly optimized with the conventional reconstruction loss function to update the parameters of the visual discrete encoder and the visual codebook; After feature extraction, the input hydropower engineering drawing image is mapped into a discrete visual sequence by a trained visual discrete encoder.

[0006] Furthermore, the construction of a semantic map of hydropower terminology based on hydropower-related standard texts includes: Extracting high-frequency terms from standard texts in the hydropower sector; Based on the definitions and contextual logic of the standard text, semantic relationships between terms are constructed. Generate a directed graph structure term semantic graph containing term nodes and semantic relation edges. The mathematical expression of the term semantic graph is:

[0007] in, For a set of term nodes, It is a semantic relation edge set.

[0008] Furthermore, the high-frequency terms include: equipment-related terms, phenomenon-related terms, and parameter-related terms.

[0009] Furthermore, the semantic relationships between the terms include inclusion relationships and functional association relationships.

[0010] Furthermore, the step of generating semantic embedding vectors corresponding to each term in the terminology semantic graph using a pre-trained language model specifically involves: A pre-trained language model is used to encode terms and their context in hydropower-related standard texts. After fine-tuning, a semantic embedding vector of dimension d is obtained. The mathematical expression of the semantic embedding vector is as follows: .

[0011] Furthermore, the construction of the triplet contrastive loss function based on the terminology semantic graph specifically includes: OCR recognition is performed on image blocks in hydropower engineering drawings to obtain the recognized valid terms; Terms that have a positive semantic association with the effective terms are retrieved from the semantic graph and selected as positive samples; terms that have no association with the effective terms are selected as negative samples. Based on the semantic embedding vectors of effective terms, the semantic embedding vectors of positive samples, the semantic embedding vectors of negative samples, and the visual token embeddings corresponding to the current image patch after being encoded by a visual discrete encoder, a triplet contrastive loss function is constructed.

[0012] Furthermore, the mathematical expression for the triple contrastive loss function is:

[0013] in, The preset interval hyperparameter, For the semantic embedding vector of effective terms, The semantic embedding vector of the negative sample. Embed the visual token corresponding to the current image block; The mathematical expression for the conventional reconstruction loss function is:

[0014] in, For image blocks in hydropower engineering drawings, Features extracted from image patches by a visual transformer.

[0015] Furthermore, the visual discrete encoder is a discrete variational autoencoder.

[0016] Furthermore, the image feature extraction of the hydropower engineering drawing image is achieved through a visual transformer.

[0017] Furthermore, the relevant standard text in the hydropower field is the "Design Code for Pumped Storage Power Stations" NB / T10072.

[0018] The beneficial effects of this invention are as follows: (1) Based on authoritative standards in the field of hydropower, a semantic graph of terms containing term hierarchy and association is constructed, and the graph is used as a semantic supervision signal for visual token training. This solves the core problems of missing semantic structure of terms and unmodeled semantic association between terms in the visual tokenization process in the existing technology, so that the visual token can fully carry the semantic hierarchy information of terms in the field of hydropower. (2) Based on the triple contrast loss function constructed by the term semantic graph, the visual token embedding space is structurally constrained, which makes the visual tokens corresponding to semantically related terms closer in the embedding space and the visual tokens corresponding to irrelevant terms farther apart. This solves the problem that the visual token embedding space has no professional semantic constraints in the existing technology, and makes the visual token embedding space automatically reflect the semantic association characteristics of terms in the field of hydropower. (3) A customized training method is designed for hydropower engineering drawing scenarios. Through joint optimization of terminology map and contrast loss, the generated visual token can implicitly encode professional semantic information in the field of hydropower, which can provide structured semantic support for downstream high-precision multimodal tasks such as equipment association analysis and fault diagnosis Q&A in hydropower engineering. Attached Figure Description

[0019] Figure 1A flowchart of the visual token semantic structuring training method provided in this embodiment of the invention; Detailed Implementation The technical solution of the present invention will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are not all embodiments of the present invention. All other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the present invention.

[0020] It should be noted that, unless otherwise specifically stated, the relative arrangement and numerical expressions of the components and steps described in these embodiments should not be construed as limiting the scope of the invention.

[0021] The following description of exemplary embodiments is merely illustrative and is not intended to limit the invention or its application or use in any way. Techniques, methods, and apparatus known to those skilled in the art may not be discussed in detail herein, but where applicable, such techniques, methods, and apparatus should be considered part of this specification.

[0022] Example See Figure 1 , Figure 1 This is a flowchart of the visual token semantic structuring training method based on industry terminology graph proposed in this invention. Specific steps may include: S1. Constructing a semantic map of hydropower terminology: This involves constructing a semantic map of hydropower terminology based on standard texts in the field; specifically including: S11. Extract high-frequency terms from standard texts in the hydropower field; the high-frequency terms include: equipment terms, phenomenon terms, and parameter terms. Among them, equipment terms include "reversible pump-turbine", "governor", and "ball valve"; phenomenon terms include "tailrace surge" and "water hammer effect"; parameter terms include "rated head" and "maximum flow rate".

[0023] S12. Based on the definitions and contextual logic of the standard text, construct semantic relationships between terms. These semantic relationships include inclusion relationships and functional association relationships. Inclusion relationships are those where a subclass points to a parent class, such as "reversible water pump turbine" → "water pump turbine." Functional association relationships reflect the functional chain between terms, such as "speed governor" → "control" → "guide vane." Terms without direct association, such as "speed governor" and "excitation system," are excluded and labeled as not directly related to reduce the negative impact of difficult samples on model performance.

[0024] S13. Generate a directed graph structure term semantic graph containing term nodes and semantic relation edges, and store it using an adjacency list as the data structure of the graph. The mathematical expression of the term semantic graph is:

[0025] in, For a set of term nodes, It is a semantic relation edge set.

[0026] S2. Generate semantic embedding vectors for terms: Use a pre-trained language model to generate semantic embedding vectors for each term in the semantic graph; specifically including: Specifically, a pre-trained language model is used to encode terms and their context in hydropower-related standard texts. After fine-tuning, a semantic embedding vector of dimension d is obtained. The mathematical expression of the semantic embedding vector is as follows: The standard text for the hydropower sector is the "Design Code for Pumped Storage Power Stations" NB / T10072.

[0027] S3. Training the visual discrete encoder: During the training process of the visual discrete encoder, a triplet contrastive loss function is constructed based on the terminology semantic graph; specifically including: S31. Cut the hydropower engineering drawing into multiple image blocks. Input the same image block into the visual transformer and the OCR model respectively. The same image block will obtain the corresponding image features and text recognition results. If the OCR recognizes valid terms, proceed to the next step. S32. Retrieve terms from the semantic graph that have a positive semantic association with the effective term as positive samples, and select terms that are not associated with the effective term as negative samples; for example, when the effective term is "speed governor", the positive sample is "guide vane"; at the same time, randomly select several terms that are not associated with the effective term as negative samples, for example, when the effective term is "speed governor", the negative sample is "excitation system".

[0028] S33. Construct a triplet contrastive loss function based on the semantic embedding vectors of effective terms, the semantic embedding vectors of positive samples, the semantic embedding vectors of negative samples, and the visual token embeddings corresponding to the current image patch after being encoded by the visual discrete encoder.

[0029] The mathematical expression for the triple contrastive loss function is:

[0030] in, The preset interval hyperparameter, For the semantic embedding vector of effective terms, The semantic embedding vector of the negative sample. The visual token is embedded for the current image patch; this loss function prompts the visual token embedding. In the vector space, move closer to its corresponding terms and semantically related terms, while moving away from irrelevant terms.

[0031] S4. Generate a structured visual token sequence: jointly optimize the triplet contrast loss function and the conventional reconstruction loss function to update the parameters of the visual discrete encoder and the visual codebook; The mathematical expression for the conventional reconstruction loss function is:

[0032] in, For image blocks in hydropower engineering drawings, The features extracted from the image patch by the visual transformer. After sufficient training, each entry in the visual codebook not only retains the visual features of the original image, but also implicitly encodes the semantic position of its corresponding term in the hydropower knowledge system.

[0033] S5. Obtaining Discrete Visual Sequences: After feature extraction, the input hydropower engineering drawing images are mapped into discrete visual sequences by a trained discrete visual encoder.

[0034] Specifically, the input hydropower engineering drawing image is cut into multiple image blocks. The same image block is input into the visual transformer and the OCR model respectively. The same image block yields the corresponding image features and text recognition results. The image features are mapped into discrete visual sequences by the trained discrete variational autoencoder (visual discrete encoder).

[0035] Because semantic structure constraints are introduced during training, the generated visual tokens naturally exhibit hierarchical and relational characteristics between terms in the embedding space. For example, the Euclidean distance between the visual token corresponding to "reversible water pump turbine" and the visual token of "water pump turbine" in the embedding space is significantly smaller than its distance from unrelated terms such as "excitation system," thus providing structured semantic support for downstream multimodal tasks (such as equipment association analysis and fault diagnosis question answering).

[0036] The above specific embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to examples, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the scope of the technical solutions of the present invention, and all such modifications or substitutions should be covered within the scope of the claims of the present invention.

Claims

1. A visual token semantic structuring training method based on industry terminology graphs, characterized in that, include: Constructing a semantic map of hydropower terminology based on standard texts in the field of hydropower; A pre-trained language model is used to generate semantic embedding vectors for each term in the terminology semantic graph; During the training process of the visual discrete encoder, a triplet contrastive loss function is constructed based on the terminology semantic graph; The triplet contrast loss function is jointly optimized with the conventional reconstruction loss function to update the parameters of the visual discrete encoder and the visual codebook; After feature extraction, the input hydropower engineering drawing image is mapped into a discrete visual sequence by a trained visual discrete encoder.

2. The visual token semantic structuring training method based on industry terminology graph as described in claim 1, characterized in that, The construction of a semantic map of hydropower terminology based on standard texts in the hydropower field includes: Extracting high-frequency terms from standard texts in the hydropower sector; Based on the definitions and contextual logic of the standard text, semantic relationships between terms are constructed. Generate a directed graph structure term semantic graph containing term nodes and semantic relation edges. The mathematical expression of the term semantic graph is: in, For a set of term nodes, It is a semantic relation edge set.

3. The visual token semantic structuring training method based on industry terminology graph as described in claim 2, characterized in that, The high-frequency terms include: equipment-related terms, phenomenon-related terms, and parameter-related terms.

4. The visual token semantic structuring training method based on industry terminology graph as described in claim 2, characterized in that, The semantic relationships between the terms include inclusion relationships and functional association relationships.

5. The visual token semantic structuring training method based on industry terminology graph as described in claim 1, characterized in that, The step of generating semantic embedding vectors for each term in the terminology semantic graph using a pre-trained language model is as follows: A pre-trained language model is used to encode terms and their context in hydropower-related standard texts. After fine-tuning, a semantic embedding vector of dimension d is obtained. The mathematical expression of the semantic embedding vector is as follows: .

6. The visual token semantic structuring training method based on industry terminology graph as described in claim 1, characterized in that, The construction of the triplet contrastive loss function based on the terminology semantic graph specifically includes: OCR recognition is performed on image blocks in hydropower engineering drawings to obtain the recognized valid terms; Terms that have a positive semantic association with the effective terms are retrieved from the semantic graph and selected as positive samples; terms that have no association with the effective terms are selected as negative samples. Based on the semantic embedding vectors of effective terms, the semantic embedding vectors of positive samples, the semantic embedding vectors of negative samples, and the visual token embeddings corresponding to the current image patch after being encoded by a visual discrete encoder, a triplet contrastive loss function is constructed.

7. The visual token semantic structuring training method based on industry terminology graph as described in claim 6, characterized in that, The mathematical expression for the triple contrastive loss function is: in, The preset interval hyperparameter, For the semantic embedding vector of effective terms, The semantic embedding vector of the negative sample. Embed the visual token corresponding to the current image block; The mathematical expression for the conventional reconstruction loss function is: in, For image blocks in hydropower engineering drawings, Features extracted from image patches by a visual transformer.

8. The visual token semantic structuring training method based on industry terminology graph as described in claim 1, characterized in that, The visual discrete encoder is a discrete variational autoencoder.

9. The visual token semantic structuring training method based on industry terminology graph as described in claim 1, characterized in that, Image feature extraction of the hydropower engineering drawings is achieved through a visual transformer.

10. The visual token semantic structuring training method based on industry terminology graph according to claim 1, characterized in that, The relevant hydropower standard text is the "Design Code for Pumped Storage Power Stations" NB / T10072.