Method and apparatus for gene function characterization
By acquiring multi-source gene attribute data and using a generative large language model for targeted knowledge completion and iterative comparative learning training, this approach solves the problem that existing methods cannot fully utilize multi-source, fine-grained, and structured biomedical text knowledge. It generates accurate gene function embedding representations, supporting gene association prediction and disease gene discovery.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- GUANGZHOU NAT LAB
- Filing Date
- 2026-02-09
- Publication Date
- 2026-06-12
AI Technical Summary
Existing gene representation learning methods cannot fully utilize multi-source, fine-grained, structured biomedical text knowledge, resulting in insufficient and inaccurate gene function representation.
By acquiring multi-source gene attribute data, we use a generative large language model for targeted knowledge completion, construct a gene knowledge graph, and convert it into a structured text sequence using attribute-aware templates containing special labels for attribute categories. Combined with iterative contrastive learning training, including data augmentation strategies, progressive negative sample sampling, and a composite loss function, we use a long context encoder and an attribute-aware attention pooling layer to generate accurate gene function embedding representations.
It achieves comprehensive integration and quality enhancement of fine-grained gene attribute information scattered across multiple authoritative databases. The generated gene function embedding representation can accurately capture fine functional semantics, supporting subsequent gene association prediction and disease gene discovery tasks.
Smart Images

Figure CN122201460A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing technology, and in particular to a method and apparatus for gene function characterization. Background Technology
[0002] High-quality, information-rich gene function representations are crucial for downstream bioinformatics tasks such as gene association prediction and disease gene discovery. With the development of natural language processing technology, researchers have begun to treat genes as semantic entities and are attempting to use deep learning frameworks to learn low-dimensional, dense gene embedding representations from biological data in order to capture their functional semantics.
[0003] Currently, mainstream gene representation learning methods mainly include sequence- or ontology-based methods, basic models based on single-cell transcriptomics, and text summarization methods based on large language models. Sequence- or ontology-based methods mainly rely on structured or numerical data, such as co-expression matrices, gene ontology terms, or sequence k-mers. They fail to systematically utilize the rich descriptive text annotations in biomedical knowledge bases and can only capture coarse-grained statistical associations between genes or hierarchical relationships between predefined terms.
[0004] While basic models based on single-cell transcriptomics can learn expression-level patterns, their training data either do not contain or only implicitly contain comprehensive biological knowledge from authoritative databases that has been manually reviewed. These models lack explicit connections to structured, fine-grained functional knowledge. Text summarization methods based on large language models primarily rely on concise, high-level abstract descriptions. These summaries are often highly generalized representations of gene function, inevitably ignoring a large amount of fine-grained knowledge scattered across multiple specialized databases. Furthermore, general large language models typically treat gene descriptions as ordinary natural language text, failing to explicitly model the inherent structured properties within gene descriptions.
[0005] In summary, existing gene representation learning methods cannot fully utilize multi-source, fine-grained, structured biomedical text knowledge to construct comprehensive and accurate gene function representations. Summary of the Invention
[0006] This invention provides a gene function characterization method and apparatus to address the problem that existing gene representation learning methods cannot fully utilize multi-source, fine-grained, structured biomedical text knowledge to construct comprehensive and accurate gene function representations.
[0007] This invention provides a gene function characterization method, comprising: Acquire multi-source gene attribute data covering molecular function, signaling pathways, and tissue expression dimensions; Using a generative large language model, targeted knowledge completion is performed on the missing attributes in the multi-source gene attribute data to obtain a gene knowledge graph; The gene knowledge graph is converted into a structured text sequence using an attribute-aware template that includes special markers for attribute categories. Based on the structured text sequence, iterative comparative learning training is performed on the gene function embedding model until the preset convergence condition is met, and the trained gene function embedding model is obtained. Each contrastive learning training round includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function; the gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
[0008] According to a gene function characterization method provided by the present invention, the step of using a generative large language model to perform targeted knowledge completion on missing attributes in the multi-source gene attribute data includes: Construct a constraint prompt instruction; wherein the constraint prompt instruction is used to configure the generative large language model to preferentially cite factual data from authoritative sources and set the generation temperature parameter to a preset low determinism threshold; Obtain the candidate completion content output by the generative large language model based on the constraint prompt instruction; If the candidate completion content is detected to contain the cognitive uncertainty words, the candidate completion content is discarded; wherein, the cognitive uncertainty words include at least terms that represent speculative meanings or terms that represent meanings that are not yet clear; If no cognitive uncertainty words are detected, the candidate completion content is retained as the completion result.
[0009] According to a gene function characterization method provided by the present invention, the step of converting the gene knowledge graph into a structured text sequence using an attribute-aware template containing attribute category-specific markers includes: Define special markers of various attribute types, fill the attribute values of each gene record in the gene knowledge graph into the corresponding special markers to form a semi-structured natural language sequence, and use the special markers as structural prior information in subsequent encoding processing; The special markers include at least a pathway marker for encapsulating pathway information, a function marker for encapsulating function description, and an organization marker for encapsulating organization expression information.
[0010] According to a gene function characterization method provided by the present invention, the step of constructing positive sample pairs using data augmentation strategies includes: performing at least one of the following augmentation operations on the structured text sequence to generate an augmented text sequence: Contextual synonym fusion: Replacing key terms in the structured text sequence with biomedical synonyms; Functional entity mask: Randomly masks biological phrase fragments in the structured text sequence; Structured field reordering: While maintaining the priority of core fields, the order of non-core attribute blocks is shuffled; Functional abstraction: Randomly simplifying the functional description portion in the structured text sequence; Pathway representation variation: Controlled reordering or selective pruning of the path list in the structured text sequence; The positive sample pair is formed by combining the original structured text sequence with the enhanced text sequence, or by combining two different enhanced text sequences generated for the same gene.
[0011] According to a gene function characterization method provided by the present invention, the step of selecting negative samples using a progressive sampling strategy is dynamically adjusted according to the current training round stage, and includes the following stages executed in the order of training rounds: Basic feature distribution establishment stage: In the early rounds of training, random sampling is performed within the current training batch to obtain the negative samples; Difficult negative sample mining stage: In the middle round of training, the cosine similarity between the anchor gene and other genes in the current training batch is calculated, and non-homologous genes with a cosine similarity higher than a preset threshold are selected as negative samples. Memory bank sampling phase: In later rounds of training, a memory bank storing historical gene embedding representations is maintained, and negative samples are sampled from the memory bank.
[0012] According to a gene function characterization method provided by the present invention, the step of selecting negative samples using a progressive sampling strategy further includes: After identifying candidate negative samples, external biological databases are searched to obtain information on the functional association between the candidate negative samples and anchor genes. Determine whether the candidate negative sample and the anchor gene meet the functional association conditions; wherein, the functional association conditions include: the candidate negative sample and the anchor gene belong to the same KEGG pathway, or the candidate negative sample and the anchor gene have an interaction score higher than a preset score in a protein interaction database; If the functional association condition is met, the candidate negative sample is excluded from the negative sample sampling pool.
[0013] According to a gene function characterization method provided by the present invention, the long context encoder adopts a Transformer architecture based on sliding window attention; During the process of updating the model parameters, the bottom preset number of Transformer layers of the long context encoder are frozen, and only the top preset number of Transformer layers and the embedding head are fine-tuned.
[0014] According to a gene function characterization method provided by the present invention, the attribute-aware attention pooling layer performs the following processing: The embedding vector of the global classification label is used as the query vector, and the hidden state sequence output by the long context encoder is used as the key vector and value vector. The special markers in the structured text sequence are used as semantic anchors to calculate attention weights; The value vector is weighted and aggregated based on the attention weights to generate the gene representation vector; The attribute-aware attention pooling layer is configured with multiple attention heads to focus on the semantic regions corresponding to different special tags.
[0015] According to a gene function characterization method provided by the present invention, the composite loss function includes: Noise contrast estimation loss is used to bring the positive sample pairs closer together in the representation space and push the negative samples further apart in the representation space; Covariance loss is used to minimize the off-diagonal elements of the covariance matrix of the current training batch. Variance loss, constructed based on the hinge loss function, is used to constrain the average variance of each dimension from exceeding a preset target threshold. The value of the composite loss function is obtained by weighted summation of the noise contrast estimation loss, the covariance loss, and the variance loss.
[0016] The present invention also provides a gene function characterization device, comprising: The acquisition module is used to acquire multi-source gene attribute data covering molecular function, signaling pathways, and tissue expression dimensions. The completion module is used to perform targeted knowledge completion on the missing attributes in the multi-source gene attribute data using a generative large language model, thereby obtaining a gene knowledge graph. The conversion module is used to convert the gene knowledge graph into a structured text sequence using an attribute-aware template containing special markers for attribute categories; The training module is used to perform iterative comparative learning training on the gene function embedding model based on the structured text sequence until the preset convergence condition is met, so as to obtain the trained gene function embedding model. The gene function embedding model performs iterative contrastive learning training in each round, which includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function. The gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, which are used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
[0017] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the gene function characterization method as described above.
[0018] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the gene function characterization method as described above.
[0019] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the gene function characterization method as described above.
[0020] The gene function characterization method and apparatus provided by this invention achieve comprehensive integration and quality enhancement of fine-grained gene attribute information scattered across multiple authoritative databases by acquiring multi-source gene attribute data and using a generative large language model for targeted knowledge completion. By using attribute-aware templates containing special markers for attribute categories to convert gene knowledge graphs into structured text sequences, the inherent structured attributes in gene descriptions are explicitly preserved. Through data augmentation strategies, progressive negative sample sampling strategies, and composite loss functions in iterative comparative learning training, combined with the synergistic effect of the long context encoder and attribute-aware attention pooling layer in the gene function embedding model, the model can dynamically aggregate semantic information from different attribute fields. This effectively solves the technical problem that existing methods cannot fully utilize multi-source, fine-grained, and structured biomedical text knowledge, enabling the generated gene function embedding representation to accurately capture fine functional semantics. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0022] Figure 1 This is a flowchart illustrating the gene function characterization method provided by the present invention; Figure 2 This is a schematic diagram of the system structure provided by the present invention; Figure 3 This is a visualization of the UMAP representation of gene embedding in an embodiment of the present invention; Figure 4 This is a visualization of the glycolysis pathway genes in an embodiment of the present invention; Figure 5 This is a box plot showing the similarity distribution between pathway genes and random genes in an embodiment of the present invention. Figure 6 A schematic diagram of the gene function characterization device provided by the present invention; Figure 7 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0023] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0024] Figure 1 This is a flowchart illustrating the gene function characterization method provided by the present invention, as shown below. Figure 1 As shown, the method includes the following: Step 110: Obtain multi-source gene attribute data covering molecular function, signaling pathways, and tissue expression dimensions; In this application, multi-source gene attribute data refers to a set of attribute information obtained from multiple biomedical databases of different sources, used to describe the biological characteristics of genes.
[0025] Multi-source gene attribute data covers at least three core dimensions: molecular function dimension describes the biochemical activities performed by gene products at the molecular level; signaling pathway dimension describes the biological signal transduction pathways in which genes participate; and tissue expression dimension describes the expression patterns of genes in different tissues or organs.
[0026] The process of obtaining the multi-source gene attribute data includes: Determine the set of data sources to be integrated, which may include authoritative biomedical databases; Extract gene-related attribute fields from various data sources; perform data cleaning operations on the extracted attribute data, including case normalization, deduplication of redundant data, filtering of non-target species data, and outlier handling. The cleaned attribute data is linked and integrated according to the gene identifier to form a multi-source gene attribute data record at the gene level.
[0027] The multi-source gene attribute data covers nine core attribute categories: gene symbol, gene description, molecular function annotation, biological process annotation, cellular component localization, list of signaling pathways involved, tissue expression profile information, disease association information, and prognostic information.
[0028] Nine core attribute categories characterize the biological characteristics of genes from different perspectives. Among them, molecular function annotation describes the molecular-level biochemical activities performed by gene products, biological process annotation describes the macroscopic biological processes in which genes participate, and cellular component localization describes the spatial distribution of gene products within cells.
[0029] In a specific implementation scenario, the server extracts gene attribute data from the Harmonizome database, OmniPath database, Ensembl database, NCBI Gene database, and Human Protein Atlas database.
[0030] The data cleaning process specifically included: performing case normalization to eliminate duplicate records caused by inconsistencies in capitalization; performing non-human pathway filtering to remove pathway annotation information from other species; performing missing value marking to identify the integrity status of each attribute field; and performing format unification to ensure that similar attributes from different data sources use a consistent representation format. After data cleaning and integration, a multi-source gene attribute dataset covering approximately 19,773 human genes that had been manually reviewed was obtained.
[0031] The attribute data covers specific attribute types including, but not limited to: gene symbols, gene descriptions, molecular functional annotations, lists of signaling pathways involved, tissue expression profiles, protein localization information, and disease association information. After data cleaning and integration, a multi-source gene attribute data set covering approximately 19,773 human genes was obtained.
[0032] Step 120: Using a generative large language model, target knowledge completion is performed on the missing attributes in the multi-source gene attribute data to obtain a gene knowledge graph; In this application, the generative large language model refers to a deep learning model that has been pre-trained on a large-scale corpus and possesses natural language understanding and generation capabilities. The targeted knowledge completion refers to the process of using the knowledge generation capabilities of the generative large language model to fill in missing values in specific attribute fields within the multi-source gene attribute data.
[0033] A gene knowledge graph is a collection of gene attribute data that has been significantly improved in terms of attribute completeness after knowledge completion processing.
[0034] The process of performing the targeted knowledge completion is as follows: traversing each gene record in the multi-source gene attribute data and detecting the missing status of each attribute field; for gene records with missing attributes, constructing a query request containing known attribute information; inputting the query request into the generative large language model to obtain the completed content; performing quality assessment and filtering on the completed content; and filling the corresponding missing attribute fields with the completed content that has passed the quality assessment to form the gene knowledge graph.
[0035] Step 130: Using an attribute-aware template containing special markers for attribute categories, the gene knowledge graph is converted into a structured text sequence; In this application, attribute-aware templates refer to predefined text formatting rules used to convert genetic attribute data into natural language sequences.
[0036] Special markers for attribute categories refer to special characters or words used in the attribute-aware template to identify and distinguish different attribute types.
[0037] Structured text sequence refers to natural language text that retains attribute structure information and is generated according to the attribute-aware template.
[0038] The process of converting the gene knowledge graph into the structured text sequence includes: defining corresponding special markers for each attribute type; filling each attribute value into the position range defined by the corresponding special marker in order of priority of the attribute type; and splicing the filled attribute fragments into the complete structured text sequence.
[0039] Special markers serve as structural prior information in subsequent encoding processes, assisting the model in identifying and distinguishing different types of attribute content.
[0040] Step 140: Based on the structured text sequence, perform iterative comparative learning training on the gene function embedding model until the preset convergence condition is met, and obtain the trained gene function embedding model. Each contrastive learning training round includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function; the gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
[0041] In this application, the gene functional embedding model refers to a deep learning model used to map structured text sequences of genes into dense vector representations.
[0042] In this application, iterative contrastive learning training refers to the process of optimizing the parameters of the gene functional embedding model by using a contrastive learning paradigm through multiple training rounds.
[0043] The preset convergence conditions can specifically refer to the following: the training loss drops below a preset threshold, the validation set performance metrics no longer improve, the preset patience value is reached, or the training rounds reach the preset maximum number of rounds.
[0044] Figure 2 This is a schematic diagram of the system structure provided by the present invention, such as... Figure 2 As shown, the gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer.
[0045] The long context encoder is used to perform context-aware encoding processing on the input structured text sequence to generate a hidden state sequence.
[0046] The attribute-aware attention pooling layer is used to aggregate the hidden state sequence based on the attention mechanism to generate a gene representation vector of uniform dimension.
[0047] In one specific implementation, the long context encoder uses the Longformer-base-4096 model as the backbone network.
[0048] The Longformer-base-4096 model supports processing input sequences of up to 4096 tokens, which can meet the needs of encoding long text gene descriptions.
[0049] To adapt to the specific needs of gene function characterization tasks, the server expands the vocabulary of the Longformer-base-4096 model by adding 12 domain-specific attribute marker lexicons, which include special markers used to identify the boundaries of different attribute types.
[0050] The gene functional embedding model further includes a projection head located after the attribute-aware attention pooling layer. This projection head maps the gene representation vector output by the attribute-aware attention pooling layer to the contrastive learning space. The projection head employs a non-linear projection structure, comprising at least one fully connected layer and a non-linear activation function. It transforms the normalized gene representation vector into an embedding vector suitable for calculating the contrastive loss through a non-linear transformation.
[0051] The gene function embedding model maps the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge. The gene function embedding representation can be used for subsequent downstream tasks such as gene association prediction and functional classification.
[0052] Each contrastive learning training round includes the following processing steps: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function.
[0053] The data augmentation strategy refers to the technique of performing semantically preserved transformations on the original structured text sequence to generate an augmented text sequence that is semantically consistent with the original sequence but has a different expression form. The positive sample pair refers to a pairing of samples consisting of the original sequence and the augmented sequence of the same gene, or two different augmented sequences of the same gene, used to bridge the gap between different representations of the same gene in contrastive learning.
[0054] The progressive sampling strategy refers to a sampling mechanism that dynamically adjusts the selection method of negative samples as the training process progresses. Negative samples are gene samples that are different from the anchor gene, used to extrapolate the distance between different gene representations during contrastive learning.
[0055] The training batch refers to the set of samples input into the model in a single parameter update. After the training batch is input into the gene function embedding model, the long context encoder first encodes the structured text sequence of each sample and outputs the hidden state sequence; then the attribute-aware attention pooling layer aggregates the hidden state sequence and outputs the gene representation vector of each sample.
[0056] The composite loss function refers to a training objective function formed by a weighted combination of multiple loss terms, used to simultaneously constrain the geometric properties of the representation space from multiple optimization objectives. The loss value of the current batch is calculated based on the composite loss function, and the gradient of the model parameters is calculated using the backpropagation algorithm. The parameters of the gene functional embedding model are then updated using the optimizer.
[0057] In a preferred embodiment, the process of updating the model parameters employs an exponential moving average mechanism. The exponential moving average mechanism refers to a technique that uses an exponentially weighted average of historical parameter values to smooth the current parameter update. Specifically, the server maintains a set of shadow parameters. After each parameter update, the current model parameters and shadow parameters are weighted and fused according to a preset momentum coefficient to obtain the updated shadow parameters. This exponential moving average mechanism can combat random noise introduced during negative sampling, achieving robust convergence of model training.
[0058] In a specific training configuration, iterative contrastive learning training is performed for 24 training epochs, and the optimal validation set performance is obtained at the model checkpoint saved in the 16th training epoch. The server selects the model parameters corresponding to this optimal checkpoint as the final parameters of the trained gene functional embedding model.
[0059] Figure 3 This is a visualization of the UMAP representation of gene embedding in an embodiment of the present invention. Figure 3 As shown, the trained gene function embedding model can generate gene representations with good semantic discriminativeness. Genes belonging to the same functional category form clearly distinguishable clusters in the embedding space, while genes of different functional categories maintain significant spatial separation.
[0060] Figure 4 This is a visualization of the glycolysis pathway genes in an embodiment of the present invention. Figure 5 This is a box plot showing the similarity distribution between pathway genes and random genes in an embodiment of the present invention. For example... Figure 4 and Figure 5 As shown, specifically in the analysis experiment targeting the glycolysis pathway in the KEGG database, the embedding representation generated by the gene functional embedding model resulted in a highly similar cluster of 67 genes within the pathway, achieving a discrimination of approximately 0.3 between the model and 60 random background genes. This discrimination is significantly higher than the performance of existing baseline models, indicating that the gene functional embedding model can accurately capture fine functional semantic information.
[0061] The embedding representation generated by the gene function embedding model enables genes within the same biological pathway to form highly similar, tightly clustered groups, maintaining significant distinguishability from random background genes.
[0062] The gene function characterization method provided by the embodiments of the present invention can make full use of multi-source, fine-grained, and structured biomedical knowledge to generate gene function embedding representations that are rich in information and highly discriminative, providing high-quality gene feature support for downstream bioinformatics tasks such as gene association prediction and disease gene discovery.
[0063] Optionally, the step of using a generative large language model to perform targeted knowledge completion on missing attributes in the multi-source gene attribute data includes: Construct a constraint prompt instruction; wherein the constraint prompt instruction is used to configure the generative large language model to preferentially cite factual data from authoritative sources and set the generation temperature parameter to a preset low determinism threshold; Obtain the candidate completion content output by the generative large language model based on the constraint prompt instruction; If the candidate completion content is detected to contain the cognitive uncertainty words, the candidate completion content is discarded; wherein, the cognitive uncertainty words include at least terms that represent speculative meanings or terms that represent meanings that are not yet clear; If no cognitive uncertainty words are detected, the candidate completion content is retained as the completion result.
[0064] In this application, restrictive prompts refer to guiding text used to configure the generative behavior of the generative large language model. The purpose of constructing these restrictive prompts is to improve the factual accuracy and certainty of the generated content, and to reduce the generation of speculative or unreliable content.
[0065] The constraint prompts are used to configure the generative large language model to prioritize the use of factual data from authoritative sources.
[0066] Authoritative sources include, but are not limited to, peer-reviewed academic literature, officially maintained biomedical databases, and recognized knowledge resources in the field.
[0067] By explicitly requiring the model to answer based on existing accepted facts in the aforementioned restrictive prompts, the probability of the model generating fictitious content can be reduced.
[0068] The constraint prompt also sets the generation temperature parameter to a preset low determinism threshold. The generation temperature parameter is a hyperparameter that controls the degree of randomness in the output of the generative large language model. A lower temperature value makes the model more inclined to select the output word with the highest probability, thereby improving the determinism and consistency of the output. The preset low determinism threshold can be set to a value in the range of 0.1 to 0.3.
[0069] For attribute fields with missing values in the multi-source gene attribute data, the server constructs query content containing gene identification information and missing attribute types. The query content is combined with the constraint prompt instruction and then input into the generative large language model to obtain the candidate completion content returned by the generative large language model.
[0070] In this application, cognitive uncertainty vocabulary refers to linguistic markers in text that express the speaker's lack of certainty about the content of a statement.
[0071] The vocabulary of cognitive uncertainty includes at least terms that characterize speculative meanings and terms that characterize meanings that are not yet clearly defined. The terms characterizing speculative meanings include, but are not limited to, words such as possible, speculate, infer, and seem.
[0072] Terms that represent meanings that are not yet clearly defined include, but are not limited to, expressions such as not yet clearly defined, function unknown, and requiring further research.
[0073] The server performs a lexical matching test on the candidate completion content to determine whether it contains any word from a pre-built blacklist of words indicating cognitive uncertainty. If a match is detected, the candidate completion content is deemed unreliable and discarded from the completion result.
[0074] For candidate completions for which no cognitive uncertainty words are detected, the server determines them as valid completion results and fills the completion results into the missing attribute field of the corresponding gene record.
[0075] In a more stringent implementation, the targeted knowledge completion process also includes a manual auditing step.
[0076] The manual auditing step includes: randomly sampling from the completed results to obtain sampled completed content; having domain experts manually review the sampled completed content to determine its factual accuracy; cross-referencing the sampled completed content with external biomedical literature for verification; and removing completed content that fails the review from the gene knowledge graph or marking it as pending review. This manual auditing step serves as the final line of quality control, further ensuring the factual reliability of the completed content included in the gene knowledge graph.
[0077] By combining the above-mentioned constraint prompts and uncertainty word filtering mechanisms, the quality of the output content of the generative large language model can be effectively controlled, ensuring that the supplementary content included in the gene knowledge graph has high factual reliability and avoiding the introduction of speculative or inaccurate information that could negatively impact subsequent model training.
[0078] Optionally, the step of converting the gene knowledge graph into a structured text sequence using an attribute-aware template containing attribute category-specific markers includes: Define special markers of various attribute types, fill the attribute values of each gene record in the gene knowledge graph into the corresponding special markers to form a semi-structured natural language sequence, and use the special markers as structural prior information in subsequent encoding processing; The special markers include at least a pathway marker for encapsulating pathway information, a function marker for encapsulating function description, and an organization marker for encapsulating organization expression information.
[0079] In this application, special markers refer to predefined morphemes used in the attribute-aware template to identify the start and end positions of a specific attribute type.
[0080] The server defines corresponding special markers for each attribute type involved in the gene knowledge graph.
[0081] Special tags include at least pathway tags for encapsulating pathway information, function tags for encapsulating function descriptions, and organization tags for encapsulating organization expression information.
[0082] Pathway markers are used to identify the start and end boundaries of information on the signaling pathways involved by a gene. Functional markers are used to identify the start and end boundaries of information describing the molecular function of a gene. Tissue markers are used to identify the start and end boundaries of information on gene expression in various tissues.
[0083] In one specific implementation, the special mark is in the form of an uppercase English word enclosed in square brackets.
[0084] Specifically, the pathway marker is defined as [PATHWAY], the functional marker is defined as [FUNCTION], and the tissue marker is defined as [TISSUE]. In addition to the above three markers, the special markers also include: gene markers [GENE] for encapsulating gene symbols, descriptive markers [DESC] for encapsulating gene descriptions, summary markers [SUMMARY] for encapsulating functional summaries, molecular functional markers [FUNC] for encapsulating molecular functions, process markers [PROCESS] for encapsulating biological processes, component markers [COMPONENT] for encapsulating cellular components, disease markers [DISEASE] for encapsulating disease associations, location markers [LOCATION] for encapsulating protein localization, and prognostic markers [PROGNOSIS] for encapsulating prognostic information.
[0085] The server iterates through each gene record in the gene knowledge graph and, according to the predefined attribute priority order, fills the values of each attribute field into the position range defined by the corresponding special marker.
[0086] After the filling is completed, the attribute fragments are spliced together in sequence to form the semi-structured natural language sequence.
[0087] Semi-structured natural language sequences maintain the independence and integrity of each attribute field while presenting themselves in the form of natural language, making them easier for subsequent language models to process.
[0088] In the subsequent encoding process, the special markers are used as structural prior information. Structural prior information refers to the pre-defined knowledge provided to the model regarding the inherent structural features of the input data.
[0089] When encoding the structured text sequence, the long context encoder can identify the position of the special markers, thereby perceiving the distribution of different attribute content in the sequence.
[0090] The attribute-aware attention pooling layer can use the special markers as semantic anchors to achieve differentiated attention to different attribute regions.
[0091] By introducing special tags for attribute categories, key structured information can be preserved during the process of converting tabular data into natural language sequences. This enables subsequent coding models to explicitly perceive the boundaries and semantic attribution of different attribute types, thereby improving the ability to model heterogeneous attribute information.
[0092] Optionally, the step of constructing positive sample pairs using data augmentation strategies includes: performing at least one of the following augmentation operations on the structured text sequence to generate an augmented text sequence: Contextual synonym fusion: Replacing key terms in the structured text sequence with biomedical synonyms; Functional entity mask: Randomly masks biological phrase fragments in the structured text sequence; Structured field reordering: While maintaining the priority of core fields, the order of non-core attribute blocks is shuffled; Functional abstraction: Randomly simplifying the functional description portion in the structured text sequence; Pathway representation variation: Controlled reordering or selective pruning of the path list in the structured text sequence; The positive sample pair is formed by combining the original structured text sequence with the enhanced text sequence, or by combining two different enhanced text sequences generated for the same gene.
[0093] In this application, contextual synonym fusion refers to an enhancement operation that replaces key terms in the structured text sequence with biomedical synonyms.
[0094] Biomedical synonyms refer to pairs of terms that have the same or similar meanings in the biomedical field. The server pre-builds a biomedical synonym dictionary, and during the enhancement operation, it identifies key biomedical terms in the structured text sequence and replaces them with their corresponding synonyms with a preset probability.
[0095] Functional entity masking refers to an enhancement operation that randomly masks biological phrase fragments in the structured text sequence.
[0096] Biological phrase fragments include, but are not limited to, proper noun phrases such as gene names, protein names, and pathway names.
[0097] The server selects biological phrase fragments from the structured text sequence with a preset probability and replaces them with special mask markers, thereby encouraging the model to learn the ability to infer the masked content based on context during training.
[0098] Structured field reordering refers to an enhancement operation that shuffles the order of non-core attribute blocks while maintaining the priority of core fields.
[0099] Core fields include attributes such as gene identifiers and gene names that must remain in a fixed position. Non-core attribute blocks include various annotative descriptive fields. Randomly rearranging these non-core attribute blocks improves the model's robustness to changes in attribute order.
[0100] Functional abstraction refers to an enhancement operation that randomly simplifies the functional description portion of the structured text sequence.
[0101] Random simplification includes processes such as removing descriptive words and reducing redundant descriptions. Through functional abstraction operations, it is possible to generate enhanced sequences with different information densities but consistent core semantics.
[0102] Pathway representation variation refers to an enhancement operation that involves controlled reordering or selective pruning of the path list in the structured text sequence.
[0103] Controlled reordering refers to randomly shuffling the order of pathway items in the pathway list. Selective pruning refers to deleting some pathway items from the pathway list with a preset probability. This enhancement operation can simulate the noise and incompleteness of pathway annotations in a real database.
[0104] After the enhancement operation is completed, the server combines the original structured text sequence with the enhanced text sequence, or combines two different enhanced text sequences generated for the same gene, to form the positive sample pair. The two samples in the positive sample pair originate from the same gene, have consistent semantic connotations but differ in expression.
[0105] In this application, a variety of data augmentation strategies are employed to provide rich positive sample pairs for contrastive learning, enabling the model to learn robust representations that are invariant to changes in surface form.
[0106] Optionally, the step of selecting negative samples using a progressive sampling strategy is dynamically adjusted according to the current training round, including the following stages executed in the order of training rounds: Basic feature distribution establishment stage: In the early rounds of training, random sampling is performed within the current training batch to obtain the negative samples; Difficult negative sample mining stage: In the middle round of training, the cosine similarity between the anchor gene and other genes in the current training batch is calculated, and non-homologous genes with a cosine similarity higher than a preset threshold are selected as negative samples. Memory bank sampling phase: In later rounds of training, a memory bank storing historical gene embedding representations is maintained, and negative samples are sampled from the memory bank.
[0107] In this application, during the early training rounds, the server performs random sampling within the current training batch to obtain the negative samples. The early training rounds refer to the initial several rounds after the training process begins.
[0108] At this stage, the model has not yet learned effective feature representations, and the structure of the representation space is still unstable. Using intra-batch random sampling enables the model to establish a basic feature distribution and learn the initial ability to distinguish different genes.
[0109] During the intermediate rounds of training, the server calculates the cosine similarity between the anchor gene and other genes in the current training batch, and selects non-homologous genes with a cosine similarity higher than a preset threshold as the negative samples.
[0110] Anchor gene refers to the reference gene sample used to calculate the loss. Cosine similarity is an indicator that measures the degree of similarity between two vector directions. Non-homologous gene refers to other genes that do not belong to the same gene family as the anchor gene.
[0111] At this stage, the model has a preliminary representation ability. Selecting samples that are similar to the anchor gene representation but actually belong to different genes as negative samples can help the model learn more refined distinction boundaries and sharpen its decision-making ability.
[0112] In later training rounds, the server maintains a memory bank storing historical gene embedding representations and samples the negative samples from this memory bank. The memory bank refers to a cache structure used to store gene embedding representations generated during previous training. The capacity of the memory bank can be set to a preset value.
[0113] Introducing a memory-based sampling mechanism at this stage decouples the number of negative samples from the size of the current training batch, significantly expanding the diversity of available negative samples and further improving the uniformity of the representation space.
[0114] By employing a progressive sampling strategy, the difficulty of selecting negative samples can be dynamically adjusted according to the learning status of the model at different training stages, achieving a learning effect from easy to difficult, and improving the training efficiency and final representation quality of contrastive learning.
[0115] Optionally, the step of selecting negative samples using a progressive sampling strategy further includes: After identifying candidate negative samples, external biological databases are searched to obtain information on the functional association between the candidate negative samples and anchor genes. Determine whether the candidate negative sample and the anchor gene meet the functional association conditions; wherein, the functional association conditions include: the candidate negative sample and the anchor gene belong to the same KEGG pathway, or the candidate negative sample and the anchor gene have an interaction score higher than a preset score in a protein interaction database; If the functional association condition is met, the candidate negative sample is excluded from the negative sample sampling pool.
[0116] In this application, candidate negative samples refer to negative samples that have been preliminarily selected according to the sampling rules at the current stage and have not yet undergone functional association verification. The external biological database refers to publicly available data resources that store gene functional association information. Functional association information refers to information describing whether a functional relationship exists between two genes.
[0117] For each pair of candidate negative samples and anchor genes, the server initiates a query request to the external biological database to obtain the functional association information between the two.
[0118] In this application, the functional association conditions include: the candidate negative sample and the anchor gene belong to the same KEGG pathway, or the candidate negative sample and the anchor gene have an interaction score higher than a preset score in a protein interaction database.
[0119] The KEGG pathway refers to a biological pathway as defined in the Kyoto Encyclopedia of Genes and Genomes database. If two genes participate in the same KEGG pathway, it indicates a functional relationship between them.
[0120] Protein-protein interaction databases include databases such as STRING. Interaction scores are numerical indicators that quantify the likelihood of interaction between two proteins. If the interaction score between proteins encoded by two genes is higher than a preset score, it indicates that there is a functional association between them at the protein level.
[0121] If the candidate negative sample and the anchor gene are determined to satisfy any of the above functional association conditions, the server will remove the candidate negative sample from the current negative sample sampling pool and will not use it as the final negative sample in the loss calculation.
[0122] In this application, the pseudo-negative sample elimination mechanism can avoid misclassifying functionally related genes as negative samples, ensuring that negative samples in contrastive learning truly represent functionally unrelated genes, thereby improving the biological rationality of the learned representation space.
[0123] Optionally, the long context encoder adopts a Transformer architecture based on sliding window attention; During the process of updating the model parameters, the bottom preset number of Transformer layers of the long context encoder are frozen, and only the top preset number of Transformer layers and the embedding head are fine-tuned.
[0124] In this application, sliding window attention is an efficient attention computation mechanism characterized by each word interacting only with its neighboring words within its local window, rather than interacting globally with all words in the sequence. This design reduces the computational complexity of attention from quadratic to linear, enabling the model to efficiently handle input sequences of thousands of words.
[0125] The Transformer architecture consists of multiple stacked Transformer layers, each containing a self-attention sublayer and a feedforward neural network sublayer. Employing a sliding window attention-based Transformer architecture as the long context encoder allows for maintaining the ability to model the context of long texts while keeping computational resource consumption within acceptable limits.
[0126] During the process of updating the model parameters, the server freezes the bottom 11 Transformer layers of the long context encoder and only performs parameter fine-tuning on the top 1 Transformer layer and the embedding head.
[0127] The freezing refers to keeping the parameters of a specified layer unchanged during training. The bottom 11 Transformer layers refer to the 1st to 11th Transformer layers closest to the input layer in the Longformer-base-4096 model. The top 1 Transformer layer refers to the 12th Transformer layer closest to the output layer. The embedding head includes the attribute-aware attention pooling layer and the learnable parameters in the projection head.
[0128] With the above parameter freezing strategy, the number of trainable parameters accounts for only about 8% to 10% of the total number of parameters in the gene functional embedding model. This parameter-efficient training method can retain the general language knowledge learned by the pre-trained model on large-scale corpora, avoid overfitting when training on relatively small-scale gene datasets, and significantly reduce the computational resources and time costs required for training.
[0129] Optionally, the attribute-aware attention pooling layer performs the following processing: The embedding vector of the global classification label is used as the query vector, and the hidden state sequence output by the long context encoder is used as the key vector and value vector. The special markers in the structured text sequence are used as semantic anchors to calculate attention weights; The value vector is weighted and aggregated based on the attention weights to generate the gene representation vector; The attribute-aware attention pooling layer is configured with multiple attention heads to focus on the semantic regions corresponding to different special tags.
[0130] In this application, the global classification label refers to a special label added at the beginning of the input sequence, and its corresponding embedding vector is used to carry the global semantic information of the entire sequence. The query vector, key vector, and value vector are the three core elements in the attention mechanism computation.
[0131] The attribute-aware attention pooling layer extracts the embedding vector corresponding to the global classification label from the output of the long context encoder and uses it as the query vector.
[0132] Meanwhile, the complete hidden state sequence output by the long context encoder is used as the key vector and value vector, respectively.
[0133] In this application, semantic anchors refer to specific location markers that serve as reference points during the attention weight calculation process.
[0134] The special tags for attribute categories in structured text sequences identify the starting position of each attribute region in the sequence, providing explicit semantic structural information for attention mechanisms.
[0135] The attribute-aware attention pooling layer calculates the attention score based on the dot product of the query vector and the key vector, and normalizes the attention score into attention weights through the softmax function.
[0136] During this process, the hidden state of the location of the special marker and its neighboring area can receive differentiated attention from the attention mechanism.
[0137] The attribute-aware attention pooling layer uses the calculated attention weights to perform a weighted summation on the value vector sequence, aggregating the variable-length hidden state sequence into the fixed-dimensional gene representation vector.
[0138] The gene representation vector integrates the semantic information of each attribute region in the sequence, where the contribution of each region is determined by the attention weight.
[0139] The attribute-aware attention pooling layer is configured with multiple attention heads to focus on the semantic regions corresponding to different special tags.
[0140] Multiple attention heads can learn attention patterns in parallel from different representation subspaces. Different attention heads may focus on different attribute semantic regions such as pathway information regions, functional description regions, and organizational expression regions.
[0141] In this application, through the design of a multi-head attention mechanism, the attribute-aware attention pooling layer can dynamically and content-awarely aggregate heterogeneous attribute information to generate a unified gene representation vector with rich information.
[0142] Optionally, the composite loss function includes: Noise contrast estimation loss is used to bring the positive sample pairs closer together in the representation space and push the negative samples further apart in the representation space; Covariance loss is used to minimize the off-diagonal elements of the covariance matrix of the current training batch. Variance loss, constructed based on the hinge loss function, is used to constrain the average variance of each dimension from exceeding a preset target threshold. The value of the composite loss function is obtained by weighted summation of the noise contrast estimation loss, the covariance loss, and the variance loss.
[0143] In this application, the noise contrast estimation loss is used to narrow the distance between the positive sample pairs in the representation space and widen the distance between the negative samples in the representation space. The noise contrast estimation loss is calculated as follows: for each anchor sample, its similarity with the corresponding positive sample is used as the numerator, and the sum of its similarities with all negative samples is used as part of the denominator. The logarithm is then taken as the negative value.
[0144] By minimizing the noise contrast estimation loss, the model can learn parameter configurations that bring positive sample pairs closer together and negative sample pairs further apart, thereby improving the alignment and uniformity of the representation space.
[0145] The covariance loss is used to minimize the off-diagonal elements of the covariance matrix of the current training batch. The covariance matrix is a matrix describing the correlation between the dimensions of the embedding vectors. The off-diagonal elements reflect the covariance values between different dimensions.
[0146] By minimizing the covariance loss, feature decorrelation can be achieved, reducing redundancy between different dimensions of the embedding vector, and enabling each dimension to capture relatively independent semantic information.
[0147] In this application, the variance loss is constructed based on the hinge loss function, which is used to constrain the average variance of each dimension from exceeding a preset target threshold. The hinge loss function is a form of loss function that penalizes values below the threshold and does not penalize values that reach or exceed the threshold.
[0148] By introducing the variance loss, dimensionality collapse of the representation space can be prevented, i.e., the embedding vectors of all samples are prevented from converging to the same value in some dimensions. The variance loss acts as an anti-collapse constraint, ensuring that the embedding vectors maintain sufficient variability across all dimensions.
[0149] The value of the composite loss function is obtained by weighted summation of the noise contrast estimation loss, the covariance loss, and the variance loss.
[0150] Specifically, the composite loss function can be expressed as: L = L_nce + λ1 * L_cov + λ2 * L_var; Where L_nce represents the noise contrast estimation loss, L_cov represents the covariance loss, L_var represents the variance loss, and λ1 and λ2 are the weighting coefficients of the covariance loss and variance loss, respectively.
[0151] As shown in Table 1 below, the gene functional embedding model trained based on the composite loss function has achieved excellent performance on multiple downstream tasks.
[0152] In the gene-gene interaction (GGI) prediction task, the gene functional embedding model achieved an accuracy of 0.75, an F1 score of 0.76, and a ROC-AUC value of 0.82, comprehensively outperforming baseline models such as Gene2Vec, GenePT, and Gemma. In the protein-protein interaction (PPI) prediction task based on the HuRI dataset, the gene functional embedding model achieved an accuracy of 0.60, an F1 score of 0.57, and a ROC-AUC value of 0.65, comparable to the best-performing baseline model, and exhibiting more balanced performance in both F1 score and ROC-AUC value.
[0153] Table 1
[0154] As shown in Table 2, the gene function embedding model achieved an ROC-AUC value of 0.91 on the gene necessity prediction task and an ROC-AUC value of 0.92 on the dose sensitivity prediction task. Its performance is comparable to or close to that of the best-performing general language model embedding method, which verifies the ability to effectively capture regulatory information from structured functional attributes.
[0155] Table 2
[0156] As shown in Table 3, the ablation experiment results show that the variant configuration of using full parameter fine-tuning instead of freezing the underlying parameters caused the accuracy of the GGI task to decrease from 0.748 to 0.731, and the variant configuration of using static random sampling instead of progressive negative sampling caused the ROC-AUC value to decrease from 0.821 to 0.810, which verifies the necessity and effectiveness of the design of each loss term in the composite loss function and the selection of training strategies.
[0157] Table 3
[0158] The gene function characterization device provided by the present invention is described below. The gene function characterization device described below can be referred to in correspondence with the gene function characterization method described above.
[0159] Figure 6 A schematic diagram of the gene function characterization device provided by the present invention is shown below. Figure 6 As shown, it includes: The acquisition module 710 is used to acquire multi-source gene attribute data covering molecular function, signaling pathways, and tissue expression dimensions; The completion module 720 is used to perform targeted knowledge completion on the missing attributes in the multi-source gene attribute data using a generative large language model to obtain a gene knowledge graph. The conversion module 730 is used to convert the gene knowledge graph into a structured text sequence using an attribute-aware template containing special markers for attribute categories; The training module 740 is used to perform iterative comparative learning training on the gene function embedding model based on the structured text sequence until the preset convergence condition is met, so as to obtain the trained gene function embedding model. The gene function embedding model performs iterative contrastive learning training in each round, which includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function. The gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, which are used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
[0160] In this application, by acquiring multi-source gene attribute data and using a generative large language model for targeted knowledge completion, a comprehensive integration and quality enhancement of fine-grained gene attribute information scattered across multiple authoritative databases is achieved. By using attribute-aware templates containing special labels for attribute categories to convert gene knowledge graphs into structured text sequences, the inherent structured attributes in gene descriptions are explicitly preserved. Through iterative contrastive learning training using data augmentation strategies, progressive negative sample sampling strategies, and composite loss functions, combined with the synergistic effect of the long context encoder and attribute-aware attention pooling layer in the gene function embedding model, the model can dynamically aggregate semantic information from different attribute fields. This effectively solves the technical problem that existing methods cannot fully utilize multi-source, fine-grained, and structured biomedical text knowledge, enabling the generated gene function embedding representation to accurately capture fine functional semantics.
[0161] Figure 7 This is a schematic diagram of the structure of the electronic device provided by the present invention, such as... Figure 7 As shown, the electronic device may include a processor 810, a communications interface 820, a memory 830, and a communication bus 840, wherein the processor 810, the communications interface 820, and the memory 830 communicate with each other via the communication bus 840. The processor 810 can call logical instructions in the memory 830 to execute a gene function characterization method, which includes: acquiring multi-source gene attribute data covering molecular function, signaling pathways, and tissue expression dimensions; Using a generative large language model, targeted knowledge completion is performed on the missing attributes in the multi-source gene attribute data to obtain a gene knowledge graph; The gene knowledge graph is converted into a structured text sequence using an attribute-aware template that includes special markers for attribute categories. Based on the structured text sequence, iterative comparative learning training is performed on the gene function embedding model until the preset convergence condition is met, and the trained gene function embedding model is obtained. Each contrastive learning training round includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function; the gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
[0162] Furthermore, the logical instructions in the aforementioned memory 830 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0163] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being able to be stored on a non-transitory computer-readable storage medium, the computer program being executed by a processor, the computer being able to execute the gene function characterization method provided by the above methods, the method including: acquiring multi-source gene attribute data covering molecular function, signaling pathway and tissue expression dimensions; Using a generative large language model, targeted knowledge completion is performed on the missing attributes in the multi-source gene attribute data to obtain a gene knowledge graph; The gene knowledge graph is converted into a structured text sequence using an attribute-aware template that includes special markers for attribute categories. Based on the structured text sequence, iterative comparative learning training is performed on the gene function embedding model until the preset convergence condition is met, and the trained gene function embedding model is obtained. Each contrastive learning training round includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function; the gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
[0164] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the gene function characterization methods provided by the above methods, the method comprising: acquiring multi-source gene attribute data covering molecular function, signaling pathways and tissue expression dimensions; Using a generative large language model, targeted knowledge completion is performed on the missing attributes in the multi-source gene attribute data to obtain a gene knowledge graph; The gene knowledge graph is converted into a structured text sequence using an attribute-aware template that includes special markers for attribute categories. Based on the structured text sequence, iterative comparative learning training is performed on the gene function embedding model until the preset convergence condition is met, and the trained gene function embedding model is obtained. Each contrastive learning training round includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function; the gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
[0165] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0166] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0167] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A method for characterizing gene function, characterized in that, include: Acquire multi-source gene attribute data covering molecular function, signaling pathways, and tissue expression dimensions; Using a generative large language model, targeted knowledge completion is performed on the missing attributes in the multi-source gene attribute data to obtain a gene knowledge graph; The gene knowledge graph is converted into a structured text sequence using an attribute-aware template that includes special markers for attribute categories. Based on the structured text sequence, iterative comparative learning training is performed on the gene function embedding model until the preset convergence condition is met, and the trained gene function embedding model is obtained. Each contrastive learning training round includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function; the gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
2. The gene function characterization method according to claim 1, characterized in that, The method of using a generative large language model to perform targeted knowledge completion on missing attributes in the multi-source gene attribute data includes: Construct a constraint prompt instruction; wherein the constraint prompt instruction is used to configure the generative large language model to preferentially cite factual data from authoritative sources and set the generation temperature parameter to a preset low determinism threshold; Obtain the candidate completion content output by the generative large language model based on the constraint prompt instruction; If the candidate completion content is detected to contain words with cognitive uncertainty, the candidate completion content is discarded; wherein, the words with cognitive uncertainty include at least terms that represent speculative meanings or terms that represent meanings that are not yet clear; If no cognitive uncertainty words are detected, the candidate completion content is retained as the completion result.
3. The gene function characterization method according to claim 1, characterized in that, The step of converting the gene knowledge graph into a structured text sequence using an attribute-aware template containing special markers for attribute categories includes: Define special markers of various attribute types, fill the attribute values of each gene record in the gene knowledge graph into the corresponding special markers to form a semi-structured natural language sequence, and use the special markers as structural prior information in subsequent encoding processing; The special markers include at least: a pathway marker for encapsulating pathway information, a function marker for encapsulating function description, and an organization marker for encapsulating organization expression information.
4. The gene function characterization method according to claim 1, characterized in that, The step of constructing positive sample pairs using data augmentation strategies includes: performing at least one of the following augmentation operations on the structured text sequence to generate an augmented text sequence: Contextual synonym fusion: Replacing key terms in the structured text sequence with biomedical synonyms; Functional entity mask: Randomly masks biological phrase fragments in the structured text sequence; Structured field reordering: While maintaining the priority of core fields, the order of non-core attribute blocks is shuffled; Functional abstraction: Randomly simplifying the functional description portion in the structured text sequence; Pathway representation variation: Controlled reordering or selective pruning of the path list in the structured text sequence; The positive sample pair is formed by combining the original structured text sequence with the enhanced text sequence, or by combining two different enhanced text sequences generated for the same gene.
5. The gene function characterization method according to claim 1, characterized in that, The step of selecting negative samples using a progressive sampling strategy is dynamically adjusted according to the current stage of the training round, and includes the following stages executed in the order of the training rounds: Basic feature distribution establishment stage: In the early rounds of training, random sampling is performed within the current training batch to obtain the negative samples; Difficult negative sample mining stage: In the middle round of training, the cosine similarity between the anchor gene and other genes in the current training batch is calculated, and non-homologous genes with a cosine similarity higher than a preset threshold are selected as negative samples. Memory bank sampling phase: In later rounds of training, a memory bank storing historical gene embedding representations is maintained, and negative samples are sampled from the memory bank.
6. The gene function characterization method according to claim 5, characterized in that, The step of selecting negative samples using a progressive sampling strategy further includes: After identifying candidate negative samples, external biological databases are searched to obtain information on the functional association between the candidate negative samples and anchor genes. Determine whether the candidate negative sample and the anchor gene meet the functional association conditions; wherein, the functional association conditions include: the candidate negative sample and the anchor gene belong to the same KEGG pathway, or the candidate negative sample and the anchor gene have an interaction score higher than a preset score in a protein interaction database; If the functional association condition is met, the candidate negative sample is excluded from the negative sample sampling pool.
7. The gene function characterization method according to claim 1, characterized in that, The long context encoder adopts a Transformer architecture based on sliding window attention; During the process of updating the model parameters, the bottom preset number of Transformer layers of the long context encoder are frozen, and only the top preset number of Transformer layers and the embedding head are fine-tuned.
8. The gene function characterization method according to claim 1, characterized in that, The attribute-aware attention pooling layer performs the following processing: The embedding vector of the global classification label is used as the query vector, and the hidden state sequence output by the long context encoder is used as the key vector and value vector. The special markers in the structured text sequence are used as semantic anchors to calculate attention weights; The value vector is weighted and aggregated based on the attention weights to generate the gene representation vector; The attribute-aware attention pooling layer is configured with multiple attention heads to focus on the semantic regions corresponding to different special tags.
9. The gene function characterization method according to claim 1, characterized in that, The composite loss function includes: Noise contrast estimation loss is used to bring the positive sample pairs closer together in the representation space and push the negative samples further apart in the representation space; Covariance loss is used to minimize the off-diagonal elements of the covariance matrix of the current training batch. Variance loss, constructed based on the hinge loss function, is used to constrain the average variance of each dimension from exceeding a preset target threshold. The value of the composite loss function is obtained by weighted summation of the noise contrast estimation loss, the covariance loss, and the variance loss.
10. A gene function characterization device, characterized in that, include: The acquisition module is used to acquire multi-source gene attribute data covering molecular function, signaling pathways, and tissue expression dimensions. The completion module is used to perform targeted knowledge completion on the missing attributes in the multi-source gene attribute data using a generative large language model, thereby obtaining a gene knowledge graph. The conversion module is used to convert the gene knowledge graph into a structured text sequence using an attribute-aware template containing special markers for attribute categories; The training module is used to perform iterative comparative learning training on the gene function embedding model based on the structured text sequence until the preset convergence condition is met, so as to obtain the trained gene function embedding model. The gene function embedding model performs iterative contrastive learning training in each round, which includes: constructing positive sample pairs using data augmentation strategies, selecting negative samples using progressive sampling strategies, inputting the training batch of the current round into the gene function embedding model to obtain gene representation vectors, and updating model parameters based on a composite loss function. The gene function embedding model includes a long context encoder and an attribute-aware attention pooling layer, which are used to map the structured text sequence of the target gene to be analyzed into a gene function embedding representation that integrates multi-source knowledge.
11. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the gene function characterization method as described in any one of claims 1 to 9.
12. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the gene function characterization method as described in any one of claims 1 to 9.