Subgrouping method, apparatus and computer program product for a population of near relatives

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By integrating the covariance index values of the nuclear genome and the mitochondrial genome, the environmental adaptation differences of closely related populations are identified, which solves the problem of low accuracy in distinguishing closely related subgroups and achieves high sensitivity and high specificity in distinguishing closely related subgroups with highly similar genetic backgrounds.

CN122245408APending Publication Date: 2026-06-19ANNOROAD GENE TECHNOLOGY (BEIJING) CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: ANNOROAD GENE TECHNOLOGY (BEIJING) CO LTD
Filing Date: 2026-05-21
Publication Date: 2026-06-19

Application Information

Patent Timeline

21 May 2026

Application

19 Jun 2026

Publication

CN122245408A

IPC: G16B20/20; G16B30/10; G16B40/00

AI Tagging

Application Domain

Biostatistics Proteomics

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies have low accuracy in identifying closely related subpopulations and cannot effectively distinguish the functional differences between synonymous mutations, missense mutations, and frameshift/terminator mutations. This results in key adaptive signals being diluted by noise and makes it impossible to focus on the evolutionary hotspots of subpopulations in closely related populations with highly similar genetic backgrounds.

Method used

By acquiring gene sequencing data from closely related populations, detecting variation information using pre-set variation detection tools, dividing functional gene modules based on biomedical databases, and integrating the synergistic variation index values of nuclear and mitochondrial genomes, the environmental adaptive evolutionary differences of closely related populations can be identified, achieving high sensitivity and high specificity in subgroup discrimination.

Benefits of technology

It achieves high sensitivity and high specificity in distinguishing closely related subgroups with highly similar genetic backgrounds, improves the accuracy of genetic characteristic analysis of closely related populations, and accurately captures the systematic imprint of environmental adaptive evolution.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122245408A_ABST

Patent Text Reader

Abstract

This application discloses a method, apparatus, and computer program product for subgrouping closely related populations. Relating to the field of population genetics, the method includes: acquiring gene sequencing data for each sample in a closely related population sample set; performing variation detection on the gene sequencing data to obtain variation information; dividing the genes in the gene sequencing data into functional gene modules; identifying the nuclear genome and mitochondrial genome within the functional gene modules; determining the variation status of each nuclear gene and each mitochondrial gene based on the variation information; determining co-variation index values based on the variation status of each nuclear gene and each mitochondrial gene; combining the N variation index values of the N functional gene modules of the sample into a variation index value set; and dividing the closely related population sample set into multiple subgroups based on the variation index value set of each sample. This application solves the problem of low accuracy in distinguishing closely related subgroups in related technologies.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of population genetics, and more specifically, to a method, apparatus, and computer program product for subgrouping closely related populations. Background Technology

[0002] In population genetics and the analysis of genetic characteristics of regional populations, the relevant techniques mainly rely on two types of technical approaches: one is genome-wide association analysis based on SNPs (Single Nucleotide Polymorphisms), which identifies population differentiation signals by statistically analyzing the differences in allele frequencies at a large number of nuclear gene loci; the other is maternal pedigree tracing based on mitochondrial DNA (Deoxyribonucleic Acid) haplogroups, used to analyze the migration and evolutionary history of a population's maternal ancestors. These methods are effective in distinguishing racial levels or distantly related groups, but they have limitations when dealing with closely related subgroups with highly similar genetic backgrounds, geographical proximity, and frequent historical gene exchange.

[0003] Because the overall genetic drift among closely related subpopulations is extremely small, functional variations with adaptive value, such as missense mutations and frameshift mutations that affect protein structure or enzyme activity, are submerged by a massive amount of synonymous mutations, non-coding region variations, and neutral polymorphisms. This makes it difficult for frequency statistical methods based on single sites or single genomes to obtain statistically significant results, resulting in extremely low signal-to-noise ratios. Variation load analysis in related techniques typically assigns equal weight to all detected variations, regardless of their impact on protein function, failing to distinguish the essential differences in functional cost between synonymous mutations, missense mutations, and frameshift / terminator mutations. This one-size-fits-all approach dilutes crucial adaptive signals with a large amount of background noise that has no functional impact, making it impossible to focus on evolutionary hotspots truly involved in metabolic adaptation. Subpopulation classification methods for closely related populations in related techniques suffer from low discriminative power, and functional signals are easily submerged by noise, leading to low discrimination accuracy.

[0004] There is currently no effective solution to the problem of low accuracy in distinguishing closely related subgroups in related technologies. Summary of the Invention

[0005] The main objective of this application is to provide a method, apparatus, and computer program product for subgrouping closely related populations, in order to solve the problem of low accuracy in the identification of closely related subgroups in related technologies.

[0006] To achieve the above objectives, according to one aspect of this application, a method for subgrouping closely related populations is provided. The method includes: acquiring gene sequencing data for each sample in a closely related population sample set, and performing variation detection on the gene sequencing data using a preset variation detection tool to obtain variation information for each sample; for the gene sequencing data of each sample, dividing the genes in the gene sequencing data into functional gene modules according to a preset biomedical database to obtain N functional gene modules, wherein each functional gene module contains at least one gene from the gene sequencing data, and N is a positive integer; for each functional gene module, determining the nuclear genome and mitochondrial genome within the functional gene module, and determining the variation status of each nuclear gene in the nuclear genome and the variation status of each mitochondrial gene in the mitochondrial genome based on the variation information; determining a co-variation index value based on the variation status of each nuclear gene and the variation status of each mitochondrial gene to obtain a variation index value for each functional gene module, wherein the co-variation index value is used to identify environmental adaptive evolutionary differences between different subgroups of closely related populations; combining the N variation index values of the N functional gene modules of the sample into a variation index value set, and dividing the closely related population sample set into multiple subgroups based on the variation index value set of each sample.

[0007] Optionally, the genes in the gene sequencing data are divided into functional gene modules according to a preset biomedical database to obtain N functional gene modules, including: identifying nuclear genes and mitochondrial genes in the gene sequencing data through a first biomedical database, wherein the first biomedical database is used to store the composition and functional classification of the mitochondrial proteome; classifying each nuclear gene and each mitochondrial gene in the gene sequencing data through a second biomedical database to obtain classification results, wherein the classification results include the functional gene module to which each nuclear gene belongs and the functional gene module to which each mitochondrial gene belongs, wherein the second biomedical database is used to store N functional gene modules, as well as the nuclear genes and mitochondrial genes contained in each functional gene module; determining the N functional gene modules in the classification results, wherein the N functional gene modules include at least one of the following: respiratory chain complex I, respiratory chain complex III, respiratory chain complex IV, respiratory chain complex V, and assembly factors.

[0008] Optionally, determining the covariance index value based on the variation status of each nuclear gene and each mitochondrial gene includes: determining the gene class of each nuclear gene in the functional gene module; calculating the nuclear genome load index value based on the gene class and variation status of each nuclear gene; calculating the mitochondrial genome load index value based on the variation status of each mitochondrial gene; and calculating the product of the nuclear genome load index value and the mitochondrial genome load index value to obtain the covariance index value of the functional gene module.

[0009] Optionally, calculating the nuclear genome load index value based on the gene category and variation status of each nuclear gene includes: determining the basic weight corresponding to the gene category of each nuclear gene, and determining the first variation effect weight of the variation type of the nuclear gene; determining the first state index value from the variation status of the nuclear gene, calculating the product of the first state index value, the basic weight, and the first variation effect weight to obtain the nuclear gene load index value; and calculating the cumulative value of the load index values of all nuclear genes in the functional gene module to obtain the nuclear genome load index value.

[0010] Optionally, calculating the mitochondrial genome load index value based on the variation state of each mitochondrial gene includes: determining the preset weight of the mitochondrial gene and determining the second variation effect weight of the variation type of the mitochondrial gene; determining the second state index value from the variation state of the mitochondrial gene, calculating the product of the second state index value, the preset weight, and the second variation effect weight to obtain the mitochondrial gene load index value; and calculating the cumulative value of the load index values of all mitochondrial genes in the functional gene module to obtain the mitochondrial genome load index value.

[0011] Optionally, the gene sequencing data is subjected to mutation detection using a preset mutation detection tool to obtain mutation information for each sample, including: performing single nucleotide variant detection on the gene sequencing data to obtain a first detection result; performing insertion / deletion variant detection on the gene sequencing data to obtain a second detection result; and labeling the first and second detection results using the preset mutation detection tool to obtain mutation information for each sample, wherein the mutation information includes at least the gene name, mutation site, and mutation type.

[0012] Optionally, dividing the set of variation index values of each sample into multiple subgroups based on the set of variation index values of closely related population samples includes: inputting the set of variation index values of each sample in the set of closely related population samples into a preset clustering model, processing to obtain multiple clusters; and dividing multiple samples in each cluster into a subgroup to obtain multiple subgroups.

[0013] To achieve the above objectives, according to another aspect of this application, a subgrouping device for closely related populations is provided. The device includes: an acquisition unit for acquiring gene sequencing data of each sample in a closely related population sample set, and performing mutation detection on the gene sequencing data using a preset mutation detection tool to obtain mutation information for each sample; a first partitioning unit for partitioning the genes in the gene sequencing data of each sample into functional gene modules according to a preset biomedical database to obtain N functional gene modules, wherein each functional gene module contains at least one gene from the gene sequencing data, and N is a positive integer; a first determination unit for determining the nuclear genome and mitochondrial genome in each functional gene module, and determining the mutation status of each nuclear gene in the nuclear genome and the mutation status of each mitochondrial gene in the mitochondrial genome based on the mutation information; a second determination unit for determining a co-variation index value based on the mutation status of each nuclear gene and the mutation status of each mitochondrial gene to obtain a variation index value for each functional gene module, wherein the co-variation index value is used to identify the environmental adaptive evolutionary differences between different subgroups of closely related populations; and a second partitioning unit for combining the N variation index values of the N functional gene modules of the sample into a variation index value set, and dividing the closely related population sample set into multiple subgroups based on the variation index value set of each sample.

[0014] To achieve the above objectives, according to another aspect of this application, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the subgrouping method for closely related populations described in various embodiments of this application.

[0015] This application employs the following steps: acquiring gene sequencing data for each sample in a closely related population sample set, and performing mutation detection on the gene sequencing data using a pre-set mutation detection tool to obtain mutation information for each sample; for the gene sequencing data of each sample, dividing the genes in the gene sequencing data into functional gene modules according to a pre-set biomedical database to obtain N functional gene modules, wherein each functional gene module contains at least one gene from the gene sequencing data, and N is a positive integer; for each functional gene module, determining the nuclear genome and mitochondrial genome within the functional gene module, and determining the mutation status of each nuclear gene in the nuclear genome and each mitochondrial gene in the mitochondrial genome based on the mutation information; determining co-variation index values based on the mutation status of each nuclear gene and each mitochondrial gene to obtain the variation index value for each functional gene module, wherein the co-variation index value is used to identify the environmental adaptive evolutionary differences between different subgroups of closely related populations; combining the N variation index values of the N functional gene modules of the sample into a variation index value set, and dividing the closely related population sample set into multiple subgroups based on the variation index value set of each sample, thus solving the problem of low discrimination accuracy of closely related subgroups in related technologies. By integrating variation information from the nuclear and mitochondrial genomes and calculating covariance index values between them, covariance pressure quantification at the functional gene module level is achieved. This enables precise capture of systematic imprints left by environmental adaptive evolution based on covariance index values. It achieves high sensitivity and specificity in distinguishing closely related subgroups with highly similar genetic backgrounds, thereby improving the accuracy of genetic characteristic analysis of closely related populations. Attached Figure Description

[0016] The accompanying drawings, which form part of this application, are used to provide a further understanding of this application. The illustrative embodiments and descriptions of this application are used to explain this application and do not constitute an undue limitation of this application. In the drawings:

[0017] Figure 1 This is a flowchart of a method for subgrouping closely related populations according to an embodiment of this application;

[0018] Figure 2 This is a schematic diagram of the covariance index values of the respiratory chain complex, including core subunits, auxiliary subunits, and assembly factors, according to the embodiments of this application.

[0019] Figure 3 This is a schematic diagram of the covariance index values of a respiratory chain complex containing a core subunit, provided according to an embodiment of this application.

[0020] Figure 4This is a schematic diagram of the co-variation index values of the respiratory chain complex, including the core subunit, auxiliary subunit, and assembly factor, according to the embodiments of this application.

[0021] Figure 5 This is a schematic diagram of the synergistic variation index values of the three respiratory chain complexes containing core subunits, provided according to the embodiments of this application.

[0022] Figure 6 This is a schematic diagram of the co-variation index values of respiratory chain complex four, including core subunits, auxiliary subunits, and assembly factors, according to the embodiments of this application.

[0023] Figure 7 This is a schematic diagram of the synergistic variation index values of the four respiratory chain complexes containing core subunits, provided according to the embodiments of this application.

[0024] Figure 8 This is a schematic diagram of the covariance index values of respiratory chain complex five, including core subunits, auxiliary subunits, and assembly factors, provided according to the embodiments of this application.

[0025] Figure 9 This is a schematic diagram of the synergistic variation index values of the five respiratory chain complexes containing core subunits, provided according to the embodiments of this application.

[0026] Figure 10 This is a schematic diagram of the covariance index value of the total assembly factor according to the embodiments of this application;

[0027] Figure 11 This is a driver gene volcano map provided according to the embodiments of this application;

[0028] Figure 12 This is a schematic diagram of a subgrouping device for closely related populations provided in an embodiment of this application;

[0029] Figure 13 This is a schematic diagram of an electronic device provided according to an embodiment of this application. Detailed Implementation

[0030] It should be noted that, unless otherwise specified, the embodiments and features described in this application can be combined with each other. This application will now be described in detail with reference to the accompanying drawings and embodiments.

[0031] To enable those skilled in the art to better understand the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present application, and not all embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative effort should fall within the scope of protection of the present application.

[0032] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate for the embodiments of this application described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0033] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for display, data used for analysis, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties.

[0034] It should be noted that the information collected is information and data authorized by the user or fully authorized by all parties, and the collection, storage, use, processing, transmission, provision, disclosure and application of the relevant data all comply with the relevant laws, regulations and standards of the relevant regions, necessary confidentiality measures have been taken, and they do not violate public order and good morals. Corresponding operation portals are provided for users to choose to authorize or refuse.

[0035] The present invention will now be described in conjunction with preferred implementation steps. Figure 1 This is a flowchart of a method for subgrouping closely related populations according to embodiments of this application, such as... Figure 1 As shown, the method includes the following steps:

[0036] Step S101: Obtain gene sequencing data for each sample in the closely related population sample set, and perform mutation detection on the gene sequencing data using a preset mutation detection tool to obtain mutation information for each sample.

[0037] In step S101, the gene sequencing data may include WES (Whole Exome Sequencing) data and mtDNA (mitochondrial DNA) whole genome data. By collecting gene sequencing data from samples of individuals from closely related subgroups, a high-coverage genetic information foundation is constructed. The preset variant detection tools may be tools such as VEP (Variant Effect Predictor) and GATK (Genome Analysis Toolkit). By comparing with the reference genome, correcting sequencing errors, and filtering false positives using the preset variant detection tools, the true SNPs and small fragment InDel (Insertion / Deletion) variants in each sample are accurately identified. The variant information may be variant types that have potential functional impacts, clearly marked as missense, synonym, frameshift, terminator gain / loss, etc., after functional annotation by tools such as VEP. Genetic variants that may disturb protein structure and function are screened out, laying a reliable, focused, and noise-free data foundation for subsequent analysis.

[0038] Step S102: For the gene sequencing data of each sample, the genes in the gene sequencing data are divided into functional gene modules according to a preset biomedical database to obtain N functional gene modules, wherein each functional gene module contains at least one gene in the gene sequencing data, and N is a positive integer.

[0039] In step S102, the biomedical database may include MitoCarta 3.0 (mitochondrial proteome database) and KEGG (Kenome Encyclopedia). This biomedical database clusters all genes detected in the gene sequencing data according to their actual biological roles in cellular energy metabolism, forming functional gene modules with clear physiological significance. These include functional gene modules such as Complex I (respiratory chain complex one), Complex III (respiratory chain complex three), Complex IV (respiratory chain complex four), Complex V (respiratory chain complex five), and assembly factors, except for Complex II (respiratory chain complex two), as it is a complex entirely encoded by nDNA (nuclear DNA, nuclear genome deoxyribonucleic acid). By dividing the genes in the gene sequencing data into functional gene modules, the variation information originally scattered throughout the genome is merged into functionally related network units, thereby capturing the population evolution signals of multi-gene synergistic perturbations and avoiding the masking of noise by weak single-gene effects.

[0040] Step S103: For each functional gene module, determine the nuclear genome and mitochondrial genome in the functional gene module, and determine the variation status of each nuclear gene in the nuclear genome and the variation status of each mitochondrial gene in the mitochondrial genome based on variation information.

[0041] In step S103, within the functional gene module framework, genes from nuclear genome and mitochondrial genome are clearly distinguished and extracted separately. The variation status indicates whether a variation exists in each gene and the type of variation, which is used to quantify the functional damage load borne by each gene.

[0042] Step S104: Based on the variation status of each nuclear gene and the variation status of each mitochondrial gene, determine the co-variation index value to obtain the variation index value of each functional gene module. The co-variation index value is used to identify the environmental adaptive evolutionary differences between different subgroups of closely related populations.

[0043] In step S104, the co-variation index value increases significantly only when functional impairment variants accumulate simultaneously in the same functional gene module of the nuclear genome and mitochondrial genome. This accurately reflects the cumulative effect of co-adaptive evolution of the nuclear genome and mitochondrial genome under environmental stress (such as extreme cold or high altitude). This mechanism can identify functional compensatory shifts caused by co-variation of the two genomes at the individual level. The co-variation index value is determined by the variation state of each nuclear gene and each mitochondrial gene in the functional gene module. The originally weak and scattered co-variation signals of multiple genes are integrated into a comparable single value, which quantifies the differences in metabolic function between populations that were originally submerged by noise. This enables highly sensitive identification of recessive adaptive differentiation between closely related subgroups.

[0044] Step S105: Combine the N variation index values of the N functional gene modules of the sample into a variation index value set, and divide the closely related population sample set into multiple subgroups based on the variation index value set of each sample.

[0045] In step S105, a set of variation index values is constructed by obtaining the co-variation index values of each sample in N functional gene modules, forming a systematic characterization of the overall metabolic adaptation status of the sample. Unsupervised clustering algorithms (such as hierarchical clustering or principal component analysis) can be used to achieve accurate division of closely related groups. Alternatively, the variation index values of each functional gene module can be compared one by one to determine whether there are differences in genetic characteristics between samples, and then subgroups can be divided. For example, if the difference in the variation index value of any functional gene module between sample a and sample b is greater than a preset threshold, it indicates that there are differences in genetic characteristics between sample a and sample b, and they belong to different subgroups.

[0046] The method for subgrouping closely related populations provided in this application involves acquiring gene sequencing data from each sample in a closely related population sample set, and performing mutation detection on the gene sequencing data using a preset mutation detection tool to obtain mutation information for each sample. For the gene sequencing data of each sample, the genes in the gene sequencing data are divided into functional gene modules according to a preset biomedical database, resulting in N functional gene modules. Each functional gene module contains at least one gene from the gene sequencing data, and N is a positive integer. For each functional gene module, the nuclear genome and mitochondrial genome within the functional gene module are determined. Based on the mutation information, the mutation status of each nuclear gene in the nuclear genome and the mutation status of each mitochondrial gene in the mitochondrial genome are determined. Based on the mutation status of each nuclear gene and the mutation status of each mitochondrial gene, a co-variation index value is determined to obtain the variation index value for each functional gene module. The co-variation index value is used to identify the environmental adaptive evolutionary differences between different subgroups of closely related populations. The N variation index values of the N functional gene modules of the sample are combined into a variation index value set. Based on the variation index value set of each sample, the closely related population sample set is divided into multiple subgroups, solving the problem of low accuracy in distinguishing closely related subgroups in related technologies. By integrating variation information from the nuclear and mitochondrial genomes and calculating covariance index values between them, covariance pressure quantification at the functional gene module level is achieved. This enables precise capture of systematic imprints left by environmental adaptive evolution based on covariance index values. It achieves high sensitivity and specificity in distinguishing closely related subgroups with highly similar genetic backgrounds, thereby improving the accuracy of genetic characteristic analysis of closely related populations.

[0047] The classification of functional gene modules is achieved through a biomedical database. Optionally, in the subgroup classification method for closely related populations provided in this application embodiment, the gene sequencing data is classified into functional gene modules according to a preset biomedical database to obtain N functional gene modules, including: determining nuclear genes and mitochondrial genes in the gene sequencing data through a first biomedical database, wherein the first biomedical database is used to store the composition and functional classification of the mitochondrial proteome; classifying each nuclear gene and each mitochondrial gene in the gene sequencing data through a second biomedical database to obtain classification results, wherein the classification results include the functional gene module to which each nuclear gene belongs and the functional gene module to which each mitochondrial gene belongs, wherein the second biomedical database is used to store N functional gene modules and the nuclear genes and mitochondrial genes contained in each functional gene module; determining the N functional gene modules in the classification results, wherein the N functional gene modules include at least one of the following: respiratory chain complex I, respiratory chain complex III, respiratory chain complex IV, respiratory chain complex V, and assembly factors.

[0048] In some embodiments, the first biomedical database may be MitoCarta 3.0, which labels the origin (nuclear-coding or mitochondrial-coding), subcellular location, complex affiliation, and functional category (such as complex I core subunit or assembly factor) of each mitochondrial protein; the second biomedical database may be the KEGG pathway database, which, based on the established metabolic network map, precisely classifies each gene with a confirmed origin into the functional gene module it participates in.

[0049] For example, respiratory chain complex one, the largest and most complex enzyme in the oxidative phosphorylation system, consists of seven mtDNA-encoded NADH dehydrogenase subunits (ND), specifically ND1, ND2, ND3, ND4, ND4L, ND5, and ND6; seven nDNA-encoded NADH dehydrogenase [ubiquinone] iron-sulfur protein subunits (NDUFS) and NADH dehydrogenase [ubiquinone] flavoprotein (NDUFV), specifically NDUFS1, NDUFS2, NDUFS3, NDUFS7, NDUFS8, NDUFV1, and NDUFV2; and approximately 31 nDNA-encoded accessory subunits. As the main electron entry point in the respiratory chain, the covariance index value of this complex directly reflects the genetic robustness of the organism's response to upstream energy substrates (such as NADH).

[0050] Respiratory chain complex III comprises one mtDNA-encoded cytochrome b subunit; three nDNA-encoded Ubiquinol-cytochrome c reductase complex core protein subunits (UQCRC), specifically UQCRC1, UQCRC2, and UQCRFS1; and eleven nDNA-encoded accessory subunits. As a core relay station for electron transport, it connects upstream and downstream components via the Q cycle. The perfect fit between Cytb (of mitochondrial origin) and its surrounding nuclear-encoded subunits is crucial for preventing electron leakage and controlling reactive oxygen species (ROS) production; the interaction pressures at this site reflect the population's tolerance to oxidative stress.

[0051] Respiratory chain complex four: cytochrome c oxidase, composed of approximately eight nuclear-encoded accessory subunits and three mtDNA-encoded cytochrome c oxidase subunits (COX). Specifically, it includes a co-variant load of COX1, COX2, and COX3, representing oxygen utilization efficiency. As the final step in oxygen consumption, it directly determines the body's oxygen utilization efficiency and heat production capacity.

[0052] Respiratory chain complex five comprises two mtDNA-encoded transmembrane ATP synthase membrane subunits (ATP6 and ATP8); five nDNA-encoded ATP synthase F1 subunits: α (ATP5F1A), β (ATP5F1B), γ (ATP5F1C), δ (ATP5F1D), and ε (ATP5F1E); and approximately ten nDNA-encoded accessory subunits. It is responsible for converting the transmembrane proton gradient established by the preceding complexes into ATP chemical energy, playing a crucial role in regulating the evolution of basal metabolic rate.

[0053] Assembly factors: These are used to calculate the interactions between nuclear-encoded proteins (such as assembly chaperone proteins) that do not constitute the final structure of complexes but are responsible for building these complexes and the mitochondrial genome, representing the system's assembly and regulatory stability.

[0054] This embodiment divides the original variation information from a chaotic gene list into a set of co-functional modules of the nuclear genome and mitochondrial genome by dividing the functional gene modules. This ensures the biological rationality of the analysis framework and makes the subsequent calculation of co-variation index values completely revolve around the core adaptive pathway of respiratory chain energy metabolism. This avoids the problem of low signal-to-noise ratio caused by the excessively broad analysis scope and blurred functional background of traditional methods, and achieves high specificity and high sensitivity genetic discrimination of closely related subgroups.

[0055] After determining the variation status of each nuclear gene and each mitochondrial gene, a covariance index value is determined based on the variation status. Optionally, in the subgrouping method for closely related populations provided in this application embodiment, determining the covariance index value based on the variation status of each nuclear gene and each mitochondrial gene includes: determining the gene category of each nuclear gene in the functional gene module; calculating the nuclear genome load index value based on the gene category and variation status of each nuclear gene; calculating the mitochondrial genome load index value based on the variation status of each mitochondrial gene; and calculating the product of the nuclear genome load index value and the mitochondrial genome load index value to obtain the covariance index value of the functional gene module.

[0056] In some embodiments, weights are assigned according to the gene category and mutation status of different genes, and the co-variation index value is calculated based on the assigned weights. For example, the basic weight allocation of genes is as follows: Core functional subunits: Core subunits encoded by mtDNA as labeled by MitoCarta 3.0 are assigned a high basic weight (e.g., 5.0), and core subunits encoded by nDNA are assigned a relatively high basic weight (e.g., 2.0); Auxiliary / regulatory subunits: Auxiliary subunits involved in assembly or structural stability are assigned a second-highest basic weight (e.g., 1.5); Assembly factors: A mutation in an assembly factor often affects the yield of the entire complex, but it does not directly participate in electron transport. It is assigned a second-highest weight (e.g., 1.5). The functional effect weight allocation of mutations is based on the predicted degree of impairment to protein function. Synonymous mutations do not change the amino acid sequence and are considered genetic noise, with a weight of 0. Missense mutations change a single amino acid and have a weight of 1.5. High-risk mutations include frameshift mutations and terminator loss, with a weight of 2. The formula for calculating the co-variation index value is as follows:

[0057] ;

[0058] Among them, NMII module That is, the covariance index value, B nuc It is the nuclear genome load index value, B mt This is a mitochondrial genome load index value. Here, we use a product instead of a summation to amplify the slight shifts in functional cooperation between the nuclear and mitochondrial genomes through nonlinear cumulative effects.

[0059] This embodiment constructs a quantitative model of co-variation index values, using gene functional importance as weight, the degree of mutation damage as scale, and a product mechanism as amplifier. This transforms the subtle metabolic differences in closely related populations that were originally drowned out by synonymous mutation noise and failed in isolated analysis into quantitative genetic markers with high specificity and high biological explanatory power, thereby achieving accurate identification of geographically proximate and genetically similar populations.

[0060] Calculating the nuclear genome load index value based on the gene category and variation status of nuclear genes. Optionally, in the subgrouping method for closely related populations provided in this application embodiment, calculating the nuclear genome load index value based on the gene category and variation status of each nuclear gene includes: determining the basic weight corresponding to the gene category of each nuclear gene, and determining the first variation effect weight of the variation type of the nuclear gene; determining the first state index value from the variation status of the nuclear gene, calculating the product of the first state index value, the basic weight, and the first variation effect weight to obtain the nuclear gene load index value; calculating the cumulative value of the load index values of all nuclear genes in the functional gene module to obtain the nuclear genome load index value.

[0061] In some embodiments, the nuclear genome load index value is calculated using the following formula:

[0062] B nuc = ;

[0063] Where n is the total number of nuclear genes included in the calculation within the functional gene module, and i represents the i-th nuclear gene. The formula accumulates the load of all relevant nuclear genes within the module by looping i from 1 to n. The first state index value representing the i-th nuclear gene in the sample can be a score for the number of variant sites or a specific variant type. This represents the basic weight of the i-th nuclear gene. The functional effect weight of the variation contained in the i-th nuclear gene is also known as the first variation effect weight.

[0064] This embodiment calculates nuclear genome load index values, changing the assessment of nuclear genome variation from traditional quantitative statistics to functional stress modeling. This improves the ability to capture weak but coordinated adaptive evolutionary signals in closely related populations, providing a data foundation with high signal-to-noise ratio and high biological interpretability for subsequent calculation of coordinated variation index values.

[0065] Calculating the mitochondrial genome load index value based on the variation state of mitochondrial genes. Optionally, in the subgrouping method for closely related populations provided in this application embodiment, calculating the mitochondrial genome load index value based on the variation state of each mitochondrial gene includes: determining a preset weight for mitochondrial genes and determining a second variation effect weight for the variation type of mitochondrial genes; determining a second state index value from the variation states of mitochondrial genes, calculating the product of the second state index value, the preset weight, and the second variation effect weight to obtain the mitochondrial gene load index value; and calculating the cumulative value of the load index values of all mitochondrial genes in the functional gene module to obtain the mitochondrial genome load index value.

[0066] In some embodiments, the formula for calculating the mitochondrial genome load index is as follows:

[0067] B mt = ;

[0068] Where m represents the total number of mitochondrial genes included in the calculation within the functional gene module, and j represents the j-th mtDNA gene. The formula accumulates the load of all relevant mitochondrial genes within the module by looping j from 1 to m. The second state index value representing the j-th mtDNA gene in the sample can be a score for the number of variant sites or a specific variant type. The preset weight represents the j-th mitochondrial gene. The functional effect weight of the variation contained in the j-th mitochondrial gene is also known as the second variation effect weight.

[0069] This embodiment upgrades mitochondrial variation assessment from traditional maternal lineage tracing to respiratory chain functional stress mapping by calculating mitochondrial genome load index values. It overcomes the inherent defects of mtDNA analysis, such as the proliferation of synonymous mutations, signal being submerged by noise, and inability to quantify functional shifts. It provides a data foundation with complete symmetry, comparability, and high signal-to-noise ratio with the nuclear genome for the subsequent calculation of covariance index values.

[0070] Variation detection can include single nucleotide variant detection and insertion / deletion variant detection. Optionally, in the subgrouping method for closely related populations provided in this application embodiment, variation detection is performed on gene sequencing data using a preset variation detection tool to obtain variation information for each sample, including: performing single nucleotide variant detection on gene sequencing data to obtain a first detection result; performing insertion / deletion variant detection on gene sequencing data to obtain a second detection result; and labeling the first and second detection results using a preset variation detection tool to obtain variation information for each sample. The variation information includes at least the gene name, mutation site, and mutation type.

[0071] In some embodiments, single nucleotide variants refer to variants involving the substitution of a single base in the genome. The first detection result is the output after comparing, recalibrating, and calling variants in the gene sequencing data using a preset bioinformatics tool, and may include the location, reference base, variant base, and sample genotype. Insertion / deletion variants refer to the insertion or deletion of 1 to dozens of bases in the genome, which may lead to reading frame shifts or changes in protein length. The second detection result is the result identified by a preset variant detection tool after performing a local realignment of the gene sequencing data, and includes the variant location, reference sequence, insertion / deletion sequence, and genotype. The preset variant detection tool may be a functional annotation tool such as VEP, which performs biological annotation of variants based on a reference genome and database.

[0072] This embodiment achieves the extraction of variation information by detecting variations in gene sequencing data, and systematically captures and accurately labels all types of functional variations, providing a data foundation for subsequent functional gene module division and calculation of synergistic variation index values.

[0073] After obtaining the set of variation index values for each sample, subgroups can be divided by clustering. Optionally, in the subgroup division method for closely related populations provided in this application embodiment, dividing the sample set of closely related populations into multiple subgroups based on the set of variation index values for each sample includes: inputting the set of variation index values for each sample in the sample set of closely related populations into a preset clustering model, processing to obtain multiple clusters; and dividing multiple samples in each cluster into a subgroup to obtain multiple subgroups.

[0074] In some embodiments, the closely related population sample set is a collection of population samples with highly similar genetic backgrounds and geographical proximity. The preset clustering model can be an unsupervised machine learning algorithm (such as K-means (K-clustering) and hierarchical clustering), whose input is a high-dimensional numerical vector and output is a cluster structure automatically grouped based on similarity, used to discover genetic subpopulations without preset labels. The set of variation index values of samples within each cluster are close to each other in a high-dimensional functional space.

[0075] For example, by conducting an in-depth comparison of the NMII (covariance index values) of samples from group X and group E, the modular selection pattern of the oxidative phosphorylation system in population evolution was revealed. Figure 2 This is a schematic diagram of the covariance index values of the respiratory chain complex, including core subunits, accessory subunits, and assembly factors, according to embodiments of this application. Figure 2 As shown, the Index Value is the NMII value, and the Region includes four areas: H, E, T, and X. The ANOVA p=0.191 indicates that the variance assessment value among the four regions is 0.191, which is greater than the significance threshold of 0.05, indicating no statistically significant difference. The t-test p=0.023 between regions E and X indicates that the t-test assessment value is 0.023, which is less than the significance threshold of 0.05, indicating a statistically significant difference. Figure 3 This is a schematic diagram of the covariance index values of a respiratory chain complex containing a core subunit, according to an embodiment of this application. Figure 3 As shown, ANOVA p=0.234 indicates that the variance assessment value of the four regions is 0.234, which is greater than the significance threshold of 0.05, and there is no statistically significant difference. The t-test p=0.037 between regions E and X indicates that the t-test assessment value is 0.037, which is less than the significance threshold of 0.05, and there is a statistically significant difference.

[0076] Figure 4This is a schematic diagram of the covariance index values of the respiratory chain complex, including the core subunit, accessory subunits, and assembly factors, provided in the embodiments of this application. Figure 4 As shown, ANOVA p=0.500 indicates that the variance assessment value among the four regions is 0.500, which is greater than the significance threshold of 0.05, and therefore there is no statistically significant difference. Figure 5 This is a schematic diagram of the co-variation index values of the three respiratory chain complexes, each containing a core subunit, according to an embodiment of this application. Figure 5 As shown, ANOVAp=0.843 indicates that the ANOVA evaluation value among the four regions is 0.843, which is greater than the significance threshold of 0.05, and therefore there is no statistically significant difference. Figure 6 This is a schematic diagram of the covariance index values of the respiratory chain complex IV, including the core subunits, accessory subunits, and assembly factors, according to the embodiments of this application. Figure 6 As shown, ANOVA p=0.395 indicates that the ANOVA evaluation value among the four regions is 0.395, which is greater than the significance threshold of 0.05, and therefore there is no statistically significant difference. Figure 7 This is a schematic diagram of the co-variation index values of the four respiratory chain complexes, each containing a core subunit, according to an embodiment of this application. Figure 7 As shown, since there is no nuclear-encoding core subunit in respiratory chain complex four, ANOVA p=nan indicates that there is no data and statistics cannot be collected. Figure 8 This is a schematic diagram of the covariance index values of respiratory chain complex five, including core subunits, accessory subunits, and assembly factors, according to the embodiments of this application. Figure 8 As shown, ANOVA p=0.325 indicates that the ANOVA evaluation value among the four regions is 0.325, which is greater than the significance threshold of 0.05, and therefore there is no statistically significant difference. Figure 9 This is a schematic diagram illustrating the co-variation index values of the five respiratory chain complexes, each containing a core subunit, according to an embodiment of this application. Figure 9 As shown, ANOVA p=0.823 indicates that the variance assessment value among the four regions is 0.823, which is greater than the significance threshold of 0.05, and therefore there is no statistically significant difference. Figure 10 This is a schematic diagram of the covariance index value of the total assembly factor according to the embodiments of this application, such as... Figure 10 As shown, ANOVA p=0.195 indicates that the variance assessment value of the four regions is 0.195, which is greater than the significance threshold of 0.05, and there is no statistically significant difference. The t-test p=0.043 between regions E and X indicates that the t-test assessment value is 0.043, which is less than the significance threshold of 0.05, and there is a statistically significant difference.

[0077] like Figure 2 and Figure 3As shown, respiratory chain complex I is a core stress point for genetic differentiation, with genetic differences between the two groups highly enriched at the entry point of the respiratory chain. The t-test for the covariance index values of respiratory chain complex I in regions E and X showed a p-value of 0.023, and its core module (Complex I Core, p-value of 0.037) both exhibited statistically significant differences. This demonstrates that group X underwent significant nuclear-mitochondrial genome co-evolution at the source of energy metabolism, forming a specific genetic homeostasis. Figure 10 As shown, the assembly process exhibits systematic optimization. In addition to specific complexes, the t-test for assessing the overall stability of the system (P=0.043) also reached a significant level. This means that the evolutionary advantage of Group X is not only reflected in individual subunits, but also in the systematic adaptation between the assembly factors encoded by nuclear genes and the mitochondrial genome, thereby ensuring the efficiency of complex assembly under high-intensity metabolic environments.

[0078] In stark contrast, the genetic conservation of the core functional modules was not significantly different between the two groups. Core modules of respiratory chain complex IV and other complexes (such as respiratory chain complex III and V) showed no significant differences (P>0.05). This locally significant but globally conserved distribution strongly demonstrates that covariance index values can accurately eliminate background noise and pinpoint truly environmentally driven genetic signals. Through specific optimization of respiratory chain complex I (core + overall) and the overall assembly process, Group X showed more robust nuclear-mitochondrial genome synergy than Group E. This provides key molecular markers for mitochondrial function assessment and metabolic health screening in this population.

[0079] At the single-gene level, by comparing the variation load of group X and group E, a series of highly significant driver genes were screened out. Table 1 is a statistical table of the driver gene identification results of group X and group E.

[0080] Table 1

[0081]

[0082] The last five genes are all located in the membrane components of respiratory chain complex one. Figure 11 Based on the driver gene volcano map provided in the embodiments of this application, validation analysis was performed using real whole-exome and mitochondrial genome data of a certain regional geographic subgroup (X and E). An NMII computational model was implemented using Python scripts, and the MitoCarta 3.0 database was introduced for precise gene function filtering. A dynamic weighting strategy with synonymous mutation weights set to zero was employed in the experiment, effectively improving the ability to capture functional evolutionary signals. See Table 1 and... Figure 11Compared with traditional population differentiation methods: using the traditional population differentiation index FST to globally evaluate populations X and E, the mean value is only -0.0093. According to population genetic evaluation criteria, this value means that at the whole genome level, the genetic background of the two populations exhibits a high degree of homogeneity, and traditional frequency statistics methods are completely ineffective in distinguishing between these two regional populations.

[0083] This embodiment proposes a synergistic variation index that focuses on the specific functional module of respiratory chain complex 1 and introduces a variation function weighting algorithm. The calculation results show that the NMII index of respiratory chain complex 1 shows a statistically significant difference between populations X and E (p=0.023). For p=0.023, since the genetic backgrounds of X and E are very similar, common synonymous mutations can generate significant genetic drift noise. Through weighted filtering in this embodiment, the weight of synonymous mutations is reduced to 0, which significantly improves the signal-to-noise ratio of the core module, raising the p-value from insignificant to significant (p<0.05). The variation load of a single gene is insufficient to distinguish between populations with similar genetic backgrounds (such as X and E). The synergistic variation index accumulates the multi-gene synergistic effects within the respiratory chain complex 1 pathway, revealing deep-seated metabolic evolutionary differences between populations.

[0084] The algorithm accurately identified the core driver genes contributing most to population differences, with NDUFB4 showing the most significant difference, exhibiting a statistically significant variation frequency difference of 19.74% between the two groups (p=0.0070). Considering the environmental conditions of X and E, X is an extreme cold grassland with long, windy winters, requiring a high level of heat production; while E is a high-altitude desert with higher and drier conditions, requiring higher oxygen utilization efficiency. The differences between the two groups are likely due to mitochondrial functional fine-tuning in response to harsher cold conditions. Experiments demonstrated that although some individual genes (such as NDUFS3, p=0.098) did not reach independent significance, the cumulative effect of the NMII product successfully amplified the functional deviation of minor variations, achieving accurate population discrimination.

[0085] This embodiment, through in-depth analysis of the NMII index, reveals that the nuclear genome-mitochondrial genome interaction pressure among populations is not randomly distributed throughout the respiratory chain. Experimental data confirm that the core catalytic modules of the respiratory chain remain highly conserved across different geographical populations, while environmental adaptive pressure specifically acts on assembly auxiliary modules. This discovery can precisely pinpoint key nodes driving population differentiation (such as COX15 and NDUFS1), effectively filtering out invalid background variations, thereby elevating detection sensitivity and specificity to a new level. It can distinguish geographically proximate, genetically similar subgroups, solving the problem of traditional analytical methods failing to capture weak signals. The results show that the differences are concentrated in the P-module (proton pump module) and energy metabolism sources and assembly molecules of respiratory chain complex one, which highly coincides with the heat production requirements in extreme cold environments (X) and the oxygen utilization requirements in high-altitude environments (E), proving that the features discovered by the algorithm have real biological and physical significance.

[0086] This embodiment uses the set of variation index values as clustering input and a preset clustering model to divide the sample set of closely related populations into subgroups. This achieves unsupervised subgroup division based on functional metabolic pressure rather than neutral drift signals in closely related populations with highly similar genetic backgrounds, thereby improving the accuracy of genetic characteristic analysis of closely related populations.

[0087] It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the flowchart, in some cases the steps shown or described may be executed in a different order than that shown here.

[0088] This application also provides a subgrouping device for closely related populations. It should be noted that this subgrouping device can be used to execute the subgrouping method for closely related populations provided in this application. The subgrouping device for closely related populations provided in this application is described below.

[0089] Figure 12 This is a schematic diagram of a subgrouping device for closely related populations provided according to an embodiment of this application. For example... Figure 12 As shown, the device includes:

[0090] The acquisition unit 1201 is used to acquire the gene sequencing data of each sample in the sample set of closely related people, and to perform mutation detection on the gene sequencing data through a preset mutation detection tool to obtain the mutation information of each sample.

[0091] The first partitioning unit 1202 is used to partition the genes in the gene sequencing data of each sample into functional gene modules according to a preset biomedical database, so as to obtain N functional gene modules, wherein each functional gene module contains at least one gene in the gene sequencing data, and N is a positive integer.

[0092] The first determining unit 1203 is used to determine the nuclear genome and mitochondrial genome in each functional gene module, and to determine the variation status of each nuclear gene in the nuclear genome and the variation status of each mitochondrial gene in the mitochondrial genome based on variation information.

[0093] The second determining unit 1204 is used to determine the co-variation index value based on the variation state of each nuclear gene and the variation state of each mitochondrial gene, and obtain the variation index value of each functional gene module. The co-variation index value is used to identify the environmental adaptive evolution differences between different subgroups of closely related populations.

[0094] The second partitioning unit 1205 is used to combine the N variation index values of the N functional gene modules of the sample into a variation index value set, and to divide the sample set of closely related populations into multiple subgroups based on the variation index value set of each sample.

[0095] The subgrouping device for closely related populations provided in this application embodiment acquires gene sequencing data of each sample in a closely related population sample set through an acquisition unit 1201, and performs mutation detection on the gene sequencing data using a preset mutation detection tool to obtain mutation information for each sample; the first subgrouping unit 1202 divides the gene sequencing data of each sample into functional gene modules according to a preset biomedical database to obtain N functional gene modules, wherein each functional gene module contains at least one gene from the gene sequencing data, and N is a positive integer; the first determining unit 1203 determines the nuclear genome and mitochondrial genome in each functional gene module, and determines the mutation status of each nuclear gene in the nuclear genome and the mutation status of each mitochondrial gene in the mitochondrial genome based on the mutation information; the second determining unit 1 204. Based on the variation status of each nuclear gene and each mitochondrial gene, a covariance index value is determined to obtain the variation index value for each functional gene module. The covariance index value is used to identify environmental adaptive evolutionary differences between different subgroups of closely related populations. The second partitioning unit 1205 combines the N variation index values of the N functional gene modules of the sample into a variation index value set. Based on the variation index value set of each sample, the closely related population sample set is divided into multiple subgroups. This solves the problem of low discrimination accuracy of closely related subgroups in related technologies. By integrating the variation information of the nuclear genome and mitochondrial genome, the covariance index value between the nuclear genome and mitochondrial genome is calculated, realizing the quantification of covariant evolutionary pressure at the functional gene module level. It can accurately capture the systematic imprint left by environmental adaptive evolution based on the covariance index value. This achieves high sensitivity and high specificity in the discrimination of closely related population subgroups with highly similar genetic backgrounds, thereby improving the accuracy of genetic characteristic analysis of closely related populations.

[0096] Optionally, in the subgrouping device for closely related populations provided in this application embodiment, the first subgrouping unit 1202 includes: a first determining module, used to determine nuclear genes and mitochondrial genes in gene sequencing data through a first biomedical database, wherein the first biomedical database is used to store the composition and functional classification of the mitochondrial proteome; a classifying module, used to classify each nuclear gene and each mitochondrial gene in the gene sequencing data through a second biomedical database to obtain a classification result, wherein the classification result includes the functional gene module to which each nuclear gene belongs and the functional gene module to which each mitochondrial gene belongs, the second biomedical database is used to store N functional gene modules, and the nuclear genes and mitochondrial genes contained in each functional gene module; and a second determining module, used to determine the N functional gene modules in the classification result, wherein the N functional gene modules include at least one of the following: respiratory chain complex I, respiratory chain complex III, respiratory chain complex IV, respiratory chain complex V, and assembly factors.

[0097] Optionally, in the subgrouping device for closely related populations provided in this application embodiment, the second determining unit 1204 includes: a third determining module, used to determine the gene category of each nuclear gene in the functional gene module, and calculate the nuclear genome load index value based on the gene category and variation status of each nuclear gene; a first calculating module, used to calculate the mitochondrial genome load index value based on the variation status of each mitochondrial gene; and a second calculating module, used to calculate the product of the nuclear genome load index value and the mitochondrial genome load index value to obtain the co-variation index value of the functional gene module.

[0098] Optionally, in the subgrouping device for closely related populations provided in this application embodiment, the third determining module includes: a first determining submodule, used to determine the basic weight corresponding to the gene category of each nuclear gene, and to determine the first variation effect weight of the variation type of the nuclear gene; a second determining submodule, used to determine the first state index value from the variation state of the nuclear gene, calculate the product of the first state index value, the basic weight and the first variation effect weight to obtain the nuclear gene load index value; and a first calculation submodule, used to calculate the cumulative value of the load index values of all nuclear genes in the functional gene module to obtain the nuclear genome load index value.

[0099] Optionally, in the subgrouping device for closely related populations provided in this application embodiment, the first calculation module includes: a third determining submodule, used to determine the preset weight of mitochondrial genes and determine the second variation effect weight of the variation type of mitochondrial genes; a fourth determining submodule, used to determine the second state index value from the variation state of mitochondrial genes, calculate the product of the second state index value, the preset weight, and the second variation effect weight to obtain the mitochondrial gene load index value; and a second calculation submodule, used to calculate the cumulative value of the load index values of all mitochondrial genes in the functional gene module to obtain the mitochondrial genome load index value.

[0100] Optionally, in the subgrouping device for closely related populations provided in this application embodiment, the acquisition unit 1201 includes: a first detection module for performing single nucleotide variant detection on gene sequencing data to obtain a first detection result; a second detection module for performing insertion / deletion variant detection on gene sequencing data to obtain a second detection result; and a labeling module for labeling the first and second detection results using a preset variant detection tool to obtain variant information for each sample, wherein the variant information includes at least gene name, variant site, and variant type.

[0101] Optionally, in the subgrouping device for closely related populations provided in this application embodiment, the second division unit 1205 includes: an input module, used to input the set of variation index values of each sample in the sample set of closely related populations into a preset clustering model, and process to obtain multiple clusters; and a division module, used to divide multiple samples in each cluster into a subgroup, and obtain multiple subgroups.

[0102] The subgrouping device for closely related populations includes a processor and a memory. The aforementioned acquisition unit 1201, first subgrouping unit 1202, first determination unit 1203, second determination unit 1204, and second subgrouping unit 1205 are all stored in the memory as program units. The processor executes the aforementioned program units stored in the memory to achieve the corresponding functions.

[0103] The processor contains a kernel, which retrieves the corresponding program units from memory. One or more kernels can be configured, and adjusting kernel parameters can improve the accuracy of genetic trait analysis in closely related populations.

[0104] The memory may include non-permanent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM, and the memory includes at least one memory chip.

[0105] This invention provides a computer-readable storage medium storing a program that, when executed by a processor, implements a method for subgrouping closely related populations.

[0106] This invention provides a processor for running a program, wherein the program executes a subgroup classification method for closely related populations.

[0107] Figure 13 This is a schematic diagram of an electronic device provided according to an embodiment of this application. For example... Figure 13 As shown, electronic device 1301 includes a processor, a memory, and a program stored in the memory and executable on the processor. When the processor executes the program, it implements the following steps: a method for subgrouping closely related populations. The device in this document can be a server, PC, PAD, mobile phone, etc.

[0108] This application also provides a computer program product that, when executed on a data processing device, is suitable for executing a program with the following initialization steps: a method for subgrouping closely related populations.

[0109] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0110] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0111] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0112] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0113] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0114] Memory may include non-persistent memory in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0115] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0116] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element.

[0117] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0118] The above are merely embodiments of this application and are not intended to limit the scope of this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of the claims of this application.

Claims

1. A method for subgrouping closely related populations, characterized in that, include: Gene sequencing data of each sample in a closely related population sample set is obtained, and the gene sequencing data is subjected to mutation detection using a preset mutation detection tool to obtain the mutation information of each sample. For each sample's gene sequencing data, the genes in the gene sequencing data are divided into functional gene modules according to a preset biomedical database to obtain N functional gene modules, wherein each functional gene module contains at least one gene in the gene sequencing data, and N is a positive integer. For each functional gene module, the nuclear genome and mitochondrial genome in the functional gene module are determined, and the variation status of each nuclear gene in the nuclear genome and the variation status of each mitochondrial gene in the mitochondrial genome are determined based on the variation information. Based on the variation state of each nuclear gene and the variation state of each mitochondrial gene, a co-variation index value is determined to obtain the variation index value of each functional gene module, wherein the co-variation index value is used to identify the differences in environmental adaptive evolution between different subgroups of closely related populations. The N variation index values of the N functional gene modules of the sample are combined into a variation index value set, and the closely related population sample set is divided into multiple subgroups based on the variation index value set of each sample.

2. The method according to claim 1, characterized in that, According to a preset biomedical database, the genes in the gene sequencing data are divided into functional gene modules, resulting in N functional gene modules, including: The nuclear genes and mitochondrial genes in the gene sequencing data are determined by a first biomedical database, wherein the first biomedical database is used to store the composition and functional classification of the mitochondrial proteome. The gene sequencing data of each nuclear gene and each mitochondrial gene in the second biomedical database are classified to obtain classification results. The classification results include the functional gene module to which each nuclear gene belongs and the functional gene module to which each mitochondrial gene belongs. The second biomedical database is used to store the N functional gene modules, as well as the nuclear gene and the mitochondrial gene contained in each functional gene module. Identify N functional gene modules in the classification results, wherein the N functional gene modules include at least one of the following: respiratory chain complex I, respiratory chain complex III, respiratory chain complex IV, respiratory chain complex V, and assembly factor.

3. The method according to claim 1, characterized in that, Determining the co-variation index value based on the variation status of each nuclear gene and each mitochondrial gene includes: The gene class of each nuclear gene in the functional gene module is determined, and the nuclear genome load index value is calculated based on the gene class and the mutation state of each nuclear gene; The mitochondrial genome load index value is calculated based on the variation status of each mitochondrial gene; The product of the nuclear genome load index value and the mitochondrial genome load index value is calculated to obtain the covariance index value of the functional gene module.

4. The method according to claim 3, characterized in that, The nuclear genome load index value is calculated based on the gene class and the variant state of each nuclear gene, including: Determine the basic weight corresponding to the gene category of each nuclear gene, and determine the first variation effect weight of the variation type of the nuclear gene; The first state index value is determined from the mutation state of the nuclear gene, and the product of the first state index value, the base weight, and the first mutation effect weight is calculated to obtain the load index value of the nuclear gene. The cumulative value of the load index of all nuclear genes in the functional gene module is calculated to obtain the nuclear genome load index value.

5. The method according to claim 3, characterized in that, The mitochondrial genome load index value is calculated based on the variation status of each mitochondrial gene, including: Determine the preset weights of the mitochondrial genes, and determine the second variation effect weights of the variation types of the mitochondrial genes; The second state index value is determined from the mutation state of the mitochondrial gene, and the product of the second state index value, the preset weight, and the second mutation effect weight is calculated to obtain the load index value of the mitochondrial gene. The cumulative value of the load index of all mitochondrial genes in the functional gene module is calculated to obtain the mitochondrial genome load index value.

6. The method according to claim 1, characterized in that, The gene sequencing data is subjected to mutation detection using a preset mutation detection tool to obtain mutation information for each sample, including: The gene sequencing data were subjected to single nucleotide variant detection to obtain the first detection result; Insertion and deletion variant detection was performed on the gene sequencing data to obtain a second detection result; The first and second detection results are labeled using the preset mutation detection tool to obtain mutation information for each sample, wherein the mutation information includes at least gene name, mutation site, and mutation type.

7. The method according to claim 1, characterized in that, The closely related population sample set is divided into multiple subgroups based on the set of variation index values for each sample, including: The set of variation index values for each sample in the closely related population sample set is input into a preset clustering model to obtain multiple clusters. Multiple samples in each cluster are divided into a subgroup, resulting in multiple subgroups.

8. A subgrouping device for closely related populations, characterized in that, include: The acquisition unit is used to acquire the gene sequencing data of each sample in a closely related population sample set, and to perform mutation detection on the gene sequencing data using a preset mutation detection tool to obtain the mutation information of each sample. The first partitioning unit is used to partition the genes in the gene sequencing data of each sample into functional gene modules according to a preset biomedical database, thereby obtaining N functional gene modules, wherein each functional gene module contains at least one gene in the gene sequencing data, and N is a positive integer. The first determining unit is configured to, for each functional gene module, determine the nuclear genome and mitochondrial genome in the functional gene module, and determine the variation status of each nuclear gene in the nuclear genome and the variation status of each mitochondrial gene in the mitochondrial genome based on the variation information; The second determining unit is used to determine a co-variation index value based on the variation state of each nuclear gene and the variation state of each mitochondrial gene, and obtain a variation index value for each functional gene module, wherein the co-variation index value is used to identify the environmental adaptive evolution differences between different subgroups of closely related populations. The second partitioning unit is used to combine the N variation index values of the N functional gene modules of the sample into a variation index value set, and to partition the closely related population sample set into multiple subgroups based on the variation index value set of each sample.

9. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the subgrouping method for closely related populations as described in any one of claims 1 to 7.

10. An electronic device, characterized in that, It includes one or more processors and a memory, the memory being used to store one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors cause the one or more processors to implement the subgrouping method for closely related populations as described in any one of claims 1 to 7.