Cell type prediction method and system based on multivariate feature set fusion

By using a multi-feature set fusion method and deep learning models to process single-cell Hi-C data, the accuracy and applicability issues of cell type identification for complex datasets in existing technologies are solved, and efficient cell type prediction is achieved.

CN117633630BActive Publication Date: 2026-06-30SHANDONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHANDONG UNIV
Filing Date
2023-12-13
Publication Date
2026-06-30

Smart Images

  • Figure CN117633630B_ABST
    Figure CN117633630B_ABST
Patent Text Reader

Abstract

This invention belongs to the field of single-cell Hi-C data processing and provides a cell type prediction method and system based on multi-feature set fusion. The cell type prediction method based on multi-feature set fusion includes acquiring single-cell Hi-C data and preprocessing it to obtain a chromosome contact sparse matrix; then spatially smoothing the chromosome contact sparse matrix to obtain a chromosome contact enhancement matrix; extracting small-domain contact probability feature sets from the chromosome contact sparse matrix; and extracting smoothed small-domain contact probability feature sets and smoothed bi n contact probability feature sets from the chromosome contact enhancement matrix; using a pre-trained fusion classification model to extract corresponding features from the small-domain contact probability feature sets, smoothed small-domain contact probability feature sets, and smoothed bi n contact probability feature sets respectively; fusing the extracted corresponding features; and finally predicting the cell type based on the fused features.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of single-cell Hi-C data processing, and particularly relates to a cell type prediction method and system based on multivariate feature set fusion. Background Technology

[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.

[0003] Single-cell Hi-C technology, an extension of Hi-C technology, has emerged and revolutionized the study of 3D genome structure. This innovative approach allows for the classification of large, heterogeneous cell populations based on single-cell Hi-C data. However, sequencing single-cell Hi-C data from cell populations presents significant challenges due to the presence of multiple cell types. Inaccurate identification of individual cells in experiments can mask the specificity within heterogeneous cell populations, hindering detailed studies of the genomic structure of each unique cell. Furthermore, accurate cell classification not only opens new avenues for exploring 3D genome structure but also facilitates simplified analysis of genome structure within homogeneous cell types. Therefore, accurate cell type identification has become a central research focus. However, traditional biological methods for cell type detection are prohibitively expensive experimentally.

[0004] Therefore, computational methods for identifying cell types based on single-cell Hi-C data are essential. However, existing methods relying on single-cell Hi-C data, such as scHiCStackL, still have considerable room for improvement in accurately predicting unknown cell types. While scHiCStackL addresses the challenge of predicting unknown cell types, its applicability is limited to simple datasets (such as those used by Ramani et al. and Flyamer et al.), lacking the ability to scale to more complex datasets. Consequently, the effectiveness of high-precision classification algorithms in identifying unknown cells is constrained.

[0005] Patent document CN113160886A, entitled "Cell Type Prediction System Based on Single-Cell Hi-C Data," provides a cell type prediction system based on single-cell Hi-C data, comprising: a data preprocessing module configured to: for single-cell Hi-C data, divide a chromosome into several non-overlapping bins according to a pre-set resolution, and then match the information to form a contact matrix; and a neural network module configured to: process the contact matrix processed by the data preprocessing module and output four cell types predicted by the model. This document only initially solves the problem of location-based cell type identification, but its applicability is limited to simple datasets and lacks extension to complex datasets. Therefore, the high-precision classification algorithm is limited in identifying unknown cell types. Summary of the Invention

[0006] To address the technical problems mentioned above, this invention provides a cell type prediction method and system based on multi-feature set fusion, which improves the accuracy of cell type classification by fusing multiple features from single-cell Hi-C data.

[0007] To achieve the above objectives, the present invention adopts the following technical solution:

[0008] The first aspect of the present invention provides a cell type prediction method based on multivariate feature set fusion.

[0009] A cell type prediction method based on multivariate feature set fusion, comprising:

[0010] Single-cell Hi-C data were acquired and preprocessed to obtain a chromosome contact sparse matrix. Then, the chromosome contact sparse matrix was spatially smoothed to obtain a chromosome contact enhancement matrix.

[0011] Small-domain contact probability feature sets are extracted from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact sparse matrix; smooth small-domain contact probability feature sets are extracted from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact enhancement matrix; and smooth bin contact probability feature sets characterizing the contact frequency distribution are extracted from the chromosome contact enhancement matrix.

[0012] The pre-trained fusion classification model is used to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. The extracted corresponding features are then fused, and the cell type is predicted based on the fused features.

[0013] As one implementation method, the values ​​in the small-domain contact probability feature set are: the total number of contact information in the preset area divided by the number of contacts in the chromosome contact sparse matrix.

[0014] As one implementation method, the values ​​of each part of the smoothed small-domain contact probability feature set are: the total number of contact information in the preset area divided by the number of contacts in the chromosome contact enhancement matrix.

[0015] As one implementation, each value in the smooth bin contact probability feature set is: the sum of the number of contacts between a chromosome segment and its neighboring chromosome segments divided by the total number of contacts in the entire chromosome contact matrix.

[0016] As one implementation method, the preprocessing process for single-cell Hi-C data is as follows:

[0017] Each chromosome is segmented based on a preset resolution, resulting in multiple chromosome segments, with each chromosome segment forming a bin;

[0018] The relevant information in the single-cell Hi-C data is then matched with each bin to obtain the chromosome contact sparse matrix.

[0019] In one implementation, the fusion classification model includes a convolution module and a fusion module; the convolution module is used to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively; the fusion module is used to fuse the extracted corresponding features and then predict the cell type based on the fused features.

[0020] A second aspect of the present invention provides a cell type prediction system based on multivariate feature set fusion.

[0021] A cell type prediction system based on multivariate feature set fusion, comprising:

[0022] The data processing module is used to acquire single-cell Hi-C data and preprocess it to obtain a chromosome contact sparse matrix. Then, the chromosome contact sparse matrix is ​​spatially smoothed to obtain a chromosome contact enhancement matrix.

[0023] The feature set extraction module is used to extract small-domain contact probability feature sets from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact sparse matrix, extract smooth small-domain contact probability feature sets from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact enhancement matrix, and extract smooth bin contact probability feature sets characterizing the contact frequency distribution from the chromosome contact enhancement matrix.

[0024] The cell type prediction module uses a pre-trained fusion classification model to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. The extracted corresponding features are then fused, and the cell type is predicted based on the fused features.

[0025] In one implementation, the fusion classification model includes a convolution module and a fusion module; the convolution module is used to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively; the fusion module is used to fuse the extracted corresponding features and then predict the cell type based on the fused features.

[0026] A third aspect of the present invention provides a computer-readable storage medium.

[0027] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps in the cell type prediction method based on multivariate feature set fusion as described above.

[0028] A fourth aspect of the present invention provides an electronic device.

[0029] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps in the cell type prediction method based on multivariate feature set fusion as described above.

[0030] Compared with the prior art, the beneficial effects of the present invention are:

[0031] (1) This invention predicts cell type by fusing the corresponding features in the small domain contact probability feature set, the smoothed small domain contact probability feature set and the smoothed bin contact probability feature set. It improves the accuracy of cell type prediction from multiple perspectives, including chromatin interaction within the domain centered on the target chromosome segment and the different contact frequency distributions within the same chromosome segment.

[0032] (2) In order to solve the problem that the difference between the number of contacts detected in the experiment and the actual number of contacts affects the accuracy of cell type prediction, the present invention uses spatial smoothing technology to process the sparse matrix of chromosome contacts. By integrating information from spatially adjacent chromosome segments, the information of the target chromosome segments is refined, reducing the error between the number of contacts, thereby improving the accuracy of cell classification.

[0033] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description

[0034] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.

[0035] Figure 1 This is a flowchart of the cell type prediction method based on multi-feature set fusion according to an embodiment of the present invention;

[0036] Figure 2 This is a performance comparison chart of the present invention on various feature sets on a balanced dataset;

[0037] Figure 3 This is a comparison chart of the present invention with other methods on balanced datasets;

[0038] Figure 4 This is a performance comparison chart of the present invention on different feature sets on an imbalanced dataset;

[0039] Figure 5 This is a comparison graph of the present invention with other methods on imbalanced datasets; Detailed Implementation

[0040] The present invention will be further described below with reference to the accompanying drawings and embodiments.

[0041] It should be noted that the following detailed description is illustrative and intended to provide further explanation of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.

[0042] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.

[0043] Example 1

[0044] This embodiment provides a cell type prediction method based on multivariate feature set fusion, which specifically includes the following steps:

[0045] Step 1: Acquire single-cell Hi-C data and preprocess it to obtain a chromosome contact sparse matrix. Then, perform spatial smoothing on the chromosome contact sparse matrix to obtain a chromosome contact enhancement matrix.

[0046] In the specific implementation process, the preprocessing of single-cell Hi-C data is as follows:

[0047] Each chromosome is segmented based on a preset resolution, resulting in multiple chromosome segments, with each chromosome segment forming a bin;

[0048] The relevant information in the single-cell Hi-C data is then matched with each bin to obtain the chromosome contact sparse matrix.

[0049] The raw data file describes the interactions between chromatin fragments in each cell and other specific information. In this embodiment, a resolution of R = 1 Mb is selected, and the chromosome is cut into multiple fragments each day, with each fragment forming a bin. The calculation formula is as follows:

[0050]

[0051] Here, n represents the number of segments that each chromosome can be divided into. Therefore, chromosome contact information is obtained through the chromosome contact matrix A.n*n This is indicated by the location of the chromosome segment, where A... ij This represents the number of contacts between chromosome segment i and chromosome segment j.

[0052] In raw single-cell Hi-C data, there is a discrepancy between the experimentally detected contact number and the actual contact number, due to inherent limitations of the single-cell Hi-C technology. These discrepancies affect the accuracy of characterizing chromosome structures across various cell types. To address this issue, a spatial smoothing method is employed to reduce errors in contact numbers, thereby improving the accuracy of cell classification. The spatial smoothing technique is explained as follows: it refines the target chromosome segment information by integrating information from spatially adjacent chromosome segments. Specifically, chromosome segments spatially adjacent to the target chromosome segment are identified by satisfying at least one of the following conditions: they are either linearly adjacent to the target chromosome segment or participate in the interaction of the target chromosome segment. All interaction information between the target chromosome segment and its adjacent segments is represented in matrix A. n*n Each row in the matrix. Therefore, the neighbor bin contact matrix is ​​generated. Where i represents the sequence number of the target chromosome, and b represents the number of adjacent segments of the target chromosome segment. The process of spatial smoothing the contact information is as follows:

[0053]

[0054] Where S represents the chromosome contact matrix after spatial smoothing, and the subscripts i and j of S represent chromosome segment i and chromosome segment j, and correspond to the rows and columns of the matrix; represents the elements in the neighbor bin contact matrix; b represents the total number of adjacent segments of the target chromosome segment; A ij represents the number of contacts between chromosome segment i and chromosome segment j; s represents the s-th adjacent segment of the target chromosome segment; n represents the number of segments that each chromosome can be divided into.

[0055] Because the chromosome contact matrix exhibits symmetry, the smoothing operation is specifically targeted at its diagonal and upper triangular elements. This method guarantees that columns are always greater than or equal to rows throughout the smoothing process.

[0056] Step 2: Extract small-domain contact probability feature sets from the intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact sparse matrix; extract smooth small-domain contact probability feature sets from the intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact enhancement matrix; and extract smooth bin contact probability feature sets representing the contact frequency distribution from the chromosome contact enhancement matrix.

[0057] ①Small Area Contact Probability Feature Set (SICP):

[0058] Based on the sparse matrix, a small-domain contact probability feature set is extracted from the chromatin interactions within the domain centered on the target chromosome segment. Conceptually, a "small domain" is defined as a triangular region centered on the target chromosome segment and composed of its adjacent first-order linear units. The values ​​in the small-domain contact probability feature set are calculated by dividing the total number of contact information within the preset region by the number of contacts in the chromosome contact sparse matrix.

[0059] The SICP value for each region is calculated by dividing the total number of contact information points within the region by the number of contacts in the sparse matrix. Therefore, the contact probability within a small region is calculated using the following formula:

[0060]

[0061] Among them B ij Represents the elements within the currently calculated subdomain, with index c ranging from [1, n]. Total(A n*n The number of contacts in the sparse chromosome contact matrix is ​​represented by ). It's important to emphasize that for SICP feature calculation, sparse matrix A needs to be filled with zero elements to construct matrix B. This step ensures that the small domains can be fully formed regardless of whether c equals 1 or n. Finally, the contact probabilities within the small domains throughout the entire chromatin are aggregated to form the SICP feature set of the current cell.

[0062] ② Smooth Small-Domain Contact Probability Feature Set (SSICP):

[0063] Small-area contact probability feature sets were extracted based on sparse matrices. However, sparse matrices themselves contain limited feature information. Influenced by this conclusion, smoothed small-area contact probability feature sets (SSICP) were introduced based on enhancement matrices. The values ​​in the smoothed small-area contact probability feature set are calculated by dividing the total number of contact information within a preset region by the number of contacts in the chromosome contact enhancement matrix.

[0064] SSICP is derived from the smoothing matrix and aims to integrate SICP and SSICP features to improve the accuracy of information. The formula for calculating SSICP is as follows:

[0065]

[0066] Where D ij This represents the element within the currently computed small region, and the index c ranges from [1, n]. It must be emphasized that to compute the SSICP feature set, the smoothing matrix S must be filled with zero elements to generate matrix D. Similar to the method used in SICP, the contact probabilities within multiple smoothed small regions across the entire chromatin are combined to form the SSICP feature set.

[0067] ③ Smooth bin contact probability feature set (SBCP):

[0068] Different cell types exhibit different contact frequency distributions within the same chromosome segment. Considering the potential discrepancy between the observed and actual contact numbers in the sparse matrix, a smoothed bin contact probability (SBCP) feature set is introduced. This feature set is based on an enhancement matrix to increase information between chromosome segments (bins). Specifically, the SBCP value is defined as the sum of the contact numbers between a bin and its neighboring bins divided by the total number of contacts in the entire chromosome contact matrix. The smoothed chromosome contact matrix is ​​represented as S... n*n , where n represents the sequence number of bin. The values ​​in the smooth bin contact probability feature set are: the sum of the contact numbers between a chromosome segment and its neighboring chromosome segments divided by the total number of contacts in the entire chromosome contact matrix.

[0069] The process for determining SBCP is described below:

[0070]

[0071] Where Total(S) n*n Let represent the total number of contacts within the smoothed chromosome matrix, with variable i varying in the range [0, n-1]. Finally, the SBCP feature set for each cell is constructed by concatenating all the smoothed bin contact probability feature sets.

[0072] Step 3: Using a pre-trained fusion classification model, extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. Then, fuse the extracted corresponding features and predict the cell type based on the fused features.

[0073] To more accurately predict cell types, a comprehensive classification model based on deep learning was constructed. Details are attached. Figure 1 The model includes two key convolutional blocks for extracting complex and important features. A fusion module integrates feature information from different perspectives, while the Bi-LSTM module reveals the underlying information within the feature set.

[0074] Each convolutional block comprises the following components: a one-dimensional convolutional layer (Conv1d), a batch normalization layer (BatchNorm), a max-pooling layer (Max-Pool), and a dropout layer. The CNN layer uses one-dimensional convolutional kernels to extract key information from the input feature set. The BatchNorm layer accelerates network convergence and mitigates problems such as vanishing or exploding gradients. The Max-Pool layer reduces the dimensionality of features while retaining important information, thus preventing overfitting and improving the model's accuracy and generalization ability. The ReLU activation function is used to enhance non-linearity, speeding up training while generating more complex representations. The Dropout layer eliminates specific neurons during forward propagation with predefined probabilities, helping to improve the model's robustness. Finally, a flattening operation merges the features from all channels into a single vector, where each feature set corresponds to a vector that served as input to the fusion module.

[0075] In the fusion module, the three feature set vectors generated by the convolutional blocks are merged into a unified vector, which helps to extract a comprehensive and representative feature set from multiple angles and dimensions. The fused vector serves as the input to the Bi-LSTM module. In the classification module, the FC layer extracts feature information from different dimensions, while the LayerNorm layer preserves the relationships between feature sets to enhance the network's generalization ability. The Bi-LSTM captures the latent information between feature sets. Then, a linear layer and the "log_softmax" function are used to map the scores of each class into a one-dimensional vector. The final classification result is the index corresponding to the highest score. To prevent overfitting during training, this invention implements an early stopping mechanism and employs a 5-fold cross-validation method. If the loss on the validation dataset does not decrease for 10 consecutive rounds, the training process will be terminated.

[0076] This embodiment uses operations such as convolution and pooling to perform multi-angle analysis and extract features, and then predicts cell types based on these features. Compared with previous methods, this embodiment achieves excellent classification results and outperforms other methods.

[0077] To evaluate the cell type prediction performance of multi-omics feature fusion, this embodiment first validates the model on balanced datasets (Flyamer, Ramani, and Collombet datasets). This embodiment compares the performance of the fusion classification model with six other models: three single-feature-set models (SICP, SBCP, SSICP) and three dual-feature-set fusion models (SBCP_SSICP, SBCP_SICP, SICP_SSICP). See attached... Figure 2As shown, the three single-feature-set models exhibit high performance on the three balanced datasets, confirming the effectiveness of feature extraction. Furthermore, the three dual-feature-set fusion models demonstrate excellent performance, highlighting the value of feature fusion in enhancing the model's classification ability. The final results confirm that, for all three balanced datasets, the fusion classification model with three feature sets achieves the best and most consistent performance across metrics such as ACC, BACC, F1, and Precision. In conclusion, the fusion classification model effectively integrates features from multiple perspectives, thereby enhancing its cell type classification ability.

[0078] To demonstrate the superiority of the method in predicting cell types on balanced datasets, it was compared with the state-of-the-art scHiCStackL method. Furthermore, this embodiment employs three traditional machine learning methods (Support Vector Machine, Logistic Regression, and Random Forest) to further evaluate the effectiveness of the deep learning model constructed in this invention. (See attached...) Figure 3 As shown, among the four metrics of ACC, ARI, F1 and Precision, the model in this embodiment exhibits the highest and most stable performance.

[0079] To address the challenge of imbalanced datasets, the performance of cell type prediction based on multi-omics feature fusion was evaluated on imbalanced datasets (4DN, Lee, and Nagano datasets). (See attached image) Figure 4 As shown, the fusion classification model in this embodiment exhibits best and consistent performance on imbalanced datasets across four metrics: ACC, BACC, F1, and Precision. Based on these results, the fusion classification model demonstrates strong generalization and robustness by effectively fusing feature sets from various perspectives, ultimately providing excellent cell type prediction performance.

[0080] Because previous validation of scHiCStackL was limited to datasets with simple and easily distinguishable cell types, it did not provide a comprehensive evaluation of the framework's generality and robustness. Therefore, the effectiveness of this embodiment in cell type prediction was quantified using three imbalanced datasets and directly compared with three traditional machine learning methods and the existing scHiCStackL. (See attached...) Figure 5 It can be seen that this embodiment can demonstrate optimal performance on any dataset.

[0081] Rigorous genomic analysis was performed on both balanced and imbalanced datasets to demonstrate the accuracy and applicability of the method in this embodiment at the genomic analysis level. Detailed annotation of cell type-specific patterns is required, followed by correlation with specific genes based on feature relevance ranking. A survey of extensive studies revealed consistent strong correlations between the annotation patterns of this embodiment and various cell types across multiple datasets, confirming the accuracy and relevance of the method.

[0082] Example 2

[0083] This embodiment provides a cell type prediction system based on multivariate feature set fusion, which specifically includes the following modules:

[0084] The data processing module is used to acquire single-cell Hi-C data and preprocess it to obtain a chromosome contact sparse matrix. Then, the chromosome contact sparse matrix is ​​spatially smoothed to obtain a chromosome contact enhancement matrix.

[0085] The feature set extraction module is used to extract small-domain contact probability feature sets from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact sparse matrix, extract smooth small-domain contact probability feature sets from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact enhancement matrix, and extract smooth bin contact probability feature sets characterizing the contact frequency distribution from the chromosome contact enhancement matrix.

[0086] The cell type prediction module uses a pre-trained fusion classification model to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. The extracted corresponding features are then fused, and the cell type is predicted based on the fused features.

[0087] The fusion classification model includes a convolution module and a fusion module. The convolution module is used to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. The fusion module is used to fuse the extracted corresponding features and then predict the cell type based on the fused features.

[0088] It should be noted that each module in this embodiment corresponds one-to-one with each step in Embodiment 1, and their specific implementation processes are the same.

[0089] Example 3

[0090] This embodiment provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps in the cell type prediction method based on multivariate feature set fusion as described above.

[0091] Example 4

[0092] This embodiment provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the program, it implements the steps in the cell type prediction method based on multivariate feature set fusion as described above.

[0093] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, as well as combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0094] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A cell type prediction method based on multivariate feature set fusion, characterized in that, include: Single-cell Hi-C data were acquired and preprocessed to obtain a chromosome contact sparse matrix. Then, the chromosome contact sparse matrix was spatially smoothed to obtain a chromosome contact enhancement matrix. Small-domain contact probability feature sets are extracted from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact sparse matrix; smooth small-domain contact probability feature sets are extracted from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact enhancement matrix; and smooth bin contact probability feature sets characterizing the contact frequency distribution are extracted from the chromosome contact enhancement matrix. The pre-trained fusion classification model is used to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. The extracted corresponding features are then fused, and the cell type is predicted based on the fused features.

2. The cell type prediction method based on multivariate feature set fusion as described in claim 1, characterized in that, The values ​​in the small-domain contact probability feature set are: the total number of contact information within the preset region divided by the number of contacts in the chromosome contact sparse matrix.

3. The cell type prediction method based on multivariate feature set fusion as described in claim 1, characterized in that, The values ​​in the smooth small-domain contact probability feature set are: the total number of contact information in the preset area divided by the number of contacts in the chromosome contact enhancement matrix.

4. The cell type prediction method based on multivariate feature set fusion as described in claim 1, characterized in that, The values ​​in the smooth bin contact probability feature set are: the sum of the contact numbers between a chromosome segment and its neighboring chromosome segments divided by the total number of contacts in the entire chromosome contact matrix.

5. The cell type prediction method based on multivariate feature set fusion as described in claim 1, characterized in that, The preprocessing process for single-cell Hi-C data is as follows: Each chromosome is segmented based on a preset resolution, resulting in multiple chromosome segments, with each chromosome segment forming a bin; The relevant information in the single-cell Hi-C data is then matched with each bin to obtain the chromosome contact sparse matrix.

6. The cell type prediction method based on multivariate feature set fusion as described in claim 1, characterized in that, The fusion classification model includes a convolution module and a fusion module. The convolution module is used to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. The fusion module is used to fuse the extracted corresponding features and then predict the cell type based on the fused features.

7. A cell type prediction system based on multi-feature set fusion, characterized in that, include: The data processing module is used to acquire single-cell Hi-C data and preprocess it to obtain a chromosome contact sparse matrix. Then, the chromosome contact sparse matrix is ​​spatially smoothed to obtain a chromosome contact enhancement matrix. The feature set extraction module is used to extract small-domain contact probability feature sets from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact sparse matrix, extract smooth small-domain contact probability feature sets from intra-domain chromatin interactions centered on the target chromosome segment in the chromosome contact enhancement matrix, and extract smooth bin contact probability feature sets characterizing the contact frequency distribution from the chromosome contact enhancement matrix. The cell type prediction module uses a pre-trained fusion classification model to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. The extracted corresponding features are then fused, and the cell type is predicted based on the fused features.

8. The cell type prediction system based on multi-feature set fusion as described in claim 7, characterized in that, The fusion classification model includes a convolution module and a fusion module. The convolution module is used to extract corresponding features from the small-domain contact probability feature set, the smoothed small-domain contact probability feature set, and the smoothed bin contact probability feature set, respectively. The fusion module is used to fuse the extracted corresponding features and then predict the cell type based on the fused features.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the steps in the cell type prediction method based on multivariate feature set fusion as described in any one of claims 1-6.

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps in the cell type prediction method based on multivariate feature set fusion as described in any one of claims 1-6.