Data processing method and device based on t cell receptor sequence classification, equipment, storage medium and computer program product

By screening, cleaning, missing data completion, and category labeling of T-cell receptor sequence data, multidimensional features are extracted and integrated to construct an optimized classification model, which solves the problem of insufficient accuracy in T-cell receptor sequence classification and achieves higher recognition accuracy and stability.

CN122245448APending Publication Date: 2026-06-19SHENZHEN HAPLOX BIOTECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SHENZHEN HAPLOX BIOTECH
Filing Date
2026-04-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, T cell receptor sequence classification lacks standardized preprocessing and multi-class feature co-characterization, resulting in poor accuracy of classification results.

Method used

The target sequence dataset is obtained through filtering, cleaning, missing data completion, and category labeling. Numerical, regional, and sequence content representation information is extracted, integrated into a feature matrix, and feature reduction is performed. A classification model is then constructed and optimized to ensure consistency in the processing methods.

Benefits of technology

It improves the accuracy and stability of T cell receptor sequence classification and identification, reduces interference from abnormal data, enhances the sufficiency of feature characterization, and reduces the impact of redundant features.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122245448A_ABST
    Figure CN122245448A_ABST
Patent Text Reader

Abstract

This application relates to the field of bioinformatics analysis technology, and in particular to a data processing method, apparatus, device, storage medium, and computer program product based on T cell receptor sequence classification. The method involves acquiring sequence data to be processed, and then filtering, cleaning, completing missing data, and labeling categories to obtain a target sequence dataset. Numerical representation information, region identification representation information, and sequence content representation information are extracted from the target sequence dataset and integrated to obtain a target feature matrix. Feature reduction processing is performed based on the target feature matrix and category labeling results to determine the core feature set and generate dimensionality-reduced feature data. A classification model is constructed and optimized based on the dimensionality-reduced feature data and category labeling results to obtain a target classification model. The sequence files to be classified are processed according to the processing method corresponding to the sequence data to be processed, resulting in feature data to be classified, which is then input into the target classification model to obtain the target classification result.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of bioinformatics analysis technology, and in particular to a data processing method, apparatus, device, storage medium and computer program product based on T cell receptor sequence classification. Background Technology

[0002] T-cell receptor sequences can reflect differences in the immune status of samples; therefore, classification and identification based on T-cell receptor sequences has become an important research direction in bioinformatics and tumor immunology analysis. Current technologies for classifying T-cell receptor sequences typically model directly based on raw sequence data or some statistical features, lacking standardized preprocessing of sequence data and collaborative characterization of multiple features. However, T-cell receptor sequence data usually contains numerical information, region identification information, and sequence content information. Data from different sources may also contain invalid records, missing content, and inconsistent formats. Current technologies struggle to uniformly screen, clean, complete missing data, and label categories for this type of data, and also find it difficult to perform feature integration, feature reduction, and consistent classification processing, resulting in poor accuracy of the final classification results. Therefore, improving the accuracy of T-cell receptor sequence-based classification and identification has become an urgent technical problem to be solved. Summary of the Invention

[0003] The main objective of this application is to provide a data processing method, apparatus, device, storage medium, and computer program product based on T-cell receptor sequence classification, aiming to solve the technical problem of how to improve the accuracy of T-cell receptor sequence classification and identification.

[0004] To achieve the above objectives, this application provides a data processing method based on T cell receptor sequence classification, the method comprising the following steps: Obtain the sequence data to be processed, and perform filtering, cleaning, missing data completion, and category labeling on the sequence data to be processed to obtain the target sequence dataset; Based on the target sequence dataset, numerical representation information, regional identification representation information, and sequence content representation information are extracted, and the numerical representation information, regional identification representation information, and sequence content representation information are integrated to obtain the target feature matrix; Based on the target feature matrix and category labeling results, feature reduction processing is performed to determine the core feature set, and dimensionality-reduced feature data is generated based on the core feature set. Based on the dimensionality reduction feature data and the category labeling results, a classification model is constructed and optimized to obtain the target classification model; The sequence file to be classified is processed according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and the feature data to be classified is input into the target classification model to obtain the target classification result.

[0005] In one embodiment, the step of acquiring the sequence data to be processed and performing filtering, cleaning, missing data completion, and category labeling on the sequence data to obtain the target sequence dataset includes: Obtain raw sequence files from multiple sources, and extract sequence fields, clone characterization fields, region fields, and chain fields from the raw sequence files to obtain initial sequence data; The initial sequence data is filtered based on preset validity conditions, and the filtered sequence data is processed by outlier removal and missing field completion to obtain normalized sequence data. The category identifier is determined based on the data source corresponding to the normalized sequence data, and the category identifier is added to the normalized sequence data to obtain the target sequence dataset.

[0006] In one embodiment, the step of extracting numerical representation information, region identification representation information, and sequence content representation information based on the target sequence dataset, and integrating the numerical representation information, the region identification representation information, and the sequence content representation information to obtain a target feature matrix includes: Based on the target sequence dataset, numerical features corresponding to sequence clone representation and sequence length representation are extracted to obtain the numerical representation information; Based on the target sequence dataset, region fields corresponding to the sequence region segments are extracted, and the region fields are encoded and converted to obtain the region identification representation information; Based on the target sequence dataset, amino acid composition features and sequence vector features corresponding to the sequence fields are extracted to obtain the sequence content characterization information. The numerical characterization information, the region identification characterization information, and the sequence content characterization information are then integrated to obtain the target feature matrix.

[0007] In one embodiment, the step of performing feature reduction processing based on the target feature matrix and category labeling results to determine the core feature set, and generating dimensionality-reduced feature data based on the core feature set, includes: Based on the target feature matrix and the category labeling results, a feature selection model is constructed, and the correlation between each feature in the target feature matrix and the category labeling results is analyzed based on the feature selection model to obtain the feature selection results; Based on the feature filtering results, target features that meet preset conditions are determined, and the target features are defined as the core feature set. Based on the core feature set, the target feature matrix is ​​reduced and mapped to obtain the dimensionality-reduced feature data.

[0008] In one embodiment, the step of constructing and optimizing a classification model based on the dimensionality reduction feature data and the category labeling results to obtain a target classification model includes: Based on the dimensionality reduction feature data and the category labeling results, a support vector machine classification model is constructed to obtain an initial classification model; Based on a preset parameter search strategy, the kernel function parameters and penalty parameters of the initial classification model are optimized to obtain the target parameter configuration; Based on the target parameter configuration and the dimensionality reduction feature data, the initial classification model is trained to obtain the target classification model.

[0009] In one embodiment, the step of processing the sequence file to be classified according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and inputting the feature data to be classified into the target classification model to obtain the target classification result includes: Obtain the sequence file to be classified, and preprocess the sequence file to be classified according to the filtering, cleaning, missing completion and category feature construction methods corresponding to the sequence data to be processed, to obtain the sequence data to be analyzed; Based on the sequence data to be analyzed, numerical representation information, regional identification representation information, and sequence content representation information are extracted, and the numerical representation information, regional identification representation information, and sequence content representation information are processed according to the integration method corresponding to the target feature matrix to obtain the feature data to be classified. The feature data to be classified is input into the target classification model for classification determination, and the target classification result is obtained.

[0010] Furthermore, to achieve the above objectives, this application also proposes a data processing device based on T cell receptor sequence classification, the data processing device based on T cell receptor sequence classification comprising: The data processing module is used to acquire the sequence data to be processed, and to perform filtering, cleaning, missing data completion and category labeling on the sequence data to be processed to obtain the target sequence dataset; The information integration module is used to extract numerical representation information, regional identification representation information and sequence content representation information based on the target sequence dataset, and to integrate the numerical representation information, the regional identification representation information and the sequence content representation information to obtain the target feature matrix; The feature reduction module is used to perform feature reduction processing based on the target feature matrix and category labeling results, determine the core feature set, and generate dimensionality-reduced feature data based on the core feature set. The model optimization module is used to construct and optimize the classification model based on the dimensionality reduction feature data and the category labeling results to obtain the target classification model; The target module is used to process the sequence file to be classified according to the processing method corresponding to the sequence data to be processed, to obtain the feature data to be classified, and to input the feature data to be classified into the target classification model to obtain the target classification result.

[0011] Furthermore, to achieve the above objectives, this application also proposes a data processing device based on T-cell receptor sequence classification. The device includes: a memory, a processor, and a data processing program based on T-cell receptor sequence classification stored in the memory and executable on the processor. The data processing program based on T-cell receptor sequence classification is configured to implement the steps of the data processing method based on T-cell receptor sequence classification as described in any of the above embodiments.

[0012] In addition, to achieve the above objectives, this application also proposes a storage medium storing a data processing program based on T-cell receptor sequence classification, wherein when the data processing program based on T-cell receptor sequence classification is executed by a processor, it implements the steps of the data processing method based on T-cell receptor sequence classification as described above.

[0013] In addition, to achieve the above objectives, this application also proposes a computer program product, which includes a computer program that, when executed by a processor, implements the steps of the data processing method based on T-cell receptor sequence classification as described above.

[0014] This application obtains the sequence data to be processed and performs filtering, cleaning, missing data completion, and category labeling to obtain the target sequence dataset. Based on the target sequence dataset, numerical representation information, regional identification representation information, and sequence content representation information are extracted and integrated to obtain the target feature matrix. Based on the target feature matrix and category labeling results, feature reduction processing is performed to determine the core feature set, and dimensionality-reduced feature data is generated based on the core feature set. Based on the dimensionality-reduced feature data and category labeling results, a classification model is constructed and optimized to obtain the target classification model. The sequence files to be classified are processed according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and the feature data to be classified is input into the target classification model to obtain the target classification result. This application reduces the interference of abnormal data on classification by screening, cleaning, missing data completion, and category labeling of sequence data; improves the sufficiency of characterizing T cell receptor sequence differences by extracting and integrating numerical characterization information, regional identification characterization information, and sequence content characterization information; determines the core feature set and generates dimensionality-reduced feature data by feature reduction, thereby reducing the impact of redundant features on the classification model; and improves the accuracy of T cell receptor sequence-based classification by processing the sequence files to be classified using the same processing methods as in the training stage before inputting them into the target classification model, making the classification criteria more consistent. Attached Figure Description

[0015] Figure 1 This is a flowchart illustrating the first embodiment of the data processing method based on T-cell receptor sequence classification in this application; Figure 2 This is a schematic diagram of a sub-process in the second embodiment of the data processing method based on T cell receptor sequence classification of this application; Figure 3 This is a schematic diagram of a sub-process in the third embodiment of the data processing method based on T cell receptor sequence classification of this application; Figure 4 This is a schematic diagram of the module structure of the data processing device based on T cell receptor sequence classification according to an embodiment of this application; Figure 5 This is a schematic diagram of the hardware operating environment involved in the data processing method based on T cell receptor sequence classification in the embodiments of this application.

[0016] The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0017] It should be understood that the specific embodiments described herein are for illustrative purposes only and are not intended to limit the scope of this application.

[0018] To better understand the technical solution of this application, a detailed description will be provided below in conjunction with the accompanying drawings and specific implementation methods.

[0019] It is important to note that T-cell receptor sequences can reflect differences in the immune status of samples. Therefore, classification and identification based on T-cell receptor sequences has become an important research direction in bioinformatics and tumor immunology analysis. Current technologies for classifying T-cell receptor sequences typically rely directly on raw sequence data or partial statistical features for modeling, lacking standardized preprocessing of sequence data and collaborative characterization of multiple features. However, T-cell receptor sequence data usually contains numerical information, region identification information, and sequence content information. Data from different sources may also contain invalid records, missing content, and inconsistent formats. Existing technologies struggle to uniformly screen, clean, complete missing data, and label categories for this type of data. Furthermore, they are unable to perform feature integration, feature reduction, and consistent classification processing on this basis, resulting in poor accuracy of the final classification results. Therefore, improving the accuracy of T-cell receptor sequence-based classification and identification has become an urgent technical problem to be solved.

[0020] The main solution of this application is as follows: First, acquire the sequence data to be processed, and perform filtering, cleaning, missing data completion, and category labeling on the sequence data to obtain the target sequence dataset. Second, based on the target sequence dataset, extract numerical representation information, region identification representation information, and sequence content representation information, and integrate these information to obtain the target feature matrix. Third, based on the target feature matrix and category labeling results, perform feature reduction processing to determine the core feature set, and generate dimensionality-reduced feature data based on the core feature set. Fourth, based on the dimensionality-reduced feature data and category labeling results, construct and optimize the classification model to obtain the target classification model. Fifth, process the sequence files to be classified according to the processing method corresponding to the sequence data to be processed, obtain the feature data to be classified, and input the feature data to be classified into the target classification model to obtain the target classification result.

[0021] This application reduces the interference of abnormal data on classification by screening, cleaning, missing data completion, and category labeling of sequence data; improves the sufficiency of characterizing T cell receptor sequence differences by extracting and integrating numerical characterization information, regional identification characterization information, and sequence content characterization information; determines the core feature set and generates dimensionality-reduced feature data by feature reduction, thereby reducing the impact of redundant features on the classification model; and improves the accuracy of T cell receptor sequence-based classification by processing the sequence files to be classified using the same processing methods as in the training stage before inputting them into the target classification model, making the classification criteria more consistent.

[0022] It should be noted that the executing entity of the method in this embodiment can be a computing service device with data processing, network communication, and program execution functions, or it can be the aforementioned data processing device based on T-cell receptor sequence classification with the same or similar functions. This embodiment and the following embodiments will be described using a data processing device based on T-cell receptor sequence classification as an example.

[0023] Based on this, a first embodiment of the data processing method based on T cell receptor sequence classification of this application is proposed. Please refer to [link / reference]. Figure 1 , Figure 1 This is a flowchart illustrating the first embodiment of the data processing method based on T-cell receptor sequence classification in this application.

[0024] In this embodiment, the data processing method based on T cell receptor sequence classification includes the following steps: S1: Obtain the sequence data to be processed, and perform filtering, cleaning, missing data completion and category labeling on the sequence data to be processed to obtain the target sequence dataset; It should be noted that the sequence data to be processed refers to the raw T-cell receptor-related data used for subsequent classification analysis. This data originates from the raw sequence files corresponding to different categories of samples, and these files typically include fields such as AASeq, cloneCount, cloneFraction, Vregion, Dregion, Jregion, Length, and Chain. Filtering refers to selecting valid data from the raw sequence data that meets the analysis requirements based on preset criteria. Cleaning refers to processing the filtered data, including removing invalid records, removing abnormal content, and formatting. Missing data completion refers to filling in missing fields in the data according to preset methods. Category labeling refers to assigning classification labels to the corresponding data based on the sample category to which the sequence data belongs. The target sequence dataset refers to the standardized dataset formed after filtering, cleaning, missing data completion, and category labeling, which can be used for subsequent feature extraction and classification modeling. Among them, AASeq represents the amino acid sequence field corresponding to the T cell receptor, used to characterize the composition information of the sequence itself; cloneCount represents the clone count field, used to characterize the number of clones of the corresponding sequence in the sample; cloneFraction represents the clone percentage field, used to reflect the relative percentage of the corresponding sequence in the sample; Vregion, Dregion, and Jregion represent different gene region fields corresponding to the T cell receptor sequence, used to characterize the regional composition information of the sequence; Length represents the sequence length field, used to characterize the length information of the corresponding sequence; and Chain represents the receptor chain type field, used to distinguish different types of T cell receptor chains and as one of the criteria for screening effective sequences.

[0025] Specifically, the first step is to acquire the sequence data to be processed. This sequence data can come from multiple raw sequence files in breast cancer sample datasets and healthy sample datasets. After reading the raw files, the core fields relevant to subsequent analysis are extracted to form the initial sequence data. Subsequently, the initial sequence data is filtered, prioritizing the retention of valid sequences that meet the target chain type criteria, while excluding data records with empty sequence content, abnormal clone counts, or no analytical significance. This narrows down the scope of subsequent processing from the source, ensuring that the data entering subsequent analysis first meets the basic validity requirements.

[0026] Furthermore, after the screening is completed, the retained sequence data undergoes cleaning and missing data completion processing. The cleaning process may include removing invalid samples, formatting fields, and standardizing data representation across different source files. For missing fields, they are filled in using a preset completion method to improve the information completeness of each sequence record. Next, category labeling is performed based on the data source or sample category, assigning corresponding labels to samples of different categories. The labeled sequence data are then merged to obtain the target sequence dataset that can be used for subsequent feature extraction.

[0027] By first screening the sequence data to be processed, invalid sequences that do not meet the analysis requirements can be excluded. Cleaning reduces interference from outlier records, null records, and inconsistent formats. Missing data completion improves the completeness of sequence information, preventing insufficient representation during feature extraction. Finally, category labeling allows sequence data from different sources and categories to form a data foundation directly usable for training and analysis under a unified standard. Therefore, this step provides more standardized, complete, and category-indicating data input for subsequent feature construction, feature reduction, and classification model training, thereby improving the stability and accuracy of subsequent T-cell receptor sequence classification.

[0028] S2: Based on the target sequence dataset, extract numerical representation information, region identification representation information, and sequence content representation information, and integrate the numerical representation information, region identification representation information, and sequence content representation information to obtain the target feature matrix; S3: Based on the target feature matrix and category labeling results, perform feature reduction processing to determine the core feature set, and generate dimensionality-reduced feature data based on the core feature set; It should be noted that: numerical representation information refers to the feature information extracted from sequence data that reflects the numerical attributes of the sequence, such as information related to the number of clones, clone ratio, sequence length, and their derivation relationships; region identification representation information refers to the identifier-type feature information transformed from the region field corresponding to the sequence, used to represent the differences in region composition among different sequences; sequence content representation information refers to the feature information extracted based on the composition of the sequence itself, used to represent the differences at the content level of the sequence; the target feature matrix refers to the feature set formed after unifying and integrating numerical representation information, region identification representation information, and sequence content representation information; category labeling results refer to the classification labels assigned to each sequence data according to the sample source or sample category in the preprocessing, used to indicate the category to which different sequences belong; feature reduction processing refers to the process of filtering and compressing the original features based on the target feature matrix and category labeling results to reduce redundant or low-correlation features; the core feature set refers to the feature set that is retained after feature reduction processing and is highly correlated with the classification task; and dimensionality-reduced feature data refers to the low-dimensional feature data mapped or extracted from the target feature matrix based on the core feature set.

[0029] Specifically, firstly, feature construction processing is performed based on the target sequence dataset obtained earlier. Numerical features related to the number of clones, clone ratio, sequence length, and their derivation relationships are extracted from each sequence sample to form numerical representation information. Then, corresponding regional information is extracted from the region field and region identifier representation information is formed through encoding conversion. At the same time, sequence content representation information reflecting the amino acid composition and sequence distribution characteristics is extracted from the sequence field itself. Afterward, the above three types of representation information are uniformly spliced ​​and integrated according to the sample correspondence to form a target feature matrix for subsequent classification modeling.

[0030] Furthermore, after obtaining the target feature matrix, feature reduction is performed in conjunction with the category labeling results. Specifically, a feature selection model can be used to analyze the correlation between each feature and the category label, identifying features more representative of the classification task from the target feature matrix and determining them as the core feature set. Then, based on this core feature set, the original target feature matrix is ​​pruned, mapped, or reorganized to remove features with high redundancy or low contribution, ultimately generating dimensionality-reduced feature data with more streamlined dimensions for subsequent classification model construction and training.

[0031] This step extracts numerical representation information, region identification representation information, and sequence content representation information separately, and integrates these three types of information into a unified target feature matrix. This allows the same sequence sample to be represented from multiple dimensions, thereby enhancing the ability to characterize the differences in T cell receptor sequences. Furthermore, feature reduction processing is performed based on the category labeling results to select core features that are more relevant to the classification task and generate dimensionality-reduced feature data. This reduces the interference of redundant and irrelevant features on subsequent classification modeling, thus improving the accuracy and stability of classification and identification based on T cell receptor sequences.

[0032] S4: Based on the dimensionality reduction feature data and the category labeling results, construct and optimize the classification model to obtain the target classification model; S5: Process the sequence file to be classified according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and input the feature data to be classified into the target classification model to obtain the target classification result; It should be noted that the classification model refers to a data discrimination model built based on training sample data to distinguish different categories of sequence samples; the target classification model refers to a model formed after construction and parameter optimization to determine the category of the sequence to be classified; the sequence file to be classified refers to the test sequence file that needs to be input into the target classification model for category identification; the feature data to be classified refers to the feature data obtained after processing the sequence file to be classified according to the corresponding processing method in the training stage; and the target classification result refers to the category identification result output by the target classification model after judging the feature data to be classified.

[0033] Specifically, firstly, a classification model is constructed and optimized based on the dimensionality-reduced feature data and category labeling results obtained previously. Specifically, the dimensionality-reduced feature data is first matched with the corresponding category labeling results to form a sample set for model training; then, a support vector machine classification model is built based on this sample set to obtain an initial classification model. After the initial classification model is established, a preset parameter search strategy is further adopted to adjust key parameters such as kernel function parameters and penalty parameters in the model to determine the target parameter configuration that is suitable for the current input features; then, the optimized parameter configuration is used to train the initial classification model, thereby obtaining a target classification model that can be used for sequence classification.

[0034] Furthermore, after the target classification model is constructed, the processing flow corresponding to the training phase is executed on the sequence file to be classified. Specifically, the sequence file to be classified is first read and preprocessed according to the aforementioned processing method for sequence data, including filtering, cleaning, and missing data completion. Then, the corresponding numerical features, region identifier features, and sequence content features are extracted according to the aforementioned feature construction method, and the same type of feature data to be classified as that in the training phase is generated based on the core feature set. Finally, the feature data to be classified is input into the target classification model, which classifies the feature data and outputs the corresponding target classification result. When necessary, classification confidence information corresponding to the target classification result can also be output simultaneously.

[0035] This step first constructs and optimizes a classification model based on the dimensionality-reduced input features and category labeling results, enabling the model to establish more targeted classification judgment relationships around the core features that have been selected and retained. Then, the sequence file to be classified is converted into feature data to be classified in the same way as in the training stage and input into the target classification model. This ensures that the training data and the prediction data are consistent in terms of processing methods and feature expression, reducing the judgment bias caused by inconsistent processing methods. Therefore, it is beneficial to improve the accuracy and stability of classification and recognition based on T cell receptor sequences.

[0036] This embodiment obtains the sequence data to be processed and performs filtering, cleaning, missing data completion, and category labeling on the sequence data to obtain the target sequence dataset. Based on the target sequence dataset, numerical representation information, region identification representation information, and sequence content representation information are extracted and integrated to obtain the target feature matrix. Based on the target feature matrix and the category labeling results, feature reduction processing is performed to determine the core feature set, and dimensionality-reduced feature data is generated based on the core feature set. Based on the dimensionality-reduced feature data and the category labeling results, a classification model is constructed and optimized to obtain the target classification model. The sequence file to be classified is processed according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and the feature data to be classified is input into the target classification model to obtain the target classification result. This embodiment reduces the interference of abnormal data on classification by screening, cleaning, missing data completion, and category labeling of sequence data; it improves the sufficiency of representing T cell receptor sequence differences by extracting and integrating numerical representation information, regional identification representation information, and sequence content representation information; it reduces the impact of redundant features on the classification model by determining the core feature set and generating dimensionality-reduced feature data through feature reduction; and it improves the accuracy of T cell receptor sequence-based classification by processing the sequence files to be classified using the same processing methods as in the training phase before inputting them into the target classification model, making the classification criteria more consistent.

[0037] Based on the first embodiment described above, a second embodiment of the data processing method for T-cell receptor sequence classification in this application is proposed. Please refer to... Figure 2 , Figure 2 This is a schematic diagram of a sub-process in the second embodiment of the data processing method based on T-cell receptor sequence classification of this application.

[0038] like Figure 2 As shown, in this embodiment, step S1 includes: S11: Obtain raw sequence files from multiple sources, and extract sequence fields, clone characterization fields, region fields, and chain fields from the raw sequence files to obtain initial sequence data; S12: The initial sequence data is filtered based on preset validity conditions, and the filtered sequence data is processed by outlier removal and missing field completion to obtain normalized sequence data; S13: Determine the category identifier based on the data source corresponding to the normalized sequence data, and add the category identifier to the normalized sequence data to obtain the target sequence dataset.

[0039] It should be noted that: Original sequence files refer to data files from different sample categories, used to carry original information related to T-cell receptor sequences; sequence fields refer to data fields used to characterize the amino acid sequence content of T-cell receptors; clonal characterization fields refer to data fields used to characterize clonal information such as the number and percentage of clones corresponding to the sequence; region fields refer to data fields used to characterize the composition of different gene regions of the T-cell receptor; chain type fields refer to data fields used to characterize the chain type of the T-cell receptor; initial sequence data refers to the original data set formed after extracting the above-mentioned relevant fields from original sequence files from multiple sources; preset validity conditions refer to the screening conditions used to determine whether the initial sequence data meets the requirements of subsequent analysis; normalized sequence data refers to a relatively unified and standardized data set formed after screening, outlier removal, and missing field completion; category identifiers refer to the category labels determined according to the data source corresponding to the normalized sequence data; target sequence datasets refer to the data set formed after adding the category identifiers to the normalized sequence data, which can be used for subsequent feature extraction and classification modeling.

[0040] Specifically, firstly, raw sequence files from multiple sources are acquired. These raw sequence files can originate from breast cancer sample datasets and healthy sample datasets, and are stored in file formats such as CSV. After reading the raw sequence files, basic fields relevant to subsequent classification analysis are extracted from each file, including sequence fields characterizing amino acid sequence content, clone characterization fields reflecting the number and proportion of clones, region fields reflecting the composition of different gene regions, and chain type fields distinguishing receptor chain types, thereby forming initial sequence data. After this processing, the raw content from different source files is uniformly extracted into a similar field structure, facilitating consistent data processing in the future.

[0041] Furthermore, after generating the initial sequence data, the initial sequence data is filtered based on preset validity conditions. For example, valid sequences that meet the target chain type conditions can be retained first, while data records with empty sequence content, abnormal clone counts, or no significance for subsequent analysis can be removed. Subsequently, the filtered sequence data is further processed, including outlier removal and missing field completion, to ensure that data from different sources maintain consistency in field completeness and content standardization. After completing the above processing, category identifiers are determined according to the data source corresponding to the normalized sequence data, and the corresponding category identifiers are added to each normalized sequence data, ultimately obtaining a target sequence dataset that can be used for subsequent feature extraction and classification modeling.

[0042] This step first extracts standardized fields from raw sequence files from multiple sources to form initial sequence data, thus laying the foundation for unified processing of data from different sources. Next, valid sequences are screened using preset validity criteria, and outlier removal and missing field completion are performed on the screened sequence data. This reduces the interference caused by invalid data, outliers, and missing fields in subsequent analysis. Finally, category identifiers are determined based on the data source corresponding to the standardized sequence data to form a target sequence dataset, providing a unified, complete, and category-indicating data foundation for subsequent feature extraction and classification modeling. Therefore, this step helps improve the standardization and consistency of subsequent sequence characterization and classification processing, and provides reliable data support for improving the accuracy of T-cell receptor sequence-based classification and identification.

[0043] Based on the first embodiment described above, in this embodiment, step S2 includes: S21: Based on the target sequence dataset, extract the numerical features corresponding to the sequence clone representation and sequence length representation to obtain the numerical representation information; S22: Based on the target sequence dataset, extract the region fields corresponding to the sequence region segments, and perform encoding conversion processing on the region fields to obtain the region identification representation information; S23: Based on the target sequence dataset, extract the amino acid composition features and sequence vector features corresponding to the sequence fields to obtain the sequence content characterization information, and integrate the numerical characterization information, the region identification characterization information and the sequence content characterization information to obtain the target feature matrix.

[0044] It should be noted that sequence clonal characterization refers to the characterization content used to reflect the clonal distribution status of the corresponding T cell receptor sequence in a sample; sequence length characterization refers to the characterization content used to reflect the length attribute of the corresponding sequence; numerical features refer to the numerical features extracted based on sequence clonal characterization and sequence length characterization; regional fragments refer to different gene regions corresponding to the T cell receptor sequence; encoding conversion processing refers to the process of converting the region field from its original category form into an identifier form that can participate in calculations; amino acid composition features refer to the feature information extracted based on the amino acid composition in the sequence field; sequence vector features refer to the feature information obtained after mapping the sequence field into vector form; and the target feature matrix refers to the unified feature set formed by integrating numerical characterization information, regional identifier characterization information, and sequence content characterization information.

[0045] Specifically, firstly, multidimensional feature extraction is performed on the target sequence dataset. Specifically, features related to the number of clones, clone ratio, and sequence length are extracted from each sequence sample, and these features are aggregated to obtain corresponding numerical representation information. Simultaneously, regional fragment information corresponding to different gene regions is extracted from the region fields of the target sequence dataset, and the extracted region fields undergo encoding transformation to convert the original region category information into region identifier representation information usable for subsequent modeling and analysis. Furthermore, amino acid composition features and sequence vector features are extracted based on the sequence fields themselves, enabling the compositional and expression differences within the sequence content to be represented in a structured form.

[0046] Furthermore, after completing the extraction of the above three types of features, the obtained numerical representation information, region identification representation information, and sequence content representation information are uniformly integrated. During integration, features from different sources and of different types can be spliced ​​and merged according to the sample correspondence, so that each sequence sample corresponds to a complete set of comprehensive feature representations, thereby constructing the target feature matrix. On this basis, feature reduction processing is performed on the target feature matrix in combination with the category labeling results. By analyzing the degree of correlation between different features and category distinction, features that are more representative of the classification task are selected as the core feature set. Subsequently, the original target feature matrix is ​​pruned and mapped based on the core feature set to remove features with strong redundancy or low contribution, and finally, dimensionality-reduced feature data that can be used for subsequent classification model training is generated.

[0047] This step extracts numerical features corresponding to sequence cloning and sequence length characterization, regional identifier features corresponding to sequence fragments, and amino acid composition and sequence vector features corresponding to sequence fields. This allows the same sequence sample to be characterized from multiple levels, including numerical, regional, and sequence content attributes. Integrating these three types of features to form a target feature matrix enhances the comprehensive expression of T-cell receptor sequence differences. Furthermore, feature reduction is performed based on the category labeling results, selecting more representative core features and generating dimensionality-reduced feature data. This reduces the interference of redundant and irrelevant features on subsequent classification modeling. Therefore, this step improves both the sufficiency of the input features in representing sample differences and the targeting of features used in subsequent classification models, thereby enhancing the accuracy and stability of T-cell receptor sequence-based classification.

[0048] This embodiment obtains the sequence data to be processed and performs filtering, cleaning, missing data completion, and category labeling on the sequence data to obtain the target sequence dataset. Based on the target sequence dataset, numerical representation information, region identification representation information, and sequence content representation information are extracted and integrated to obtain the target feature matrix. Based on the target feature matrix and the category labeling results, feature reduction processing is performed to determine the core feature set, and dimensionality-reduced feature data is generated based on the core feature set. Based on the dimensionality-reduced feature data and the category labeling results, a classification model is constructed and optimized to obtain the target classification model. The sequence file to be classified is processed according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and the feature data to be classified is input into the target classification model to obtain the target classification result. This embodiment reduces the interference of abnormal data on classification by screening, cleaning, missing data completion, and category labeling of sequence data; it improves the sufficiency of representing T cell receptor sequence differences by extracting and integrating numerical representation information, regional identification representation information, and sequence content representation information; it reduces the impact of redundant features on the classification model by determining the core feature set and generating dimensionality-reduced feature data through feature reduction; and it improves the accuracy of T cell receptor sequence-based classification by processing the sequence files to be classified using the same processing methods as in the training phase before inputting them into the target classification model, making the classification criteria more consistent.

[0049] Based on the second embodiment described above, a third embodiment of the data processing method for T-cell receptor sequence classification in this application is proposed. Please refer to... Figure 3 , Figure 3 This is a schematic diagram of a sub-process in the third embodiment of the data processing method based on T cell receptor sequence classification of this application.

[0050] In this embodiment, step S3 includes: S31: Based on the target feature matrix and the category labeling results, construct a feature selection model, and analyze the correlation between each feature in the target feature matrix and the category labeling results based on the feature selection model to obtain the feature selection results; S32: Based on the feature filtering results, determine the target features that meet the preset conditions, and define the target features as the core feature set; S33: Based on the core feature set, the target feature matrix is ​​reduced and mapped to obtain the dimensionality-reduced feature data.

[0051] It should be noted that the feature selection model refers to a model used to analyze the correlation between each feature in the target feature matrix and the category labeling results, and to perform feature selection accordingly; the feature selection result refers to the selection output result obtained after analyzing each feature through the feature selection model, which is used to characterize the correlation between each feature and the classification task; the target feature refers to the feature that meets the preset conditions determined according to the feature selection result; the reduction mapping process refers to the process of selecting, pruning or mapping the target feature matrix based on the core feature set to generate feature data with more concise dimensions.

[0052] Specifically, firstly, a feature selection model is constructed based on the target feature matrix and category labeling results obtained previously. This feature selection model can be based on the idea of ​​sparsity constraints. It uses the category labeling results as the discrimination criterion and the target feature matrix as input to analyze the correlation between each feature in the target feature matrix and the category labeling results. Through this analysis process, features with strong discriminative power among samples of different categories can be identified, and corresponding feature selection results can be formed. These feature selection results can be used to reflect the retention value of each feature in the classification task, thus providing a foundation for the subsequent determination of core features.

[0053] Furthermore, after obtaining the feature selection results, target features that meet preset conditions are further determined based on the feature selection results, and these are identified as the core feature set. Subsequently, the original target feature matrix is ​​reduced and mapped based on the core feature set. That is, the feature content corresponding to the core feature set is retained from the original target feature matrix, while features with high redundancy or low contribution are removed, thereby forming dimensionality-reduced feature data with more concise dimensions and more concentrated feature expression. Through the above processing, the original high-dimensional comprehensive features are further compressed into an input feature form more suitable for subsequent classification model training and prediction.

[0054] This step first constructs a feature selection model based on the target feature matrix and category labeling results, and analyzes the correlation between each feature and the category labeling results, thereby identifying features with greater discriminative power for the classification task. Then, based on the feature selection results, target features that meet preset conditions are determined, forming a core feature set that excludes redundant features with weak relevance to the classification task. Next, the target feature matrix is ​​reduced and mapped based on the core feature set to obtain dimensionality-reduced feature data, making the input features used by the subsequent classification model more focused and targeted. Therefore, this step helps reduce the interference of high-dimensional redundant features on classification modeling, improves the matching degree between input features and the classification task, and thus provides support for improving the accuracy and stability of classification based on T-cell receptor sequences.

[0055] Based on the second embodiment described above, in this embodiment, step S4 includes: S41: Based on the dimensionality reduction feature data and the category labeling results, construct a support vector machine classification model to obtain an initial classification model; S42: Based on a preset parameter search strategy, optimize the kernel function parameters and penalty parameters of the initial classification model to obtain the target parameter configuration; S43: Based on the target parameter configuration and the dimensionality reduction feature data, train the initial classification model to obtain the target classification model.

[0056] It should be noted that a Support Vector Machine (SVM) classification model is a machine learning model that classifies data based on the relationship between sample features and class labels; an initial classification model is a classification model initially constructed based on dimensionality-reduced feature data and class labeling results; a preset parameter search strategy is a processing strategy that searches and filters the parameter combinations of the classification model according to a pre-defined parameter traversal or comparison method; kernel function parameters are parameters used to control how the SVM classification model maps input features and constructs classification boundaries; penalty parameters are parameters used to adjust the trade-off between classification margin and sample error in the SVM classification model; target parameter configuration refers to the parameter combination determined after parameter optimization that is suitable for the current dimensionality-reduced feature data; and the target classification model is the model obtained after training the initial classification model based on the target parameter configuration and used for subsequent classification decisions.

[0057] Specifically, firstly, a support vector machine (SVM) classification model is constructed based on the previously obtained dimensionality-reduced feature data and class labeling results, resulting in an initial classification model. Specifically, the dimensionality-reduced feature data corresponding to each sample is matched with its class labeling results to form the sample set required for model training. Then, the SVM classification model is built based on this sample set. Since the dimensionality-reduced feature data has retained the core features strongly correlated with the classification task, the initial classification model can initially establish the discriminative relationship between samples of different categories based on these input features, laying the foundation for subsequent parameter optimization and model training.

[0058] Furthermore, after obtaining the initial classification model, the kernel function parameters and penalty parameters of the initial classification model are optimized based on a preset parameter search strategy to obtain the target parameter configuration. Specifically, the model performance under different parameter configurations can be compared according to a preset parameter combination range, and the parameter combination that is more suitable for the current dimensionality reduction feature data can be selected as the target parameter configuration. Subsequently, based on the target parameter configuration and the dimensionality reduction feature data, the initial classification model is trained, enabling the model to learn the relationship between sample features and categories under the optimized parameter conditions, ultimately obtaining a target classification model that can be used for subsequent classification sequence determination.

[0059] This step first constructs a support vector machine (SVM) classification model based on dimensionality-reduced feature data and category labeling results, enabling the model to establish category discrimination relationships around the selected and retained core features. Next, a preset parameter search strategy is used to optimize the kernel function parameters and penalty parameters, making the model parameters more closely match the current input features and thus improving the model's ability to characterize the boundaries of different category samples. Finally, the initial classification model is trained based on the target parameter configuration to form the target classification model, allowing the model to learn classification rules under optimized parameter conditions. Therefore, this step helps enhance the classification model's ability to identify T-cell receptor sequence differences, thereby providing support for improving the accuracy and stability of classification based on T-cell receptor sequences.

[0060] Based on the second embodiment described above, in this embodiment, step S5 includes: S51: Obtain the sequence file to be classified, and preprocess the sequence file to be classified according to the filtering, cleaning, missing completion and category feature construction methods corresponding to the sequence data to be processed, to obtain the sequence data to be analyzed; S52: Based on the sequence data to be analyzed, extract numerical representation information, regional identification representation information and sequence content representation information, and process the numerical representation information, regional identification representation information and sequence content representation information according to the integration method corresponding to the target feature matrix to obtain the feature data to be classified. S53: Input the feature data to be classified into the target classification model for classification determination, and obtain the target classification result.

[0061] It should be noted that the sequence file to be classified refers to the test sequence file that needs to be input into the target classification model for category identification; the sequence data to be analyzed refers to the data set formed after the sequence file to be classified has been processed according to the requirements of screening, cleaning, missing data completion, and feature construction in the training phase; the feature data to be classified refers to the feature data extracted from the sequence data to be analyzed and integrated according to the corresponding method of the target feature matrix; classification determination refers to the process of inputting the feature data to be classified into the target classification model and having the target classification model output the category identification result; the target classification result refers to the category output result obtained by the target classification model after completing the determination of the feature data to be classified.

[0062] Specifically, firstly, the sequence files to be classified are obtained and preprocessed according to the processing methods corresponding to the sequence data to be processed in the training phase. Specifically, sequence fields, clonal characterization fields, region fields, and chain-like fields relevant to subsequent analysis can be read from the sequence files to be classified, and these are filtered according to the validity conditions used in the training phase, retaining valid sequences that meet the analysis requirements. Subsequently, following the same processing standards as in the training phase, outlier removal, missing field completion, and field formatting are performed on the filtered sequence data, thereby obtaining the sequence data to be analyzed that is consistent with the training data processing standards.

[0063] Furthermore, after obtaining the sequence data to be analyzed, numerical representation information, region identification representation information, and sequence content representation information are extracted from the sequence data. These multiple features are then uniformly processed according to the integration method corresponding to the target feature matrix to form feature data to be classified. Subsequently, the feature data to be classified is input into the trained target classification model, which performs classification determination and outputs the corresponding target classification result. If necessary, classification confidence information corresponding to the target classification result can also be output simultaneously for subsequent result analysis and application.

[0064] This step first processes the sequence files to be classified using the same screening, cleaning, missing data completion, and feature construction methods as the training phase, ensuring consistency between the data to be classified and the training data in terms of data foundation and feature representation. Then, it extracts the feature data to be classified based on the sequence data to be analyzed and inputs it into the target classification model for classification determination. This allows the target classification model to complete category identification based on a unified set of input features. Therefore, this step reduces classification bias caused by different data sources, processing methods, or inconsistent feature representations, improving the adaptability and consistency of the target classification model in the actual prediction stage, thereby enhancing the accuracy and stability of classification based on T-cell receptor sequences.

[0065] This embodiment obtains the sequence data to be processed and performs filtering, cleaning, missing data completion, and category labeling on the sequence data to obtain the target sequence dataset. Based on the target sequence dataset, numerical representation information, region identification representation information, and sequence content representation information are extracted and integrated to obtain the target feature matrix. Based on the target feature matrix and the category labeling results, feature reduction processing is performed to determine the core feature set, and dimensionality-reduced feature data is generated based on the core feature set. Based on the dimensionality-reduced feature data and the category labeling results, a classification model is constructed and optimized to obtain the target classification model. The sequence file to be classified is processed according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and the feature data to be classified is input into the target classification model to obtain the target classification result. This embodiment reduces the interference of abnormal data on classification by screening, cleaning, missing data completion, and category labeling of sequence data; it improves the sufficiency of representing T cell receptor sequence differences by extracting and integrating numerical representation information, regional identification representation information, and sequence content representation information; it reduces the impact of redundant features on the classification model by determining the core feature set and generating dimensionality-reduced feature data through feature reduction; and it improves the accuracy of T cell receptor sequence-based classification by processing the sequence files to be classified using the same processing methods as in the training phase before inputting them into the target classification model, making the classification criteria more consistent.

[0066] In one embodiment, the data processing method based on T cell receptor sequence classification includes the following five steps: Step 1: Data Preprocessing Data input: Breast cancer TCR dataset (PRJNA330606, containing 39 CSV files including SRR4084209SRR4102112), healthy individuals TCR dataset (PRJNA395098, containing 42 CSV files including SRR5851375SRR6372958), filtering a portion of the files, with core fields including AASeq, cloneCount, cloneFraction, Vregion, Dregion, Jregion, Length, and Chain; Data cleaning: The tidyverse package was used to read the files and filter valid sequences with Chain="TRB"; invalid samples with empty AASeq and cloneCount≤0 were removed; missing values ​​of Dregion "Unknown" were filled using the KNN imputation method; based on the operation, 2,419,877 Breast_cancer data and 4,972,795 health data were filtered. Labeling: Breast cancer samples are labeled with the classification label "1", and healthy samples are labeled with the classification label "0". The samples are then merged to obtain a standardized dataset.

[0067] Step 2: Feature Extraction Numerical feature extraction: Calculate the clone amplification coefficient (cloneCount / (cloneFraction+1e-10)) and retain the four basic numerical features: Length, cloneCount, cloneFraction, and clone amplification coefficient. Categorical feature encoding: Convert Vregion, Dregion, and Jregion into numerical features (V_encoded, D_encoded, J_encoded) using label encoding (LabelEncoder); Sequence feature quantification: Extract the proportion of hydrophobic amino acids in AASeq and convert the amino acid sequence into a fixed-dimensional (200-dimensional) numerical vector using the BLOSUM62 matrix; Feature merging: Combine the above three types of features to form an N×M dimensional feature matrix (N is the number of samples, and M is the total number of features).

[0068] Step 3: LASSO Feature Filtering Model Construction: A LASSO regression model was built based on the glmnet package, with the classification label as the dependent variable and the feature matrix as the independent variable, and the cross-validation fold was set to 10. Feature selection: Select the optimal penalty coefficient λ (lambda.min) and select the features corresponding to non-zero coefficients in the model; retain the top 80% of core features in terms of feature importance score, remove redundant features, and obtain the feature matrix after dimensionality reduction.

[0069] Step 4: Training the SVM classification model Data partitioning: The dimensionality-reduced feature matrix is ​​divided into training and test sets in a 7:3 ratio, and a random seed (seed=42) is set to ensure repeatability; Hyperparameter optimization: Based on the grid search function of the caret package, optimize the hyperparameters of the SVM model (kernel function is RBF, gamma range is 10^-310^1, penalty coefficient C range is 10^-210^2). Model training: Use the e1071 package to build an SVM classification model, complete the training through 5-fold cross-validation, and save the model.

[0070] Step 5: Model Validation and Prediction Performance validation: Calculate the area under the ROC curve (AUC), accuracy, precision, recall, and F1 score of the test set using the pROC package to complete the model performance evaluation; Sample prediction: After processing the TCR sequence file to be predicted (such as CSV or TXT format) according to steps 1-2 above, input it into the trained SVM model and output the classification result (breast cancer / healthy person) and confidence score.

[0071] This application also provides a data processing device based on T cell receptor sequence classification. Please refer to... Figure 4 , Figure 4 This is a schematic diagram of the module structure of a data processing device based on T-cell receptor sequence classification according to an embodiment of this application. The data processing device based on T-cell receptor sequence classification includes: The data processing module 401 is used to acquire the sequence data to be processed, and to perform filtering, cleaning, missing data completion and category labeling on the sequence data to be processed to obtain the target sequence dataset; The information integration module 402 is used to extract numerical representation information, regional identification representation information and sequence content representation information based on the target sequence dataset, and to integrate the numerical representation information, the regional identification representation information and the sequence content representation information to obtain a target feature matrix. The feature reduction module 403 is used to perform feature reduction processing based on the target feature matrix and the category labeling results, determine the core feature set, and generate dimensionality-reduced feature data based on the core feature set. The model optimization module 404 is used to construct and optimize the classification model based on the dimensionality reduction feature data and the category labeling results to obtain the target classification model; The target module 405 is used to process the sequence file to be classified according to the processing method corresponding to the sequence data to be processed, to obtain the feature data to be classified, and to input the feature data to be classified into the target classification model to obtain the target classification result.

[0072] The data processing device based on T-cell receptor sequence classification provided in this application, employing the data processing method based on T-cell receptor sequence classification described in the above embodiments, can solve the technical problem of how to improve the accuracy of T-cell receptor sequence classification and identification. Compared with the prior art, the beneficial effects of the data processing device based on T-cell receptor sequence classification provided in this application are the same as those of the data processing method based on T-cell receptor sequence classification provided in the above embodiments, and other technical features in the data processing device based on T-cell receptor sequence classification are the same as those disclosed in the methods of the above embodiments, and will not be repeated here.

[0073] This application provides a data processing device based on T-cell receptor sequence classification. The data processing device based on T-cell receptor sequence classification includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to perform the data processing method based on T-cell receptor sequence classification in the above embodiments.

[0074] The following is for reference. Figure 5 , Figure 5 This is a schematic diagram of the hardware operating environment of the data processing method based on T-cell receptor sequence classification in the embodiments of this application. It shows a schematic diagram of the structure of the data processing device based on T-cell receptor sequence classification suitable for implementing the embodiments of this application. Figure 5 The data processing device based on T-cell receptor sequence classification shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of this application.

[0075] like Figure 5As shown, the data processing device based on T-cell receptor sequence classification may include a processing unit 1001 (e.g., a central processing unit, a graphics processing unit, etc.), which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 1002 or a program loaded from storage device 1003 into random access memory (RAM) 1004. The RAM 1004 also stores various programs and data required for the operation of the data processing device based on T-cell receptor sequence classification. The processing unit 1001, ROM 1002, and RAM 1004 are interconnected via a bus 1005. An input / output (I / O) interface 1006 is also connected to the bus. Typically, the following systems can be connected to I / O interface 1006: input devices 1007 including, for example, touchscreens, touchpads, keyboards, mice, image sensors, microphones, accelerometers, gyroscopes, etc.; output devices 1008 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 1003 including, for example, magnetic tapes, hard disks, etc.; and communication devices 1009. Communication device 1009 allows the data processing device based on T-cell receptor sequence classification to communicate wirelessly or wiredly with other devices to exchange data. Although the figure shows a data processing device based on T-cell receptor sequence classification with various systems, it should be understood that it is not required to implement or possess all the systems shown. More or fewer systems may be implemented alternatively.

[0076] In particular, according to the embodiments disclosed in this application, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, the embodiments disclosed in this application include a computer program product comprising a computer program carried on a computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. When the computer program is executed by the processing device 1001, it performs the functions defined in the methods of the embodiments disclosed in this application.

[0077] The data processing device based on T-cell receptor sequence classification provided in this application, employing the data processing method based on T-cell receptor sequence classification in the above embodiments, can solve the technical problem of how to improve the accuracy of T-cell receptor sequence classification and identification. Compared with the prior art, the beneficial effects of the data processing device based on T-cell receptor sequence classification provided in this application are the same as those of the data processing method based on T-cell receptor sequence classification provided in the above embodiments, and other technical features in this data processing device based on T-cell receptor sequence classification are the same as those disclosed in the previous embodiment method, and will not be repeated here.

[0078] It should be understood that the various parts disclosed in this application can be implemented using hardware, software, firmware, or a combination thereof. In the description of the above embodiments, specific features, structures, materials, or characteristics can be combined in any suitable manner in one or more embodiments or examples.

[0079] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

[0080] This application provides a computer-readable storage medium having computer-readable program instructions (i.e., a computer program) stored thereon, the computer-readable program instructions being used to execute the data processing method based on T-cell receptor sequence classification in the above embodiments.

[0081] The aforementioned computer-readable storage medium carries one or more programs. When these programs are executed by a data processing device for T-cell receptor sequence classification, the data processing device performs the following actions: acquires sequence data to be processed, and performs filtering, cleaning, missing data completion, and category labeling on the sequence data to obtain a target sequence dataset; based on the target sequence dataset, extracts numerical representation information, region identification representation information, and sequence content representation information, and integrates these information to obtain a target feature matrix; based on the target feature matrix and category labeling results, performs feature reduction processing to determine a core feature set, and generates dimensionality-reduced feature data based on the core feature set; based on the dimensionality-reduced feature data and category labeling results, constructs and optimizes a classification model to obtain a target classification model; processes the sequence files to be classified according to the processing method corresponding to the sequence data to be processed, obtains feature data to be classified, and inputs the feature data to be classified into the target classification model to obtain the target classification result. Computer program code for performing the operations of this application can be written in one or more programming languages ​​or a combination thereof, including object-oriented programming languages ​​such as Java, Smalltalk, and C++, and conventional procedural programming languages ​​such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a Local Area Network (LAN) or a Wide Area Network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0082] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0083] The modules described in the embodiments of this application can be implemented in software or hardware. The names of the modules do not necessarily limit the functionality of the unit itself.

[0084] The readable storage medium provided in this application is a computer-readable storage medium that stores computer-readable program instructions (i.e., a computer program) for executing the above-described data processing method based on T-cell receptor sequence classification, thereby solving the technical problem of how to improve the accuracy of T-cell receptor sequence-based classification and identification. Compared with the prior art, the beneficial effects of the computer-readable storage medium provided in this application are the same as those of the data processing method based on T-cell receptor sequence classification provided in the above embodiments, and will not be repeated here.

[0085] This application provides a computer program product, including a computer program that, when executed by a processor, implements the steps of the data processing method based on T-cell receptor sequence classification as described above.

[0086] The computer program product provided in this application can solve the technical problem of how to improve the accuracy of T-cell receptor sequence-based classification and identification. Compared with the prior art, the beneficial effects of the computer program product provided in the embodiments of this application are the same as the beneficial effects of the data processing method based on T-cell receptor sequence classification provided in the above embodiments, and will not be repeated here.

[0087] All user-related data involved in this application (such as sequence data to be processed) were obtained with the user's permission or consent; that is, when this application is applied to a specific product or technology, user permission is required to obtain and process the relevant data, and the processing of the relevant data must comply with the relevant laws, regulations and regulatory standards of the relevant countries and regions.

[0088] The above are merely preferred embodiments of this application and do not limit the scope of protection of this application. Any equivalent structural or procedural transformations made based on the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the scope of this application.

Claims

1. A data processing method based on T cell receptor sequence classification, characterized in that, The method includes: Obtain the sequence data to be processed, and perform filtering, cleaning, missing data completion, and category labeling on the sequence data to be processed to obtain the target sequence dataset; Based on the target sequence dataset, numerical representation information, regional identification representation information, and sequence content representation information are extracted, and the numerical representation information, regional identification representation information, and sequence content representation information are integrated to obtain the target feature matrix; Based on the target feature matrix and category labeling results, feature reduction processing is performed to determine the core feature set, and dimensionality-reduced feature data is generated based on the core feature set. Based on the dimensionality reduction feature data and the category labeling results, a classification model is constructed and optimized to obtain the target classification model; The sequence file to be classified is processed according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and the feature data to be classified is input into the target classification model to obtain the target classification result.

2. The method as described in claim 1, characterized in that, The steps of obtaining the sequence data to be processed, and performing filtering, cleaning, missing data completion, and category labeling on the sequence data to obtain the target sequence dataset include: Obtain raw sequence files from multiple sources, and extract sequence fields, clone characterization fields, region fields, and chain fields from the raw sequence files to obtain initial sequence data; The initial sequence data is filtered based on preset validity conditions, and the filtered sequence data is processed by outlier removal and missing field completion to obtain normalized sequence data. The category identifier is determined based on the data source corresponding to the normalized sequence data, and the category identifier is added to the normalized sequence data to obtain the target sequence dataset.

3. The method as described in claim 1, characterized in that, The step of extracting numerical representation information, region identification representation information, and sequence content representation information based on the target sequence dataset, and integrating the numerical representation information, region identification representation information, and sequence content representation information to obtain the target feature matrix includes: Based on the target sequence dataset, numerical features corresponding to sequence clone representation and sequence length representation are extracted to obtain the numerical representation information; Based on the target sequence dataset, region fields corresponding to the sequence region segments are extracted, and the region fields are encoded and converted to obtain the region identification representation information; Based on the target sequence dataset, amino acid composition features and sequence vector features corresponding to the sequence fields are extracted to obtain the sequence content characterization information. The numerical characterization information, the region identification characterization information, and the sequence content characterization information are then integrated to obtain the target feature matrix.

4. The method as described in claim 1, characterized in that, The steps of performing feature reduction processing based on the target feature matrix and category labeling results, determining the core feature set, and generating dimensionality-reduced feature data based on the core feature set include: Based on the target feature matrix and the category labeling results, a feature selection model is constructed, and the correlation between each feature in the target feature matrix and the category labeling results is analyzed based on the feature selection model to obtain the feature selection results; Based on the feature filtering results, target features that meet preset conditions are determined, and the target features are defined as the core feature set. Based on the core feature set, the target feature matrix is ​​reduced and mapped to obtain the dimensionality-reduced feature data.

5. The method as described in claim 1, characterized in that, The step of constructing and optimizing a classification model based on the dimensionality reduction feature data and the category labeling results to obtain the target classification model includes: Based on the dimensionality reduction feature data and the category labeling results, a support vector machine classification model is constructed to obtain an initial classification model; Based on a preset parameter search strategy, the kernel function parameters and penalty parameters of the initial classification model are optimized to obtain the target parameter configuration; Based on the target parameter configuration and the dimensionality reduction feature data, the initial classification model is trained to obtain the target classification model.

6. The method as described in claim 1, characterized in that, The step of processing the sequence file to be classified according to the processing method corresponding to the sequence data to be processed to obtain the feature data to be classified, and inputting the feature data to be classified into the target classification model to obtain the target classification result includes: Obtain the sequence file to be classified, and preprocess the sequence file to be classified according to the filtering, cleaning, missing completion and category feature construction methods corresponding to the sequence data to be processed, to obtain the sequence data to be analyzed; Based on the sequence data to be analyzed, numerical representation information, regional identification representation information, and sequence content representation information are extracted, and the numerical representation information, regional identification representation information, and sequence content representation information are processed according to the integration method corresponding to the target feature matrix to obtain the feature data to be classified. The feature data to be classified is input into the target classification model for classification determination, and the target classification result is obtained.

7. A data processing device based on T cell receptor sequence classification, characterized in that, The device includes: The data processing module is used to acquire the sequence data to be processed, and to perform filtering, cleaning, missing data completion and category labeling on the sequence data to be processed to obtain the target sequence dataset; The information integration module is used to extract numerical representation information, regional identification representation information and sequence content representation information based on the target sequence dataset, and to integrate the numerical representation information, the regional identification representation information and the sequence content representation information to obtain the target feature matrix; The feature reduction module is used to perform feature reduction processing based on the target feature matrix and category labeling results, determine the core feature set, and generate dimensionality-reduced feature data based on the core feature set. The model optimization module is used to construct and optimize the classification model based on the dimensionality reduction feature data and the category labeling results to obtain the target classification model; The target module is used to process the sequence file to be classified according to the processing method corresponding to the sequence data to be processed, to obtain the feature data to be classified, and to input the feature data to be classified into the target classification model to obtain the target classification result.

8. A data processing device based on T cell receptor sequence classification, characterized in that, The device includes: a memory, a processor, and a data processing program based on T-cell receptor sequence classification stored in the memory and executable on the processor, the data processing program based on T-cell receptor sequence classification being configured to implement the steps of the data processing method based on T-cell receptor sequence classification as described in any one of claims 1 to 6.

9. A storage medium, characterized in that, The storage medium stores a data processing program based on T-cell receptor sequence classification, which, when executed by a processor, implements the steps of the data processing method based on T-cell receptor sequence classification as described in any one of claims 1 to 6.

10. A computer program product, characterized in that, The computer program product includes a computer program that, when executed by a processor, implements the steps of the data processing method based on T-cell receptor sequence classification as described in any one of claims 1 to 6.