Statistical and deep learning based feature derivation method, apparatus, device, and medium
By using a feature derivation method based on statistics and deep learning, the high cost and model inconsistency caused by reliance on expert knowledge in existing technologies are solved, achieving efficient and accurate feature generation and risk management.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CITIC AIBANK CORPORATION LIMITED
- Filing Date
- 2023-10-09
- Publication Date
- 2026-06-12
AI Technical Summary
Existing feature derivation methods rely on expert knowledge, resulting in high human resource costs and time cycles. Furthermore, the model performance is inconsistent across different bank risk control business scenarios, making it difficult to guarantee accuracy and stability.
A feature derivation method based on statistics and deep learning is adopted. By acquiring bank risk control data, performing classification processing and deep learning models to generate embedded features, and combining statistical features for fusion processing, derived features are generated.
It improves the adaptability and accuracy of feature derivation, reduces labor costs, and enhances the reliability of the model and the reliability of the decision.
Smart Images

Figure CN117313032B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of machine learning technology, and more specifically, to a feature derivation method, apparatus, device, and medium based on statistics and deep learning. Background Technology
[0002] Risk management plays a crucial role in the financial sector. Effective risk control is the cornerstone of the long-term sound operation of banking institutions, directly impacting not only the security of the financial institution itself but also the financial security of its clients and market stability. Feature derivation is a key tool in risk management. Existing feature derivation methods primarily rely on manually developed indicators based on business expertise, a approach with significant problems. First, existing methods are overly dependent on expert knowledge, requiring continuous adjustment and optimization of indicators by specialists in the relevant field. This leads to high labor costs and time commitments, especially when facing different risk control scenarios within a bank, necessitating constant redesign and adjustment of features. Second, due to the diversity and constant evolution of business operations, the performance of existing feature derivation methods varies greatly across different scenarios, making it difficult to guarantee accuracy and stability in diverse environments. This inconsistency severely restricts the reliability of risk assessment and decision-making.
[0003] Based on the shortcomings of the existing technologies, there is an urgent need for a feature derivation method, device, equipment, and medium based on statistics and deep learning. Summary of the Invention
[0004] The purpose of this invention is to provide a feature derivation method, apparatus, device, and medium based on statistics and deep learning to improve the aforementioned problems. To achieve the above objective, the technical solution adopted by this invention is as follows:
[0005] Firstly, this application provides a feature derivation method based on statistics and deep learning, including:
[0006] Obtain first information, which includes bank risk control data under at least two business scenarios;
[0007] The first information is classified to obtain the second information and the third information. The second information is sequential data and the third information is non-sequential data.
[0008] The fourth information is generated based on the second information and the preset deep learning mathematical model. The fourth information is an embedded feature that includes the original sequence and supplementary information.
[0009] The fifth information is obtained by performing operator operations on each type of data in the third information, and the fifth information is the statistical feature corresponding to the third information.
[0010] A deep learning feature model is constructed based on the fourth information;
[0011] A statistical feature model is constructed based on the fifth piece of information;
[0012] A combined model is obtained by fusing the deep learning feature model and the statistical feature model, and derived features are generated based on the combined model.
[0013] Secondly, this application also provides a feature derivation device based on statistics and deep learning, including:
[0014] The acquisition module is used to acquire first information, which includes bank risk control data under at least two business scenarios.
[0015] The classification module is used to classify the first information to obtain second information and third information, wherein the second information is sequence data and the third information is non-sequence data;
[0016] The generation module is used to generate fourth information based on the second information and a preset deep learning mathematical model. The fourth information is an embedded feature that includes the original sequence and supplementary information.
[0017] The statistics module is used to perform operator operations on each type of data in the third information to obtain the fifth information, wherein the fifth information is the statistical feature corresponding to the third information;
[0018] The first construction module is used to construct a deep learning feature model based on the fourth information;
[0019] The second construction module is used to construct a statistical feature model based on the fifth information;
[0020] The fusion module is used to perform fusion processing on the deep learning feature model and the statistical feature model to obtain a combined model, and to generate derived features based on the combined model.
[0021] Thirdly, this application also provides a feature derivation device based on statistics and deep learning, comprising:
[0022] Memory, used to store computer programs;
[0023] A processor for implementing the steps of the statistical and deep learning-based feature derivation method when executing the computer program.
[0024] Fourthly, this application also provides a medium on which a computer program is stored, which, when executed by a processor, implements the steps of the above-described feature derivation method based on statistics and deep learning.
[0025] The beneficial effects of this invention are as follows:
[0026] This invention employs an automated feature derivation method, exhibiting high adaptability and versatility, eliminating the need for customized feature processing for each application scenario. By fusing deep learning and statistical features, this invention can more accurately capture information and patterns in data, thereby improving model accuracy and helping to reduce risk and enhance decision-making reliability.
[0027] Other features and advantages of the invention will be set forth in the following description, and will be apparent in part from the description, or may be learned by practicing embodiments of the invention. The objects and other advantages of the invention may be realized and obtained by means of the structures particularly pointed out in the written description, claims, and drawings. Attached Figure Description
[0028] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0029] Figure 1 This is a schematic diagram of the feature derivation method based on statistics and deep learning described in this embodiment of the invention;
[0030] Figure 2 This is a schematic diagram of the feature derivation device based on statistics and deep learning described in an embodiment of the present invention;
[0031] Figure 3 This is a schematic diagram of the feature derivation device structure based on statistics and deep learning as described in an embodiment of the present invention.
[0032] The diagram is labeled as follows: 1. Acquisition module; 2. Classification module; 21. First processing unit; 22. First extraction unit; 23. First detection unit; 24. Second extraction unit; 3. Generation module; 31. Second processing unit; 32. First embedding unit; 33. First conversion unit; 4. Statistics module; 41. First calculation unit; 42. First encoding unit; 43. Third extraction unit; 44. First merging unit; 5. First construction module; 51. Third processing unit; 52. First construction unit; 53. Fourth processing unit; 6. Second construction module; 7. Fusion module; 71. Second conversion unit; 72. Third conversion unit; 73. First fusion unit; 74. First generation unit; 800. Feature derivation device based on statistics and deep learning; 801. Processor; 802. Memory; 803. Multimedia component; 804. I / O interface; 805. Communication component. Detailed Implementation
[0033] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.
[0034] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this invention, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0035] Example 1:
[0036] This embodiment provides a feature derivation method based on statistics and deep learning.
[0037] See Figure 1 The figure shows that the method includes steps S100, S200, S300, S400, S500, S600 and S700.
[0038] Step S100: Obtain first information, which includes bank risk control data under at least two business scenarios.
[0039] Understandably, this step involves collecting data from different financial business scenarios to facilitate subsequent feature derivation and risk management. These different business scenarios in the financial sector may include multiple areas such as investment, lending, insurance, and payments, each with its unique data characteristics and needs.
[0040] Step S200: Classify the first information to obtain the second information and the third information. The second information is sequential data and the third information is non-sequence data.
[0041] Understandably, this step, by classifying the raw data into sequential and non-sequential data, is crucial for subsequent processing and feature derivation, as sequential and non-sequential data typically require different processing methods and feature extraction techniques. User transaction records are usually sequential data because they include a series of time-series transaction information. Basic user information includes non-sequential data such as the user's name, address, and age. It should be noted that step S200 includes steps S210, S220, S230, and S240.
[0042] Step S210: Preprocess the data according to the first information to obtain preprocessed data.
[0043] Understandably, preprocessing can include data cleaning steps to remove outliers, missing values, and erroneous data from the raw data. This helps ensure data quality and accuracy, improving the reliability of subsequent analysis.
[0044] Step S220: Perform feature selection processing based on the preprocessed data to obtain preliminary identification results.
[0045] Understandably, feature selection helps filter out irrelevant or redundant features, thereby reducing the complexity of subsequent analysis and improving computational efficiency. The initial identification result is a subset of data after feature selection, containing the most informative features.
[0046] Step S230: Based on the preliminary identification results, perform timestamp detection to obtain the sequence data identification results.
[0047] Understandably, this step, through timestamp detection, allows the system to identify sequence information in the data. This sequence information may contain temporal correlations or the order of events related to the business scenario. The results of sequence data identification can be further used for training deep learning models and feature extraction. This helps to better utilize the temporal correlations in sequence data, improving the accuracy of risk assessment or prediction.
[0048] Step S240: Extract the second and third information from the preprocessed data based on the sequence data recognition results.
[0049] Understandably, this step processes the preprocessed data based on the sequence data identification results, separating data containing sequence information from non-sequence data. This helps in better understanding and processing different types of data.
[0050] Step S300: Generate fourth information based on the second information and the preset deep learning mathematical model. The fourth information is an embedded feature that includes the original sequence and supplementary information.
[0051] Understandably, transforming the original sequence data and supplementary information into embedded features, which have a higher level of expression, can be better used for modeling and training deep learning models. It should be noted that step S300 includes steps S310, S320, and S330.
[0052] Step S310: Perform sequence processing and information supplementation based on the second information to obtain enhanced sequence data.
[0053] Understandably, this step improves the quality and features of the original sequence data through sequence processing and information supplementation, making it more suitable for training deep learning models. This helps improve the model's performance and its ability to understand sequence data.
[0054] Step S320: Embedded data is obtained by embedding the enhanced sequence data and the original data.
[0055] Understandably, this step involves sequence embedding of the augmented sequence data, transforming each transaction sequence into a fixed-dimensional vector representation. Converting various data types into embedded data makes them more suitable for training deep learning models, contributing to improved model performance and data understanding.
[0056] Step S330: Convert the embedded data into an embedded vector, and train the embedded vector based on a preset deep learning mathematical model to obtain the fourth information. The fourth information is the vector output of the model before the classification layer.
[0057] Understandably, embedded data is typically a vector representation obtained by processing text, sequences, or other data types through an embedding layer. This embedded data is high-dimensional, and the model needs to transform it into embedding vectors suitable for deep learning models. This transformation process usually involves dimensionality reduction or compression of the high-dimensional embedded data to reduce computational complexity and improve model training efficiency.
[0058] Step S400: Perform operator operations on each type of data in the third information to obtain the fifth information, which is the statistical feature corresponding to the third information.
[0059] Understandably, these statistical features include mean, variance, maximum, minimum, frequency, mode, etc., and different features can be selected according to different needs and data types. These statistical features can help identify patterns and regularities in the data. It should be noted that step S400 includes steps S410, S420, S430, and S440.
[0060] Step S410: Perform numerical calculations based on the data type information in the third information to obtain numerical statistical features.
[0061] Understandably, this step generates numerical statistical features through numerical computation. These features provide a comprehensive description of the numerical data, aiding in a deeper understanding of its properties. These statistical features can be used for various data science tasks, such as data exploration, anomaly detection, and model training, improving the accuracy and reliability of data analysis. Furthermore, the generation of these statistical features is automated, unrestricted by data type or business scenario, helping to reduce the complexity of feature engineering.
[0062] Step S420: Encode the category type data in the third information to obtain category statistical features.
[0063] Understandably, this step transforms categorical data into numerical features through encoding, which can then be used for further data analysis and modeling. Categorical statistical features allow the model to better understand patterns and correlations in the categorical data, thereby improving model performance.
[0064] Step S430: Extract time features based on the time type data in the third information to obtain time statistical features.
[0065] Understandably, this step, through the extraction of temporal features, enables the model to better understand the characteristics and patterns of time-based data. This helps improve the model's sensitivity to changes in events over time, thereby enhancing the model's performance and predictive ability.
[0066] Step S440: Perform feature merging processing on the numerical statistical features, category statistical features and time statistical features to obtain the fifth information.
[0067] Understandably, this step effectively combines different types of statistical features, providing a more comprehensive feature set for subsequent feature engineering and modeling. This helps the model better understand the diversity of the data, capture the correlations between different types of data, and improve the model's performance and predictive accuracy.
[0068] Step S500: Construct a deep learning feature model based on the fourth information.
[0069] Understandably, deep learning feature models will be able to learn feature representations of data from fourth information, capture complex relationships in the data, and use them for subsequent data analysis and prediction. It should be noted that step S500 includes steps S510, S520, and S530.
[0070] Step S510: Perform data cleaning and feature selection processing based on the fourth information to obtain input data.
[0071] Understandably, this step, through data cleaning and feature selection, can remove unnecessary noise and redundant information, thereby improving the model's generalization ability and training efficiency.
[0072] Step S520: Build a preliminary model based on the preset deep learning model architecture.
[0073] Understandably, the goal of this step is to create an initial deep learning model whose structure, number of layers, activation functions, and other parameters are defined according to a pre-defined model architecture.
[0074] Step S530: Train the preliminary model based on the input data to obtain a deep learning feature model.
[0075] Understandably, the goal of this step is to use the input data to train the initially constructed deep learning model, gradually adjusting the weights and parameters to better fit the training data and improve performance.
[0076] Step S600: Construct a statistical feature model based on the fifth piece of information.
[0077] Understandably, this model is able to extract and combine statistical features from different types of data without requiring complex manual feature engineering. This makes the feature generation process more efficient and intelligent.
[0078] Step S700: The deep learning feature model and the statistical feature model are fused to obtain a combined model, and derived features are generated based on the combined model.
[0079] Understandably, this step involves fusing the outputs of the deep learning feature model and the statistical feature model. This can be achieved in various ways, such as concatenating their outputs, applying a fusion algorithm, or inputting them into another model to obtain a more comprehensive feature representation. The goal of this step is to combine the strengths of the deep learning model and the statistical model to achieve a more powerful feature representation capability. Once the deep learning feature model and the statistical feature model are fused, we can use this fused model to generate derived features. These features can reflect multiple aspects of the data, including the sequence information, statistical properties, and distribution characteristics of the original data. The generation of derived features can be a combination process that comprehensively considers different types of data features. It should be noted that step S700 includes steps S710, S720, S730, and S740.
[0080] Step S710: Perform transformation processing based on the deep learning feature model to obtain the first probability distribution feature.
[0081] Understandably, this step utilizes a pre-trained deep learning feature model to transform the input data. This model includes multiple layers and weight parameters, capturing the complex relationships within the data by learning its intrinsic feature representations. After transformation, we obtain the first probability distribution feature. This feature reflects the distribution of the data under the deep learning model and can be a probability density function, probability mass function, or other forms of probability distribution. This feature can provide statistical information about the data, including mean, variance, quantiles, etc.
[0082] Step S720: Perform transformation processing based on the statistical feature model to obtain the second probability distribution feature.
[0083] Understandably, the second probability distribution feature obtained after this transformation reflects the distribution of the data under the statistical feature model, typically expressed as a probability distribution function, density function, or other forms of probability distribution. This feature can provide information about the statistical properties of the data, such as mean, standard deviation, skewness, and kurtosis.
[0084] Step S730: Perform fusion processing based on the first probability distribution features and the second probability distribution features to obtain fused probability distribution features, and perform model construction processing based on the fused probability distribution features to obtain a combined model.
[0085] Understandably, in this step, the first and second probability distribution features are merged to generate a fused probability distribution feature. This fusion can employ various methods, such as weighted averaging, convolution operations, and linear combinations, depending on the nature of the problem and the data. The fusion process yields a new feature, the fused probability distribution feature. This feature combines information from both the first and second probability distribution features, possessing more comprehensive characteristics and patterns about the data. Next, the fused probability distribution feature is used as input or features to construct a combined model. This combined model is a machine learning model, such as a neural network, decision tree, or support vector machine. The combined model can be constructed using a training dataset so that the model can learn patterns and relationships within the data. By fusing probability distribution features and constructing a combined model, the effectiveness of data modeling can be improved, better reflecting the complex characteristics and relationships of the data. This helps in more accurate data analysis and prediction across various tasks.
[0086] Step S740: Generate derived features based on the combined model.
[0087] Understandably, traditional feature derivation methods often rely on domain expertise and manual feature engineering, resulting in low efficiency. This step, by combining deep learning and automatic feature derivation, successfully addresses the issues of low efficiency and susceptibility to changing business scenarios. It can efficiently generate a large number of features while maintaining model performance, ensuring efficient and reliable data processing and modeling capabilities even in the face of constantly evolving business scenarios.
[0088] Example 2:
[0089] like Figure 2 As shown, this embodiment provides a feature derivation device based on statistics and deep learning. The device includes:
[0090] Module 1 is used to acquire first information, which includes bank risk control data under at least two business scenarios.
[0091] Classification module 2 is used to classify the first information to obtain the second and third information. The second information is sequence data and the third information is non-sequence data.
[0092] The generation module 3 is used to generate fourth information based on the second information and a preset deep learning mathematical model. The fourth information is an embedded feature that includes the original sequence and supplementary information.
[0093] The statistics module 4 is used to perform operator operations on each type of data in the third information to obtain the fifth information, which is the statistical feature corresponding to the third information.
[0094] The first building module 5 is used to build a deep learning feature model based on the fourth information.
[0095] The second construction module 6 is used to construct a statistical feature model based on the fifth information.
[0096] Fusion module 7 is used to fuse deep learning feature models and statistical feature models to obtain a combined model, and generate derived features based on the combined model.
[0097] In one specific embodiment of this disclosure, the classification module 2 includes:
[0098] The first processing unit 21 is used to perform preprocessing based on the first information to obtain preprocessed data.
[0099] The first extraction unit 22 is used to perform feature selection processing based on preprocessed data to obtain preliminary recognition results.
[0100] The first detection unit 23 is used to perform timestamp detection based on the preliminary identification results to obtain the sequence data identification results.
[0101] The second extraction unit 24 is used to extract second and third information from the preprocessed data based on the sequence data identification results.
[0102] In one specific embodiment of this disclosure, the generation module 3 includes:
[0103] The second processing unit 31 is used to perform sequence processing and information supplementation based on the second information to obtain enhanced sequence data.
[0104] The first embedding unit 32 is used to perform embedding processing on the enhanced sequence data and the original data to obtain embedded data.
[0105] The first conversion unit 33 is used to convert the embedded data into an embedded vector, and to train the embedded vector based on a preset deep learning mathematical model to obtain the fourth information, which is the vector output of the model before the classification layer.
[0106] In one specific embodiment of this disclosure, the statistics module 4 includes:
[0107] The first calculation unit 41 is used to perform numerical calculations based on the data type data in the third information to obtain numerical statistical features.
[0108] The first encoding unit 42 is used to encode the category type data in the third information to obtain category statistical features.
[0109] The third extraction unit 43 performs time feature extraction processing on the time type data in the third information to obtain time statistical features.
[0110] The first merging unit 44 is used to perform feature merging processing on numerical statistical features, category statistical features and time statistical features to obtain the fifth information.
[0111] In one specific embodiment of this disclosure, the first construction module 5 includes:
[0112] The third processing unit 51 is used to perform data cleaning and feature selection processing based on the fourth information to obtain input data.
[0113] The first building unit 52 is used to build a preliminary model based on a preset deep learning model architecture.
[0114] The fourth processing unit 53 is used to perform model training processing on the preliminary model based on the input data to obtain a deep learning feature model.
[0115] In one specific embodiment of this disclosure, the fusion module 7 includes:
[0116] The second conversion unit 71 is used to perform conversion processing based on the deep learning feature model to obtain the first probability distribution feature.
[0117] The third conversion unit 72 is used to perform conversion processing based on the statistical feature model to obtain the second probability distribution feature.
[0118] The first fusion unit 73 performs fusion processing based on the first probability distribution features and the second probability distribution features to obtain fusion probability distribution features, and performs model construction processing based on the fusion probability distribution features to obtain a combined model.
[0119] The first generation unit 74 is used to generate derived features based on the combined model.
[0120] Example 3:
[0121] Corresponding to the above method embodiments, this embodiment also provides a feature derivation device based on statistics and deep learning. The feature derivation device based on statistics and deep learning described below can be referred to in correspondence with the feature derivation method based on statistics and deep learning described above.
[0122] Figure 3 This is a block diagram illustrating a feature derivation device 800 based on statistics and deep learning, according to an exemplary embodiment. Figure 3 As shown, the feature derivation device 800 based on statistical and deep learning may include: a processor 801 and a memory 802. The feature derivation device 800 may also include one or more of a multimedia component 803, an I / O interface 804, and a communication component 805.
[0123] The processor 801 controls the overall operation of the statistical and deep learning-based feature derivation device 800 to complete all or part of the steps in the aforementioned statistical and deep learning-based feature derivation method. The memory 802 stores various types of data to support the operation of the statistical and deep learning-based feature derivation device 800. This data may include, for example, instructions for any application or method operating on the statistical and deep learning-based feature derivation device 800, as well as application-related data such as contact data, sent and received messages, images, audio, video, etc. The memory 802 can be implemented using any type of volatile or non-volatile storage device or a combination thereof, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The multimedia component 803 may include a screen and an audio component. The screen may be, for example, a touchscreen, and the audio component is used to output and / or input audio signals. For example, the audio component may include a microphone for receiving external audio signals. The received audio signals may be further stored in the memory 802 or transmitted via the communication component 805. The audio component also includes at least one speaker for outputting audio signals. I / O interface 804 provides an interface between processor 801 and other interface modules, such as keyboards, mice, and buttons. These buttons can be virtual or physical. Communication component 805 is used for wired or wireless communication between the statistical and deep learning-based feature-derived device 800 and other devices. Wireless communication includes Wi-Fi, Bluetooth, Near Field Communication (NFC), 2G, 3G, or 4G, or a combination thereof. Therefore, the corresponding communication component 805 may include a Wi-Fi module, a Bluetooth module, and an NFC module.
[0124] In an exemplary embodiment, the feature derivation device 800 based on statistical and deep learning may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the aforementioned feature derivation method based on statistical and deep learning.
[0125] In another exemplary embodiment, a computer-readable storage medium including program instructions is also provided, which, when executed by a processor, implement the steps of the above-described feature derivation method based on statistical and deep learning. For example, the computer-readable storage medium may be the memory 802 including the program instructions, which may be executed by the processor 801 of the feature derivation device 800 based on statistical and deep learning to complete the above-described feature derivation method based on statistical and deep learning.
[0126] Example 4:
[0127] Corresponding to the above method embodiments, this embodiment also provides a readable storage medium. The readable storage medium described below corresponds to and can be referred to in relation to the feature derivation method based on statistics and deep learning described above.
[0128] A readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the feature derivation method based on statistics and deep learning described in the above method embodiments.
[0129] Specifically, the readable storage medium can be a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, or any other readable storage medium capable of storing program code.
[0130] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
[0131] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A feature derivation method based on statistics and deep learning, characterized in that, include: Obtain first information, which includes bank risk control data under at least two business scenarios; The first information is classified to obtain the second information and the third information. The second information is sequential data and the third information is non-sequential data. The fourth information is generated based on the second information and the preset deep learning mathematical model. The fourth information is an embedded feature that includes the original sequence and supplementary information. Transform the original sequence data and supplementary information into embedded features; Based on the second information, sequence processing and information supplementation are performed to obtain enhanced sequence data; Embedded data is obtained by embedding the enhanced sequence data and the original data; The enhanced sequence data is processed by sequence embedding, which transforms each transaction sequence into a fixed-dimensional vector representation; The embedded data is converted into an embedded vector, and the embedded vector is trained on a preset deep learning mathematical model to obtain the fourth information, which is the vector output of the model before the classification layer. The fifth information is obtained by performing operator operations on each type of data in the third information, and the fifth information is the statistical feature corresponding to the third information. A deep learning feature model is constructed based on the fourth information; A statistical feature model is constructed based on the fifth piece of information; The input data is transformed and processed according to the deep learning feature model to obtain the first probability distribution feature; the deep learning feature model includes multiple levels and weight parameters, and it captures the complex relationships of the data by learning the intrinsic feature representation of the data; The second probability distribution feature is obtained by performing a transformation process based on the statistical feature model. The second probability distribution feature reflects the distribution of data under the statistical feature model; The first probability distribution feature and the second probability distribution feature are fused together to obtain the fused probability distribution feature, and the combined model is obtained by model building based on the fused probability distribution feature. Derivative features are generated based on the combined model.
2. The feature derivation method based on statistics and deep learning according to claim 1, characterized in that, Based on the first information, the second and third information are obtained through classification processing, including: Preprocessing is performed based on the first information to obtain preprocessed data; Preliminary identification results are obtained by performing feature selection processing on the preprocessed data; Based on the preliminary identification results, timestamp detection is performed to obtain the sequence data identification results; The second and third information are extracted from the preprocessed data based on the sequence data identification results.
3. The feature derivation method based on statistics and deep learning according to claim 1, characterized in that, The fifth information is obtained by performing operator operations on each type of data in the third information, including: Numerical statistical features are obtained by performing numerical calculations based on the data type information in the third information. The category statistical features are obtained by encoding the category type data in the third information. Time statistical features are obtained by extracting time features from the time type data in the third information. The fifth piece of information is obtained by merging the numerical statistical features, the category statistical features, and the time statistical features.
4. The feature derivation method based on statistics and deep learning according to claim 1, characterized in that, A deep learning feature model is constructed based on the fourth information, including: The input data is obtained by performing data cleaning and feature selection based on the fourth information; A preliminary model is obtained by constructing a pre-defined deep learning model architecture; The preliminary model is trained based on the input data to obtain a deep learning feature model.
5. A feature derivation device based on statistics and deep learning, characterized in that, include: The acquisition module is used to acquire first information, which includes bank risk control data under at least two business scenarios. The classification module is used to classify the first information to obtain second information and third information, wherein the second information is sequence data and the third information is non-sequence data; The generation module is used to generate fourth information based on the second information and a preset deep learning mathematical model. The fourth information is an embedded feature that includes the original sequence and supplementary information. Transform the original sequence data and supplementary information into embedded features; The generation module includes: The second processing unit is used to perform sequence processing and information supplementation based on the second information to obtain enhanced sequence data; The first embedding unit is used to perform embedding processing on the enhanced sequence data and the original data to obtain embedded data; The first conversion unit is used to convert the embedded data into an embedded vector, and to train the embedded vector based on a preset deep learning mathematical model to obtain fourth information, wherein the fourth information is the vector output of the model before the classification layer. The statistics module is used to perform operator operations on each type of data in the third information to obtain the fifth information, wherein the fifth information is the statistical feature corresponding to the third information; The first construction module is used to construct a deep learning feature model based on the fourth information; The second construction module is used to construct a statistical feature model based on the fifth information; The fusion module includes: The second conversion unit is used to perform conversion processing based on the deep learning feature model to obtain the first probability distribution feature; the deep learning feature model includes multiple levels and weight parameters, and it captures the complex relationships of the data by learning the intrinsic feature representation of the data; The third transformation unit is used to perform transformation processing based on the statistical feature model to obtain a second probability distribution feature; the second probability distribution feature reflects the distribution of data under the statistical feature model. The first fusion unit performs fusion processing on the first probability distribution feature and the second probability distribution feature to obtain the fusion probability distribution feature, and performs model building processing on the fusion probability distribution feature to obtain the combined model; The first generation unit is used to generate derived features based on the combined model.
6. The feature derivation device based on statistics and deep learning according to claim 5, characterized in that, The classification module includes: The first processing unit is configured to perform preprocessing based on the first information to obtain preprocessed data; The first extraction unit is used to perform feature selection processing based on the preprocessed data to obtain preliminary recognition results; The first detection unit is used to perform timestamp detection based on the preliminary identification results to obtain sequence data identification results; The second extraction unit is used to extract second information and third information from the preprocessed data based on the sequence data identification results.
7. The feature derivation device based on statistics and deep learning according to claim 5, characterized in that, The statistics module includes: The first calculation unit is used to perform numerical calculations based on the data type data in the third information to obtain numerical statistical features; The first encoding unit is used to encode the category type data in the third information to obtain category statistical features; The third extraction unit performs time feature extraction processing on the time type data in the third information to obtain time statistical features; The first merging unit is used to perform feature merging processing on the numerical statistical features, the category statistical features, and the time statistical features to obtain the fifth information.
8. The feature derivation device based on statistics and deep learning according to claim 5, characterized in that, The first building module includes: The third processing unit is used to perform data cleaning and feature selection processing based on the fourth information to obtain input data; The first building unit is used to build a preliminary model based on a pre-defined deep learning model architecture. The fourth processing unit is used to perform model training processing on the preliminary model based on the input data to obtain a deep learning feature model.
9. A feature derivation device based on statistics and deep learning, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the feature derivation method based on statistics and deep learning as described in any one of claims 1 to 4 when executing the computer program.
10. A medium, characterized in that: The medium stores a computer program that, when executed by a processor, implements the steps of the feature derivation method based on statistics and deep learning as described in any one of claims 1 to 4.