A data processing method and apparatus

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By constructing a linear relationship between the prediction results of user data and the marginal contribution values of M-dimensional features, overfitting features can be accurately located and optimized, thus solving the overfitting problem of machine learning models in the financial field and improving prediction accuracy.

CN115659599BActive Publication Date: 2026-06-19WEBANK (CHINA)

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: WEBANK (CHINA)
Filing Date: 2022-09-27
Publication Date: 2026-06-19

AI Technical Summary

Technical Problem

Existing technologies in the financial field are prone to overfitting during the modeling process of machine learning models, resulting in poor prediction performance on the test set. In particular, when user data features do not conform to a normal distribution, existing methods have difficulty accurately locating overfitting features, thus affecting prediction accuracy.

Method used

By constructing a linear relationship between the prediction results of user data and the marginal contribution values of M-dimensional features, the fitting status of each dimension of features is obtained. The process is stored and processed using a memory pool to eliminate nonlinear effects, accurately locate overfitted features, and perform optimization processing.

Benefits of technology

It improves the prediction accuracy of machine learning models on user data, eliminates nonlinear effects, accurately locates and optimizes overfitting features, and enhances the predictive performance of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115659599B_ABST

Patent Text Reader

Abstract

This application provides a data processing method and apparatus applicable to an optimization system for a user mining mechanism. The method includes: acquiring a user mining mechanism and a user dataset, wherein the user dataset contains N user data points and M-dimensional features; acquiring any user data point, constructing a linear relationship between the prediction result of the user data and the marginal contribution values of the M-dimensional features, whereby the marginal contribution values characterize the degree of influence of the corresponding dimensional feature on the prediction result in the user data, and the prediction result is obtained by the user mining mechanism from the user data; obtaining N*M marginal contribution values based on the linear relationship of the N user data points; acquiring the N marginal contribution values of any dimensional feature and determining the fitting condition of the dimensional feature; and optimizing the user mining mechanism based on the fitting condition of each dimensional feature. This method is used to accurately obtain the fitting condition of each dimensional feature in the user's business data and to precisely optimize the model.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of data processing technology, and in particular to a data processing method and apparatus. Background Technology

[0002] In recent years, with the development of computer technology, more and more technologies are being applied in the financial sector, and the traditional financial industry is gradually transforming into Fintech. However, due to the security and real-time requirements of the financial industry, higher demands are being placed on technology. For example, machine learning technology is used to detect abnormal transactions in a timely manner, and to develop product upgrade plans. Therefore, machine learning technology plays a crucial role in the transformation of Fintech.

[0003] Generally, machine learning techniques can be used to model based on large amounts of data. Training an initial model with a large dataset yields an application model that can be used in production to predict a user's insurance probability (the corresponding business data for prediction could include the user's age, gender, the legal representative's age and gender, business registration information, loan history, number of loans, loan amount, deposit status, company financial situation, historical telemarketing records, etc.). However, during the modeling process, issues with the original business dataset may lead to a phenomenon where the model's predictions perform well on the training set but plummet on the test set; this phenomenon is called model overfitting. Existing techniques use linear regression models to fit the business dataset, using the optimal fitting coefficient as the fitting constant to obtain the optimal fitting line. By changing the model fitting coefficient and obtaining the squared residuals of the prediction results, overfitting features can be identified. While this method can address some overfitting issues, when the features in the business dataset do not conform to a normal distribution, this linear regression model's method of identifying overfitting features is inaccurate, and the final prediction model still cannot accurately predict the user's insurance probability.

[0004] Therefore, there is an urgent need for a data processing method and device to accurately obtain the fitting of features in various dimensions of user business data, to accurately optimize the model, and to improve the accuracy of prediction results. Summary of the Invention

[0005] This application provides a data processing method and apparatus for accurately acquiring the fitting of features of various dimensions in a user's business data, performing precise optimization of the model, and improving the accuracy of prediction results.

[0006] In a first aspect, embodiments of this application provide a data processing method applicable to an optimization system for a user mining mechanism. The method includes: storing the acquired user mining mechanism and user dataset into a first memory pool, wherein the user dataset contains N user data points, each user data point containing M-dimensional features, where N and M are integers and both ≥ 1; retrieving any user data point from the first memory pool through a processing process; constructing a linear relationship between the prediction result of the user data point and the marginal contribution value of the M-dimensional features in the user data, wherein the marginal contribution value characterizes the degree of influence of the corresponding dimensional feature on the prediction result in the user data point. The prediction result is obtained by the user mining mechanism from the user data. The processing process is any one of a set number of processing processes in memory. Based on the linear relationship of the N user data, the marginal contribution value of any dimension feature in any user data is obtained, and the obtained N*M marginal contribution values are stored in the second memory pool. The N marginal contribution values of any dimension feature are obtained from the second memory pool through the processing process to determine the fitting condition of the dimension feature. The fitting condition is used to characterize whether the corresponding dimension feature is overfitted. Based on the fitting condition of each dimension feature, the user mining mechanism is optimized.

[0007] In the above method, a linear relationship is constructed between the prediction result of user data and the marginal contribution values of M-dimensional features in the user data. This transforms the marginal contribution values of each dimension feature in the user data from non-linear to linear. Based on the constructed linear relationship of N user data sets, the marginal contribution value of each dimension feature of each user data set is further obtained, resulting in N*M marginal contribution values. Thus, the fitting condition of each dimension feature is obtained based on these N*M marginal contribution values, ensuring that the fitting condition of the dimension feature is calculated after eliminating the non-linear effects of that dimension feature. Compared to existing technologies that determine the fitting condition of a dimension feature through a linear regression model regardless of whether the dimension feature and the prediction result have a linear relationship, and thus identify overfitted dimension features, the fitting condition of the dimension features obtained in this application after eliminating the non-linear effects is more accurate. Furthermore, the identification of overfitted dimension features based on the fitting condition of the dimension features obtained in this application is more accurate. Furthermore, by setting up a first memory pool and a second memory pool, this application enables the processing process in memory to clearly obtain user data from the user mining mechanism and user dataset from the first memory pool, and obtain marginal contribution values from the second memory pool, thus ensuring the accuracy of data processing in the processing process.

[0008] Optionally, the user dataset is a test user dataset or a training user dataset. Before storing the acquired user mining mechanism and user dataset into the first memory pool, the method further includes: determining that the difference between the first prediction result and the second prediction result of the user mining mechanism meets the difference condition, wherein the first prediction result and the second prediction result are obtained based on the test user dataset and the training user dataset, respectively.

[0009] In the above method, when the difference between the first and second prediction results of the user mining mechanism meets the difference condition, the optimization system determines to optimize the user mining mechanism accordingly based on data processing. This ensures the accuracy of the user mining mechanism.

[0010] Optionally, constructing a linear relationship between the prediction results of the user data and the marginal contribution values of the M-dimensional features in the user data includes: constructing the linear relationship based on the prediction results of the user data, the mean of the prediction results of the N user data, and the marginal contribution values of the M-dimensional features in the user data.

[0011] In the above method, a linear relationship between user data can be constructed based on the prediction results of user data, the mean of the prediction results of N user data, and the marginal contribution value of the M-dimensional features in the user data. This ensures that the linear relationship takes into account the prediction results of all user data in the user dataset and the marginal contribution value of the M-dimensional features of that user data to the prediction results, thus improving the accuracy of the linear relationship.

[0012] Optionally, obtaining the marginal contribution value of any dimension feature in any user data includes: for any dimension feature, inputting incomplete user data into the prediction model to obtain an incomplete prediction result, wherein the incomplete user data contains M-1 other dimension features besides the dimension feature; and obtaining the marginal contribution value of the dimension feature based on the prediction result and the incomplete prediction result.

[0013] In the above method, based on the linear relationship of each user's data, the marginal contribution value of any dimension feature of any user's data can be obtained.

[0014] Optionally, before determining the fitting condition of the dimensional feature, the method further includes: obtaining the distribution results of N marginal contribution values of the dimensional feature through the processing process, storing the distribution results of the dimensional feature in a third memory pool; and correcting the marginal contribution values of the dimensional feature according to the distribution results of the dimensional feature to obtain the corrected marginal contribution values.

[0015] In the above method, for any given feature dimension, the distribution of N marginal contribution values for that feature dimension is determined. Based on the distribution of that feature dimension, the marginal contribution values are corrected to obtain the corrected marginal contribution values. This eliminates the inconsistency in the distribution of marginal contribution values for the same feature dimension across different datasets, which can lead to differences in fit. For ease of understanding, in one example, the distribution of a feature dimension in the training user dataset of the user mining mechanism differs from the distribution of that feature dimension in the test user dataset. The user mining mechanism trained on the training user dataset may have a significant difference between its second prediction result for the training user dataset and its first prediction result for the test user dataset. This application can correct the marginal contribution values based on the distribution of the feature dimension's marginal contribution values to eliminate the differences in prediction results caused by this distribution difference, improve the accuracy of the feature dimension's fit, and further enhance the optimization effect of the user mining mechanism. Furthermore, storing the distribution results in a third memory pool ensures that the processing process accurately retrieves the distribution results from the third memory pool.

[0016] Optionally, the marginal contribution value of the dimensional feature is corrected based on the distribution result of the dimensional feature, including: if the distribution result of the dimensional feature follows a normal distribution, obtaining the mean and standard deviation of the N marginal contribution values of the dimensional feature; for any user data, correcting the marginal contribution value of the dimensional feature in the user data based on the marginal contribution value of the dimensional feature in the user data, the mean and the standard deviation, to obtain the corrected marginal contribution value.

[0017] In the above method, if the distribution of the dimensional features follows a normal distribution, the marginal contribution values can be corrected based on the mean and standard deviation of the N marginal contribution values of the dimensional features. Thus, by standardizing, the distribution shift caused by the spatiotemporal acquisition bias of the dimensional feature data is eliminated, improving the accuracy of the dimensional feature fitting and further enhancing the accuracy of the optimized user mining mechanism.

[0018] Optionally, the marginal contribution value of the dimensional feature is corrected based on the distribution result of the dimensional feature, including: if the marginal contribution value of the dimensional feature does not follow a normal distribution, obtaining the mean and standard deviation of the N marginal contribution values of the dimensional feature; for any user data, performing logarithmic processing on the marginal contribution value of the dimensional feature in the user data to obtain the logarithmically processed marginal contribution value; correcting the marginal contribution value of the dimensional feature in the user data based on the logarithmically processed marginal contribution value of the dimensional feature, the mean, and the standard deviation to obtain the corrected marginal contribution value.

[0019] In the above method, if the marginal contribution value of a dimensional feature does not follow a normal distribution, the mean and standard deviation of the N marginal contribution values of that dimensional feature are obtained. Furthermore, the marginal contribution value of that dimensional feature in any user data is logarithmically processed to obtain the logarithmically processed marginal contribution value. Thus, by logarithmically processing the marginal contribution value of that dimensional feature (and potentially taking the square root after logarithmic processing), the distribution of the logarithmically processed marginal contribution value becomes more "closely" and approximately follows a normal distribution. This further improves the accuracy of subsequent fitting results for that dimensional feature. Furthermore, it improves the optimization effect of the user mining mechanism based on the fitting results of each dimensional feature, thereby improving the accuracy of the user mining mechanism.

[0020] Optionally, the user dataset is divided into multiple sub-user datasets according to user data type, and the marginal contribution values of the corresponding M-dimensional features are stored in the second memory pool according to the sub-user datasets. The process retrieves N marginal contribution values of any dimensional feature from the second memory pool to determine the fitting condition of the dimensional feature, including: retrieving the marginal contribution value of the dimensional feature corresponding to any sub-user dataset from the second memory pool; determining the fitting condition of the sub-user dataset to the dimensional feature based on the marginal contribution value of the dimensional feature corresponding to the sub-user dataset; and determining the fitting condition of the dimensional feature based on the fitting weights corresponding to each sub-user dataset and the fitting condition of each sub-user dataset to the dimensional feature.

[0021] In the above method, the user dataset is divided into multiple sub-user datasets according to user data type. For each sub-user dataset, the fitting performance of its dimensional features is determined. Furthermore, based on the fitting weights corresponding to each sub-user dataset and the fitting performance of each sub-user dataset for that dimensional feature, the fitting performance of the dimensional feature is determined. In this way, by considering the fitting weights of various types of user data in the user dataset, the inaccuracies caused by uniform calculations for different types of user data are eliminated, resulting in more accurate fitting performance for each dimensional feature.

[0022] Optionally, determining the fit of the sub-user dataset to the dimensional feature based on the marginal contribution value of the dimensional feature corresponding to the sub-user dataset includes: for any user data in the sub-user dataset, performing fitting processing on the marginal contribution values of the remaining M-1 dimensional features of the user data (excluding the dimensional feature) to obtain the marginal contribution estimate of the dimensional feature of the user data; performing fitting processing on the prediction results of the remaining M-1 dimensional features of the user data to obtain the prediction result estimate of the dimensional feature of the user data; and determining the fit of the dimensional feature of the sub-user dataset based on the marginal contribution estimate of the dimensional feature of each user data in the sub-user dataset and the prediction result estimate of the dimensional feature.

[0023] In the above method, the marginal contribution estimate of the feature dimension of the user data is obtained. Here, when processing any feature dimension of the user data in the sub-user dataset, each of the remaining M-1 features corresponds to the number of user data points (the number of user data points in the sub-user dataset) of feature values. Thus, during the fitting process, the impact of each feature dimension in the M-1 features on the marginal contribution value in the sub-user dataset can be obtained based on the marginal contribution value of this feature dimension. Furthermore, based on the impact of the M-1 features on the marginal contribution value and the marginal contribution value of the M-1 features, the marginal contribution value of this feature dimension can be estimated, obtaining the marginal contribution estimate. The prediction result estimate of the feature dimension of the user data is then obtained. Here, when processing any feature dimension of the sub-user dataset, each of the remaining M-1 features corresponds to the number of user data points (the number of user data points in the sub-user dataset) of feature values. Thus, during the fitting process, the impact of each feature dimension in the M-1 features on the prediction result in the sub-user dataset can be obtained based on the prediction result of this feature dimension. Furthermore, based on the impact of the M-1 dimensional feature on the prediction results and the marginal contribution value of the M-1 dimensional feature, the prediction result of this dimensional feature can be estimated, yielding an estimated prediction result. Thus, based on the user data in the sub-user dataset, the marginal contribution estimate and prediction result estimate of this dimensional feature for that user data are obtained. Further, the marginal contribution estimate and prediction result estimate of this dimensional feature for each user data can be obtained. Based on the difference between the marginal contribution estimate and the marginal contribution value, and the difference between the prediction result and the estimated prediction result, the fit of this dimensional feature is determined. This ensures that the obtained fit fully considers the actual user data in the sub-user dataset and also incorporates the marginal contribution value and prediction result corresponding to the user mining mechanism, making the fit of the dimensional feature more accurate.

[0024] Optionally, for any sub-user dataset, the ratio of the number of user data in the sub-user dataset to the number of user data in the user dataset is obtained, and the ratio is used as the fitting weight of the sub-user dataset.

[0025] In the above method, the ratio of the number of user data points in the sub-user dataset to the number of user data points in the user dataset is used as the fitting weight for the dimensional feature. Thus, when obtaining the fitting result of a dimensional feature, the fitting result is obtained based on each sub-user dataset. In one example, each fitting result is multiplied by its corresponding weight, and the fitting results obtained after multiplication, considering the fitting weights of each user data type, are summed to obtain the final fitting result for that dimensional feature. This approach considers the weights of various types of user data in the business dataset, making the fitting results for each dimensional feature more accurate.

[0026] Optionally, determining the fit of the dimensional features in the sub-user dataset based on the marginal contribution estimate and the prediction result estimate of the dimensional features for each user data in the sub-user dataset includes: determining the distribution result of the dimensional features from the third memory pool; if the distribution result of the dimensional features is a normal distribution, then for any user data in the sub-user dataset, determining a first difference between the marginal contribution estimate and the marginal contribution value of the dimensional features of the user data, and a second difference between the prediction result estimate and the prediction result of the dimensional features of the user data; and determining the fit of the dimensional features in the sub-user dataset based on the first correlation between the first difference and the second difference corresponding to each user data in the sub-user dataset.

[0027] If the distribution of the dimensional features is non-normal, then for any user data in the sub-user dataset, a first order value of the first difference and a second order value of the second difference are determined; the first order value is obtained by sorting the first difference of the user data in ascending order, and the second order value is obtained by sorting the second difference of the user data in ascending order; based on the second correlation between the first order value and the second order value of each user data in the sub-user dataset, the fitting condition of the dimensional features in the sub-user dataset is determined.

[0028] In the above method, the first difference is the difference between the estimated marginal contribution of the fitted feature and the actual marginal contribution, reflecting a certain degree of fit. The second difference is the difference between the estimated prediction and the actual prediction of the fitted feature, also reflecting a certain degree of fit. Thus, the fit can be obtained based on the correlation coefficient of these two differences. If the marginal contribution of the feature does not follow a normal distribution, the fit can be obtained by calculating the correlation coefficient between the first-order value of the first difference and the second-order value of the second difference. This ensures that the obtained fit fully considers the actual user data in the sub-user dataset (reflected by the estimated marginal contribution and the predicted prediction), and also incorporates the marginal contribution and prediction results corresponding to the user mining mechanism, making the fit of the feature more accurate.

[0029] Secondly, embodiments of this application provide a data processing apparatus suitable for an optimization system targeting a user mining mechanism, the apparatus comprising:

[0030] The acquisition module is used to store the acquired user mining mechanism and user dataset into a first memory pool. The user dataset contains N user data, and each user data contains M-dimensional features. N and M are integers and both are ≥1.

[0031] The processing module is used to obtain any user data from the first memory pool through the processing process, construct a linear relationship between the prediction result of the user data and the marginal contribution value of the M-dimensional features in the user data, the marginal contribution value is used to characterize the degree of influence of the corresponding dimension feature on the prediction result in the user data, the prediction result is obtained by the user mining mechanism from the user data, and the processing process is any one of a set number of processing processes in memory.

[0032] The processing module is further configured to: obtain the marginal contribution value of any dimension feature in any user data according to the linear relationship of the N user data, store the obtained N*M marginal contribution values into the second memory pool; obtain the N marginal contribution values of any dimension feature from the second memory pool through the processing process, determine the fitting condition of the dimension feature, and the fitting condition is used to characterize whether the corresponding dimension feature is overfitted.

[0033] The processing module is also used to optimize the user mining mechanism based on the fitting of the features of each dimension.

[0034] Thirdly, embodiments of this application also provide a computing device, including: a memory for storing a program; and a processor for calling the program stored in the memory and executing the method described in various possible designs of the first aspect according to the obtained program.

[0035] Fourthly, embodiments of this application also provide a computer-readable non-volatile storage medium including a computer-readable program that, when read and executed by a computer, causes the computer to perform the method described in various possible designs of the first aspect.

[0036] These or other implementations of this application will become clearer and easier to understand in the following description of the embodiments. Attached Figure Description

[0037] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0038] Figure 1A schematic diagram of the architecture of an optimized system for user mining mechanism provided in an embodiment of this application;

[0039] Figure 2 A schematic diagram of the architecture of an optimized system for user mining mechanism provided in an embodiment of this application;

[0040] Figure 3 A schematic diagram of the architecture of an optimized system for user mining mechanism provided in an embodiment of this application;

[0041] Figure 4 A schematic diagram of the architecture of an optimized system for user mining mechanism provided in an embodiment of this application;

[0042] Figure 5 A flowchart illustrating a data processing method provided in an embodiment of this application;

[0043] Figure 6 A flowchart illustrating a data processing method provided in an embodiment of this application;

[0044] Figure 7 A schematic diagram of an apparatus corresponding to a data processing method provided in an embodiment of this application;

[0045] Figure 8 This is a schematic diagram of a data processing device provided in an embodiment of this application. Detailed Implementation

[0046] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0047] To facilitate understanding of the technical solutions provided in the embodiments of this application, some key terms used in the embodiments of this application will be explained below:

[0048] User mining mechanisms include machine learning models, such as random forest models (e.g., XGBoost, an optimized distributed gradient boosting library that implements machine learning algorithms within the Gradient Boosting framework; XGBoost provides parallel tree boosting), LightGBM (histogram algorithm model), CatBoost (decision tree algorithm model), SVM (support vector machine), and neural network models. These mechanisms determine the predicted outcome of input user data and, based on the predicted outcome, whether the user is a candidate for mining. The specific types of machine learning models included in the user mining mechanism can be determined based on the user data to be mined.

[0049] Dimensional features: These can be used to characterize the data contained in user data. Adding a dimensional feature will add that dimensional feature and its corresponding feature value to any user data corresponding to the user mining mechanism; removing a dimensional feature will remove that dimensional feature and its corresponding feature value from any user data corresponding to the user mining mechanism.

[0050] Marginal contribution value: The magnitude of the impact of adding a feature dimension on the prediction result. If positive, it means that the feature dimension has a positive effect on the prediction result; if negative, it means that the feature dimension has a negative effect on the prediction result; and if 0, it means that the feature dimension has no contribution to the prediction result.

[0051] Fit coefficient: Characterizes the degree of fit between a dimensional feature and the prediction result. The higher the fit coefficient, the higher the correlation between the dimensional feature and the prediction result.

[0052] Normal distribution: also known as Gaussian distribution, the normal curve is bell-shaped, low at both ends and high in the middle, and symmetrical from left to right because of its bell shape.

[0053] In the process of modeling the user mining mechanism, after training on the training user dataset, the mechanism obtains accurate prediction results (basically equal to the actual results) on the training user data. However, when the user mining mechanism is tested on the test user data, the predicted results differ significantly from the actual results of the test user data. That is, overfitting occurs during the modeling process. One of the reasons for this overfitting is the improper handling of the dimensional features of the user data.

[0054] Currently, to eliminate overfitting, a linear regression model is used to fit the user dataset to obtain the optimal fitting line. The optimal fitting coefficient of the linear regression model is used as the fitting constant. By changing the fitting coefficient and obtaining the squared residuals as the corresponding change magnitude, the overfitting coefficient is determined. When the overfitting coefficient exceeds a certain threshold, the user dataset is considered overfitted, and the dimensional feature corresponding to this coefficient is identified as the overfitted dimensional feature. This method can estimate the fitting coefficient of linear relationships (the dimensional feature and the prediction result are linear) in the user data using a linear regression model. However, when the dimensional feature and the prediction result have a non-linear relationship, this method's estimation of the fitting coefficient is inaccurate, failing to accurately locate the overfitted dimensional feature and thus hindering accurate optimization of the user mining mechanism.

[0055] Based on this, embodiments of this application provide a data processing method applicable to optimization systems for user mining mechanisms. In this method, a linear relationship is constructed between the prediction results of user data and the marginal contribution values of each dimension feature in the user data. Based on the linear relationship between the prediction results of user data and the marginal contribution values of each dimension feature in the user data, the marginal contribution value of any dimension feature in any user data is obtained. The fitting condition of that dimension feature is obtained based on its marginal contribution value. Furthermore, overfitting dimension features are determined based on the fitting conditions of each dimension feature, and the user mining mechanism is optimized for these overfitting dimension features. Therefore, even if the prediction results of the user data corresponding to the user mining mechanism have a non-linear relationship with the dimension features, overfitting dimension features can be accurately identified, and optimization can be precisely completed.

[0056] After introducing the design concept of the embodiments of this application, the main technologies involved in the embodiments of this application will be introduced below.

[0057] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

[0058] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.

[0059] Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and instructional learning.

[0060] The data processing method for optimizing user mining mechanisms provided in this application mainly involves machine learning / deep learning technologies under the field of artificial intelligence. Specifically, the method provided in this application uses machine learning to obtain an optimized user mining mechanism, which can then be used to mine users. Details will be explained in subsequent embodiments.

[0061] The following is a brief introduction to the application scenarios to which the technical solutions of the embodiments of this application are applicable. It should be noted that the application scenarios described below are only for illustrating the embodiments of this application and are not intended to limit the scope. In specific implementation, the technical solutions provided by the embodiments of this application can be flexibly applied according to actual needs.

[0062] The solution provided in this application can be applied to scenarios involving user acquisition. For example, it is applicable to scenarios such as insurance policyholder acquisition, fund recommendation, and wealth management recommendation. To facilitate the introduction of this application, the following example uses the insurance policyholder acquisition scenario.

[0063] like Figure 1 As shown, this application provides a system architecture for an optimization system for user mining mechanisms, including: a linear relationship construction module, a fitting analysis module, and a model optimization module.

[0064] Linear Relationship Construction Module: This module obtains the prediction result of any user data in the user dataset using the user mining mechanism. The user dataset contains N user data points, each containing M-dimensional features, where N and M are integers and both ≥ 1. For any user data point, a linear relationship is constructed between the prediction result of that user data and the marginal contribution values of the M-dimensional features in that user data. Furthermore, based on the linear relationship of the N user data points in the user dataset, the marginal contribution value (N*M marginal contribution values) of any dimension feature in any user data point is obtained.

[0065] Fitting Analysis Module: Based on the N marginal contribution values of any dimension feature, determine the fitting status of that dimension feature, and further determine the dimension features that are overfitted based on the fitting status of each dimension feature.

[0066] Model optimization module: Optimizes the overfitted dimensional features in the user mining mechanism, eliminates the overfitting phenomenon of these overfitted dimensional features, and obtains the optimized user mining mechanism.

[0067] based on Figure 1 In the optimization system, this application embodiment provides another system architecture for an optimization system targeting user mining mechanisms, such as... Figure 2 As shown, it includes: a linear relationship construction module, a distribution and correction module, a fitting analysis module, and a model optimization module;

[0068] Distribution and Correction Module: This module obtains the distribution results of the N marginal contribution values of the dimensional features in the linear relationship construction module. Based on the distribution results of the dimensional features, it corrects the marginal contribution values of the dimensional features to obtain the corrected marginal contribution values.

[0069] based on Figure 1 In the optimization system, this application embodiment provides another system architecture for an optimization system targeting user mining mechanisms, such as... Figure 3 As shown, it includes: a linear relationship construction module, a classification analysis module, a fitting analysis module, and a model optimization module;

[0070] The classification analysis module categorizes user data from the user dataset to obtain sub-user datasets. Correspondingly, the subsequent fitting analysis module uses these sub-user datasets as the basis for analysis. For each sub-user dataset, it determines the fit of that feature based on its marginal contribution value, continuing until the fit of any feature across all sub-user datasets is obtained. Based on the fitting weights of each sub-user dataset and the fit of each sub-user dataset for any given feature dimension, it determines the fit of that feature dimension, continuing until the fit of each feature dimension is obtained. Finally, based on the fit of each feature dimension, it identifies any overfitting features.

[0071] based on Figure 2 In the optimization system, this application embodiment provides another system architecture for an optimization system targeting user mining mechanisms, such as... Figure 4 As shown, it includes: a linear relationship construction module, a distribution and correction module, a classification analysis module, a fitting analysis module, and a model optimization module;

[0072] Distribution and Correction Module: This module obtains the distribution results of the N marginal contribution values of the dimensional features in the linear relationship construction module. Based on the distribution results of the dimensional features, it corrects the marginal contribution values of the dimensional features to obtain the corrected marginal contribution values.

[0073] The classification analysis module categorizes user data from the user dataset to obtain sub-user datasets. Correspondingly, the subsequent fitting analysis module uses these sub-user datasets as its foundation. For each sub-user dataset, it determines the fit of that feature based on the corrected marginal contribution value of any dimension, continuing until the fit of any dimension for any sub-user dataset is obtained. Based on the fitting weights corresponding to each sub-user dataset and the fit of each sub-user dataset for any given dimension, the module determines the fit of that dimension, continuing until the fit of each dimension is obtained. Further, based on the fit of each dimension, it identifies overfitting dimensions.

[0074] The system architectures of the above-mentioned optimization systems are merely possible embodiments of this application and do not limit the specific implementation of the solutions in this application. It is also clear that any modifications to the optimization systems within the scope of the ideas in this application should be protected within the scope of this application.

[0075] Based on the above system architectures, this application provides a data processing method flow, applicable to systems optimized for user mining mechanisms, such as... Figure 5 As shown, the method flow includes:

[0076] Step 501: Store the acquired user mining mechanism and user dataset into the first memory pool. The user dataset contains N user data, and each user data contains M-dimensional features. N and M are integers and are both ≥1.

[0077] Taking financial services as an example, for the scenario of identifying policyholders, user data can include the characteristics of the legal representative at the time of insurance application (such as age and gender), company financial situation, business registration information, corporate finance-related factors (whether there is a loan record, loan balance, number of loans, whether there are deposits, deposit balance, etc.), and historical telemarketing records (average telemarketing duration, number of historical telemarketings, etc.) as dimensional features and feature values of the user data. For example, the feature value for the legal representative's age is 45 years old. It should be noted that this example does not limit the specific settings of user data in this application, nor does it limit the implementation scenario of this application. For example, in the fund recommendation scenario, user data can include user age, the types of funds the user has historically purchased, and the number of funds the user has purchased.

[0078] Step 502: Obtain any user data from the first memory pool through the processing process, construct a linear relationship between the prediction result of the user data and the marginal contribution value of the M-dimensional features in the user data, the marginal contribution value is used to characterize the degree of influence of the corresponding dimension features on the prediction result in the user data, the prediction result is predicted by the user mining mechanism on the user data, and the processing process is any one of the set number of processing processes in memory.

[0079] In the above examples, in the scenario of identifying insured users, the prediction result can represent the probability that the user data is insured or whether the user data is classified as an insured user or a non-insured user. In the scenario of fund recommendation, the prediction result can represent the probability that the user data will purchase a fund, the type of fund purchased, the amount of fund purchased, etc. There are no restrictions on the specific representation of the prediction result here.

[0080] Step 503: Based on the linear relationship of the N user data, obtain the marginal contribution value of any dimension feature in any user data, and store the obtained N*M marginal contribution values into the second memory pool.

[0081] In the aforementioned scenario of mining insured users, taking the mechanism for mining users' willingness to purchase insurance as an example, historical insured users are obtained as positive samples, and historically non-insured users are obtained as negative samples. Corporate legal characteristics, corporate financial characteristics, and business registration information are used as dimensional features (there can be a total of 100 dimensional features). All user data is divided into a training user dataset (approximately 80,000 records) and a test user dataset (approximately 20,000 records) in an 8:2 ratio (this ratio is just an example; it could also be 6:4, 5:5, etc., and can be set as needed). The 80,000 training user records are used to build the machine learning model in the user mining mechanism. After the model training is complete, feeding the 80,000 training user records into the model yields 80,000 corresponding insurance probability scores y′ (prediction results). Based on the prediction results and the marginal contribution values of the M-dimensional features in the user data, the marginal contribution value of each dimensional feature of each user data can be obtained, which is to say, an 80,000*100 marginal contribution matrix. Similarly, by sending 20,000 test user data into the model, we obtain 20,000 insurance probability scores y′, resulting in a marginal contribution matrix of size 20,000*100.

[0082] Step 504: Obtain N marginal contribution values of any dimension feature from the second memory pool through the processing process, and determine the fitting status of the dimension feature. The fitting status is used to characterize whether the corresponding dimension feature is overfitted.

[0083] Step 505: Optimize the user mining mechanism based on the fitting results of the features in each dimension.

[0084] In the above method, a linear relationship is constructed between the prediction result of user data and the marginal contribution values of M-dimensional features in the user data. This transforms the marginal contribution values of each dimension feature in the user data from non-linear to linear. Based on the constructed linear relationship of N user data sets, the marginal contribution value of each dimension feature of each user data set is further obtained, resulting in N*M marginal contribution values. Thus, the fitting condition of each dimension feature is obtained based on these N*M marginal contribution values, ensuring that the fitting condition of the dimension feature is calculated after eliminating the non-linear effects of that dimension feature. Compared to existing technologies that determine the fitting condition of a dimension feature through a linear regression model regardless of whether the dimension feature and the prediction result have a linear relationship, and thus identify overfitted dimension features, the fitting condition of the dimension features obtained in this application after eliminating the non-linear effects is more accurate. Furthermore, the identification of overfitted dimension features based on the fitting condition of the dimension features obtained in this application is more accurate. Furthermore, by setting up a first memory pool and a second memory pool, this application enables the processing process in memory to clearly obtain user data from the user mining mechanism and user dataset from the first memory pool, and obtain marginal contribution values from the second memory pool, thus ensuring the accuracy of data processing in the processing process.

[0085] Based on the above method flow, this application embodiment provides another data processing method, wherein the user dataset is a test user dataset or a training user dataset. Before step 501, which involves storing the acquired user mining mechanism and user dataset into the first memory pool, the method further includes: determining that the difference between the first prediction result and the second prediction result of the user mining mechanism meets the difference condition, wherein the first prediction result and the second prediction result are obtained based on the test user dataset and the training user dataset, respectively.

[0086] In one example, for a scenario involving the discovery of insured users, a user dataset of historical insured users and historically uninsured users can be randomly selected and used as the training user dataset. Alternatively, a user dataset of historical insured users and historically uninsured users from a period relatively close to the current prediction time point can be randomly selected and used as the test user dataset. If the user discovery mechanism performs well on the prediction results of the user data in the training user dataset but performs poorly on the test user dataset (e.g., it cannot accurately predict the insurance probability of the user data in the test user dataset), and the difference between the first and second prediction results of the user discovery mechanism meets the difference condition, then the user discovery mechanism needs optimization.

[0087] Based on the above method and process, this application provides another data processing method. Step 502, constructing a linear relationship between the prediction result of the user data and the marginal contribution value of the M-dimensional features in the user data, includes: constructing the linear relationship based on the prediction result of the user data, the mean of the prediction results of the N user data, and the marginal contribution value of the M-dimensional features in the user data.

[0088] In one example, training user data x can be obtained to train the user mining mechanism. train and the corresponding label value y train (Actual results) and the prediction results y′ of the training user data on the model train (The prediction result obtained by sending the training user data to the trained user mining mechanism M). Obtain the single-dimensional feature of y′ from the training user data. train Given the marginal contribution value, construct the following equation (that is, use the output y′ of the user mining mechanism). train The goal is to find the relationship between each dimension of the features in the training user data and this y′. train (marginal contribution value).

[0089] Assume that the training user data 0 has 10 features in each dimension. The corresponding prediction result is y′0. Therefore:

[0090]

[0091] In the above formula (1) y′ represents the prediction result of the user mining mechanism, and n represents the number of user data in the training user dataset. This is the marginal contribution value of the k-th dimension feature of the i-th training user data. Its value is the contribution of this dimension feature to y′0 (the degree of influence on y′0). If it is positive, it means that this dimension feature has a positive effect on the prediction result, and if it is negative, it means that this dimension feature has a negative effect on the prediction result.

[0092] For ease of understanding, the above formula can be used to decompose the user mining mechanism into an approximately linear model. The predicted result y′ of the model is taken as the target of formula (1), and the base is also obtained from y′. It can represent the basic level of the prediction results in the training user dataset. The marginal contribution value of each feature to the prediction result can be obtained sequentially according to formula (1).

[0093] Based on the above method and process, this application provides another data processing method. Step 503, obtaining the marginal contribution value of any dimension feature in any user data, includes: for any dimension feature, inputting incomplete user data into the prediction model to obtain an incomplete prediction result, wherein the incomplete user data contains M-1 other dimension features besides the dimension feature; and obtaining the marginal contribution value of the dimension feature based on the prediction result and the incomplete prediction result.

[0094] In one example, the marginal contribution value of any dimension feature in any user data can be obtained based on the linear relationship of N user data. Specifically, the calculation process of the entire marginal contribution value involves discarding the original dimension features one by one and using the remaining dimension features to fit the target y′. Taking the user mining mechanism for insurance purchase intention mentioned above as an example, the construction of the insurance purchase intention model of the user mining mechanism has been completed. It is necessary to calculate the marginal contribution value of the 100 dimension features involved in the construction. The 80,000 training user datasets mentioned above can be sent to the user mining mechanism to obtain 80,000 corresponding insurance purchase intentions y′. Randomly removing one dimension feature (e.g., asset dimension feature), a sub-user mining mechanism m1 is constructed again with the remaining 99 dimension features. Similarly, the 80,000 training user data with the asset dimension feature removed are input into this sub-user mining mechanism, and the corresponding y′ can be obtained for each training user data with 99 dimension features. m1′ In other words, without the asset dimension feature, the resulting score for the willingness to purchase insurance ranges from y′ to y. m1′ So this y′-y m1′The difference is taken as the marginal contribution value of this dimension feature (asset dimension feature) (that is, the one mentioned above). Using this method, 100 dimensional features can be eliminated one by one, and then the marginal contribution of a single dimensional feature in each training user data to the prediction result of that training user data can be calculated. That is, 80,000 training user data can eventually yield 80,000 marginal contribution values, which is an 80,000 marginal contribution matrix. Similarly, on the training user dataset, a 20,000 marginal contribution matrix is obtained in the same way. It is clear that any model, including linear and nonlinear models, can be decomposed according to this method. Because nonlinear models can also be linearized using the above formula (1). This improvement can solve the problem that the method of estimating results by linear fitting in the existing technology has poor performance when applied to nonlinear models. The marginal contribution values of different dimensional features of different data sets to the prediction result can be obtained through formula (1). Since the influence of the same dimensional feature on the prediction result is different for different user data in a user dataset, it is necessary to estimate the relationship between the dimensional feature and the prediction result in the user dataset, and use this as the fitting condition between the dimensional feature and the prediction result.

[0095] Based on the above method flow, this application embodiment provides another data processing method, which further includes, before determining the fitting condition of the dimensional feature in step 504: obtaining the distribution results of N marginal contribution values of the dimensional feature through the processing process, storing the distribution results of the dimensional feature in a third memory pool; and correcting the marginal contribution values of the dimensional feature according to the distribution results of the dimensional feature to obtain the corrected marginal contribution values.

[0096] In practical implementation, for one or more dimensional features of user datasets from different periods and / or spaces corresponding to the user mining mechanism, the distribution of feature values for these dimensional features varies significantly across different user datasets. In one example, regarding the scenario of mining insured users, due to significant differences in the time and space of user insurance purchases, the feature value distribution of one or more dimensional features obtained in the initial stage of building the user mining mechanism may differ greatly from the feature value distribution in later validation. For instance, in the initial stage of building the user mining mechanism, all historical insured users are considered positive samples, and historical non-insured users who were contacted by phone but did not purchase insurance are considered negative samples. However, in the early stages of business development, the number of negative samples is 10-20 times that of positive samples. But as the business expands, the number of users intending to purchase insurance grows rapidly, and the ratio may approach 10 times. This change in the ratio of positive to negative samples will cause the marginal contribution value distribution of the same dimensional feature to the prediction result to change in both the training and validation stages of the user mining mechanism. To eliminate the error in the marginal contribution value of the dimensional features caused by this change... This application performs distribution judgment on any dimension feature in the user dataset, and performs corresponding corrections based on different distribution results, so that the obtained corrected marginal contribution value is more accurate, which facilitates the improvement of the accuracy of the dimensional feature fitting.

[0097] In one example, the distribution analysis method can be either based on a normal distribution or a non-normal distribution. For instance, in a scenario involving the mining of insured users, if any dimension of the user dataset has N feature values corresponding to N user data points, the distribution can be determined by plotting the occurrence time of the user data on the horizontal axis and the feature values of the dimension (e.g., assets) on the vertical axis, or by plotting the market share of the user data on the horizontal axis and the feature values of the dimension (e.g., assets) on the vertical axis. The meaning of the horizontal and vertical axes in this distribution determination can be set according to specific needs and is not restricted here.

[0098] Determine the kurtosis and skewness of the distribution of N marginal contribution values for this dimension feature, and determine whether the distribution of this dimension feature approximately follows a normal distribution based on whether the values of the kurtosis z-score and skewness z-score are within the interval (-1.96, +1.96). Taking the asset dimension feature as an example: let k1 represent the skewness of the distribution of this asset dimension feature, k2 represent the kurtosis of the distribution of this asset dimension feature, and std k1 The standard error of skewness, std k2 The standard error representing kurtosis.

[0099] The formula for the kurtosis of a dimensional feature is as follows:

[0100]

[0101] The formula for the skewness of a dimensional feature is as follows:

[0102]

[0103]

[0104] std k2 =4*(N) 2 -1)*std k1 / ((N-3)*(N+5)) (5)

[0105]

[0106]

[0107] Among them, u 资产 The mean of the marginal contribution values of the asset dimension features in the user dataset. f 资产 , where N is the marginal contribution value of the asset dimension feature, and N is the number of user data in the user dataset.

[0108] According to formulas (6) and (7), the skewness Z-score and the skewness Z-score can be calculated respectively. If both values are between -1.96 and +1.96, the asset dimension feature is considered to follow a normal distribution; otherwise, it does not follow a normal distribution. It should be noted that using the normal distribution as an analysis method to distribute the dimension feature is only one embodiment and does not limit the specific implementation of this application. For example, gamma distribution, beta distribution, etc. can be applied adaptively.

[0109] Based on the above method and process, this application provides another data processing method, which corrects the marginal contribution value of the dimensional feature according to the distribution result of the dimensional feature, including: if the distribution result of the dimensional feature follows a normal distribution, obtaining the mean and standard deviation of the N marginal contribution values of the dimensional feature; for any user data, correcting the marginal contribution value of the dimensional feature in the user data according to the marginal contribution value of the dimensional feature in the user data, the mean and the standard deviation, to obtain the corrected marginal contribution value.

[0110] In other words, the marginal contribution value of a dimensional feature is corrected based on its distribution. When the dimensional feature follows a normal distribution, the marginal contribution value can be corrected using the following formula:

[0111]

[0112] Where i represents the i-th user data in the user dataset, j represents the j-th dimension feature in the i-th user data, and f′ ijLet be the corrected marginal contribution value of the j-th dimension feature in the i-th user data, u be the mean of the N marginal contribution values of this dimension feature, and Std be the standard deviation of the N marginal contribution values of this dimension feature.

[0113] Based on the above method and process, this application provides another data processing method, which corrects the marginal contribution value of the dimensional feature according to the distribution result of the dimensional feature, including: if the marginal contribution value of the dimensional feature does not follow a normal distribution, obtaining the mean and standard deviation of the N marginal contribution values of the dimensional feature; for any user data, performing logarithmic processing on the marginal contribution value of the dimensional feature in the user data to obtain the logarithmically processed marginal contribution value; correcting the marginal contribution value of the dimensional feature in the user data according to the logarithmically processed marginal contribution value of the dimensional feature, the mean and the standard deviation, to obtain the corrected marginal contribution value.

[0114] In one example, the marginal contribution value of a dimensional feature does not follow a normal distribution. The marginal contribution value of this dimensional feature is logarithmically processed, and then the square root is taken to obtain the logarithmic marginal contribution value. Then, the marginal contribution value correction method of formula (8) is used. The corrected marginal contribution value is obtained. In this method, for dimensional features that are not approximately normal, the logarithm of the marginal contribution value is taken and the square root is taken, so that the marginal contribution value of the dimensional feature is more "closely" and approximately follows a normal distribution. The marginal contribution value after logarithmic processing follows a normal distribution, so the normal distribution standardization process as in formula (8) can be reused to perform the operation of removing the mean and dividing the variance on the marginal contribution value of the dimensional feature, eliminating the influence of the distribution being left or right biased relative to the expected value on the fitting situation.

[0115] Based on any of the above methods, this application provides another data processing method, wherein the user dataset is divided into multiple sub-user datasets according to the user data type, and the marginal contribution values of the corresponding M-dimensional features are stored in the second memory pool according to the sub-user datasets; the process obtains N marginal contribution values of any dimension feature from the second memory pool through the processing process to determine the fitting condition of the dimension feature, including: obtaining the marginal contribution value of the dimension feature corresponding to any sub-user dataset from the second memory pool through the processing process; determining the fitting condition of the sub-user dataset in the dimension feature according to the marginal contribution value of the dimension feature corresponding to the sub-user dataset; and determining the fitting condition of the dimension feature according to the fitting weights corresponding to each sub-user dataset and the fitting condition of each sub-user dataset in the dimension feature.

[0116] In practical business scenarios, the fitting performance of the same dimensional features for different types of user data generally varies significantly. For example, in the scenario of identifying insured users, users with more assets and a larger market share will show significantly different fitting performance for the same dimensional features (asset and market share) compared to users with less assets and a smaller market share. Furthermore, the amount of user data of different data types within the user dataset is often uneven. Therefore, this uneven distribution of user data across different data types can lead to inaccurate fitting performance for dimensional features. To eliminate this inaccurate fitting performance caused by uneven distribution of user data across different data types, the user dataset can be divided into sub-datasets based on user data type. The classification method here can be a classification machine learning model, such as logistic regression, Naive Bayes, decision trees, support vector machines, random forests, gradient boosting trees, etc., which can be selected as needed; no specific restrictions are imposed here. Therefore, for each sub-dataset, a fitting weight can be determined, which can be based on the proportion of user data in that sub-dataset within the total user data in the user dataset. Alternatively, the fitting weights can be determined by the ratio of the average value of that key dimension feature in the sub-user dataset to the average value of that key dimension feature in the user dataset. The specific method for obtaining the fitting weights is not limited here and can be determined as needed. Then, based on the fitting weights corresponding to each sub-user dataset and the fitting performance of each sub-user dataset in that dimension feature, the fitting performance of the user dataset in that dimension feature is determined. This solves the problem of inaccurate fitting caused by the uneven distribution of user data across the user dataset.

[0117] For ease of understanding, embodiments of this application provide a method for obtaining fitting weights, comprising: for any sub-user dataset, obtaining the ratio of the number of user data in the sub-user dataset to the number of user data in the user dataset, and using the ratio as the fitting weight of the sub-user dataset.

[0118] In other words, for any user data type corresponding to a sub-user dataset, if the number of user data points in the user dataset is N and the number of user data points in the sub-user dataset is a, then the fitting weights corresponding to that sub-user dataset are: If the user dataset is divided into sub-datasets of class a, class b, and class c, and assuming the fit is represented by the fit coefficients, then the fit coefficient for any dimension of the user dataset is the weighted sum of the fit coefficients for each sub-dataset, P = w a *P a (fit coefficients of the class a sub-user dataset) + w b *P b(fit coefficients of the class b sub-user dataset) + w c *P c (Fit coefficients of the class c sub-user dataset).

[0119] Taking user classification in a corporate finance scenario as an example, if the goal is to estimate a user's earnings over the past six months, users can be categorized into four groups: high-value, second-high-value, medium-value, and low-value. In practice, user data is characterized by the corporate entity's characteristics, financial characteristics, and recent loan behavior. Earnings generated by users over the past six months are then categorized into four groups based on a certain amount: high-value, second-high-value, medium-value, and low-value.

[0120] The existing 1 million users are labeled according to their revenue over the past six months using the labeling principles described above, and then divided into an 8:2 split: 800,000 training users and 200,000 test users. The distribution of the four categories in both sets is 1:3:6:10. Specifically, the 800,000 training users consist of 40,000 high-value users (a sub-dataset of high-value user data type), 120,000 second-highest-value users (a sub-dataset of second-highest-value user data type), 240,000 medium-value users (a sub-dataset of medium-value user data type), and 400,000 low-value users (a sub-dataset of low-value user data type). Correspondingly, based on the above methods, the fitting coefficients for the asset dimension features of each sub-dataset can be obtained (here, the fitting coefficients are based on asset dimension features as an example): Accordingly, the fitting weights for the asset dimension features of each sub-user dataset are: Where, N 高价值 The user data volume N of the sub-user dataset for high-value user data types 次高价值 The user data volume N of the sub-user dataset for the second highest value user data type. 中等价值 The user data volume N of the sub-user dataset for the medium-value user data type. 低价值 Let be the amount of user data in the sub-dataset representing low-value user data types. Then, the fitting coefficient for the asset dimension features of the user dataset is:

[0121] Similarly, the fitting coefficients for the asset dimension features of the test user dataset are: In one example, the difference between the fitting coefficients of the training user dataset and the test user dataset for the asset dimension feature can be compared. If the difference reaches a threshold, the asset dimension feature is determined to be an overfitted dimension feature, and optimization processes such as deletion or parameter adjustment can be performed on the asset dimension feature in the user mining mechanism.

[0122] Based on the above method flow, this application embodiment provides a data processing method, which determines the fitting condition of the sub-user dataset to the dimensional feature according to the marginal contribution value of the dimensional feature corresponding to the sub-user dataset, including:

[0123] For any user data in the sub-user dataset, the marginal contribution values of the remaining M-1 dimensional features of the user data (excluding the stated dimensional feature) are fitted to obtain the marginal contribution estimate of the stated dimensional feature of the user data; the prediction results of the remaining M-1 dimensional features of the user data are fitted to obtain the prediction result estimate of the stated dimensional feature of the user data; the fitting condition of the stated dimensional feature of the sub-user dataset is determined based on the marginal contribution estimate of the stated dimensional feature of each user data in the sub-user dataset and the prediction result estimate of the stated dimensional feature.

[0124] In one example, the marginal contribution estimate of the j-th dimension feature in the i-th user data: If the sub-user dataset contains 80,000 user data points, then 80,000 such equations can be obtained, corresponding to the fit of k1, k2…k. j-1 k j+1 …then you can get Similarly, we can obtain the estimated value of the prediction result for the j-th dimension feature in the i-th user data: α1, α2…α can be obtained by fitting. j-1 α j+1 …then you can get Therefore, the estimated marginal contribution of the j-th dimension feature in the i-th user data is obtained by fitting the marginal contribution values of this dimension feature based on 8W user data. The marginal contribution value f of the j-th dimension feature in the i-th user data is... ij This is derived from the linear relationship of the user data. Similarly, the estimated prediction result of the j-th dimension feature in the i-th user data is obtained by fitting the prediction results of that dimension feature from 80,000 user data. The predicted result y of the j-th dimension feature in the i-th user data... ijThis is obtained from the user data in the user mining mechanism. Therefore, the fit of the dimensional feature of the sub-user dataset can be determined based on the marginal contribution estimate and the prediction result estimate of the dimensional feature of each user data in the sub-user dataset.

[0125] Based on the above methods and processes, this application provides a data processing method that determines the fitting condition of the dimensional features of the sub-user dataset according to the marginal contribution estimate of the dimensional features of each user data in the sub-user dataset and the prediction result estimate of the dimensional features, including:

[0126] The distribution of the dimensional feature is determined from the third memory pool; if the distribution of the dimensional feature is normally distributed, then for any user data in the sub-user dataset, the first difference between the estimated marginal contribution of the dimensional feature of the user data and the estimated marginal contribution of the dimensional feature of the user data, and the second difference between the estimated prediction result of the dimensional feature of the user data and the prediction result of the dimensional feature of the user data are determined; based on the first correlation between the first difference and the second difference corresponding to each user data in the sub-user dataset, the fitting condition of the dimensional feature in the sub-user dataset is determined;

[0127] If the distribution result of the dimensional feature is non-normal, then for any user data in the sub-user dataset, determine the first order value of the first difference of the user data and the second order value of the second difference; the first order value is obtained by sorting the first difference of the user data in ascending order, and the second order value is obtained by sorting the second difference of the user data in ascending order.

[0128] The fitting condition of the dimensional features in the sub-user dataset is determined based on the second correlation between the first order value and the second order value corresponding to each user data in the sub-user dataset.

[0129] Based on the above example, if the distribution of the j-th dimension feature in the i-th user data within the sub-user dataset is a normal distribution, then it is determined that... and The fit of the j-th dimension feature in this sub-user dataset (in this example, the fit coefficient) can be obtained by calculating the Pearson partial correlation coefficient (first correlation). Therefore, the fit coefficient P of the j-th dimension feature in the sub-user dataset is... XY The calculation formula is as follows:

[0130]

[0131] Where, N a Let 'a' be the amount of user data in a sub-user dataset of user data type 'a'. and

[0132] It needs to be explained that, This can be used to characterize the difference between the marginal contribution value of a specific dimension feature of a user's data in the user mining mechanism and the estimated marginal contribution value fitted by that specific dimension feature of the user's data in the user dataset. This difference, to some extent, reflects the fitting influence of that specific dimension feature of the user's data. Similarly, It can be used to characterize the difference between the prediction result of the user data in a certain dimension feature in the user mining mechanism and the estimated prediction result fitted by the user data in a certain dimension feature in the user dataset. This difference reflects the fitting effect of the user data in that dimension feature to a certain extent.

[0133] If the distribution of the j-th dimension feature in the i-th user data in the sub-user dataset does not follow a normal distribution, then the Spearman partial correlation coefficient (secondary correlation coefficient) can be used to calculate the fit coefficient r of the j-th dimension feature in the sub-user dataset. xy The calculation formula is as follows:

[0134]

[0135] Where, d x For the first order value, d y The second order value is obtained by sorting the first difference of the i-th user data in the sub-user dataset according to the first difference of each user data in the sub-user dataset in ascending order. The second order value is obtained by sorting the second difference of the i-th user data in the sub-user dataset according to the second difference of each user data in the sub-user dataset in ascending order.

[0136] In one example, the sub-user dataset contains data from 7 users. The first and second differences for each user's data are shown in Table 1:

[0137] <![CDATA[Δf ij ]]> <![CDATA[Δy ij ′]]> 0.052 0.28 0.051 0.32 0.37 0.06 0.68 0.03 0.01 0.99 0.27 0.12 0.82 0.2

[0138] Table 1

[0139] The above Δf ij Δy ij Sort the values in ascending order, and we can obtain the first-order values and the second-order values, as shown in Table 2:

[0140]

[0141] Table 2

[0142] Based on the above methods and processes, this application provides a data processing method, such as... Figure 6 As shown, it includes:

[0143] Step 601: Obtain the user mining mechanism and user dataset.

[0144] Here, the user dataset contains N user data points, and each user data point contains M-dimensional features, where N and M are integers and both are ≥1.

[0145] Step 602: Obtain the prediction results of each user in the user dataset by the user mining mechanism.

[0146] Step 603: Based on the linear relationship between the prediction result of any user data and the marginal contribution value of the M-dimensional features in the user data, obtain the linear relationship of N user data.

[0147] Step 604: Based on the linear relationship of the N user data, obtain N*M marginal contribution values.

[0148] Step 605: For this user dataset, classify the user data in the user dataset and obtain the sub-user datasets corresponding to each user data type.

[0149] Step 606: For any sub-user dataset, obtain the distribution results of each dimension of the features of the sub-user dataset.

[0150] Step 607: Based on the distribution results of each dimension feature in the sub-user dataset, correct the marginal contribution value of each user data corresponding to the dimension feature, and obtain the corrected marginal contribution value. In this way, the corrected marginal contribution value of any dimension feature of any user data in any sub-user dataset is obtained, that is, the N*M corrected marginal contribution values are obtained.

[0151] Step 608: For any sub-user dataset, obtain the ratio of the number of user data points in the user dataset to the total number of user data points in the user dataset, and use this ratio as the fitting weight for the sub-user dataset.

[0152] Step 609: For any user data in any sub-user dataset, perform fitting processing on the corrected marginal contribution values of the remaining M-1 dimensional features of the user data other than the dimensional feature, to obtain the marginal contribution estimate of the dimensional feature of the user data, and perform fitting processing on the prediction results of the remaining M-1 dimensional features of the user data to obtain the prediction result estimate of the dimensional feature of the user data.

[0153] Step 610: Determine the fitting coefficient of the feature of that dimension in the sub-user dataset based on the marginal contribution estimate and the prediction result estimate of the feature of that dimension for each user data in the sub-user dataset.

[0154] Step 611: Based on the above processes 602 to 610, obtain the fitting coefficients of each dimension feature in the training user dataset and the fitting coefficients of each dimension feature in the test user dataset. For the same dimension feature, determine the difference between the fitting coefficient of that dimension feature in the training user dataset and the fitting coefficient of that dimension feature in the test user dataset.

[0155] Step 612: Determine whether the difference is greater than the set threshold. If yes, proceed to step 613. If no, proceed to step 611 for another feature of the same dimension, until all M-dimensional features have been determined.

[0156] Step 613: Identify overfitting in this dimension. The overfitting dimensions in the M-dimensional features can be obtained, and the user mining mechanism can be optimized based on these overfitting dimensions.

[0157] It should be noted that the execution steps of the above method are not unique. For example, before step 601, it is also possible to determine whether the difference between the first and second prediction results of the user mining mechanism meets the difference condition. If it does, then steps 601-613 are executed. Here, the first and second prediction results are obtained based on the test user dataset and the training user dataset, respectively.

[0158] In addition, embodiments of this application also provide an apparatus corresponding to a data processing method, such as... Figure 7As shown, the user mining mechanism and user dataset can be stored in a first memory pool of computer memory. The computer memory contains multiple processing processes (n processes). These n processes can retrieve the user mining mechanism and user dataset from the first memory pool and obtain marginal contribution values. The n processes then place the N*M marginal contribution values into a second memory pool. Next, the n processes retrieve the N*M marginal contribution values from the second memory pool and determine the distribution result of any dimension feature in any sub-user dataset based on these N*M marginal contribution values. The n processes then place the distribution result into a third memory pool. Finally, the n processes retrieve the distribution result of any dimension feature in any sub-user dataset from the third memory pool and, based on the distribution result of any dimension feature in any sub-user dataset, correct the marginal contribution value of that dimension feature in the user dataset to obtain the corrected marginal contribution value. The n processes then place the N*M corrected marginal contribution values into a fourth memory pool. n processing processes retrieve N*M corrected marginal contribution values from the fourth memory pool, obtain marginal contribution estimates and prediction result estimates, and then place the N*M marginal contribution estimates and N prediction result estimates into the fifth memory pool. The n processing processes then retrieve N*M marginal contribution estimates and N prediction result estimates from the fifth memory pool, and obtain the fitting coefficients for M-dimensional features, placing them into the sixth memory pool. In this way, when the processing processes perform parallel data processing, they can clearly and accurately retrieve each data point from the corresponding memory pool, ensuring the accuracy of the user's data mining mechanism's optimization processing.

[0159] Based on the same concept, embodiments of this application provide a data processing apparatus suitable for optimization systems targeting user mining mechanisms, such as... Figure 8 As shown, the device includes:

[0160] The acquisition module 801 is used to store the acquired user mining mechanism and user dataset into a first memory pool. The user dataset contains N user data, and each user data contains M-dimensional features. N and M are integers and are both ≥1.

[0161] The processing module 802 is used to obtain any user data from the first memory pool through the processing process, construct a linear relationship between the prediction result of the user data and the marginal contribution value of the M-dimensional features in the user data, the marginal contribution value is used to characterize the degree of influence of the corresponding dimension feature on the prediction result in the user data, the prediction result is obtained by the user mining mechanism from the user data, and the processing process is any one of a set number of processing processes in memory.

[0162] The processing module 802 is further configured to: obtain the marginal contribution value of any dimension feature in any user data according to the linear relationship of the N user data, store the obtained N*M marginal contribution values into a second memory pool; obtain the N marginal contribution values of any dimension feature from the second memory pool through the processing process, determine the fitting condition of the dimension feature, and the fitting condition is used to characterize whether the corresponding dimension feature is overfitted.

[0163] The processing module 802 is further configured to optimize the user mining mechanism based on the fitting of the features of each dimension.

[0164] Optionally, the processing module 802 is further configured to determine that the difference between the first prediction result and the second prediction result of the user mining mechanism meets the difference condition, wherein the first prediction result and the second prediction result are obtained based on the test user dataset and the training user dataset, respectively.

[0165] Optionally, the processing module 802 is specifically used to construct the linear relationship based on the prediction results of the user data, the mean of the prediction results of the N user data, and the marginal contribution value of the M-dimensional features in the user data.

[0166] Optionally, the processing module 802 is specifically used to: input incomplete user data into the prediction model to obtain an incomplete prediction result for any dimension feature, wherein the incomplete user data contains M-1 other dimension features besides the dimension feature; and obtain the marginal contribution value of the dimension feature based on the prediction result and the incomplete prediction result.

[0167] Optionally, the processing module 802 is further configured to: obtain the distribution results of the N marginal contribution values of the dimension feature through the processing process; store the distribution results of the dimension feature into a third memory pool; and correct the marginal contribution values of the dimension feature according to the distribution results of the dimension feature to obtain the corrected marginal contribution values.

[0168] Optionally, the processing module 802 is specifically used to: if the distribution result of the dimensional feature follows a normal distribution, obtain the mean and standard deviation of the N marginal contribution values of the dimensional feature; for any user data, correct the marginal contribution value of the dimensional feature in the user data according to the marginal contribution value of the dimensional feature in the user data, the mean and the standard deviation, to obtain the corrected marginal contribution value.

[0169] Optionally, the processing module 802 is specifically used to: if the marginal contribution value of the dimensional feature does not follow a normal distribution, obtain the mean and standard deviation of the N marginal contribution values of the dimensional feature; for any user data, perform logarithmic processing on the marginal contribution value of the dimensional feature in the user data to obtain the logarithmically processed marginal contribution value; and correct the marginal contribution value of the dimensional feature in the user data according to the logarithmically processed marginal contribution value of the dimensional feature, the mean, and the standard deviation to obtain the corrected marginal contribution value.

[0170] Optionally, the user dataset is divided into multiple sub-user datasets according to the user data type, and the marginal contribution values of the corresponding M-dimensional features are stored in the second memory pool according to the sub-user datasets; the processing module 802 is specifically used to: obtain the marginal contribution value of the dimensional feature corresponding to any sub-user dataset from the second memory pool through the processing process; determine the fitting condition of the sub-user dataset to the dimensional feature according to the marginal contribution value of the dimensional feature corresponding to the sub-user dataset; and determine the fitting condition of the dimensional feature according to the fitting weights corresponding to each sub-user dataset and the fitting condition of each sub-user dataset to the dimensional feature.

[0171] The processing module 802 is specifically used to: for any user data in the sub-user dataset, perform fitting processing on the marginal contribution values of the remaining M-1 dimensional features of the user data (excluding the dimensional feature) to obtain the marginal contribution estimate of the dimensional feature of the user data; perform fitting processing on the prediction results of the remaining M-1 dimensional features of the user data to obtain the prediction result estimate of the dimensional feature of the user data; and determine the fitting status of the dimensional feature of the sub-user dataset based on the marginal contribution estimate of the dimensional feature of each user data in the sub-user dataset and the prediction result estimate of the dimensional feature.

[0172] The processing module 802 is specifically used to, for any sub-user dataset, obtain the ratio of the number of user data in the sub-user dataset to the number of user data in the user dataset, and use the ratio as the fitting weight of the sub-user dataset.

[0173] The processing module 802 is specifically used to: determine the distribution result of the dimensional feature from the third memory pool; if the distribution result of the dimensional feature is a normal distribution, then for any user data in the sub-user dataset, determine the first difference between the estimated marginal contribution value of the dimensional feature of the user data and the estimated marginal contribution value of the dimensional feature of the user data, and the second difference between the estimated prediction result of the dimensional feature of the user data and the prediction result of the dimensional feature of the user data; and determine the fitting of the dimensional feature in the sub-user dataset based on the first correlation between the first difference and the second difference corresponding to each user data in the sub-user dataset. If the distribution of the dimensional features is non-normal, then for any user data in the sub-user dataset, determine the first order value of the first difference and the second order value of the second difference of the user data; the first order value is obtained by sorting the first difference of the user data in ascending order, and the second order value is obtained by sorting the second difference of the user data in ascending order; based on the second correlation between the first order value and the second order value of each user data in the sub-user dataset, determine the fitting condition of the dimensional features in the sub-user dataset.

[0174] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0175] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to this application. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0176] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0177] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0178] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. A data processing method, characterized by, The method, applicable to an optimized system targeting user mining mechanisms, includes: The acquired user mining mechanism and user dataset are stored in the first memory pool. The user dataset contains N user data points, and each user data point contains M-dimensional features. N and M are integers, and both are ≥1. The processing process retrieves any user data from the first memory pool and constructs a linear relationship between the prediction result of the user data and the marginal contribution value of the M-dimensional features in the user data. The marginal contribution value is used to characterize the degree of influence of the corresponding dimensional features on the prediction result in the user data. The prediction result is obtained by the user mining mechanism from the user data. The processing process is any one of a set number of processing processes in memory. Based on the linear relationship of the N user data, obtain the marginal contribution value of any dimension feature in any user data, and then use the obtained N... M marginal contribution values are stored in the second memory pool; the user dataset is divided into multiple sub-user datasets according to the user data type, and the marginal contribution values of the corresponding M-dimensional features are stored in the second memory pool according to the sub-user datasets; The process involves retrieving N marginal contribution values of any dimension feature from the second memory pool through the processing process, and determining the fitting condition of the dimension feature. This includes: retrieving the marginal contribution value of the dimension feature corresponding to any sub-user dataset from the second memory pool through the processing process; for any user data in the sub-user dataset, performing fitting processing on the marginal contribution values of the remaining M-1 dimension features of the user data (excluding the dimension feature) to obtain the marginal contribution estimate of the dimension feature of the user data; and performing fitting processing on the prediction results of the remaining M-1 dimension features of the user data to obtain the prediction result estimate of the dimension feature of the user data. The distribution of the dimensional feature is determined from the third memory pool; if the distribution of the dimensional feature is normally distributed, then for any user data in the sub-user dataset, the first difference between the estimated marginal contribution of the dimensional feature of the user data and the estimated marginal contribution of the dimensional feature of the user data, and the second difference between the estimated prediction result of the dimensional feature of the user data and the prediction result of the dimensional feature of the user data are determined; based on the first correlation between the first difference and the second difference corresponding to each user data in the sub-user dataset, the fitting condition of the dimensional feature in the sub-user dataset is determined; If the distribution of the dimensional features is non-normal, then for any user data in the sub-user dataset, a first order value for the first difference of the user data and a second order value for the second difference are determined; the first order value is obtained by sorting the first differences of the user data in ascending order, and the second order value is obtained by sorting the second differences of the user data in ascending order; based on the second correlation between the first order value and the second order value of each user data in the sub-user dataset, the fitting condition of the dimensional features in the sub-user dataset is determined; the fitting condition is used to characterize whether the corresponding dimensional features are overfitted. Based on the fitting results of the features in each dimension, the user mining mechanism is optimized.

2. The method of claim 1, wherein, The user dataset is either a test user dataset or a training user dataset. Before storing the acquired user mining mechanism and user dataset into the first memory pool, the process further includes: The difference between the first prediction result and the second prediction result of the user mining mechanism is determined to meet the difference condition, wherein the first prediction result and the second prediction result are obtained based on the test user dataset and the training user dataset, respectively.

3. The method of claim 1, wherein, Constructing a linear relationship between the prediction results of the user data and the marginal contribution values of the M-dimensional features in the user data includes: The linear relationship is constructed based on the prediction results of the user data, the mean of the prediction results of the N user data, and the marginal contribution value of the M-dimensional features in the user data.

4. The method of claim 1, wherein, Obtain the marginal contribution value of any dimension feature in any user data, including: For any dimension feature, incomplete user data is input into the prediction model to obtain incomplete prediction results. The incomplete user data contains M-1 other dimension features besides the dimension feature mentioned above. The marginal contribution value of the dimensional feature is obtained based on the prediction results and the incomplete prediction results.

5. The method of claim 2, wherein, Before determining the fit of the dimensional features, the process also includes: The distribution results of the N marginal contribution values of the dimensional feature are obtained through the processing process, and the distribution results of the dimensional feature are stored in the third memory pool. Based on the distribution results of the dimensional features, the marginal contribution values of the dimensional features are corrected to obtain the corrected marginal contribution values.

6. The method of claim 5, wherein, Based on the distribution results of the dimensional features, the marginal contribution values of the dimensional features are corrected, including: If the distribution of the dimensional feature follows a normal distribution, obtain the mean and standard deviation of the N marginal contribution values of the dimensional feature; For any user data, the marginal contribution value of the dimensional feature in the user data is corrected according to the marginal contribution value, the mean, and the standard deviation of the dimensional feature in the user data to obtain the corrected marginal contribution value.

7. The method of claim 5, wherein, Based on the distribution results of the dimensional features, the marginal contribution values of the dimensional features are corrected, including: If the marginal contribution value of the dimensional feature does not follow a normal distribution, obtain the mean and standard deviation of the N marginal contribution values of the dimensional feature; for any user data, perform logarithmic processing on the marginal contribution value of the dimensional feature in the user data to obtain the logarithmically processed marginal contribution value. The marginal contribution value of the dimensional feature in the user data is corrected based on the logarithmic marginal contribution value of the dimensional feature, the mean, and the standard deviation, to obtain the corrected marginal contribution value.

8. The method of claim 1, wherein, For any sub-user dataset, obtain the ratio of the number of user data in the sub-user dataset to the number of user data in the user dataset, and use the ratio as the fitting weight of the sub-user dataset.

9. A data processing apparatus, characterized by, The apparatus is suitable for an optimization system targeting user mining mechanisms, and includes: The acquisition module is used to store the acquired user mining mechanism and user dataset into a first memory pool. The user dataset contains N user data, and each user data contains M-dimensional features. N and M are integers and both are ≥1. The processing module is used to obtain any user data from the first memory pool through the processing process, construct a linear relationship between the prediction result of the user data and the marginal contribution value of the M-dimensional features in the user data, the marginal contribution value is used to characterize the degree of influence of the corresponding dimension feature on the prediction result in the user data, the prediction result is obtained by the user mining mechanism from the user data, and the processing process is any one of a set number of processing processes in memory. The processing module is further configured to, based on the linear relationship of the N user data, obtain the marginal contribution value of any dimension feature in any user data, and then process the obtained N... The user dataset is divided into multiple sub-user datasets according to the user data type, and the marginal contribution values of the corresponding M-dimensional features are stored in the second memory pool according to the sub-user datasets. The process involves retrieving N marginal contribution values of any dimension feature from the second memory pool through the processing process, and determining the fitting condition of the dimension feature. This includes: retrieving the marginal contribution value of the dimension feature corresponding to any sub-user dataset from the second memory pool through the processing process; for any user data in the sub-user dataset, performing fitting processing on the marginal contribution values of the remaining M-1 dimension features of the user data (excluding the dimension feature) to obtain the marginal contribution estimate of the dimension feature of the user data; and performing fitting processing on the prediction results of the remaining M-1 dimension features of the user data to obtain the prediction result estimate of the dimension feature of the user data. The distribution of the dimensional feature is determined from the third memory pool; if the distribution of the dimensional feature is normally distributed, then for any user data in the sub-user dataset, the first difference between the estimated marginal contribution of the dimensional feature of the user data and the estimated marginal contribution of the dimensional feature of the user data, and the second difference between the estimated prediction result of the dimensional feature of the user data and the prediction result of the dimensional feature of the user data are determined; based on the first correlation between the first difference and the second difference corresponding to each user data in the sub-user dataset, the fitting condition of the dimensional feature in the sub-user dataset is determined; If the distribution of the dimensional features is non-normal, then for any user data in the sub-user dataset, a first order value for the first difference of the user data and a second order value for the second difference are determined; the first order value is obtained by sorting the first differences of the user data in ascending order, and the second order value is obtained by sorting the second differences of the user data in ascending order; based on the second correlation between the first order value and the second order value of each user data in the sub-user dataset, the fitting condition of the dimensional features in the sub-user dataset is determined; the fitting condition is used to characterize whether the corresponding dimensional features are overfitted. The processing module is also used to optimize the user mining mechanism based on the fitting of the features of each dimension.

10. A computing device, comprising: include: Memory, used to store program instructions; A processor is configured to invoke program instructions stored in the memory and execute the method according to any one of claims 1 to 8.

11. A computer-readable non-transitory storage medium, characterized in that, computer readable instructions, which, when read and executed by a computer, cause the computer to perform the method of any one of claims 1 to 8.