Data mining method and device, computer equipment and storage medium

A data mining and computer technology, applied in the field of artificial intelligence, can solve problems such as low efficiency and long time consumption, and achieve the effect of improving accuracy, increasing diversity, and improving the efficiency of evaluation and analysis

Pending Publication Date: 2020-12-04
CHINA PING AN LIFE INSURANCE CO LTD
0 Cites 4 Cited by

AI-Extracted Technical Summary

Problems solved by technology

[0004] The purpose of the embodiments of the present application is to propose a data mining method, device, computer equipment, a...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Method used

The data mining method provided by the application is based on multiple sample intervals to obtain the evaluation parameter value of the new feature of the updated data, without the need to input the new feature of the sample and the original model-entry feature to the target model for long-term training to confirm The impact of new features on the target model can quickly select new features that improve the effect of the target model. On the one hand, it improves the efficiency of evaluation and analysis of new features. The accuracy of the output of the target model. When mining new features in the agent retention prediction model in the insurance scenario, using the method of the present application to evaluate the mined new features can be shortened from 2 hours to 1 minute, which greatly improves the efficiency of evaluating the diversity of new feature information.
[0082] In some embodiments, when the feature extraction module 301 extracts new features from the update data, it is specifically used to extract a plurality of new features from the update data to form a feature set. When multiple new features are extracted, the feature extraction module 301 can pre-screen the extracted new features. Specifically, in the first calculation module 302, the original model features of each ...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Abstract

The invention belongs to the field of artificial intelligence, and relates to a data mining method, which comprises the steps of extracting new features according to update data of a sample set; inputting the original modeling feature into a target model to obtain a first model output, dividing the sample set according to the first model output to obtain a plurality of sample partitions, and calculating a first evaluation parameter value of each sample partition; inputting the first model output and the new feature into a preset intermediate model to obtain a second model output, and calculating a second evaluation parameter value of each sample partition; and calculating evaluation parameter values of the new features according to the first evaluation parameter values and the second evaluation parameter values, and determining whether to take the new features as in-mold features of the target model or not according to the evaluation parameter values. The invention further provides a data mining device, computer equipment and a storage medium. The invention also relates to a block chain technology. The feature values of the new features determined as the in-mold features can be stored in the block chain. According to the method, the new features for improving the effect of the target model can be quickly selected, and the data mining efficiency is higher.

Application Domain

Technology Topic

Image

  • Data mining method and device, computer equipment and storage medium
  • Data mining method and device, computer equipment and storage medium
  • Data mining method and device, computer equipment and storage medium

Examples

  • Experimental program(1)

Example Embodiment

[0028] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings, and any variations thereof, are intended to cover the non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.
[0029] Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
[0030] In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.
[0031] like figure 1 As shown, the system architecture 100 may include terminal devices 101 , 102 , 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.
[0032] The user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.
[0033] The terminal devices 101, 102, and 103 may be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, and MP3 players (Moving Picture Experts Group Audio Layer III, moving picture experts). Compression Standard Audio Layer 3), MP4 (Moving PictureExperts Group Audio Layer IV, Moving Picture Experts Compression Standard Audio Layer 4) Players, Laptops and Desktops, etc.
[0034] The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .
[0035] It should be noted that, the data mining method provided by the embodiments of the present application is generally executed by a server, and accordingly, a data mining apparatus is generally set in the server.
[0036] It should be understood that figure 1 The numbers of terminal devices, networks and servers in are only illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.
[0037] continue to refer to figure 2 , which shows a flow chart of one embodiment of the data mining method according to the present application. The data mining method includes the following steps:
[0038] S201, receive a data mining instruction, read the updated data of each sample in the sample set from the database according to the data mining instruction, and extract new features from the updated data;
[0039] S202, input the original input features of each sample in the sample set into the target model, obtain the first model output of each of the samples, and divide the samples in the sample set according to the first model output to obtain a plurality of samples partition, and calculate the first evaluation parameter value of each of the sample partitions based on the first model output;
[0040] S203, input the first model output and the new feature of each of the samples into a preset intermediate model, obtain the second model output of each of the samples, and calculate each of the sample partitions based on the second model output The second evaluation parameter value of ;
[0041] S204, calculate the evaluation parameter value of the new feature according to the first evaluation parameter value and the second evaluation parameter value of each of the sample partitions, so as to determine whether to use the new feature as the new feature according to the evaluation parameter value Describe the injection characteristics of the target model.
[0042] The above steps are described below.
[0043]For step S201, the data mining instruction may be initiated by the user of the client through a given interface, or automatically initiated in the background according to a preset time interval, or automatically initiated when it is detected that there is updated data in the sample of the database. For samples, in actual application scenarios, new business scenarios may be generated or new stable data sources may be accessed. At this time, data mining is required, and the resulting data processing forms new features. Specifically, new data is generated mainly from the addition of new business processes or the addition of new online data acquisition channels. The addition of new business processes, for example, after the revision of the APP (Application, application), adds buried points (such as in insurance APPs). "Insurance" click module buried point), the acquisition of such data will increase the corresponding data with the revision of the APP, and the addition of new data acquisition channels may include data sources generated by cross-service business or originally missed for some reasons (confidentiality, etc.) For example, when the current APP accesses the data of another APP, new data will be generated accordingly. These new data can be stored in the specified database to be read during data mining. .
[0044] The extraction of new features refers to obtaining structured data of sample granularity based on new data processing. If the increase of new data may be a log-type click log, then perform text recognition on the click log, extract feature fields based on the text recognition results of the click log, and generate new statistical features, such as clicks within the recent 1 month in the insurance APP The number of times of the "insurance" module, whether you have clicked on the "insurance" module in the past 7 days, etc., the text recognition in it can use the existing related technologies, which will not be expanded here.
[0045] In this embodiment, in the process of extracting new features, new data may be cleaned in advance, and the cleaning of new data includes invalid data screening, sensitive data removal, and the like.
[0046] In some embodiments, the step of extracting new features from the updated data includes: extracting a plurality of new features from the updated data to form a feature set. When multiple new features are extracted, pre-screening can be performed on the extracted new features.
[0047] Specifically, before the step of inputting the original mold-in features of each sample in the sample set into the target model, the method further includes: sequentially judging whether each new feature in the feature set belongs to the mold-in feature set of the target model If it is, it will be removed from the feature set, otherwise it will be retained to obtain a feature subset that does not belong to the input features of the target model, and then perform the evaluation on each new feature in the feature subset in turn. The step of parameter value, that is, performing steps S203 and S204, to obtain the mold-in features of one or more target models, thereby improving the diversity of the mold-in features of the target models.
[0048] For step S202, in this embodiment, each sample in the sample set has a plurality of original mold-in features, and these original mold-in features can be formed into a mold-in feature set of the target model.
[0049] In some embodiments, the target model may be a user retention prediction model. Correspondingly, the first model output is a prediction score obtained by inputting the original input feature set of each sample into the user retention prediction model. An evaluation parameter value is the AUC (Area Under Curve) value obtained according to the predicted score, where the AUC is obtained based on the ROC (Receiver Operating Characteristic, receiver operating characteristic) curve obtained based on the predicted score of each sample, Specifically, it is the area enclosed by the coordinate axis under the ROC curve.
[0050] Further, dividing the samples in the sample set according to the output of the first model, and obtaining a plurality of sample partitions are: sorting according to the predicted scores of the samples output by the user retention prediction model, and sorting the samples into the sample set according to the sorting result. The samples are divided into multiple sample partitions. At this point, the AUC value of each partition can be calculated. Specifically, after obtaining the sorting result, a ROC curve can be generated by combining the predicted score of each sample and the target variable of each sample (the predicted score of each sample is the predicted value of the target variable), and the ROC curve is partitioned, and each partition corresponds to the target variable. Part of the samples in the sample set is obtained, that is, multiple sample partitions are obtained, and then the corresponding AUC value can be calculated based on the predicted scores and target variables of the samples corresponding to each of the sample partitions, that is, the first evaluation parameter of each of the sample partitions can be obtained. value.
[0051] In this embodiment, the user retention prediction model may specifically adopt a multi-model fusion scheme of LightGBM+Xgboost+pruning strategy+DNN. The output results can be presented in the form of a sample prediction score vector. Specifically, the original input feature set is input into the LightGBM/Xgboost model for training to generate a combined feature set (leaf node), using pruning rules (such as monthly samples in each The distribution ratio of leaf nodes and the stability of the proportion of positive samples) prune the tree composed of leaf nodes, remove the combined features that are unstable across time, and then splicing the combined features obtained after pruning and input them to the DNN model to obtain a The output results presented in the form of sample prediction scoring vectors, in which the LightGBM/Xgboost model training, pruning rules and DNN models can be implemented using existing technical solutions, which are not expanded here.
[0052] In some embodiments, when the samples in the sample set are divided according to the output of the first model, the data mining method includes: according to the sample capacity of the sample set and the classification of samples in the sample set Determine the number of the sample partitions.
[0053] Specifically, when the samples are divided into several sample partitions according to the results of the sample scoring and sorting, the division basis includes sample capacity and business application requirements, wherein the business application requirements include classifying the samples according to business application scenarios. For example, in the user retention prediction model scenario, The samples in the sample set are the designated population to be predicted. It is assumed that the number of samples in the sample set is about 100,000. In order to ensure that each sample partition has enough sample size, the number of partitions cannot exceed 10; in practical applications, the samples are divided into multiple levels (such as 7 levels), in order to enable new features to achieve cross-level sorting optimization, it is necessary to preferably divide into a certain number of sample partitions, such as 3 to 5 sample partitions, and the final sample can be determined by considering both sample capacity and sample classification. The number of partitions, such as in the previous example, is finally divided into 3 to 10 sample partitions.
[0054] For step S203, in this embodiment, on the premise that the target model is the user retention prediction model in step S202, and the first evaluation parameter value is the AUC value, the intermediate model is based on the new feature and the first index of each sample in the target model. A model outputs a scoring model generated by two dimensions. Correspondingly, the output of the second model is a new score obtained by inputting the scoring model after adding new features to each sample, and the second evaluation parameter value is the AUC value of each sample partition after adding new features to the sample. , that is, input the new feature and the first score of each sample (the score output by the target model) into the scoring model, and obtain the new score of each sample and the new AUC value of each sample partition based on the new score.
[0055] In some embodiments, the scoring model adopts a LightGBM tree model including a preset tree. Specifically, the scoring model may use a LightGBM tree model with about 10 trees. By using this number of LightGBM tree models, accurate scoring results can be obtained without occupying too much system processing resources. In other embodiments, the scoring model can also select a logistic regression model or a model determined according to actual requirements.
[0056] In other embodiments, when the target model is other classification models, the first evaluation parameter value and the second evaluation parameter value in steps S202 and S203 may also be values ​​of evaluation features of other classification models, such as accurate evaluation features. Rate, coverage, cross entropy, logloss, etc., will not be expanded here.
[0057] For step S204, in this embodiment, the step of determining whether to use the new feature as a mold-in feature of the target model according to the evaluation parameter value is specifically: judging the new feature according to the evaluation parameter value Whether there is a positive correlation influence on the target model, and when it is determined that there is a positive correlation influence, the new feature is used as an input feature of the target model. Among them, the positive correlation effect means that the new feature can improve the effect of the target model, which can ultimately be reflected in the improvement of the prediction accuracy of the model.
[0058] The sample partitions in steps S202 and S203 are consistent. In some embodiments, the evaluation of the new feature is calculated according to the first evaluation parameter value and the second evaluation parameter value of each of the sample partitions. The parameter value includes: obtaining the sub-evaluation parameter value of the new feature according to the first evaluation parameter value and the second evaluation parameter value of the same sample partition, and obtaining the new feature according to a plurality of the sample partitions The multiple sub-evaluation parameter values ​​are obtained, and the evaluation parameter value is obtained according to the multiple sub-evaluation parameter values. By calculating multiple sub-evaluation parameter values, the overall sample can be evaluated, and the impact of new features on sample partitions can also be accurately evaluated.
[0059] Specifically, the evaluation parameter value of the new feature can be obtained by weighting and summing up a plurality of the sub-evaluation parameter values. The weighting coefficient of each sample partition is determined according to the importance to the business, and can be sorted accurately by each sample partition. Set different weights on the size of the business contribution. For example, in the scenario of the retention prediction model, more attention is paid to the accuracy of the head and tail groups (the head group has resources tilted, and the tail group is eliminated), because the head group is considered to be an excellent group. , can obtain more resources (such as commissions) support, and the tail group considers it to be a poor group, and will implement elimination measures. Compared with the middle division, there is no major difference. If the head division and the tail division are accurate If the degree of retention is low, better resources will be given to poor people or those who should be retained will be eliminated without reaching the desired retention level. Therefore, in the application scenario of retention prediction model, the samples of the head and tail populations are analyzed. The setting of the weighting coefficient of the partition is higher than that of other sample partitions.
[0060] In this embodiment, on the premise that the first evaluation parameter value and the second evaluation parameter value are AUC values, the sub-evaluation parameter values ​​of the new feature are calculated according to the AUC values ​​obtained twice before and after each sample partition. Here, a possible calculation process is illustrated by dividing three sample partitions as an example. The three sample partitions are sample partition A, sample partition B and sample partition C respectively. The sub-evaluation parameter value of the new feature is obtained by the following formula:
[0061] P A =AUC 1A –AUC 0A;
[0062] P B =AUC 1B –AUC 0B;
[0063] P C =AUC 1C –AUC 0C;
[0064] Among them, P A , P B , P C is the sub-evaluation parameter value corresponding to the new feature on the three sample partitions, AUC 0A , AUC 0B , AUC 0C The AUC value obtained for the first time for the three sample partitions, AUC 1A , AUC 1B , AUC 1C In order to obtain the second AUC value of the three sample partitions after adding the new feature, the evaluation parameter value P of the new feature can be obtained by the following formula:
[0065] P=a*P A +b*P B +c*P C;
[0066]Among them, a, b, and c are the weighting coefficients of each sample partition. If the evaluation parameter value P is greater than 0, it means that the new feature input to the model has a positive income, that is, it is determined that the new feature has a positive correlation with the effect of the target model. It can be directly input to the target model as the mold-in feature.
[0067] In some embodiments, the data mining method further includes:
[0068] If it is determined according to the evaluation parameter value that the new feature has no positive correlation effect on the target model, then it is determined whether the new feature has a positive effect on each of the sample intervals according to the sub-evaluation parameter values ​​of the new feature. Positive correlation influence, if it is determined that there is a positive correlation influence, the new feature is used as a selection feature for post-processing of the target model result.
[0069] In this embodiment, taking the new feature as the selection feature of the target model result post-processing means: after the target model outputs the result, when the model output is post-processed, the new feature that has a positive correlation effect on some sample intervals can be used as Post-processing selected features, combined with preset rules to fine-tune the model output. In this way, the accuracy of the output when the target model makes predictions can be improved.
[0070] Taking the user retention prediction model as an example, when the first evaluation parameter value and the second evaluation parameter value are AUC values, P is obtained based on the above equation. A , P B , P C and after P, if P is less than 0, but P A , P B , P C If there is a value greater than 0 and the verification is stable across time, the new feature has a negative benefit on the overall sample ranking, that is, it is determined that the new feature does not have a positive correlation with the effect of the user retention prediction model, but it has a continuous benefit in a certain sample interval. If it is positive, although the new feature cannot be used as the input feature of the user retention prediction model, it can be used as a post-processing selection feature of the user retention prediction model, that is, the score output by the user retention prediction model is fine-tuned through rules, such as a new feature. The AUC improvement benefit for all partitioned samples is negative, but the AUC improvement for the end population partitions is obvious, which can be used as a post-processing rule. In this case, post-processing means that after the model outputs the score, the score is sorted and adjusted by rules, and the user retains the prediction. In the scenario, if the samples with the new feature X value of 1 are retained significantly better than the samples with the feature value of 0, and all the samples with the new feature X value of 1 are arranged before the samples with the value of 0 in this partition, If the AUC of this arrangement order partition is higher than the AUC of the original order partition, the rule "arrange all samples with the new feature X value of 1 in this partition before the samples with the value of 0" is an effective post-processing rule.
[0071] The data mining method provided by the present application is based on obtaining the evaluation parameter values ​​of the new features of the updated data based on multiple sample intervals, and it is not necessary to input the new features of the samples and the original input features into the target model for long-term training to confirm the pair of new features. The influence of the target model can quickly select new features that improve the effect of the target model. On the one hand, it improves the evaluation and analysis efficiency of new features, and on the other hand, it also improves the diversity of the target model's entry features, which is conducive to improving the target model. the accuracy of the output. When mining new features in the agent retention prediction model in the insurance scenario, the method of the present application can be used to evaluate the new features mined from 2 hours to 1 minute, which greatly improves the efficiency of evaluating the diversity of new feature information.
[0072] It should be emphasized that, in order to further ensure the privacy and security of information, after obtaining the new feature as the feature of the model, the feature value of the new feature can also be stored in a node of a blockchain.
[0073] The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
[0074] The present application may be used in numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, including A distributed computing environment for any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including storage devices.
[0075] Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing the relevant hardware through computer-readable instructions, and the computer-readable instructions can be stored in a computer-readable storage medium. , when the program is executed, it may include the processes of the foregoing method embodiments. The aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).
[0076] It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same moment, but may be executed at different timings, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.
[0077] further reference image 3 , as a response to the above figure 2 The implementation of the shown method, the present application provides an embodiment of a data mining device, the device embodiment is the same as figure 2 Corresponding to the method embodiments shown, the apparatus can be specifically applied to various electronic devices.
[0078] like image 3 As shown, the data mining apparatus in this embodiment includes: a feature extraction module 301 , a first calculation module 302 , a second calculation module 303 and a feature determination module 304 .
[0079] Among them, the feature extraction module 301 is used to receive data mining instructions, read the updated data of each sample in the sample set from the database according to the data mining instructions, and extract new features from the updated data; the first calculation module 302 is used for inputting the original input model features of each sample in the sample set into the target model to obtain a first model output of each of the samples, and dividing the samples in the sample set according to the first model output to obtain a plurality of sample partitions, and calculate the first evaluation parameter value of each of the sample partitions based on the first model output; the second calculation module 303 is used to input the first model output and the new feature of each of the samples into a preset middle model, obtain the second model output of each of the samples, and calculate the second evaluation parameter value of each of the sample partitions based on the second model output; the feature determination module 304 is configured to The evaluation parameter value and the second evaluation parameter value are used to calculate the evaluation parameter value of the new feature, so as to determine whether to use the new feature as the mold-in feature of the target model according to the evaluation parameter value.
[0080] In this embodiment, the new features of the data mined by the feature extraction module 301 after receiving the data mining instruction are derived from new business scenarios or new stable data sources, wherein the extraction of new features refers to obtaining sample granularity based on new data processing. Institutionalized data. For details about data mining instructions, generation of new data, and extraction of new features, reference may be made to the foregoing method embodiments, which will not be expanded here.
[0081] In this embodiment, in the process of extracting new features, the feature extraction module 301 may clean new data in advance, and the cleaning of new data includes filtering invalid data, clearing sensitive data, and the like.
[0082] In some embodiments, when the feature extraction module 301 extracts new features from the updated data, it is specifically configured to extract multiple new features from the updated data to form a feature set. When multiple new features are extracted, the feature extraction module 301 can pre-screen the extracted new features. Specifically, the first calculation module 302 inputs the original mold-in features of each sample in the sample set into the target model Before, the feature extraction module 301 is also used to sequentially determine whether each new feature in the feature set belongs to the feature set of the target model's input feature set, and if so, remove it from the feature set, otherwise keep it, and obtain a result that does not belong to the feature set. Describe the feature subset of the mold-in features of the target model. The first calculation module 302, the second calculation module 303, and the feature determination module 304 sequentially perform related operations based on each new feature in the feature subset to obtain one or more mold-in features of the target model, thereby improving the target Diversity of in-mold features of the model.
[0083] In this embodiment, each sample in the sample set has multiple original mold-in features, and these original mold-in features can be formed into a mold-in feature set of the target model.
[0084] In some embodiments, the target model used by the first calculation module 302 may be a user retention prediction model, and correspondingly, the first model output is obtained by inputting the original input feature set of each sample into the user retention prediction model. The predicted score, the first evaluation parameter value is the AUC value obtained according to the predicted score. Further, when the first calculation module 302 divides the samples in the sample set according to the output of the first model, and obtains a plurality of sample partitions, it is specifically configured to perform the calculation according to the prediction score of each sample output by the user retention prediction model. Sort, divide the samples in the sample set according to the sorting result to obtain multiple sample partitions. At this time, the AUC value of each partition can be obtained by calculation. For details, reference may be made to the relevant content of the above method embodiments, which will not be expanded here.
[0085] In this embodiment, the user retention prediction model may specifically adopt a multi-model fusion scheme of LightGBM+Xgboost+pruning strategy+DNN. The output result may be presented in the form of a sample prediction score vector. For details, please refer to the relevant content of the above method embodiment, which will not be expanded here.
[0086] In some embodiments, when the first calculation module 302 divides the samples in the sample set according to the output of the first model, it is also used for dividing the samples in the sample set according to the sample capacity of the sample set and the samples in the sample set The number of the sample partitions is determined by classification. For details, reference may be made to the relevant content of the above method embodiments, which will not be expanded here.
[0087]In this embodiment, on the premise that the target model is the user retention prediction model and the first evaluation parameter value is the AUC value, the intermediate model used by the second calculation module 303 is based on the new features and the samples in the target model. The first model outputs a two-dimensionally generated scoring model. Correspondingly, the output of the second model is a new score obtained by inputting the scoring model after adding new features to each sample, and the second evaluation parameter value is the AUC value of each sample partition after adding new features to the sample. , that is, input the new feature and the first score of each sample (the score output by the target model) into the scoring model, and obtain the new score of each sample and the new AUC value of each sample partition based on the new score. In some embodiments, the scoring model adopts a LightGBM tree model including a preset tree. Specifically, the scoring model may use a LightGBM tree model with about 10 trees. By using this number of LightGBM tree models, accurate scoring results can be obtained without occupying too much system processing resources. In other embodiments, the scoring model can also select a logistic regression model or a model determined according to actual requirements.
[0088] In other embodiments, when the target model is another classification model, the first evaluation parameter value and the second evaluation parameter value obtained by the first calculation module 302 and the second calculation module 303 may also be the evaluation characteristics of other classification models. These evaluation features such as accuracy, coverage, cross entropy, logloss, etc. will not be expanded here.
[0089] In this embodiment, when the feature determination module 304 determines whether to use the new feature as the mold-in feature of the target model according to the evaluation parameter value, it is specifically configured to determine the new feature according to the evaluation parameter value Whether there is a positive correlation influence on the target model, and when it is determined that there is a positive correlation influence, the new feature is used as an input feature of the target model. Among them, the positive correlation effect means that the new feature can improve the effect of the target model, which can ultimately be reflected in the improvement of the prediction accuracy of the model.
[0090] In this embodiment, the sample partitions calculated by the first calculation module 302 and the second calculation module 303 are consistent. In some embodiments, the feature determination module 304 is based on the first evaluation parameter value of each of the sample partitions. When calculating the evaluation parameter value of the new feature with the second evaluation parameter value, it is specifically used to obtain the new feature according to the first evaluation parameter value and the second evaluation parameter value of the same sample partition The sub-evaluation parameter values ​​of the new feature are obtained according to a plurality of the sample partitions, and the evaluation parameter values ​​are obtained according to the plurality of sub-evaluation parameter values. By calculating multiple sub-evaluation parameter values, the overall sample can be evaluated, and the impact of new features on sample partitions can also be accurately evaluated.
[0091] Specifically, the evaluation parameter value of the new feature can be obtained by weighting and summing up a plurality of the sub-evaluation parameter values. The weighting coefficient of each sample partition is determined according to the importance to the business, and can be sorted accurately by each sample partition. Set different weights on the size of the business contribution. For example, in the scenario of the retention prediction model, more attention is paid to the accuracy of the head and tail groups (the head group has resources tilted, and the tail group is eliminated), because the head group is considered to be an excellent group. , can obtain more resources (such as commissions) support, and the tail group considers it to be a poor group, and will implement elimination measures. Compared with the middle division, there is no major difference. If the head division and the tail division are accurate If the degree of retention is low, better resources will be given to poor people or those who should be retained will be eliminated without reaching the desired retention level. Therefore, in the application scenario of retention prediction model, the samples of the head and tail populations are analyzed. The setting of the weighting coefficient of the partition is higher than that of other sample partitions.
[0092] In this embodiment, on the premise that the first evaluation parameter value and the second evaluation parameter value are AUC values, the sub-evaluation parameter values ​​of the new feature are calculated according to the AUC values ​​obtained twice before and after each sample partition. For the calculation process, reference may be made to the relevant content of the above method embodiments, which will not be expanded here.
[0093] In some embodiments, the feature determination module 304 is further configured to, when it is determined according to the evaluation parameter value that the new feature does not have a positive correlation effect on the target model, according to each of the sub-evaluations of the new feature The parameter value determines whether the new feature has a positive correlation effect on each of the sample intervals, and if it is determined that there is a positive correlation effect, the new feature is used as a selection feature for post-processing of the target model result. In this embodiment, taking the new feature as the selection feature of the target model result post-processing means: after the target model outputs the result, when the model output is post-processed, the new feature that has a positive correlation effect on some sample intervals can be used as Post-processing selected features, combined with preset rules to fine-tune the model output. In this way, the accuracy of the output when the target model makes predictions can be improved. For specific example descriptions, reference may be made to the above method embodiments, which will not be described here.
[0094] The data mining device provided by the present application obtains the evaluation parameter values ​​of the new features of the updated data based on multiple sample intervals, and does not need to input the new features of the samples and the original input features into the target model for long-term training to confirm the pair of new features. The influence of the target model can quickly select new features that improve the effect of the target model. On the one hand, it improves the evaluation and analysis efficiency of new features, and on the other hand, it also improves the diversity of the target model's entry features, which is conducive to improving the target model. the accuracy of the output.
[0095] To solve the above technical problems, the embodiments of the present application also provide computer equipment. For details, please refer to Figure 4 , Figure 4 This is a block diagram of the basic structure of the computer equipment in this embodiment. The computer device 4 includes a memory 41, a processor 42, and a network interface 43 that communicate with each other through a system bus. The memory 41 stores computer-readable instructions, and the processor 42 implements the above when executing the computer-readable instructions. The steps of the data mining method described in the method embodiments have beneficial effects corresponding to the above-mentioned data mining method, and are not expanded here.
[0096] It should be pointed out that only the computer device 4 having the memory 41, the processor 42, and the network interface 43 is shown in the figure, but it should be understood that it is not required to implement all the components shown, and more or more components may be implemented instead. Fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Signal Processor (Digital Signal Processor, DSP), embedded devices, etc.
[0097] The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.
[0098] In this embodiment, the memory 41 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access Memory (RAM), Static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 41 may be an internal storage unit of the computer device 4 , such as a hard disk or a memory of the computer device 4 . In other embodiments, the memory 41 may also be an external storage device of the computer device 4 , such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (SecureDigital, SD) card, flash memory card (Flash Card) and so on. Of course, the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device. In this embodiment, the memory 41 is generally used to store the operating system and various application software installed on the computer device 4 , such as computer-readable instructions corresponding to the above-mentioned data mining method. In addition, the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
[0099] The processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments. This processor 42 is typically used to control the overall operation of the computer device 4 . In this embodiment, the processor 42 is configured to execute computer-readable instructions stored in the memory 41 or process data, for example, execute computer-readable instructions corresponding to the data mining method.
[0100] The network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
[0101] The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is caused to execute the steps of the above-mentioned data mining method, and has beneficial effects corresponding to the above-mentioned data mining method, which will not be expanded here.
[0102] From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.
[0103] Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

no PUM

Description & Claims & Application Information

We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Similar technology patents

A two-path reaction device

InactiveCN104587899AImprove the efficiency of evaluation and analysisComponent separationLiquid-gas reaction processesLiquid storage tankSafety valve
Owner:TIANJIN XIANNQUAN IND & TRADE DEV

A hydrogenation parallel reaction device

InactiveCN104587915AImprove the efficiency of evaluation and analysisComponent separationLiquid-gas reaction processesControl functionEngineering
Owner:TIANJIN XIANNQUAN IND & TRADE DEV

Classification and recommendation of technical efficacy words

  • Improve the efficiency of evaluation and analysis
  • Increase diversity

A hydrogenation parallel reaction device

InactiveCN104587915AImprove the efficiency of evaluation and analysisComponent separationLiquid-gas reaction processesControl functionEngineering
Owner:TIANJIN XIANNQUAN IND & TRADE DEV

A two-path reaction device

InactiveCN104587899AImprove the efficiency of evaluation and analysisComponent separationLiquid-gas reaction processesLiquid storage tankSafety valve
Owner:TIANJIN XIANNQUAN IND & TRADE DEV

Method and system for asset allocation

ActiveUS20050171883A1Increase diversityHigh expected long-term returnFinanceSpecial data processing applicationsRisk forecastingComputer science
Owner:CITIBANK
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products