Customer churn prediction method based on lightgbm and logistic regression

By combining LightGBM and logistic regression to perform differentiated modeling for customer subsets, the problem of insufficient prediction by a single model in heterogeneous customer groups is solved, and high-precision and reliable customer upgrade and replacement prediction is achieved.

CN122243545APending Publication Date: 2026-06-19DONGFENG NISSAN DATA SERVICE CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
DONGFENG NISSAN DATA SERVICE CO LTD
Filing Date
2026-03-03
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, single machine learning models struggle to balance nonlinear feature interactions and probabilistic calibration accuracy, resulting in insufficient predictive capabilities for heterogeneous customer groups. In particular, the prediction accuracy and recall rates for niche customer groups decline, limiting overall generalization performance.

Method used

A combined prediction method of LightGBM and logistic regression is adopted. Customer feature data is divided into three mutually exclusive customer subsets based on private domain browsing behavior identifiers and external tag data. A model is built independently for each subset. The leaf node numbers output by the LightGBM model are used for one-hot encoding to generate high-order nonlinear feature vectors, which are then combined with logistic regression for prediction.

Benefits of technology

It improves the prediction accuracy and recall capability for niche customer groups, enhances the predictive system's adaptability and technical robustness to heterogeneous customer groups, and ensures the reliability and stability of prediction results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243545A_ABST
    Figure CN122243545A_ABST
Patent Text Reader

Abstract

This application relates to the field of customer behavior prediction technology and discloses a method for predicting customer upgrade / replacement purchases based on LightGBM and logistic regression. The method includes: acquiring multi-source customer data and labeling positive and negative samples; performing data preprocessing and feature derivation to generate customer feature data; dividing the customer feature data into three mutually exclusive customer subsets based on whether the customer has private domain browsing behavior and whether they use external tag data; for each subset, training a combined prediction model consisting of a LightGBM model and a logistic regression model, wherein the leaf node numbers output by the LightGBM model are one-hot encoded and used as input features of the logistic regression model; during the prediction phase, the customer to be predicted is assigned to the corresponding subset based on their attributes, and the corresponding combined model is called to output the upgrade / replacement purchase probability value. This application improves the overall accuracy and robustness of predicting existing customer intentions through customer segmentation and customized modeling.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of customer behavior prediction technology, specifically a customer purchase and upgrade prediction method based on LightGBM and logistic regression. Background Technology

[0002] In customer relationship management (CRM) practices, automakers generally rely on machine learning models to predict existing customers' intentions to upgrade or replace their vehicles in order to optimize the allocation of marketing resources. Current mainstream solutions employ a single predictive model (such as logistic regression, XGBoost, or LightGBM) to uniformly model all customer data. However, customer groups exhibit significant behavioral heterogeneity: some customers actively browse and click on brand-owned private domain platforms (such as official apps and mini-programs), while others can only be characterized by external tag data. Furthermore, the correlation between behavioral patterns and upgrade / replacement decisions differs fundamentally between different groups. When a single model is forced to fit the entire heterogeneous dataset, its parameter optimization process requires trade-offs between different distributed subsets of data. This significantly weakens the model's predictive ability for niche customer groups (such as customers without private domain behavior but rich in external tag features), limiting overall generalization performance. Simultaneously, a single algorithm architecture struggles to balance the deep mining of nonlinear feature interactions with the probabilistic calibration accuracy of prediction results, further restricting the model's technical reliability in complex business scenarios. Summary of the Invention

[0003] Therefore, it is necessary to provide a customer purchase and replacement prediction method based on LightGBM and logistic regression that can adapt to the inherent heterogeneity of customer groups, improve the prediction accuracy of each subgroup, and has good technical robustness to address the above-mentioned technical problems.

[0004] This application provides a customer purchase / upgrade prediction method based on LightGBM and logistic regression, including the following steps: Acquire multi-source data of existing customers, and label the multi-source data as positive or negative samples based on whether the existing customers have engaged in upgrade or replacement purchase behavior within a preset observation period; Perform missing value imputation, outlier removal, and feature derivation on the labeled multi-source data to generate customer feature data; Based on customer private domain browsing behavior identifiers and external tag data usage identifiers, the customer feature data is divided into three mutually exclusive customer subsets; For each customer subset, a LightGBM model is trained, and the leaf node number of the sample output by the LightGBM model is one-hot encoded to generate a new feature vector. The new feature vector is used as input to train a logistic regression model to obtain the combined prediction model corresponding to the customer subset. Obtain the original feature data of the customer to be predicted, perform the same data preprocessing and feature derivation processing as in the training phase to generate the feature vector to be predicted, determine the customer subset to which the customer belongs based on the private domain browsing behavior identifier and external tag data of the customer to be predicted, input the feature vector to be predicted into the combined prediction model corresponding to the customer subset, and output the probability value of adding or replacing purchases in the interval of 0 to 1 by the logistic regression model.

[0005] In one embodiment, the multi-source data includes customer personal information, maintenance data, insurance purchase and claim data, APP and mini-program behavior data, e-commerce consumption data, traffic restriction and purchase restriction data, complaint data, community posting data, time series characteristics, lead generation data, store visit data, test drive data, and third-party external tag data.

[0006] In one embodiment, the feature derivation process includes feature processing and feature derivation, which includes time window features, behavior aggregation features, statistical features, and discrete features.

[0007] In one embodiment, the three mutually exclusive customer subsets include: a first customer subset containing customer samples whose private domain browsing behavior is identified as true and whose external tag data usage is identified as true; a second customer subset containing customer samples whose private domain browsing behavior is identified as true and whose external tag data usage is identified as false; and a third customer subset containing customer samples whose private domain browsing behavior is identified as false.

[0008] In one embodiment, after dividing the three mutually exclusive customer subsets, feature filtering is performed for each customer subset based on feature IV value and feature correlation.

[0009] In one embodiment, the private domain browsing behavior identifier is generated based on the customer's private domain browsing click behavior in the brand's official application or mini-program.

[0010] In one embodiment, for each of the customer subsets, the corresponding LightGBM model is trained using cross-validation.

[0011] In one embodiment, for each customer subset, the leaf node number of the sample output by the corresponding LightGBM model is one-hot encoded and used as the input feature of the logistic regression model corresponding to the customer subset.

[0012] In one embodiment, the method further includes periodically updating the parameters of the combined prediction model.

[0013] In one embodiment, the method further includes using the upgrade / replacement probability value for customer marketing activities.

[0014] The aforementioned customer add-on purchase prediction method based on LightGBM and logistic regression solves the technical problem of insufficient model generalization ability caused by heterogeneous customer data by dividing customer feature data into three mutually exclusive customer subsets based on private domain browsing behavior identifiers and external tag data, and independently constructing a combined prediction model of LightGBM and logistic regression for each subset. The three-factor clustering strategy ensures high homogeneity in the distribution of customer behavior features within each subset, reducing distribution noise in the training data and providing a higher-quality input foundation for model training. The subset-specific combined prediction model can deeply adapt to the unique behavioral and decision-making mapping patterns of the group, avoiding the accuracy loss caused by compromise optimization of single model parameters, and improving the prediction accuracy and recall of niche customer groups. The leaf node numbers output by the LightGBM model are used to generate high-order nonlinear feature vectors through one-hot encoding, which serve as input to the logistic regression model. This fully explores the implicit interaction relationships between features and utilizes the linear structure of logistic regression to achieve accurate calibration and numerical stability of the probability output, ensuring that the final output of the purchase / exchange probability value maintains high discriminativeness while possessing reliable numerical credibility. The overall solution, through the collaborative design of data clustering and model customization, enhances the predictive system's adaptability and technical robustness to heterogeneous customer groups without increasing online inference complexity, providing a feasible technical implementation path for high-precision customer behavior prediction. Attached Figure Description

[0015] Figure 1 A flowchart illustrating the customer purchase / replacement prediction method based on LightGBM and logistic regression provided in this application embodiment. Detailed Implementation

[0016] To facilitate understanding of the technical solutions provided in the embodiments of this application, the background technology involved in the embodiments of this application will be described below.

[0017] In the field of customer relationship management, especially in the automotive industry's existing customer operations, using machine learning models to predict customers' intentions to upgrade or replace their vehicles has become a core technical aspect of optimizing resource allocation. Currently, the industry generally adopts a single model architecture (such as training a single LightGBM or logistic regression model with full data) for end-to-end prediction.

[0018] However, customer behavior data inherently exhibits structural heterogeneity: some customers frequently browse and click on brand private domain platforms (official apps, mini-programs), and their purchase decisions are strongly correlated with private domain interaction characteristics; some customers lack traces of private domain behavior and can only be characterized through external tag data (such as third-party credit reports, regional consumption indices); still others possess both private domain behavior and external tag characteristics. When a single model is forced to fit such mixed data with significantly different distributions, the optimization process requires parameter trade-offs between the feature distributions of different subgroups, resulting in blurred discrimination boundaries for niche groups (such as customers without private domain behavior but with significant external tag characteristics), a simultaneous decline in prediction accuracy and recall, and limited overall generalization ability.

[0019] More importantly, a single algorithm architecture struggles to meet the dual requirements of technical implementation: while tree models (such as LightGBM) can effectively uncover nonlinear interactions of high-dimensional features, their output probability values ​​often suffer from calibration bias (such as overestimating low-probability events), making it difficult to meet the accuracy requirements for marketing threshold determination; and while linear models such as logistic regression possess good probability calibration characteristics, they struggle to fully capture the implicit correlations between complex behavioral features. Although existing technologies have attempted coarse-grained grouping based on single dimensions such as customer activity, the grouping logic is not deeply coupled with the characteristics of the data source (private domain behavior identifiers, external tag usage status), and the homogeneous model structure remains after grouping, failing to customize differentiated modeling strategies for the data distribution characteristics of each subgroup. This results in the persistent technical bottleneck of "incomplete grouping and inaccurate modeling."

[0020] Therefore, the core technical problem that urgently needs to be solved in this field is: how to construct a grouping mechanism that matches the inherent structure of customer behavior data, and design customized prediction models for each subgroup with the dual advantages of nonlinear feature mining capability and probability output calibration accuracy, so as to improve the overall prediction reliability and technical robustness of heterogeneous customer groups while maintaining the system's inference efficiency.

[0021] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application. Furthermore, it should be specifically noted that all customer data collection and processing involved in this embodiment strictly comply with relevant laws and regulations regarding personal information protection and data security, and are conducted only after obtaining explicit authorization from the customer and completing data anonymization processing (such as hash encryption and generalization processing), ensuring that the entire data lifecycle complies with privacy protection standards.

[0022] This embodiment provides a customer purchase / upgrade prediction method based on LightGBM and logistic regression, such as... Figure 1As shown, the method includes the following steps in sequence: S1: Obtain multi-source data of existing customers, and label the multi-source data as positive or negative samples based on whether existing customers have engaged in upgrade or replacement purchase behavior within the preset observation period.

[0023] In this step, multi-source data is linked and integrated using unique customer identifiers (such as Vehicle Identification Numbers (VINs) or encrypted customer IDs). Data collection and use are conducted under customer authorization. The preset observation period can be set to a business cycle of the most recent 12 months or longer. If a customer has a record of purchasing or repurchasing the target vehicle series during this period, it is marked as a positive sample; otherwise, it is marked as a negative sample. To alleviate the class imbalance problem, negative samples can be randomly undersampled to ensure that the ratio of positive to negative samples is between 1:1 and 1:3.

[0024] S2: Perform missing value imputation, outlier removal, and feature derivation processing on the labeled multi-source data to generate customer feature data.

[0025] In this step, for continuous features (such as "number of app logins in the last 30 days"), missing values ​​are filled with 0; for discrete features (such as "marital status"), missing values ​​are filled with the "unknown" category. Outlier removal is performed by calculating the mean of the continuous features. and standard deviation To achieve a value greater than or less Data points are considered outliers and removed. Feature derivation processing refers to creating new features based on the original business data, such as generating time window features (e.g., "number of visits in the last six months"), behavioral aggregation features (e.g., "total number of maintenance visits in the past year"), statistical features (e.g., "average historical repair costs"), and features obtained by one-hot encoding discrete features (e.g., "gender").

[0026] S3: Based on customer private domain browsing behavior identifiers and external tag data usage identifiers, customer feature data is divided into three mutually exclusive customer subsets.

[0027] In this step, the private domain browsing behavior identifier is generated based on whether the customer has engaged in browsing, clicking, or other interactive behaviors on the brand's official application or mini-program within a preset date range (e.g., the past year). If yes, it's "true"; otherwise, it's "false." The external tag data usage identifier is used during the training phase to indicate whether the customer sample was used to train a model that uses external tag data. The segmentation logic is as follows: First, customers are divided into two categories based on the private domain browsing behavior identifier. For customers with private domain behavior, they are further subdivided into two subsets based on whether an external tag interface is planned to be called for them and used for model training. This ultimately results in three subsets: Subset A (with private domain behavior, using external tags), Subset B (with private domain behavior, not using external tags), and Subset C (no private domain behavior).

[0028] S4: For each customer subset, train the LightGBM model, perform one-hot encoding on the leaf node number of the sample output by the LightGBM model to generate a new feature vector, use the new feature vector as input to train the logistic regression model, and obtain the combined prediction model corresponding to the customer subset.

[0029] In this step, for each subset, its data is first divided chronologically into a training set (e.g., the first 70% of the data), a validation set (e.g., the middle 15% of the data), and a test set (e.g., the last 15% of the data). When training the LightGBM model, hyperparameter tuning methods such as Bayesian optimization are used to search for the optimal parameter combination within a predefined parameter space (e.g., learning rate, maximum tree depth, minimum number of data points in leaf nodes), and this combination is used to train the model on the training set. After training, all samples from the subset (including the training, validation, and test sets) are input into the trained LightGBM model, and the numbers of all decision tree leaf nodes that each sample ultimately lands on are obtained. Assume the LightGBM model has... There are 100 trees, and each tree has 100 trees. If there are leaf nodes, then a sample will correspond to Number the leaf nodes. Each number is one-hot encoded to generate a high-dimensional sparse feature vector, the dimension of which is the sum of the number of leaf nodes in all leaves. In the vector, the position of the leaf node corresponding to the sample is 1, and the other positions are 0. Using this sparse feature vector as the input feature, and the original sample label (whether it is an add-on or replacement purchase) as the target, a logistic regression model is trained to obtain a "LightGBM feature converter + logistic regression classifier" combined prediction model specific to this customer subset.

[0030] S5: Obtain the original feature data of the customer to be predicted, perform the same data preprocessing and feature derivation processing as in the training phase to generate the feature vector to be predicted, determine the customer subset to which the customer belongs based on the private domain browsing behavior identifier and external tag data of the customer to be predicted, input the feature vector to be predicted into the combined prediction model corresponding to the customer subset, and output the probability value of adding or replacing purchases in the interval of 0 to 1 by the logistic regression model.

[0031] In this step, the external tag data used in the prediction phase depends on whether the customer meets the conditions for calling the external tag interface (e.g., a high-intent customer initially screened by the preceding model). Specifically, to obtain the upgrade / replacement probability value, the LightGBM model of this subset is first used to convert it into a leaf node encoded feature vector. This vector is then input into the corresponding logistic regression model, which outputs a value between 0 and 1, serving as the predicted upgrade / replacement probability value for the customer.

[0032] Based on the above, this method solves the technical problem of insufficient model generalization ability caused by heterogeneous customer data by dividing customer feature data into three mutually exclusive customer subsets based on private domain browsing behavior identifiers and external tag data, and independently constructing a combined prediction model of LightGBM and logistic regression for each subset. The three-factor clustering strategy ensures high homogeneity in the distribution of customer behavior features within each subset, reducing distribution noise in the training data and providing a higher-quality input foundation for model training. The subset-specific combined prediction model can deeply adapt to the unique behavioral and decision-making mapping patterns of the group, avoiding the accuracy loss caused by compromise optimization of single model parameters, and improving the prediction accuracy and recall of niche customer groups. The leaf node numbers output by the LightGBM model are used to generate high-order nonlinear feature vectors through one-hot encoding, which serve as input to the logistic regression model. This fully explores the implicit interaction relationships between features and utilizes the linear structure of logistic regression to achieve accurate calibration and numerical stability of the probability output, ensuring that the final output of the purchase / exchange probability value maintains high discriminativeness while possessing reliable numerical credibility. The overall solution, through the collaborative design of data clustering and model customization, enhances the predictive system's adaptability and technical robustness to heterogeneous customer groups without increasing online inference complexity, providing a feasible technical implementation path for high-precision customer behavior prediction.

[0033] When building a customer prediction model, the richness and quality of the feature system directly determine the upper limit of the model's performance. Traditional methods mainly rely on limited internal business data (such as CRM and after-sales maintenance), with single feature dimensions, making it difficult to comprehensively and three-dimensionally depict customers' vehicle usage status, spending power, lifestyle, and potential needs. This results in the model learning biased patterns and limited predictive capabilities. Therefore, in one embodiment, multi-source data includes customer personal information, maintenance data, insurance purchase and claim data, APP and mini-program behavior data, e-commerce consumption data, traffic restriction and purchase restriction data, complaint data, community posting data, time-series features, lead generation data, in-store data, test drive data, and third-party external tag data.

[0034] Specifically: Customer personal information: including but not limited to age, gender, occupation, region, and other information that has been anonymized; Maintenance and repair data: Records all maintenance items, costs, parts replacements, and maintenance mileage intervals for the customer's vehicle. Insurance purchase and claims data: This includes the types of commercial insurance purchased by the customer, the sum insured, the premium, and the historical claims records; APP and mini-program behavioral data: covering customer login frequency, page browsing time, function clicks (such as booking a test drive, viewing models), content sharing and other interaction logs in the brand's official APP or mini-program; Online store transaction data: refers to the transaction records of customers purchasing products, accessories, service packages, etc. in the brand's online store; Traffic restriction and purchase restriction data: Reflects macro-level information such as car purchase restriction policies, license plate acquisition methods, and traffic restriction regulations in the customer's city; Complaint data: The content, processing, and satisfaction of complaints submitted by customers through customer service hotlines and online channels; Community posting data: Posts, comments, and interactions posted by customers in the brand's online community (subject to anonymization and topic analysis); Time-series characteristics: Trends or periodic characteristics that change over time based on the above behavioral data, such as "the rate of change in maintenance frequency over the past three months compared to the same period last year"; Customer data: Information such as the customer's preferred car model, purchase budget, and expected delivery time, which are left online or offline. Store visit data: Records of the number of times customers visit dealerships and the purpose of their visits (maintenance, vehicle inspection, participation in events, etc.); Test drive data: Customer's historical test drive models, times, durations, and subsequent feedback; Third-party external tag data: Customer tags, such as spending power index, interests and preferences, and family composition predictions, obtained from third-party data service providers under legal, compliant, and authorized conditions. This data is securely matched and associated using encrypted customer identifiers.

[0035] Based on the above, by integrating diverse data sources covering customer attributes, vehicle status, interaction behavior, consumption history, external environment, and third-party profiles, a comprehensive and multi-perspective customer characteristic system was constructed. This multi-source data fusion strategy enriches the information content of the features, enabling the model to learn complex correlation patterns that cannot be captured from a single data source. For example, "high-frequency APP interaction combined with a high consumption capacity index" may be strongly correlated with the purchase of luxury vehicles, thus laying a solid data foundation for subsequent refined segmentation and accurate modeling, and improving the model's depth of characterization and predictive accuracy of potential customer intentions.

[0036] Raw business data fields often cannot be directly used in machine learning models, or they do not express enough information. Directly using raw fields (such as "last maintenance date") for modeling makes it difficult to effectively capture deep patterns related to upgrade / replacement decisions, such as behavioral trends and combined effects between different behaviors, leading to bottlenecks in model performance improvement. Therefore, in one embodiment, feature derivation processing includes feature processing and feature derivation, which include time window features, behavioral aggregation features, statistical features, and discrete features.

[0037] Specifically, for timestamped behavioral data, the frequency or intensity of occurrences within a specific time window is statistically analyzed. For example, calculating "number of days logged into the app in the last 30 days," "number of times visited the store in the past six months," and "number of complaints this year." The calculation formula is: For customers... and event type In the window Inside (such as recently) eigenvalues ​​of (day) ,in Customer Event occurred A set of time points, This is an indicator function.

[0038] Secondly, behavioral aggregation features: summarizing behaviors of the same type but different sub-items. For example, summing up the number of times a customer browses different car model pages within the app to get "Total Car Model Browsing Count within the App"; or summing up the amounts of all repair work orders to get "Historical Cumulative Repair Amount".

[0039] Secondly, statistical characteristics: performing statistical analysis on historical data in a certain dimension. For example, calculating the "mean and standard deviation of the mileage between each customer's historical maintenance intervals" to measure the regularity of their maintenance habits; or calculating the "maximum and minimum monthly premium expenditures over the past 12 months".

[0040] In addition, discrete features: The original categorical fields are encoded and transformed for model processing. The most common method is one-hot encoding. For example, for the "marital status" field, its values ​​may be {married, unmarried, divorced, unknown}. One-hot encoding transforms it into four binary features: "married", "unmarried", "divorced", and "unknown". Samples are marked as 1 for the feature corresponding to the field value, and 0 for the others.

[0041] Based on the above, feature derivation processing, through in-depth processing of the original data, creates a series of derived features with higher information density, stronger business interpretability, and more direct correlation with the prediction target. Time window features can dynamically reflect the recent activity of customers; behavioral aggregation features reveal the total scale of behavior; statistical features characterize the stability and distribution characteristics of behavioral patterns; and the encoding of discrete features enables the model to understand and utilize category information. This series of operations expands the feature space, providing richer and more effective learning materials for models such as LightGBM, enabling them to construct more complex decision rules, thereby improving the model's discriminative ability.

[0042] If all customers are trained together, the characteristics of active customers with private domain behavior data will dominate, making the model better at predicting such customers, while performing poorly on inactive customers without private domain behavior or relying on external labels. This biased modeling approach cannot achieve fair and accurate predictions for customer groups with different data availability. Therefore, in one embodiment, the three mutually exclusive customer subsets include: a first customer subset containing customer samples with both private domain browsing behavior and external label data usage labels being true; a second customer subset containing customer samples with both private domain browsing behavior and external label data usage labels being false; and a third customer subset containing customer samples with both private domain browsing behavior and external label data usage labels being false.

[0043] The specific segmentation process is as follows: First, based on whether the customer has browsing or clicking log records of the brand's official APP or mini-program within a preset historical period (such as the most recent 365 days), each customer is assigned a Boolean private domain browsing behavior identifier. If a record exists, it is True; otherwise, it is False. Secondly, a boolean external label data identifier is set in the model training strategy. This is used to indicate whether third-party external label data was introduced and utilized when training the model for this subset. For True's customer base, based on a trade-off between business costs and accuracy, they can be further divided into two groups, either randomly or according to certain rules (such as preliminary model scoring): one group... Set to True to train a model that can utilize rich internal and external information (first customer subset); the other part... Setting it to False is used to train a model that relies solely on internal data (the second client subset). For For customers who are marked as False, due to a lack of private domain behavior data, their In the final training of the model, these are usually treated as False, or considered individually, and together they form a third subset of customers. These three subsets cover all customers and have no overlap with each other.

[0044] Based on the above, this segmentation method uses core differences at the customer data level (the presence or absence of private domain behavior, and the availability of external tags) as the basis for grouping, ensuring a high degree of similarity in the feature space and data structure of customers within the same subset. The first subset has the most comprehensive customer features, allowing the model to learn the most complex cross-relationships between internal and external features; the second subset model focuses on mining predictive patterns in internal behavioral data; and the third subset model needs to search for predictive clues from other static and transactional data in the absence of traces of private domain behavior. This differentiated and targeted grouping modeling strategy enables the model tailored to each subset to learn the unique patterns of that group more focused and in-depth, thereby achieving higher predictive accuracy for each customer type than a single global model, and improving the predictive ability for customer groups lacking private domain behavior.

[0045] After feature derivation, the number of features can be enormous, including many features that lack discriminative power (low information value) for the prediction target, as well as features that are highly linearly correlated with each other. These redundant and noisy features not only increase the computational cost of model training and the risk of overfitting, but may also dilute the contribution of important features, reducing the model's generalization performance and stability. Therefore, in one embodiment, after dividing the customer into three mutually exclusive subsets, feature selection is performed for each customer subset based on feature IV values ​​and feature correlations.

[0046] Specifically, this is executed independently for each subset of customers: (1) Feature selection based on IV value: ① Binning: For each continuous feature, the equal-frequency binning method is used to discretize it into... boxes (e.g.) This ensures that each box contains approximately the same number of samples.

[0047] ②Calculation With IV: For the first The first feature For each bin, calculate its Weight of Evidence (WOE) and Information Value (IV). Note: For the first The quantity of positive samples in the box (in cases of additional purchases or exchanges). For the first The number of negative samples in the bin. Note: This represents the total number of positive samples in this customer subset. Let be the total number of negative samples. Then the th The formula for calculating the WOE of a container is: The formula for calculating the IV value of this feature is: .

[0048] ③ Screening: Remove IV values ​​that are less than a preset threshold. (For example The characteristics of [the virus] suggest that its predictive ability is too weak; at the same time, it is usually excluded if the IV value is too high (e.g., greater than [a certain value]). The characteristics of a threshold are used to prevent overfitting. In practical applications, the threshold... and Adjustments can be made based on specific data and business experience; however, this application does not limit the specific implementation of these embodiments.

[0049] (2) Feature screening based on feature correlation: ① Calculate the correlation coefficient matrix: Calculate the Pearson correlation coefficient between each pair of the remaining continuous features. Its value range is [-1, 1].

[0050] ② Filter highly correlated feature pairs: Set a correlation threshold (For example ). Iterate through all feature pairs, if Then the features are considered and characteristics Highly relevant.

[0051] ③ Eliminate redundant features: For each group of highly correlated features, calculate the sum (or average) of the absolute values ​​of the correlation coefficients between each feature and all other features. Eliminate the feature with the highest average correlation to the other features in that group, retaining the remaining features. Repeat this process until no pair of features has an absolute correlation coefficient exceeding the threshold. .

[0052] Based on the above, IV value screening filters out useless features that are almost irrelevant to the target variable from a predictive ability perspective, while retaining features with high information content. Correlation screening aims to eliminate multicollinearity among features, preventing the model from becoming unstable or difficult to interpret due to feature redundancy. Performing these two screening steps independently within each client subset ensures that the final feature set used to train the combined model for that subset is the most refined, effective, and highly independent. This helps improve model training efficiency, prevent overfitting, and enhance the model's generalization ability and robustness on unknown data.

[0053] Private domain platforms are crucial digital touchpoints for automakers to directly interact with customers, but they encompass various behavioral types (such as login, browsing, clicking, and sharing). A specific, measurable behavior potentially relevant to purchase intent needs to be identified as a key criterion for customer segmentation to avoid vague segmentation criteria or the inclusion of too many noisy behaviors that could reduce the effectiveness of segmentation. Therefore, in one embodiment, private domain browsing behavior identifiers are generated based on customers' private domain browsing and clicking behavior within the brand's official application or mini-program.

[0054] Specifically, the system extracts customer behavior event streams within a preset time frame (e.g., the past year) from the backend log database of the brand's official app and mini-program. It focuses on "browsing" and "clicking" events strongly correlated with vehicle information, service exploration, and activity participation, such as browsing vehicle details pages, browsing configurator pages, clicking the "Schedule Test Drive" button, clicking promotional activity links, and viewing after-sales service details pages. For each customer, if there is at least one browsing or clicking behavior record defined above within the time frame, their private domain browsing behavior flag is set to "true"; otherwise, it is flagged as "false." This judgment logic can be implemented by using Structured Query Language (SQL) to filter and count the behavior log table. In practical applications, the specific event types included to define "browsing and clicking behavior" can be adjusted according to business understanding; this embodiment does not limit this.

[0055] Based on the above, browsing and clicking behavior is used as the core criterion for private domain activity because this behavior directly reflects customers' proactive attention and willingness to explore brand products, services, or content, and has a strong logical correlation with their intention to upgrade or replace products. Using this clear, objective, and data-driven standard for customer segmentation ensures that the subgroup of customers exhibiting private domain behavior truly consists of customers who maintain a certain level of interest in the brand. This makes the differentiated modeling based on this segmentation more business-meaningful and effective, providing a high-quality foundation for subsequent training of high-precision prediction models.

[0056] When training a LightGBM model, the choice of hyperparameters (such as the number of trees, their depth, and the learning rate) has a decisive impact on model performance. Using only a single training-validation split for hyperparameter tuning may result in unstable optimal parameters due to the randomness of the split, failing to accurately reflect the model's generalization ability across the overall data distribution and leading to biased final model performance evaluation. Therefore, in one embodiment, cross-validation is used to train the corresponding LightGBM model for each customer subset.

[0057] Specifically, for a pre-defined subset of customer data, before finalizing the partitioning into training, validation, and test sets, K-fold cross-validation is used for hyperparameter search and model training. Taking 5-fold cross-validation as an example: First, all data in the subset is randomly shuffled and divided into 5 equal parts for 5 rounds of training. In each round, one part is used as the validation set, and the remaining 4 parts as the training set. In each round, the same candidate hyperparameter combinations are used to train the LightGBM model on the training set, and its performance is evaluated on the validation set (e.g., using AUC or F1 scores). Finally, for each hyperparameter combination, the average of its 5-round validation scores is calculated. The hyperparameter combination with the highest average validation score is selected as the optimal parameters for the LightGBM model of that subset. Then, using these optimal parameters, the final LightGBM model is retrained on the entire subset data (or the time-divided training set). In practical applications, hierarchical K-fold cross-validation can also be used to ensure a consistent ratio of positive and negative samples in each fold, or other fold numbers can be used; this embodiment does not limit this.

[0058] Based on the above, cross-validation fully utilizes limited sample data, evaluating model performance through multiple different data partitions, effectively reducing evaluation errors caused by the randomness of a single data partition. This makes the hyperparameter selection process more robust, and the resulting optimal parameter combination is more likely to represent the model's true performance on unknown data. The LightGBM model trained through cross-validation, as the first stage of the ensemble prediction model, exhibits stronger stability and generalization ability, thus laying a solid foundation for subsequently generating high-quality leaf node features and training a reliable logistic regression model.

[0059] The LightGBM model can learn complex nonlinear patterns, but its raw output (class probabilities) sometimes suffers from calibration issues, and its decision-making process is a black box. Logistic regression models offer good probabilistic calibration and interpretability, but struggle to automatically construct effective nonlinear feature interactions. The key technical challenge lies in seamlessly combining the advantages of both models to form a unified predictive framework that works collaboratively. Therefore, in one embodiment, for each customer subset, the leaf node number of the sample output from its corresponding LightGBM model is one-hot encoded and used as the input feature of the logistic regression model for that customer subset.

[0060] Specifically, for a LightGBM model trained for a subset of customers, assume that the model is... The model is composed of decision trees. All sample data from this subset (including training, validation, and test sets) are input into this LightGBM model for prediction. For each input sample, each decision tree in LightGBM traverses from the root node to a unique leaf node based on the sample's feature values. In this way, the sample obtains... The leaf node numbers are denoted by a vector. ,in Indicates the sample at the 1st The leaf node IDs in each tree are determined. Next, these leaf node IDs are one-hot encoded. First, the total number of leaf nodes in each tree needs to be determined. Then, a global feature map table is constructed for the entire LightGBM model, mapping each leaf node of each tree to a unique global index location. Finally, for each sample... The vector is converted into a vector of length . High-dimensional sparse binary vectors In vectors In this sparse vector, the global index position corresponding to the leaf node where the sample is located in each tree has a value of 1; the values ​​at all other positions in the vector are 0. These are high-order features "extracted" from the LightGBM model, representing complex combinations of sample features. Finally, these sparse vectors... Using the original add-on / replacement tags of the samples as input features and the target variable as target variables, a logistic regression model is trained. The weight coefficients of this logistic regression model reflect the contribution of different decision tree path combinations (i.e., higher-order feature crosses) to the final prediction result.

[0061] Based on the above, the decision-making process of LightGBM is symbolized as leaf node paths, and then transformed into specific feature representations through one-hot encoding. These leaf node features are essentially Boolean indicators of the most discriminative feature combination rules automatically learned and selected by LightGBM. The logistic regression model performs linear discrimination based on this, which is actually a weighted vote on these higher-order rules. This cascaded structure achieves a clear division of functions: LightGBM is responsible for mining complex nonlinear relationships and combinations from the original features; logistic regression is responsible for interpreting these constructed higher-order features and outputting probabilities. The combination of the two maintains strong discriminative power while improving the reliability and interpretability of the model's final output probability.

[0062] Customer behavior patterns, market conditions, and vehicle lifecycles all change over time, a phenomenon known as concept drift. If a deployed predictive model remains unchanged for an extended period, the historical patterns it learns may gradually become out of touch with current realities, leading to a decline in predictive performance and an inability to consistently provide accurate decision support for the business. Therefore, in one embodiment, the method further includes periodically updating the parameters of the combined predictive model.

[0063] Specifically, the update strategy can employ a rolling time window approach. For example, the model update cycle can be set to once per quarter or once per six months. When an update node is reached, the following operations are performed: ① Data Preparation: Collect new data points with a fixed time span (e.g., the last three years) that trace back from the current point in time, as the full dataset for this model update. Following the methods described above, perform sample labeling, data association, preprocessing, feature engineering, and customer subset partitioning.

[0064] ② Feature alignment: Ensure that the features generated from new data are consistent with those used in the old model in terms of name, meaning, and processing method. Newly emerging features or disappeared old features need to be evaluated and processed to maintain the consistency of the feature space.

[0065] ③ Model retraining: For each customer subset, using the new data of that subset, the aforementioned methods are used to retrain the LightGBM model (including hyperparameter tuning), generate leaf node features, and train the logistic regression model to obtain a new combined prediction model.

[0066] ④ Evaluation and Deployment: Evaluate the performance of the new model on the new test set (e.g., AUC, KS score, F1 score, etc.) and compare it with the recent performance of the old model. If the performance of the new model meets the preset upgrade criteria (e.g., AUC increases by more than 0.01 or at least does not decrease), replace the old model in the online service with the new model; otherwise, keep the old model running and analyze the reasons.

[0067] ⑤ Monitoring and Triggering: In addition to regular updates, online performance monitoring metrics can be established (such as conversion rates for high-scoring customers). If a monitoring metric falls below a threshold for an extended period, a temporary model update can be triggered.

[0068] Based on the above, the regular update mechanism enables the forecasting system to continuously learn and evolve. By periodically injecting the latest business data to retrain the model, the system can dynamically capture and adapt to changes in customer behavior preferences and market trends, and promptly correct forecast biases caused by concept drift. This ensures that the forecasting model can maintain its accuracy and timeliness in the long term, providing stable and reliable technical support for continuous and refined customer operations, and extending the model's lifecycle and value.

[0069] The predicted probabilities output by machine learning models are not the ultimate goal; they must be effectively integrated into business processes and translated into concrete business actions to generate real value. Seamlessly connecting model predictions with downstream marketing execution is a crucial step in enabling technology to empower business. Therefore, in one embodiment, the method further includes using the upgrade / replacement probability value in customer marketing activities.

[0070] Specifically, the application process is as follows: The model calculates the probability value of upgrading or replacing for each existing customer. After being ranked (between 0 and 1), they are sorted from highest to lowest probability value. Marketing teams can set one or more probability thresholds based on resource budgets and reach capabilities. For example, setting a high intent threshold. Intention threshold .

[0071] For probability Customers can be marked as "high-priority potential customers" to initiate the most precise marketing actions, such as: having a dedicated sales consultant make a one-on-one phone call to invite them, recommend specific models based on their characteristics and offer test drive benefits; or sending personalized car purchase coupons via APP push.

[0072] for Customers can be marked as "medium-priority potential customers" and targeted with low-cost automated or semi-automated marketing methods, such as: including them in the company's WeChat group for refined content cultivation; or sending them car model information and event invitations based on their potential needs via SMS or EDM.

[0073] for For customers who are already engaged in the purchase intention marketing, we can temporarily refrain from proactively marketing to them and instead focus on maintaining brand relationships or engaging with them in a low-frequency manner.

[0074] Furthermore, predicted scores can be combined with customer profile characteristics to generate attribution analyses that explain "why this customer scored high," providing sales consultants with communication scripts for reference. The effectiveness of marketing campaigns (such as whether test drives, lead generation, and sales were generated) will form a feedback loop, serving as a source of positive and negative samples for subsequent model training, thereby achieving a continuous optimization cycle of data-driven marketing.

[0075] Based on the above, the specific application scenarios and methods of the model output were clarified, directly connecting the technological achievements to the business value loop. Through customer segmentation based on predictive scoring, the optimal allocation of marketing resources (human resources, expenses, and attention) was achieved, prioritizing limited resources to customer groups with the highest probability of successful conversion, thereby improving the return on investment and overall efficiency of marketing activities. Simultaneously, marketing feedback data feeds back into the model, forming a virtuous cycle of data-model-action-new data, driving the entire customer operation system towards continuous intelligent and refined development.

[0076] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), RAMbus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0077] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0078] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the invention patent. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this patent application should be determined by the appended claims.

Claims

1. A customer purchase / upgrade prediction method based on LightGBM and logistic regression, characterized in that, Includes the following steps: Acquire multi-source data of existing customers, and label the multi-source data as positive or negative samples based on whether the existing customers have engaged in upgrade or replacement purchase behavior within a preset observation period; Perform missing value imputation, outlier removal, and feature derivation on the labeled multi-source data to generate customer feature data; Based on customer private domain browsing behavior identifiers and external tag data usage identifiers, the customer feature data is divided into three mutually exclusive customer subsets; For each customer subset, a LightGBM model is trained, and the leaf node number of the sample output by the LightGBM model is one-hot encoded to generate a new feature vector. The new feature vector is used as input to train a logistic regression model to obtain the combined prediction model corresponding to the customer subset. Obtain the original feature data of the customer to be predicted, perform the same data preprocessing and feature derivation processing as in the training phase to generate the feature vector to be predicted, determine the customer subset to which the customer belongs based on the private domain browsing behavior identifier and external tag data of the customer to be predicted, input the feature vector to be predicted into the combined prediction model corresponding to the customer subset, and output the probability value of adding or replacing purchases in the interval of 0 to 1 by the logistic regression model.

2. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to claim 1, characterized in that, The multi-source data includes customer personal information, maintenance data, insurance purchase and claim data, APP and mini-program behavior data, online shopping mall consumption data, traffic restriction and purchase restriction data, complaint data, community posting data, time series characteristics, lead generation data, in-store data, test drive data, and third-party external tag data.

3. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to claim 1, characterized in that, The feature derivation process includes feature processing and feature derivation, which includes time window features, behavior aggregation features, statistical features, and discrete features.

4. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to claim 1, characterized in that, The three mutually exclusive customer subsets include: a first customer subset containing customer samples whose private domain browsing behavior is identified as true and whose external tag data usage is identified as true; a second customer subset containing customer samples whose private domain browsing behavior is identified as true and whose external tag data usage is identified as false; and a third customer subset containing customer samples whose private domain browsing behavior is identified as false.

5. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to claim 4, characterized in that, After dividing the customer into three mutually exclusive subsets, feature filtering is performed for each customer subset based on feature IV value and feature correlation.

6. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to claim 4, characterized in that, The private domain browsing behavior identifier is generated based on the customer's private domain browsing click behavior in the brand's official application or mini-program.

7. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to claim 1, characterized in that, For each of the aforementioned customer subsets, a corresponding LightGBM model is trained using cross-validation.

8. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to claim 1, characterized in that, For each customer subset, the leaf node number of the sample output by the corresponding LightGBM model is one-hot encoded and used as the input feature of the logistic regression model corresponding to the customer subset.

9. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to any one of claims 1 to 8, characterized in that, Also includes: The parameters of the combined prediction model are updated periodically.

10. The customer purchase / upgrade prediction method based on LightGBM and logistic regression according to any one of claims 1 to 8, characterized in that, Also includes: The aforementioned upgrade / replacement probability value will be used in customer marketing activities.