A customer overdue probability prediction method based on a decision tree model
By constructing observation time axes and periodic segments based on decision tree models, extracting local behavioral features, generating risk labels, and performing joint encoding, the problem of difficulty in capturing the dynamic characteristics of customer fund flows in existing technologies is solved, and efficient customer risk prediction and accurate overdue probability prediction are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG NINGYIN CONSUMER FINANCE CO LTD
- Filing Date
- 2026-02-06
- Publication Date
- 2026-06-19
AI Technical Summary
Existing customer credit risk assessment technologies struggle to capture the dynamic characteristics of customer cash flows, particularly failing to identify patterns in cash behavior across different stages and time windows. This results in low accuracy in risk prediction and identification, making it difficult to effectively manage risks proactively, especially during periods of economic volatility.
By using a decision tree model, an observation timeline is constructed, periodic segments are divided, local behavioral features are extracted, risk labels are generated, and joint coding is performed through hierarchical decision trees to establish an overdue probability lookup table, thereby enabling dynamic monitoring and risk prediction of customer cash flow.
It achieves a panoramic view of customer financial behavior, improves the granularity and interpretability of risk assessment, can identify potential risk evolution trajectories, and enhances the accuracy and efficiency of risk prediction.
Smart Images

Figure CN122243624A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of financial technology and credit risk management technology, specifically a method for predicting customer delinquency probability based on a decision tree model. Background Technology
[0002] With the rapid development of the consumer finance industry, the number of customers and transaction volume have continued to expand, making credit risks increasingly prominent for financial institutions. Especially in unsecured and unguaranteed consumer credit products, customers' willingness and ability to repay have become core risk factors. Existing customer credit risk assessment technologies mainly rely on static features and traditional scoring card methods, using rule sets or regression and machine learning models to classify customers by risk and allocate credit limits. However, in the current market environment, the diversification of customer fund flow behavior and the complexity of transaction structures make it difficult for static features and single historical behaviors to fully reflect a customer's future delinquency risk.
[0003] Existing technologies typically predict risk by batch statistical analysis of historical data, constructing static feature libraries, applying standard scoring cards, or using classification models based on traditional machine learning. While these technologies have some practical effectiveness in customer segmentation and risk warning, they have significant limitations. First, traditional methods struggle to capture the dynamic characteristics of customers' daily cash flows, particularly failing to distinguish the patterns of inflows and outflows at different stages and time windows. For example, some customers may briefly increase their account balances before the repayment date, but their actual cash flow capacity is weak, leading to a higher likelihood of subsequent delinquency. Such short-term behavioral changes are difficult to capture effectively by static features or fixed models. In practice, existing solutions often ignore the temporal characteristics of customer cash inflows and outflows, such as continuity, periodicity, and flow structure, and lack a segmented identification mechanism for key risk points such as active periods, abnormal cash fluctuations, and critical repayment behaviors. Due to these shortcomings, financial institutions often face problems of delayed decision-making and low identification accuracy in risk prediction, credit limit control, and customer segmentation. This is especially true in market environments with economic downturns or frequent fluctuations in customer income, making it even more difficult to effectively manage delinquency risks in advance.
[0004] Therefore, this case aims to propose a customer delinquency probability prediction method based on a decision tree model. By periodically segmenting the daily net capital change sequence of customer transaction logs within a preset observation window, local behavioral features within each period segment are extracted. Thresholds are selected based on the principle of maximum information gain to generate multi-stage risk labels. The risk labels of each stage are then concatenated over time to form a risk path. Risk trend indicators are further calculated. Finally, the risk path and trend indicators are jointly encoded using a hierarchical decision tree to obtain highly discriminative leaf node numbers. Based on the leaf node numbers, a delinquency probability lookup table is established to output the prediction results. Summary of the Invention
[0005] This invention provides a method for predicting customer delinquency probability based on a decision tree model, which helps to solve the problems mentioned in the background art.
[0006] This invention provides the following technical solution: a method for predicting customer delinquency probability based on a decision tree model, comprising:
[0007] Obtain historical transaction logs from customers and the observation window, establish an observation timeline, summarize incoming and outgoing transactions by observation day to generate a daily net cash change sequence for customers, and set overdue flags based on whether overdue has occurred.
[0008] Set a period window, divide the time axis into continuous period segments according to the window, and extract a local sequence of capital flow from the customer's daily net capital change sequence according to the period segment. All of them are zero-marked silent period segments, and non-zero-marked active period segments appear.
[0009] In each active period, observation days with non-zero net capital changes are selected and recorded as the non-zero day set. The average daily net capital change, volatility, net inflow ratio and maximum single-day outflow amount are calculated on the non-zero day set and combined into a local behavioral feature vector.
[0010] In each period, a candidate set of net inflow ratio threshold and a candidate set of maximum single-day outflow amount threshold are constructed based on local behavioral feature vectors. The active customer set is divided according to the candidate threshold combination and the information gain is calculated. The threshold combination with the largest information gain is selected, and the customer risk label is generated according to the final threshold combination.
[0011] At the customer level, risk labels for different time periods are linked together in chronological order to form a time-sequential risk path.
[0012] Based on the time-series risk path, the number of risk leaps, the length of tail-end continuous risk, and the proportion of full-cycle risk are calculated to form a risk path trend indicator vector.
[0013] Construct a hierarchical decision tree, setting risk judgment nodes in the intra-cycle layer and trend judgment nodes in the cross-cycle layer, and jointly encode the risk path and trend indicators to obtain the leaf node number.
[0014] Establish a lookup table relationship between leaf nodes and overdue probabilities. For new customers, generate leaf node numbers, retrieve overdue probabilities, and output the results.
[0015] Optionally, the steps of obtaining historical transaction logs within the customer and observation window, establishing an observation timeline, summarizing inflows and outflows by observation date to generate a daily net cash change sequence for the customer, and setting overdue flags based on whether overdue payments have occurred, specifically include:
[0016] Obtain the set of customers participating in the modeling, assign a unique customer number to each customer in the set, and form a list of customers participating in the modeling;
[0017] For each customer, the customer's historical transaction logs are read from the business system in chronological order within a preset observation window. Each transaction log contains at least the natural date of the transaction and the transaction amount. A set of the customer's historical transaction logs is formed when a transaction record exists, and the set of the customer's historical transaction logs is set to an empty set when no transaction record exists.
[0018] Based on the modeling requirements, the start and end dates of the observation window are given. A unified observation timeline is established by arranging the natural days consecutively between the start and end dates. Each natural day on the timeline is marked as an observation day, forming an observation day sequence.
[0019] For each customer, each observation day is traversed one by one on a unified observation timeline, and all transaction logs of the customer on the current observation day are retrieved; when there are multiple transactions, the transaction amounts are summarized according to the rule of recording income as positive and expenses as negative, so as to obtain the customer's daily net change in funds on the current observation day; when there are no transaction logs on the current observation day, the customer's daily net change in funds on the current observation day is set to zero, thereby forming a sequence of daily net changes in funds for customers covering all observation days;
[0020] For each customer, check whether an overdue event has occurred within a preset observation window; if an overdue event exists, set the customer's overdue flag to overdue; if no overdue event occurs, set the overdue flag to not overdue, and establish a correspondence between the overdue flag and the customer's daily net cash change sequence.
[0021] Optionally, the set periodic window divides the time axis into continuous periodic segments, and extracts local cash flow sequences from the customer's daily net cash change sequence according to these periodic segments. All segments are zero-marked silent periodic segments, with non-zero-marked active periodic segments appearing. Specifically, these include:
[0022] Set the periodic window length, which is in units of the number of observation days;
[0023] On a unified observation timeline, starting from the first observation day of the observation window, the observation days are divided sequentially from front to back according to the length of the periodic window. The observation days corresponding to each complete periodic window length are divided into a periodic segment. For the remaining observation days that are less than the length of a complete periodic window, the remaining observation days are separately assigned to the last periodic segment, resulting in several consecutive periodic segments.
[0024] For each customer, within each period, extract the daily net cash flow corresponding to all observation days covered by the current period from the customer's daily net cash flow sequence to form a local sequence of the customer's cash flow within the current period.
[0025] For each customer, examine all observation days in the local sequence of fund flows within each period: when the net daily fund change is zero for all observation days, mark the current period as a silent period; when the net daily fund change is not zero for at least one observation day, mark the current period as an active period, and record the silent or active marks to form a silent mark sequence.
[0026] Optionally, in each active period, observation days with non-zero net fund changes are selected and recorded as a non-zero day set. The average daily net fund change, volatility, net inflow ratio, and maximum single-day outflow amount are calculated on the non-zero day set and combined into a local behavioral feature vector, specifically including:
[0027] For each customer, within each active period, observe days with non-zero net cash changes are selected from the customer's local cash flow sequence to form a non-zero day set, and the number of observe days in the non-zero day set is obtained; when the period is a quiet period, the non-zero day set is regarded as an empty set.
[0028] When the non-zero day set is not empty, the arithmetic mean of the daily net change in funds for each observation day in the non-zero day set is calculated to obtain the average daily net change in funds for the customer in the current period; when the non-zero day set is empty, the average daily net change in funds for the current period is set to zero.
[0029] When the non-zero day set is not empty, the average daily net change of funds for the non-zero day set is subtracted from the daily net change of funds for each observation day in the non-zero day set to obtain the deviation value for each observation day. The arithmetic mean of the squares of each deviation value is then calculated to obtain the volatility of the customer's daily net change of funds in the current period. When the non-zero day set is empty, the volatility of the daily net change of funds in the current period is set to zero.
[0030] When the non-zero day set is not empty, the portion of the daily net capital change that is greater than zero for all observed days in the non-zero day set is summed to obtain the total net inflow for the current period; the absolute value of the daily net capital change for all observed days in the non-zero day set is summed to obtain the total capital flow for the current period; when the total capital flow is greater than zero, the ratio of the total net inflow to the total capital flow is used as the net inflow ratio for the current period; when the total capital flow is equal to zero, the net inflow ratio is set to zero.
[0031] When the set of non-zero days is not empty, observe days with net daily cash changes less than zero are selected from the set of non-zero days, and the absolute value corresponding to the net daily cash change with the largest absolute value is taken as the customer's maximum single-day outflow amount in the current period; when there are no observe days with net daily cash changes less than zero, the maximum single-day outflow amount in the current period is set to zero.
[0032] For each customer's period, when the period is an active period, the average daily net change in funds, the volatility of daily net change in funds, the net inflow ratio, and the maximum single-day outflow amount of the current period are combined into a four-dimensional local behavioral feature vector in a preset order; when the period is a quiet period, all four components are set to zero and combined into a local behavioral feature vector.
[0033] Optionally, in each period, a candidate set of net inflow ratio threshold and a candidate set of maximum single-day outflow amount threshold are constructed based on local behavioral feature vectors. The active customer set is divided according to the candidate threshold combinations, and information gain is calculated. The threshold combination with the largest information gain is selected, and a customer risk label is generated based on the final threshold combination. Specifically, this includes:
[0034] Within each period segment, customers marked as active period segments are selected from all customers to form the training sample customer set for the current period segment. The training sample customer set consists of customers from active period segments. If there are no active period segment customers in the current period segment, the training sample customer set is set to an empty set.
[0035] When the training sample customer set is not empty, the net inflow ratio is extracted from the local behavioral feature vector of each customer in the training sample customer set. After removing duplicate values, a candidate set of net inflow ratio thresholds for the current period is constructed. At the same time, the maximum single-day billing amount of each customer is extracted. After removing duplicate values, a candidate set of maximum single-day billing amount thresholds for the current period is constructed.
[0036] When the training sample customer set is not empty, for each pair of candidate threshold combinations in the candidate set of net inflow ratio threshold and the candidate set of maximum single-day remittance amount threshold, perform the following operations: For each customer in the training sample customer set, compare the net inflow ratio with the candidate net inflow ratio threshold, and the maximum single-day remittance amount with the candidate maximum single-day remittance amount; when the net inflow ratio is lower than the candidate net inflow ratio threshold and the maximum single-day remittance amount is higher than the candidate maximum single-day remittance amount, classify the current customer into the risk side subset; when the net inflow ratio is not lower than the candidate net inflow ratio threshold or the maximum single-day remittance amount is not higher than the candidate maximum single-day remittance amount, classify the current customer into the non-risk side subset.
[0037] For the risk-side subset, non-risk-side subset, and training sample customer set obtained based on candidate threshold combinations, the number of customers marked as "overdue" and the total number of customers in each subset are counted. When the total number of customers in the subset is greater than zero, the ratio of the number of overdue customers to the total number of customers is taken as the overdue ratio of the corresponding subset. When the total number of customers in the subset is equal to zero, the overdue ratio is set to zero. According to the overdue ratio of each subset, the entropy values of the risk-side subset, non-risk-side subset, and training sample customer set are obtained according to the information entropy measurement rules. The corresponding weights are set according to the proportion of the number of customers in the risk-side subset and non-risk-side subset to the total number of customers in the training sample customer set. The information gain of the current candidate threshold combination is calculated based on the weighted difference between the overall entropy value and the subset entropy value.
[0038] Within each period, the net inflow ratio threshold with the largest information gain and the maximum single-day outflow amount threshold are selected from all candidate threshold combinations as the final threshold combination for the current period. When there are multiple threshold combinations with the same information gain, the combination with the smaller values of the net inflow ratio threshold and the maximum single-day outflow amount threshold is selected first. When the training sample customer set is empty, the net inflow ratio threshold and the maximum single-day outflow amount threshold for the current period are set to zero, and the information gain is set to zero.
[0039] After obtaining the final threshold combination for each period, a threshold comparison is performed on the local behavioral feature vectors of all customers in the current period: when a customer's net inflow ratio is lower than the net inflow ratio threshold of the current period and the maximum single-day outflow amount is higher than the maximum single-day outflow amount threshold of the current period, the customer's risk label in the current period is set to risk status; when the net inflow ratio is not lower than the net inflow ratio threshold of the current period or the maximum single-day outflow amount is not higher than the maximum single-day outflow amount threshold of the current period, the customer's risk label in the current period is set to non-risk status, forming a risk label record for each customer in each period, which serves as the risk judgment result for the period.
[0040] Optionally, the step of sequentially linking periodic risk tags at the customer level to form a time-sequential risk path specifically includes:
[0041] For each customer, risk labels are obtained sequentially within each period segment according to the time sequence of the period segments on the unified observation timeline.
[0042] The risk labels for each period are connected in chronological order according to a unified observation timeline to form a discrete risk label sequence from the first period to the last period, which serves as the client's time-series risk path.
[0043] Optionally, the risk path trend indicator vector is constructed by calculating the number of risk escalations, the length of the tail-end continuous risk, and the proportion of risk throughout the entire cycle based on the time-series risk path, specifically including:
[0044] For each customer, in the time-series risk path, starting from the second period segment, the risk label of the current period segment is compared with the risk label of the previous period segment; when the previous period segment is in a non-risk state and the current period segment is in a risk state, the risk jump count is incremented by one; after traversing the entire risk path, the number of risk jumps is obtained.
[0045] For each customer, risk labels are checked sequentially from the last period of the time-sequential risk path. When consecutive risk states are encountered, the length of the consecutive risk at the tail is accumulated. The accumulation stops when the first non-risk state is encountered or the starting position of the risk path is reached, thus obtaining the length of the consecutive risk at the tail.
[0046] For each customer, the number of periods in which risk states occur is counted in the time-series risk path, and the ratio of the number of periods in which risk states occur to the total number of periods covered by the risk path is taken as the full-cycle risk ratio.
[0047] For each client, the number of risk escalation, the length of tail-end continuous risk, and the proportion of full-cycle risk are combined into a risk path trend indicator vector in a preset order. This risk path trend indicator vector is then used as the input feature vector to characterize the cross-cycle risk evolution trend in the subsequent hierarchical decision tree structure.
[0048] Optionally, the construction of a hierarchical decision tree, which includes setting risk assessment nodes in intra-cycle layers and trend assessment nodes in cross-cycle layers, and jointly encoding risk paths and trend indicators to obtain leaf node numbers, specifically includes:
[0049] In the periodic decision layer of the hierarchical decision tree, a periodic risk judgment node is set for each period segment. The local behavioral feature vector of the customer in the current period segment is input into the corresponding node. The local behavioral feature vector is compared with the threshold of the net inflow ratio of the period segment and the threshold of the maximum single-day outflow amount, and the risk label of the customer in the current period segment is output.
[0050] For each customer, the path code integer is constructed starting from the first period segment by taking the risk label value in the risk path in chronological order: the path code is set to zero at the beginning, each period segment is processed in turn, the existing path code is multiplied by two when processing the current period segment, and then the risk label value of the current period segment is added to the result. After processing the last period segment, the customer's path code integer is obtained.
[0051] For each client, obtain the number of risk escalations, the length of tail-end continuation risk, the overall risk ratio, and the total number of period segments. The total number of period segments is the number of period segments covered by the time-series risk path. Perform the following operations:
[0052] S701. Round down the product of the full-cycle risk ratio and the total number of cycle segments to obtain the discrete risk ratio level.
[0053] S702. Multiply the number of risk jumps by the square of the total number of cycle segments plus one to obtain the first part of the trend code.
[0054] S703. Multiply the length of the continuous risk at the tail end by one plus the total number of period segments to obtain the second part of the trend code.
[0055] S704. Sum the first part of the trend code, the second part of the trend code, and the risk ratio level to obtain the customer's trend code integer.
[0056] For each customer, calculate the total number of binary path codes covering all risk label values for all period segments based on the total number of period segments. Multiply the trend code integer by the total number of binary path codes to obtain the trend offset. Then add the trend offset to the path code integer to obtain the leaf node number of the customer in the hierarchical decision tree, so that the leaf node numbers corresponding to different combinations of trend codes and path codes are not repeated.
[0057] During the decision tree training process, the training sample customers are attached to the corresponding leaf nodes using leaf node numbers; the intra-cycle risk assessment nodes of the intra-cycle decision layer and the trend assessment nodes of the cross-cycle decision layer together constitute a hierarchical decision tree structure.
[0058] Optionally, the step of establishing a leaf node-overdue probability lookup table relationship, generating a leaf node number for a new customer, retrieving the overdue probability, and then outputting it specifically includes:
[0059] During the training phase, for each possible leaf node number, collect all customer samples whose leaf node numbers are equal to the current number from the training dataset to form the customer set corresponding to the leaf node;
[0060] For each leaf node customer set, count the total number of customers and the number of customers marked as overdue within the set; when the total number of customers is greater than zero, use the ratio of the number of overdue customers to the total number of customers as the overdue probability of the corresponding leaf node; when the total number of customers is equal to zero, set the overdue probability to zero, and form a lookup table relationship between leaf node number and overdue probability.
[0061] During the forecasting phase, the following steps are performed for the clients to be forecasted: data acquisition, period segmentation and generation of local sequences of capital flows, extraction of local behavioral feature vectors, selection of threshold combinations and generation of risk labels for period segments, concatenation of risk labels in chronological order, calculation of risk path trend indicators, and hierarchical decision tree coding to generate leaf node numbers. The overdue probability that matches the leaf node number is then searched in the lookup table relationship between the leaf node number and the overdue probability. The found overdue probability is used as the overdue probability forecast result for the clients to be forecasted.
[0062] The present invention has the following beneficial effects:
[0063] 1. Unlike traditional static features based on aging or billing cycles, this solution directly constructs a continuous observation timeline covering the entire window, ensuring that days without transactions are explicitly marked as unchanged, thus achieving a panoramic capture of fund behavior. It fully preserves daily fund inflow and outflow fluctuations, avoiding the loss of sample information; and it maps overdue events one-to-one with daily sequences, forming directly correlated monitoring signals.
[0064] 2. The solution sets a fixed-length periodic window, dividing the continuous observation days into several periodic segments. For each customer, daily changes are captured within each periodic segment to generate a local sequence of fund flows. Quiet and active periodic segments are distinguished based on whether the data is all zero or contains non-zero days. Decomposing the long-term continuous sequence into fixed periodic windows captures the cyclical fluctuations in customer fund behavior. The division between quiet and active periods introduces a behavior sparsity determination based on zero-value statistics, which can both filter out invalid periodic segments and highlight periods of active behavior.
[0065] 3. Within each active period, the scheme selects non-zero observation days and records them as a set of non-zero days. From these, it calculates four indicators—average daily net change, volatility, net inflow ratio, and maximum single-day outflow amount—to form a local behavioral feature vector. This decomposes capital flow behavior into concentration trend, volatility characteristics, degree of capital tightness (net inflow ratio), and potential risk exposure (maximum single-day expenditure), forming a multi-dimensional representation. Compared to existing practices that solely use average balance or maximum overdraft limit, this scheme characterizes local capital dynamics from multiple perspectives, making risk assessment more granular and interpretable; it also enhances the ability to capture complex patterns of capital inflows and outflows.
[0066] 4. The solution does not use fixed empirical thresholds. Instead, it constructs candidate sets for net inflow ratio thresholds and maximum outflow thresholds in the training samples for each period. The optimal threshold pair is selected using information gain measurement to classify customers into risky and non-risky categories, and then labels them accordingly. This dynamic threshold selection using information gain adapts the judgment criteria to the current sample distribution while balancing sample purity and splitting effect; it avoids the drawbacks of manual parameter tuning and the susceptibility of empirical thresholds to failure.
[0067] 5. This solution concatenates risk labels from each period in chronological order at the customer level to obtain discrete time-series risk paths. Based on this, it calculates three trend indicators: the number of risk escalations, the length of continuous tail risk, and the overall risk ratio, forming a risk path trend vector. Treating the multi-period label sequence as a risk evolution path captures the dynamic evolution trajectory of risk. Introducing two types of time series indicators—the number of escalations and the length of continuous tail risk—reflects the sharp increase in risk and the persistence of high tail risk, respectively, which is beneficial for identifying potential overlapping risks.
[0068] 6. The scheme constructs a two-layer decision tree structure: the intra-cycle layer uses local decision nodes to process single-cycle risk labels, and the cross-cycle layer uses trend decision nodes to process risk path trend vectors. Finally, the two are jointly encoded to generate a unique leaf node number. Layering intra-cycle and cross-cycle decisions allows the model to focus on both short-term risk assessment and long-term trends. The joint encoding merges binary path encoding and trend encoding, ensuring that leaf node numbers are non-conflicting and have a reversible mapping. This effectively achieves a unified structured representation of risk path and trend information, facilitating subsequent table lookups and model expansion.
[0069] 7. During the training phase, the solution calculates the total number of customers and the number of overdue payments corresponding to each leaf node number, forming a lookup table relationship. During the prediction phase, it only needs to quickly generate the leaf node numbers and look up the table to output the overdue probability. By using a lookup table, the complex model reasoning is simplified to a number mapping, eliminating the need to recalculate features and decisions, thus improving online prediction efficiency. Compared to traditional methods that require online traversal of decision trees or deep model reasoning, the lookup table method achieves efficient decoupling between offline training and online table lookup, reducing system complexity and ensuring real-time performance. Attached Figure Description
[0070] Figure 1 This is a schematic diagram of the process of the present invention. Detailed Implementation
[0071] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0072] Example, refer to Figure 1 A method for predicting customer delinquency probability based on a decision tree model, comprising:
[0073] Obtain historical transaction logs from customers and the observation window, establish an observation timeline, summarize incoming and outgoing transactions by observation day to generate a daily net cash change sequence for customers, and set overdue flags based on whether overdue has occurred.
[0074] Set a period window, divide the time axis into continuous period segments according to the window, and extract a local sequence of capital flow from the customer's daily net capital change sequence according to the period segment. All of them are zero-marked silent period segments, and non-zero-marked active period segments appear.
[0075] In each active period, observation days with non-zero net capital changes are selected and recorded as the non-zero day set. The average daily net capital change, volatility, net inflow ratio and maximum single-day outflow amount are calculated on the non-zero day set and combined into a local behavioral feature vector.
[0076] In each period, a candidate set of net inflow ratio threshold and a candidate set of maximum single-day outflow amount threshold are constructed based on local behavioral feature vectors. The active customer set is divided according to the candidate threshold combination and the information gain is calculated. The threshold combination with the largest information gain is selected, and the customer risk label is generated according to the final threshold combination.
[0077] At the customer level, risk labels for different time periods are linked together in chronological order to form a time-sequential risk path.
[0078] Based on the time-series risk path, the number of risk leaps, the length of tail-end continuous risk, and the proportion of full-cycle risk are calculated to form a risk path trend indicator vector.
[0079] Construct a hierarchical decision tree, setting risk judgment nodes in the intra-cycle layer and trend judgment nodes in the cross-cycle layer, and jointly encode the risk path and trend indicators to obtain the leaf node number.
[0080] Establish a lookup table relationship between leaf nodes and overdue probabilities. For new customers, generate leaf node numbers, retrieve overdue probabilities, and output the results.
[0081] First, by aggregating daily inflow and outflow data to generate a daily net cash change sequence, it completely avoids the loss of key intraday fluctuation information caused by the coarse-grained processing based on monthly statements or aging windows, thus presenting a complete picture of daily cash activity. Second, by setting fixed periodic windows and dividing the time axis into quiet and active periods, it no longer relies on prior experience or manual time interval division, but automatically distinguishes between sparse and active trading periods based on actual customer behavior, effectively filtering out meaningless quiet periods and focusing on periods that truly present risk signals. Then, within the active period, it extracts four dimensions of local behavioral characteristics: average net change, volatility, net inflow ratio, and maximum outflow amount, comprehensively considering not only the size of funds, volatility intensity, and high-frequency large amounts. The system employs multi-dimensional risk indicators, including expenditures, and utilizes multiple indicators to jointly characterize micro-level capital flow features, thereby improving the sensitivity and accuracy of risk signals. Subsequently, it adopts information gain-driven dynamic threshold selection for each period, enabling risk judgment criteria to adaptively learn from data, avoiding the drawbacks of fixed thresholds in existing technologies that struggle to adapt to customer heterogeneity and phased characteristic changes. Furthermore, it concatenates risk labels from each period into a time series and defines three types of trend indicators: the number of risk jumps, the length of continuous tail risk, and the proportion of risk across the entire period, deepening the characterization of risk evolution paths. Finally, it modularly encodes short-term risks and inter-period trends using a hierarchical decision tree, generating leaf node numbers and outputting overdue probabilities through table lookup, simplifying the online prediction process.
[0082] The process of acquiring historical transaction logs from customers and within the observation window, establishing an observation timeline, summarizing incoming and outgoing transactions by observation day to generate a daily net cash change sequence for customers, and setting overdue flags based on whether overdue transactions have occurred specifically includes:
[0083] Obtain the set of customers participating in the modeling process, assign a unique customer number to each customer in the set, and form a list of customers participating in the modeling process; for each customer, read the customer's historical transaction logs from the business system in chronological order within a preset observation window, with each transaction log containing at least the natural date of the transaction and the transaction amount; if a transaction record exists, form a set of the customer's historical transaction logs; if no transaction record exists, set the customer's historical transaction log set to an empty set; according to the modeling requirements, specify the start and end dates of the observation window, establish a unified observation timeline by continuously arranging natural days between the start and end dates, and mark each natural day on the timeline as an observation day, forming an observation day sequence; for each customer, On a unified observation timeline, each observation day is traversed one by one to retrieve all transaction logs for the customer on the current observation day. When multiple transactions exist, the transaction amounts are summarized according to the rule of recording income as positive and expenses as negative to obtain the customer's daily net cash change on the current observation day. When there are no transaction logs for the current observation day, the customer's daily net cash change on the current observation day is set to zero, thus forming a sequence of daily net cash changes for customers covering all observation days. For each customer, it is checked whether an overdue event has occurred within a preset observation window. If an overdue event exists, the customer's overdue mark is set to overdue; if no overdue event occurs, the overdue mark is set to non-overdue, and a correspondence is established between the overdue mark and the customer's daily net cash change sequence.
[0084] Further specific implementation steps include:
[0085] Construct the customer set participating in the modeling as ;in, Number the customer; Given the total number of customers; construct a collection of historical transaction logs for each customer during the observation period, specifically: ;in, For customers A collection of historical transaction logs; For customers The The calendar day on which the transaction occurred; The transaction number; For customers The Transaction amount; For customers Total number of transactions; establish a unified observation timeline. ;in, To unify the observation timeline One observation day; Indexed by observation day number; Total number of observation days; The observation window begins in natural days; End the observation window in natural days; calculate the client's... On the observation day Daily net capital changes Specifically: ;in, For consistency matching functions, when Output 1 when Output 0 when the time is right; The independent variable represents the input interpolation; the client is set. The overdue mark is When customers If an overdue event occurs within the preset observation window, then... When customers If no overdue event occurs within the preset observation window, then... .
[0086] The defined periodic window divides the time axis into continuous periodic segments. A partial sequence of fund flows is formed by extracting segments from the customer's daily net fund change sequence. All segments are zero-marked silent periods, with non-zero-marked active periods. Specifically, these include:
[0087] A periodic window length is set, with the length measured in units of observation days. On a unified observation timeline, starting from the initial observation day of the observation window, observation days are sequentially divided according to the periodic window length, with each complete periodic window length corresponding to one observation day forming a periodic segment. For any remaining observation days shorter than a complete periodic window length, these remaining observation days are separately assigned to the last periodic segment, resulting in several consecutive periodic segments. For each customer, within each periodic segment, the daily net capital change sequence is extracted from all observation days covered by the current periodic segment, forming a local sequence of the customer's capital flow within that segment. For each customer, within each periodic segment, all observation days in the local sequence of capital flow are examined: when the daily net capital change for all observation days is zero, the current periodic segment is marked as a silent periodic segment; when at least one observation day has a non-zero daily net capital change, the current periodic segment is marked as an active periodic segment, and the silent or active markers are recorded to form a silent marker sequence.
[0088] Further specific implementation steps include:
[0089] Set the fixed periodic window length to and satisfy ;in, The set of positive integers; the total timeline is divided into A continuous periodic segment: ;in, The observation window was not arranged by length The number of periodic segments obtained after division; and the number of the periodic segments obtained after division. Each periodic segment is denoted as Specifically:
[0090] ;in, The periodic segment number is an integer. ;Build each customer in the first The fund flow sequence for each period is as follows: ;in, For customers In the period The daily net change set within the segment; if all within this segment If it is active, then set it to a silent cycle; otherwise, set it to an active cycle; construct a silent flag, specifically as follows: ;in, For customers In the period Does it exhibit non-zero behavior? A value of 0 indicates silence, and a value of 1 indicates activity.
[0091] In each active period, observation days with non-zero net capital changes are selected and recorded as a non-zero day set. The average daily net capital change, volatility, net inflow ratio, and maximum single-day outflow amount are calculated on the non-zero day set and combined to form a local behavioral feature vector, specifically including:
[0092] For each customer, within each active period, observe days with non-zero daily net cash changes are selected from the customer's local cash flow sequence to form a non-zero day set, and the number of observe days in the non-zero day set is obtained; when the period is a quiet period, the non-zero day set is treated as an empty set; when the non-zero day set is not empty, the arithmetic mean of the daily net cash changes of each observe day in the non-zero day set is calculated to obtain the customer's average daily net cash change within the current period; when the non-zero day set is empty, the average daily net cash change of the current period is set as... Set to zero; when the non-zero day set is not empty, subtract the average daily net capital change of the non-zero day set from the daily net capital change of each observation day in the non-zero day set to obtain the deviation value of each observation day. Calculate the arithmetic mean of the squares of each deviation value to obtain the daily net capital change volatility of the client in the current period; when the non-zero day set is empty, set the daily net capital change volatility of the current period to zero; when the non-zero day set is not empty, sum the portion of the daily net capital change of all observation days in the non-zero day set that is greater than zero to obtain the current... The total net inflow for each period; summing the absolute values of the daily net capital changes for all observed days in the non-zero day set to obtain the total capital flow scale for the current period; when the total capital flow scale is greater than zero, the ratio of the total net inflow to the total capital flow scale is used as the net inflow ratio for the current period; when the total capital flow scale is equal to zero, the net inflow ratio is set to zero; when the non-zero day set is not empty, observation days with daily net capital changes less than zero are selected from the non-zero day set, and the absolute value corresponding to the daily net capital change with the largest absolute value is used as the customer's maximum single-day withdrawal amount in the current period; when there are no observation days with daily net capital changes less than zero, the maximum single-day withdrawal amount for the current period is set to zero; for each period of each customer, when the period is an active period, the average daily net capital change, daily net capital change volatility, net inflow ratio, and maximum single-day withdrawal amount for the current period are combined into a four-dimensional local behavioral feature vector in a preset order; when the period is a silent period, all four components are set to zero and combined into a local behavioral feature vector.
[0093] Further specific implementation steps include:
[0094] Building customers In the period The set of time containing non-zero transactions is as follows:
[0095] , ;in, For customers In the period The set of observation days in which non-zero daily net changes occur; For set The number of elements; The cardinality function for a set returns the number of elements in the set. Let be the independent variable, representing any set; extract four features, specifically:
[0096] S301. Calculate the average transaction amount: ;in, For customers In the period Average daily net change; S302, Calculate the standard deviation of amounts: ;in, For customers In the period Internal volatility;
[0097] S303. Calculate the net inflow ratio: ;in, For customers In the period Net inflow ratio; S304, Calculate maximum outflow amount:
[0098] ;in, For customers In the period Maximum outflow amount; Construct feature vector: ;in, For customers In the period Local behavioral feature vectors; when season .
[0099] In each period, a candidate set of net inflow ratio thresholds and a candidate set of maximum single-day outflow amount thresholds are constructed based on local behavioral feature vectors. Active customer sets are then segmented according to the candidate threshold combinations, and information gain is calculated. The threshold combination with the highest information gain is selected, and customer risk labels are generated based on the final threshold combination. Specifically, this includes:
[0100] Within each period, customers marked as active periods are selected from all customers to form the training sample customer set for the current period. The training sample customer set consists of active period customers. If there are no active period customers in the current period, the training sample customer set is set to empty. When the training sample customer set is not empty, the net inflow ratio is extracted from the local behavioral feature vector of each customer in the training sample customer set. After removing duplicate values, a candidate set of net inflow ratio thresholds for the current period is constructed. Simultaneously, the maximum single-day billing amount for each customer is extracted, and after removing duplicate values, a candidate set of maximum single-day billing amount thresholds for the current period is constructed. When the training sample customer set is not empty, the candidate set of net inflow ratio thresholds is compared with... For each candidate threshold combination in the candidate set of maximum single-day billing amount thresholds, perform the following operations: For each customer in the training sample customer set, compare the net inflow ratio with the candidate net inflow ratio threshold, and the maximum single-day billing amount with the candidate maximum single-day billing amount; when the net inflow ratio is lower than the candidate net inflow ratio threshold and the maximum single-day billing amount is higher than the candidate maximum single-day billing amount, classify the current customer into the risk-side subset; when the net inflow ratio is not lower than the candidate net inflow ratio threshold or the maximum single-day billing amount is not higher than the candidate maximum single-day billing amount, classify the current customer into the non-risk-side subset; for the risk-side subset, non-risk-side subset, and the entire training sample customer set obtained based on the candidate threshold combinations, respectively, count the overdue markings within each set. The number of "overdue" customers is compared with the total number of customers in the set. When the total number of customers in the set is greater than zero, the ratio of the number of overdue customers to the total number of customers is used as the overdue ratio for the corresponding set; when the total number of customers in the set is equal to zero, the overdue ratio is set to zero. Based on the overdue ratio of each set, the entropy values of the risk-side subset, the non-risk-side subset, and the overall training sample customer set are obtained according to the information entropy measurement rules. Corresponding weights are set according to the proportion of customers in the risk-side subset and the non-risk-side subset to the total number of customers in the training sample customer set. The information gain of the current candidate threshold combination is calculated based on the weighted difference between the overall entropy value and the subset entropy value. Within each period, the net inflow ratio threshold with the largest information gain and the threshold with the largest information gain are selected from all candidate threshold combinations. The maximum single-day remittance amount threshold combination is used as the final threshold combination for the current period. When multiple threshold combinations with the same information gain exist, the combination with the smaller values of the net inflow ratio threshold and the maximum single-day remittance amount threshold is selected first. When the training sample customer set is empty, the net inflow ratio threshold and the maximum single-day remittance amount threshold for the current period are set to zero, and the information gain is set to zero. After obtaining the final threshold combination for each period, a threshold comparison is performed on the local behavioral feature vectors of all customers in the current period: when a customer's net inflow ratio is lower than the net inflow ratio threshold for the current period and the maximum single-day remittance amount is higher than the maximum single-day remittance amount threshold for the current period, the customer's risk label for the current period is set to risk status.When the net inflow ratio is not lower than the net inflow ratio threshold for the current period or the maximum single-day outflow amount is not higher than the maximum single-day outflow amount threshold for the current period, the customer's risk label for the current period is set to a non-risk status, forming a risk label record for each customer in each period, which serves as the risk assessment result for that period.
[0101] Further specific implementation steps include:
[0102] For each period segment, a local decision function is constructed, specifically as follows:
[0103] ;in, For customers In the period Local risk assessment function; The net inflow ratio threshold. The maximum billing amount threshold is defined; the training sample set is constructed as follows: ;in, For periodic segments The set of effective training sample customers; and the threshold candidate set is constructed as follows:
[0104] , ;in, This is a candidate set for the net inflow ratio threshold. This is the candidate set for the maximum billing threshold; for any candidate threshold pair The sample is divided into two subsets according to the decision rule, as follows: ,
[0105] ;in, For a set of candidate thresholds, For candidate net inflow ratio thresholds, The threshold for the maximum billing amount for candidates; Let be the set of Cartesian products, representing the combination space of all candidate threshold pairs; For the risk-side subset; For the non-risk subset; construct an arbitrary sample set. The overdue rate is as follows: ;in, For customers A subset of; The overdue ratio function is given; and the information entropy function is constructed. Specifically: S401, when season:
[0106] S402, when season The weights of the risk-side subset are constructed as follows: The weights of the non-risk-side subset are constructed as follows: Based on this, an information gain function is constructed. Specifically:
[0107] S403, when season S404, when season: ;in, The total entropy before partitioning; selecting the optimal threshold pair: ;in, For periodic segments The final threshold pair, For periodic segments The net inflow ratio threshold, For periodic segments Maximum billing threshold; The net inflow ratio threshold is obtained for deterministic optimization; The maximum billing threshold is obtained through deterministic optimization; when At that time, calculate the maximum information gain value: ;in, For periodic segments The maximum information gain value is calculated; and a set of maximum value points is constructed.
[0108] ;in, For periodic segments The set of threshold pairs with maximum information gain is used to obtain the deterministic optimal threshold. , ;when season , , For each client and timeframe, record risk tags: ;in, For customers In the period Risk label.
[0109] The process of sequentially linking risk tags at the customer level to form a chronological risk path specifically includes:
[0110] For each customer, risk labels are obtained sequentially within each period segment according to the time order of the period segments on the unified observation time axis. The period segment risk labels are then connected according to the time order of the unified observation time axis to form a discrete risk label sequence from the first period segment to the last period segment, which serves as the customer's time-order risk path.
[0111] Further specific implementation steps include:
[0112] The determination results of all period segments are concatenated into a time series:
[0113] ;in, For customers The full-cycle risk path sequence.
[0114] The risk path based on time sequence calculates the number of risk escalations, the length of tail-end continuous risk, and the proportion of risk throughout the entire cycle, forming a risk path trend indicator vector, which specifically includes:
[0115] For each customer, in the time-series risk path, starting from the second period segment, the risk label of the current period segment is compared with the risk label of the previous period segment. When the previous period segment is in a non-risk state and the current period segment is in a risk state, the risk jump count is incremented by one. After traversing the entire risk path, the number of risk jumps is obtained. For each customer, starting from the last period segment of the time-series risk path, the risk labels are checked sequentially backward. When consecutive risk states are encountered, the length of the tail continuous risk is accumulated. Accumulation stops when the first non-risk state is encountered or the starting position of the risk path is reached, thus obtaining the tail continuous risk length. For each customer, the number of period segments in which risk states occur in the time-series risk path is counted, and the ratio of the number of period segments in which risk states occur to the total number of period segments covered by the risk path is taken as the full-cycle risk ratio. For each customer, the number of risk jumps, the tail continuous risk length, and the full-cycle risk ratio are combined in a preset order to form a risk path trend indicator vector, and the risk path trend indicator vector is used as the input feature vector representing the cross-cycle risk evolution trend in the subsequent hierarchical decision tree structure.
[0116] Further specific implementation steps include:
[0117] Construct indicator function as Calculate the following three types of trend changes, specifically: S601, calculate the number of risk jumps: ;in, For customers Number of risk jumps; S602, Calculate the length of continuous tail risk: ;in, For customers The length of continuous risk at the tail end; The variable is a count variable, representing candidate values for the number of consecutive segments at the tail. It is a multiplication operator; This represents the offset from the tail forward; S603, calculate the full-cycle risk ratio: ;in, For customers The full-cycle risk ratio; constructing a risk trend feature vector: ;in, For customers The trend feature vector.
[0118] The construction of a hierarchical decision tree, which includes setting risk assessment nodes in intra-cycle layers and trend assessment nodes in cross-cycle layers, and jointly encoding risk paths and trend indicators to obtain leaf node numbers, specifically includes:
[0119] In the periodic decision layer of the hierarchical decision tree, a periodic risk judgment node is set for each period segment. The local behavioral feature vector of the customer in the current period segment is input into the corresponding node. The local behavioral feature vector is compared with the net inflow ratio threshold and the maximum single-day reimbursement amount threshold of the period segment, and the risk label of the customer in the current period segment is output. For each customer, the path code integer is constructed according to the risk label value in the time-order risk path, starting from the first period segment: the path code is initially set to zero, and each period segment is processed in turn. When processing the current period segment, the existing path code is multiplied by two, and then the risk label value of the current period segment is added to the result. After processing the last period segment, the customer's path code integer is obtained. For each customer, the number of risk jumps, the length of the tail continuous risk, the full period risk ratio, and the total number of period segments are obtained. The total number of period segments is the number of period segments covered by the time-order risk path. Perform the following operations: S701, round down the product of the full period risk ratio and the total number of period segments. The process involves: S702 obtaining discrete risk proportion levels; S703 multiplying the number of risk jumps by the square of the total number of period segments plus one to obtain the first part of the trend code; S704 summing the first and second parts of the trend code with the risk proportion level to obtain the customer's trend code integer; For each customer, calculating the total number of binary path codes covering all period segment risk label values based on the total number of period segments, multiplying the trend code integer by the total number of binary path codes to obtain the trend offset, and then adding the trend offset to the path code integer to obtain the customer's leaf node number in the hierarchical decision tree, ensuring that the leaf node numbers corresponding to different combinations of trend codes and path codes are unique; During decision tree training, using the leaf node numbers to attach training sample customers to the corresponding leaf nodes; The intra-period risk judgment nodes of the intra-period decision layer and the trend judgment nodes of the cross-period decision layer together constitute the hierarchical decision tree structure.
[0120] Further specific implementation steps include:
[0121] Each period segment The local decision unit is set as a node. Its input is Its output is The node mapping relationship is as follows ;in, For the first Local decision nodes for each periodic segment; The mapping symbol represents the mapping from input to output; it maps the periodic output sequence. Treating it as deterministic path encoding, and constructing the path encoding function as follows: ;in, For path encoding functions, convert binary sequences Convert to a unique integer code; set the cross-cycle trend determination node as the root node. Its input is Its output is the leaf node number. ;in, Number the leaf nodes; construct the trend encoding function. Specifically:
[0122] And construct leaf node numbering as .
[0123] The establishment of the leaf node-overdue probability lookup table relationship, which generates a leaf node number for new customers and retrieves the overdue probability before outputting it, specifically includes:
[0124] During the training phase, for each possible leaf node number, all customer samples with leaf node numbers equal to the current number are collected from the training dataset to form the customer set for the corresponding leaf node. For each leaf node customer set, the total number of customers and the number of customers marked as overdue are counted within the set. When the total number of customers is greater than zero, the ratio of the number of overdue customers to the total number of customers is used as the overdue probability of the corresponding leaf node; when the total number of customers is equal to zero, the overdue probability is set to zero, and a lookup table relationship between leaf node number and overdue probability is formed. During the prediction phase, data acquisition, cycle division and local sequence generation of fund flow, local behavioral feature vector extraction, threshold combination selection and cycle segment risk label generation, risk label concatenation in chronological order, risk path trend indicator calculation, and hierarchical decision tree coding are performed for the customers to be predicted, generating leaf node numbers. The overdue probability matching the leaf node number is found in the lookup table relationship between leaf node number and overdue probability, and the found overdue probability is used as the prediction result of the overdue probability of the customers to be predicted.
[0125] Further specific implementation steps include:
[0126] Construct the leaf nodes falling into the training set with the following numbers: The customer set is:
[0127] ;in, Assign values to the leaf node numbers; To train the focus on falling leaf nodes The customer set; and the number of customers is counted. and number of overdue customers ,but: ;in, For set Number of customers; For set Number of overdue customers; Leaf node delinquency probability; new customers After traversing the entire path, the node number is reached. Its predicted probability of delinquency is:
[0128] ;in, For new customers The predicted probability of delinquency; This is a lookup table-based probability output.
[0129] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.
[0130] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the technical principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A method for predicting customer delinquency probability based on a decision tree model, characterized in that, include: Obtain historical transaction logs from customers and the observation window, establish an observation timeline, summarize incoming and outgoing transactions by observation day to generate a daily net cash change sequence for customers, and set overdue flags based on whether overdue has occurred. Set a period window, divide the time axis into continuous period segments according to the window, and extract a local sequence of capital flow from the customer's daily net capital change sequence according to the period segment. All of them are zero-marked silent period segments, and non-zero-marked active period segments appear. In each active period, observation days with non-zero net capital changes are selected and recorded as the non-zero day set. The average daily net capital change, volatility, net inflow ratio and maximum single-day outflow amount are calculated on the non-zero day set and combined into a local behavioral feature vector. In each period, a candidate set of net inflow ratio threshold and a candidate set of maximum single-day outflow amount threshold are constructed based on local behavioral feature vectors. The active customer set is divided according to the candidate threshold combination and the information gain is calculated. The threshold combination with the largest information gain is selected, and the customer risk label is generated according to the final threshold combination. At the customer level, risk labels for different time periods are linked together in chronological order to form a time-sequential risk path. Based on the time-series risk path, the number of risk leaps, the length of tail-end continuous risk, and the proportion of full-cycle risk are calculated to form a risk path trend indicator vector. Construct a hierarchical decision tree, setting risk judgment nodes in the intra-cycle layer and trend judgment nodes in the cross-cycle layer, and jointly encode the risk path and trend indicators to obtain the leaf node number. Establish a lookup table relationship between leaf nodes and overdue probabilities. For new customers, generate leaf node numbers, retrieve overdue probabilities, and output the results.
2. The customer delinquency probability prediction method based on a decision tree model according to claim 1, characterized in that, The process of acquiring historical transaction logs from customers and within the observation window, establishing an observation timeline, summarizing incoming and outgoing transactions by observation day to generate a daily net cash change sequence for customers, and setting overdue flags based on whether overdue transactions have occurred specifically includes: Obtain the set of customers participating in the modeling, assign a unique customer number to each customer in the set, and form a list of customers participating in the modeling; For each customer, the customer's historical transaction logs are read from the business system in chronological order within a preset observation window. Each transaction log contains at least the natural date of the transaction and the transaction amount. A set of the customer's historical transaction logs is formed when a transaction record exists, and the set of the customer's historical transaction logs is set to an empty set when no transaction record exists. Based on the modeling requirements, the start and end dates of the observation window are given. A unified observation timeline is established by arranging the natural days consecutively between the start and end dates. Each natural day on the timeline is marked as an observation day, forming an observation day sequence. For each customer, each observation day is traversed one by one on a unified observation timeline, and all transaction logs of the customer on the current observation day are retrieved; when there are multiple transactions, the transaction amounts are summarized according to the rule of recording income as positive and expenditure as negative, so as to obtain the customer's daily net change in funds on the current observation day; when there are no transaction logs on the current observation day, the customer's daily net change in funds on the current observation day is set to zero, thereby forming a sequence of daily net changes in funds for customers covering all observation days; For each customer, check whether an overdue event has occurred within a preset observation window; if an overdue event exists, set the customer's overdue flag to overdue; if no overdue event occurs, set the overdue flag to not overdue, and establish a correspondence between the overdue flag and the customer's daily net cash change sequence.
3. The customer delinquency probability prediction method based on a decision tree model according to claim 2, characterized in that, The defined periodic window divides the time axis into continuous periodic segments. A partial sequence of fund flows is formed by extracting segments from the customer's daily net fund change sequence. All segments are zero-marked silent periods, with non-zero-marked active periods. Specifically, these include: Set the periodic window length, which is in units of the number of observation days; On a unified observation timeline, starting from the first observation day of the observation window, the observation days are divided sequentially from front to back according to the length of the periodic window. The observation days corresponding to each complete periodic window length are divided into a periodic segment. For the remaining observation days that are less than the length of a complete periodic window, the remaining observation days are separately assigned to the last periodic segment, resulting in several consecutive periodic segments. For each customer, within each period, extract the daily net cash flow corresponding to all observation days covered by the current period from the customer's daily net cash flow sequence to form a local sequence of the customer's cash flow within the current period. For each customer, examine all observation days in the local sequence of fund flows within each period: when the net daily fund change is zero for all observation days, mark the current period as a silent period; when the net daily fund change is not zero for at least one observation day, mark the current period as an active period, and record the silent or active marks to form a silent mark sequence.
4. The customer delinquency probability prediction method based on a decision tree model according to claim 3, characterized in that, In each active period, observation days with non-zero net capital changes are selected and recorded as a non-zero day set. The average daily net capital change, volatility, net inflow ratio, and maximum single-day outflow amount are calculated on the non-zero day set and combined to form a local behavioral feature vector, specifically including: For each customer, within each active period, observe days with non-zero net cash changes are selected from the customer's local cash flow sequence to form a non-zero day set, and the number of observe days in the non-zero day set is obtained; when the period is a quiet period, the non-zero day set is regarded as an empty set. When the non-zero day set is not empty, the arithmetic mean of the daily net change in funds for each observation day in the non-zero day set is calculated to obtain the average daily net change in funds for the customer in the current period; when the non-zero day set is empty, the average daily net change in funds for the current period is set to zero. When the non-zero day set is not empty, the average daily net change of funds for the non-zero day set is subtracted from the daily net change of funds for each observation day in the non-zero day set to obtain the deviation value for each observation day. The arithmetic mean of the squares of each deviation value is then calculated to obtain the volatility of the customer's daily net change of funds in the current period. When the non-zero day set is empty, the volatility of the daily net change of funds in the current period is set to zero. When the non-zero day set is not empty, the portion of the daily net capital change that is greater than zero for all observed days in the non-zero day set is summed to obtain the total net inflow for the current period; the absolute value of the daily net capital change for all observed days in the non-zero day set is summed to obtain the total capital flow for the current period; when the total capital flow is greater than zero, the ratio of the total net inflow to the total capital flow is used as the net inflow ratio for the current period; when the total capital flow is equal to zero, the net inflow ratio is set to zero. When the set of non-zero days is not empty, observe days with net daily cash changes less than zero are selected from the set of non-zero days, and the absolute value corresponding to the net daily cash change with the largest absolute value is taken as the customer's maximum single-day outflow amount in the current period; when there are no observe days with net daily cash changes less than zero, the maximum single-day outflow amount in the current period is set to zero. For each customer's period, when the period is an active period, the average daily net change in funds, the volatility of daily net change in funds, the net inflow ratio, and the maximum single-day outflow amount of the current period are combined into a four-dimensional local behavioral feature vector in a preset order; when the period is a quiet period, all four components are set to zero and combined into a local behavioral feature vector.
5. The customer delinquency probability prediction method based on a decision tree model according to claim 4, characterized in that, In each period, a candidate set of net inflow ratio thresholds and a candidate set of maximum single-day outflow amount thresholds are constructed based on local behavioral feature vectors. Active customer sets are then segmented according to the candidate threshold combinations, and information gain is calculated. The threshold combination with the highest information gain is selected, and customer risk labels are generated based on the final threshold combination. Specifically, this includes: Within each period segment, customers marked as active period segments are selected from all customers to form the training sample customer set for the current period segment. The training sample customer set consists of customers from active period segments. If there are no active period segment customers in the current period segment, the training sample customer set is set to an empty set. When the training sample customer set is not empty, the net inflow ratio is extracted from the local behavioral feature vector of each customer in the training sample customer set. After removing duplicate values, a candidate set of net inflow ratio thresholds for the current period is constructed. At the same time, the maximum single-day billing amount of each customer is extracted. After removing duplicate values, a candidate set of maximum single-day billing amount thresholds for the current period is constructed. When the training sample customer set is not empty, for each pair of candidate threshold combinations in the candidate set of net inflow ratio threshold and the candidate set of maximum single-day remittance amount threshold, perform the following operations: For each customer in the training sample customer set, compare the net inflow ratio with the candidate net inflow ratio threshold, and the maximum single-day remittance amount with the candidate maximum single-day remittance amount; when the net inflow ratio is lower than the candidate net inflow ratio threshold and the maximum single-day remittance amount is higher than the candidate maximum single-day remittance amount, classify the current customer into the risk side subset; when the net inflow ratio is not lower than the candidate net inflow ratio threshold or the maximum single-day remittance amount is not higher than the candidate maximum single-day remittance amount, classify the current customer into the non-risk side subset. For the risk-side subset, non-risk-side subset, and training sample customer set obtained based on candidate threshold combinations, the number of customers marked as "overdue" and the total number of customers in each subset are counted. When the total number of customers in the subset is greater than zero, the ratio of the number of overdue customers to the total number of customers is taken as the overdue ratio of the corresponding subset. When the total number of customers in the subset is equal to zero, the overdue ratio is set to zero. According to the overdue ratio of each subset, the entropy values of the risk-side subset, non-risk-side subset, and training sample customer set are obtained according to the information entropy measurement rules. Corresponding weights are set according to the proportion of the number of customers in the risk-side subset and non-risk-side subset to the total number of customers in the training sample customer set. The information gain of the current candidate threshold combination is calculated based on the weighted difference between the overall entropy value and the subset entropy value. Within each period, the net inflow ratio threshold with the largest information gain and the maximum single-day outflow amount threshold are selected from all candidate threshold combinations as the final threshold combination for the current period. When there are multiple threshold combinations with the same information gain, the combination with the smaller values of the net inflow ratio threshold and the maximum single-day outflow amount threshold is selected first. When the training sample customer set is empty, the net inflow ratio threshold and the maximum single-day outflow amount threshold for the current period are set to zero, and the information gain is set to zero. After obtaining the final threshold combination for each period, a threshold comparison is performed on the local behavioral feature vectors of all customers in the current period: when a customer's net inflow ratio is lower than the net inflow ratio threshold of the current period and the maximum single-day outflow amount is higher than the maximum single-day outflow amount threshold of the current period, the customer's risk label in the current period is set to risk status; when the net inflow ratio is not lower than the net inflow ratio threshold of the current period or the maximum single-day outflow amount is not higher than the maximum single-day outflow amount threshold of the current period, the customer's risk label in the current period is set to non-risk status, forming a risk label record for each customer in each period, which serves as the risk judgment result for the period.
6. The customer delinquency probability prediction method based on a decision tree model according to claim 5, characterized in that, The process of sequentially linking risk tags at the customer level to form a chronological risk path specifically includes: For each customer, risk labels are obtained sequentially within each period segment according to the time sequence of the period segments on the unified observation timeline. The risk labels for each period are connected in chronological order according to a unified observation timeline to form a discrete risk label sequence from the first period to the last period, which serves as the client's time-series risk path.
7. The customer delinquency probability prediction method based on a decision tree model according to claim 6, characterized in that, The risk path based on time sequence calculates the number of risk escalations, the length of tail-end continuous risk, and the proportion of risk throughout the entire cycle, forming a risk path trend indicator vector, which specifically includes: For each customer, in the time-series risk path, starting from the second period segment, the risk label of the current period segment is compared with the risk label of the previous period segment; when the previous period segment is in a non-risk state and the current period segment is in a risk state, the risk jump count is incremented by one; after traversing the entire risk path, the number of risk jumps is obtained. For each customer, risk labels are checked sequentially from the last period of the time-sequential risk path. When consecutive risk states are encountered, the length of the consecutive risk at the tail is accumulated. The accumulation stops when the first non-risk state is encountered or the starting position of the risk path is reached, thus obtaining the length of the consecutive risk at the tail. For each customer, the number of periods in which risk states occur is counted in the time-series risk path, and the ratio of the number of periods in which risk states occur to the total number of periods covered by the risk path is taken as the full-cycle risk ratio. For each client, the number of risk escalation, the length of tail-end continuous risk, and the proportion of full-cycle risk are combined into a risk path trend indicator vector in a preset order. This risk path trend indicator vector is then used as the input feature vector to characterize the cross-cycle risk evolution trend in the subsequent hierarchical decision tree structure.
8. The customer delinquency probability prediction method based on a decision tree model according to claim 7, characterized in that, The construction of a hierarchical decision tree, which includes setting risk assessment nodes in intra-cycle layers and trend assessment nodes in cross-cycle layers, and jointly encoding risk paths and trend indicators to obtain leaf node numbers, specifically includes: In the periodic decision layer of the hierarchical decision tree, a periodic risk judgment node is set for each period segment. The local behavioral feature vector of the customer in the current period segment is input into the corresponding node. The local behavioral feature vector is compared with the threshold of the net inflow ratio of the period segment and the threshold of the maximum single-day outflow amount, and the risk label of the customer in the current period segment is output. For each customer, the path code integer is constructed starting from the first period segment by taking the risk label value in the risk path in chronological order: the path code is set to zero at the beginning, each period segment is processed in turn, the existing path code is multiplied by two when processing the current period segment, and then the risk label value of the current period segment is added to the result. After processing the last period segment, the customer's path code integer is obtained. For each client, obtain the number of risk escalations, the length of tail-end continuation risk, the overall risk ratio, and the total number of period segments. The total number of period segments is the number of period segments covered by the time-series risk path. Perform the following operations: S701. Round down the product of the full-cycle risk ratio and the total number of cycle segments to obtain the discrete risk ratio level. S702. Multiply the number of risk jumps by the square of the total number of cycle segments plus one to obtain the first part of the trend code. S703. Multiply the length of the continuous risk at the tail end by one plus the total number of period segments to obtain the second part of the trend code. S704. Sum the first part of the trend code, the second part of the trend code, and the risk ratio level to obtain the customer's trend code integer. For each customer, calculate the total number of binary path codes covering all risk label values for all period segments based on the total number of period segments. Multiply the trend code integer by the total number of binary path codes to obtain the trend offset. Then add the trend offset to the path code integer to obtain the leaf node number of the customer in the hierarchical decision tree, so that the leaf node numbers corresponding to different combinations of trend codes and path codes are not repeated. During the decision tree training process, the training sample customers are attached to the corresponding leaf nodes using leaf node numbers; the intra-cycle risk assessment nodes of the intra-cycle decision layer and the trend assessment nodes of the cross-cycle decision layer together constitute a hierarchical decision tree structure.
9. The customer delinquency probability prediction method based on a decision tree model according to claim 8, characterized in that, The establishment of the leaf node-overdue probability lookup table relationship, which generates a leaf node number for new customers and retrieves the overdue probability before outputting it, specifically includes: During the training phase, for each possible leaf node number, collect all customer samples whose leaf node numbers are equal to the current number from the training dataset to form the customer set corresponding to the leaf node; For each leaf node customer set, count the total number of customers and the number of customers marked as overdue within the set; when the total number of customers is greater than zero, use the ratio of the number of overdue customers to the total number of customers as the overdue probability of the corresponding leaf node; when the total number of customers is equal to zero, set the overdue probability to zero, and form a lookup table relationship between leaf node number and overdue probability. During the forecasting phase, the following steps are performed for the clients to be forecasted: data acquisition, period segmentation and generation of local sequences of capital flows, extraction of local behavioral feature vectors, selection of threshold combinations and generation of risk labels for period segments, concatenation of risk labels in chronological order, calculation of risk path trend indicators, and hierarchical decision tree coding to generate leaf node numbers. The overdue probability that matches the leaf node number is then searched in the lookup table relationship between the leaf node number and the overdue probability. The found overdue probability is used as the overdue probability forecast result for the clients to be forecasted.