User grouping method and related device

By acquiring multi-source datasets to generate user tags and feature sets, using clustering algorithms for grouping, and then performing verification and adjustments, this approach solves the problems of insufficient multi-source data fusion and the disconnect between algorithms and business logic in existing technologies. It achieves accurate and stable user grouping, adapting to changes in user behavior and business needs.

CN122241285APending Publication Date: 2026-06-19BEIJING CHINA POWER INFORMATION TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING CHINA POWER INFORMATION TECH
Filing Date
2026-01-30
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies for electricity user grouping suffer from insufficient multi-source data fusion, one-sided and singular grouping criteria, disconnect between algorithms and business logic, lack of adaptive dynamic optimization mechanisms, and difficulty in meeting the real-time requirements of fee control services.

Method used

The system acquires multi-source datasets, generates user labels and feature sets, performs clustering processing using clustering algorithms, obtains the final clustering results through verification and adjustment, dynamically determines the optimal K value, optimizes the clustering results by combining intra-cluster sum of squares curves and silhouette coefficients, and periodically updates the clustering results.

🎯Benefits of technology

It achieves precise user segmentation, improves the accuracy and practicality of segmentation, adapts to changes in user behavior and business needs, provides stability and scalability, and supports the refined operation of expense control business.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241285A_ABST
    Figure CN122241285A_ABST
Patent Text Reader

Abstract

This application provides a user segmentation method and related equipment. The method includes: acquiring a multi-source dataset; generating user labels and feature sets based on the multi-source dataset; performing segmentation processing based on the user labels and feature sets using a clustering algorithm to obtain initial segmentation results; verifying the initial segmentation results and adjusting the initial segmentation results based on the verification results to obtain final segmentation results. This application's embodiments generate user labels and feature sets from multi-source datasets, combine them with clustering algorithms to achieve accurate segmentation, and ensure the rationality and practicality of the segmentation results through verification and adjustment. Feature selection and dimensionality reduction improve model efficiency, and the optimal K value is dynamically determined to optimize the clustering results. A periodic update mechanism enables dynamic adjustment of segmentation and rapid segmentation of new users, and anomaly identification and rule optimization enhance stability and adaptability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of user segmentation technology, and in particular to a user segmentation method and related equipment. Background Technology

[0002] With the advancement of power market reform and smart grid construction, the scale of prepaid users is gradually expanding, making traditional payment collection and segmentation models insufficient to meet differentiated business needs. Existing technologies are shifting towards data-driven user segmentation methods, integrating multi-dimensional user data (such as electricity load curves, payment timeliness, and outstanding amounts) to achieve precise operations. These segmentation technologies typically utilize clustering algorithms, such as K-means, to cluster users, outputting user cluster labels and typical characteristic curves to support targeted marketing and abnormal electricity consumption detection. However, traditional K-means algorithms suffer from problems such as random initial centroids, arbitrary K-value settings, and weak business correlation. Meanwhile, other algorithms, such as Fuzzy C-means clustering (FCM), while achieving soft segmentation, are sensitive to initial parameters and cannot fully address the complexity of the segmentation problem.

[0003] Existing technologies for electricity user segmentation suffer from several shortcomings. First, there is insufficient integration of multi-source data, resulting in segmentation based on limited and singular criteria. This fails to integrate multi-dimensional data such as user profiles, business tags, payment behavior, and request records, making it difficult to comprehensively depict the overall attributes of users. Second, the algorithm is disconnected from business logic; segmentation results cannot be directly converted into executable business rules, impacting implementation efficiency and making the strategy ineffective due to human error. Finally, existing technologies lack adaptive dynamic optimization mechanisms. Parameter settings are rigid and updates are lagging, failing to respond promptly to dynamic changes in user behavior. Furthermore, the lack of closed-loop optimization in segmentation results in insufficient stability and timeliness, making it difficult to meet the real-time requirements of fee control operations. Summary of the Invention

[0004] In view of this, the purpose of this application is to propose a user segmentation method and related equipment.

[0005] To achieve the above objectives, this application provides a user segmentation method, comprising: Obtain multi-source datasets; User labels and feature sets are generated based on the multi-source dataset; Using a clustering algorithm, clustering is performed based on the user tags and the feature set to obtain initial clustering results; The initial clustering results are validated, and the initial clustering results are adjusted based on the validation results to obtain the final clustering results.

[0006] In one possible implementation, generating user tags based on the multi-source dataset includes: The association rule algorithm is used to analyze the association relationship between electricity consumption behavior, electricity consumption habits and payment records in the multi-source dataset to generate the user tags; the user tags include basic attribute tags, electricity consumption feature tags and payment behavior tags.

[0007] In one possible implementation, the feature set includes electricity consumption fluctuation pattern features, payment timeliness features, and payment strategy response features; the electricity consumption fluctuation pattern features are used to describe the user's electricity consumption behavior patterns; the payment timeliness features are used to describe the user's payment behavior and response capability; and the payment strategy response features are used to describe the user's response to payment reminder strategies.

[0008] In one possible implementation, the method further includes: The feature set is analyzed using the Pearson correlation coefficient, and features with correlation coefficient values ​​less than a first threshold are removed to obtain a first set; The first set is filtered using analysis of variance, and features with variance values ​​less than a second threshold are removed to obtain the second set; Principal component analysis is used to reduce the dimensionality of the second set, retaining features whose cumulative variance contribution rate is greater than a third threshold, to obtain the third set, and then the third set is used for clustering.

[0009] In one possible implementation, the step of using a clustering algorithm to perform clustering based on the user tags and the feature set to obtain initial clustering results includes: Determine the initial search range for the value of K; For each K value, the centroid is randomly initialized, and the sum of squares within the cluster and the contour coefficient corresponding to each centroid are recorded; Based on the intra-cluster sum of squares corresponding to each K value, a cluster sum of squares curve is plotted, and the inflection point of the cluster sum of squares curve is determined. Obtain the K value corresponding to the inflection point, and use the contour coefficient to help confirm the optimal K value; Clustering is performed based on the optimal K value to obtain the initial clustering results.

[0010] In one possible implementation, the step of verifying the initial clustering result and adjusting the initial clustering result based on the verification result to obtain the final clustering result includes: Verify whether the initial clustering result meets the first condition. If the first condition is not met, adjust the weight of the user tags in the initial clustering result corresponding to the first condition. The first condition is that the cost control whitelist users and group users in the initial clustering result are concentrated in the same cluster, and the proportion of important users in the same cluster is greater than or equal to the fourth threshold. Verify whether the initial clustering result meets the second condition. If the first condition is not met, optimize the K value and re-cluster. The second condition is that the proportion of users with the corresponding user label in each cluster is greater than the fifth threshold, and the proportion of the user label in other clusters is less than the sixth threshold. Verify whether the initial clustering result meets the third condition. If the third condition is not met, split or merge the clusters in the initial clustering result, adjust the centroids of the clusters, and then re-cluster them. The third condition is that the ratio of the number of users in each cluster in the initial clustering result to the total number of users is greater than the seventh threshold.

[0011] In one possible implementation, the method further includes: The final clustering results are updated periodically; the clusters in the final clustering results are assigned corresponding priorities. In response to a new user assignment event, clusters are matched sequentially from highest to lowest priority in the final clustering results.

[0012] In one possible implementation, the method further includes: Anomaly identification is performed for users who fail to be matched to any group during the grouping process and users who match multiple group rules simultaneously; In response to the number of identified anomalous users exceeding the eighth threshold, the final clustering results are updated and adjusted.

[0013] Based on the same inventive concept, embodiments of this application also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the user segmentation method as described in any of the above claims.

[0014] Based on the same inventive concept, embodiments of this application also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute any of the user grouping methods described above.

[0015] As described above, the user segmentation method and related equipment provided in this application acquire multi-source datasets; generate user labels and feature sets based on the multi-source datasets; perform segmentation processing based on the user labels and feature sets using a clustering algorithm to obtain initial segmentation results; perform verification processing on the initial segmentation results; and adjust the initial segmentation results based on the verification results to obtain final segmentation results. This application's embodiments acquire multi-source datasets and generate user labels and feature sets, combine them with clustering algorithms to segment users, and finally obtain accurate segmentation results that meet business needs through verification and adjustment, thus achieving refined user management. Basic attribute labels, electricity consumption feature labels, and payment behavior labels are generated through association rule algorithms to comprehensively characterize user features. Combined with electricity consumption fluctuation patterns, payment timeliness features, and payment strategy response features, fine-grained feature descriptions are provided, improving the accuracy of segmentation. By eliminating redundant features, filtering irrelevant features, and performing principal component analysis for dimensionality reduction, the computational efficiency and accuracy of the segmentation model are further improved. The optimal K value for the clustering algorithm is dynamically determined and optimized using the intra-cluster sum of squares curve and silhouette coefficient to ensure reasonable and stable clustering results. Multiple checks and adjustments to the initial clustering results ensure the concentration of important users and the rationality of clusters, preventing the emergence of abnormal clusters and improving the practicality of the clustering results. A periodic update mechanism and priority rules enable dynamic adjustment of clustering results and rapid clustering of new users, adapting to changes in user behavior and business needs. Simultaneously, unmatched and abnormal users are identified and optimized, improving the stability and scalability of the clustering system and providing effective support for the refined operation of expense control. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in this application or related technologies, the drawings used in the description of the embodiments or related technologies will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a schematic diagram of the user grouping method according to an embodiment of this application; Figure 2 This is a schematic diagram of the electronic device structure according to an embodiment of this application. Detailed Implementation

[0018] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with specific embodiments and the accompanying drawings.

[0019] It should be noted that, unless otherwise defined, the technical or scientific terms used in the embodiments of this application should have the ordinary meaning understood by one of ordinary skill in the art to which this application pertains. The terms "first," "second," and similar terms used in the embodiments of this application do not indicate any order, quantity, or importance, but are merely used to distinguish different components. Terms such as "comprising" or "including" mean that the element or object preceding the word encompasses the elements or objects listed after the word and their equivalents, without excluding other elements or objects. Terms such as "connected" or "linked" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect. Terms such as "upper," "lower," "left," and "right" are only used to indicate relative positional relationships; when the absolute position of the described object changes, the relative positional relationship may also change accordingly.

[0020] It is understood that before using the technical solutions of the various embodiments in this disclosure, users will be informed of the type, scope of use, and usage scenarios of the personal information involved in an appropriate manner, and user authorization will be obtained.

[0021] For example, upon receiving a user's active request, a prompt message is sent to the user to explicitly inform them that the requested operation will require the acquisition and use of the user's personal information. This allows the user to independently choose, based on the prompt message, whether to provide personal information to the software or hardware such as electronic devices, applications, servers, or storage media performing the operations of this disclosed technical solution.

[0022] As an optional but not limited implementation, in response to a user's active request, sending a prompt message to the user can be done via a pop-up window, where the prompt message can be presented in text format. Furthermore, the pop-up window can also include a selection control allowing the user to choose "agree" or "disagree" to provide personal information to the electronic device.

[0023] It is understood that the above notification and user authorization process are merely illustrative and do not constitute a limitation on the implementation of this disclosure. Other methods that comply with relevant laws and regulations may also be applied to the implementation of this disclosure.

[0024] As described in the background section, with the advancement of power market reform and smart grid construction, the scale of prepaid users is expanding, making traditional clustering models insufficient to meet the demands. Existing technologies employ data-driven clustering methods, integrating data such as electricity load and payment timeliness, and utilizing the K-means algorithm to achieve clustering, outputting user cluster labels and feature curves to support precision marketing. However, the K-means algorithm suffers from problems such as random initial centroids and arbitrary setting of K values. Furthermore, existing technologies lack sufficient multi-source data fusion, resulting in a single clustering basis that fails to comprehensively characterize user attributes. The algorithm is also disconnected from business operations, making it difficult to convert clustering results into executable rules. Additionally, the lack of a dynamic optimization mechanism leads to delayed updates, hindering the real-time needs of prepaid services.

[0025] In summary, this application proposes a user segmentation method that involves acquiring a multi-source dataset; generating user tags and feature sets based on the multi-source dataset; using a clustering algorithm to perform segmentation processing based on the user tags and feature sets to obtain initial segmentation results; validating the initial segmentation results and adjusting them based on the validation results to obtain final segmentation results. This application achieves refined user management by acquiring a multi-source dataset and generating user tags and feature sets, combining this with a clustering algorithm to segment users, and finally obtaining accurate segmentation results that meet business needs through validation and adjustment. Basic attribute tags, electricity consumption feature tags, and payment behavior tags are generated using an association rule algorithm to comprehensively characterize user features. Combined with electricity consumption fluctuation patterns, payment timeliness, and payment strategy response characteristics, fine-grained feature descriptions are provided, improving the accuracy of segmentation. Furthermore, by eliminating redundant features, filtering irrelevant features, and using principal component analysis for dimensionality reduction, the computational efficiency and accuracy of the segmentation model are further improved. The optimal K value for the clustering algorithm is dynamically determined and optimized using the intra-cluster sum of squares curve and silhouette coefficient to ensure reasonable and stable clustering results. Multiple checks and adjustments to the initial clustering results ensure the concentration of important users and the rationality of clusters, preventing the emergence of abnormal clusters and improving the practicality of the clustering results. A periodic update mechanism and priority rules enable dynamic adjustment of clustering results and rapid clustering of new users, adapting to changes in user behavior and business needs. Simultaneously, unmatched and abnormal users are identified and optimized, improving the stability and scalability of the clustering system and providing effective support for the refined operation of expense control.

[0026] The technical solutions of the embodiments of this application will be described in detail below through specific examples.

[0027] refer to Figure 1 The user segmentation method in this application includes the following steps: Step S101: Obtain multi-source datasets; Step S102: Generate user labels and feature sets based on the multi-source dataset; Step S103: Using a clustering algorithm, perform clustering processing based on the user tags and the feature set to obtain the initial clustering results; Step S104: Verify the initial clustering result and adjust the initial clustering result based on the verification result to obtain the final clustering result.

[0028] Regarding step S101, in this embodiment, acquiring a multi-source dataset is the foundation for achieving accurate user segmentation in this application. By constructing a multi-system integration interface, the data dispersion problem caused by data silos in traditional expense control systems is solved, achieving comprehensive collection and fusion of multi-dimensional data. Specifically, this step extracts data from the Marketing 2.0 system, the Electricity Consumption 2.0 system, the 95598 work order system, and business systems such as expense control strategy tables and payment ledgers, covering various core data such as user basic profiles, electricity consumption behavior, and payment records, forming a complete multi-source dataset.

[0029] The basic profile data extracted from the Marketing 2.0 system includes user account number, name, contact information, importance level (e.g., top-tier, first-tier, second-tier, temporary important user), expense control whitelist flag, and primary account flag. This data helps identify expense control whitelist users, group users, and other important users, providing a basis for formulating differentiated strategies in subsequent user segmentation. The electricity consumption behavior data provided by the Procurement 2.0 system includes users' daily electricity consumption, monthly electricity consumption, electricity consumption time distribution (peak, flat, and off-peak consumption percentages), electrical equipment type, and zero-electricity records for the past 12 months. This data is used to analyze users' electricity consumption patterns, such as identifying peak electricity consumption periods, fluctuation ranges, and electricity consumption stability. Further analysis can reveal information such as peak electricity consumption periods and fluctuation ranges in electricity consumption behavior, providing more accurate behavioral descriptions for user segmentation.

[0030] The request record data extracted from the 95598 work order system includes user-submitted overdue power restoration work orders and execution records of power outage and restoration related strategies. This data is used to analyze the frequency and behavioral patterns of user requests in the collection strategy, helping to identify users with high demands or user behavior characteristics with abnormal strategy execution. The payment record data extracted from the cost control strategy table and payment ledger covers the payment amount, payment time, payment method (such as online, offline, automatic deduction), and payment delay days for the past 12 months. This data is used to analyze users' payment habits and timeliness characteristics, and further combined with collection response time, to characterize users' responsiveness to collection strategies.

[0031] In acquiring multi-source data, a unified data interface and structured storage method were designed to ensure the integrity and consistency of data collection. For example, user profile information was obtained through the user data interface of the Marketing 2.0 system, electricity consumption behavior data through the electricity consumption data interface of the Consumer Procurement 2.0 system, and complaint records were extracted through the work order interface of the 95598 work order system. The integration of collected data adopted a unified structured approach, standardizing the storage of data from different systems to facilitate subsequent cleaning, processing, and clustering calculations. Simultaneously, to ensure data timeliness, the data collection frequency was set to update every 30 days to dynamically capture the latest changes in user behavior.

[0032] Through the above steps, the obtained multi-source dataset comprehensively covers users' basic attributes, electricity consumption behavior characteristics, payment records, and related behavioral patterns, providing high-quality data support for clustering. Overall, step S101 addresses the shortcomings of traditional clustering methods, such as limited data sources and incomplete information, and ensures the real-time nature and accuracy of the data through a dynamic update mechanism, laying a solid foundation for subsequent clustering processing in this application.

[0033] Furthermore, in some embodiments, data preprocessing is required on the acquired multi-source datasets. Specifically, to ensure the integrity, consistency, and applicability of the multi-source datasets, data preprocessing is necessary. By designing a three-level processing flow of "cleaning-standardization-normalization," problems such as missing data, formatting issues, and magnitude differences are addressed, thereby improving data quality and providing high-quality data input for subsequent clustering processing. Data preprocessing is a crucial step in achieving the accuracy and efficiency of the clustering model, and its specific process is as follows.

[0034] First, data cleaning is required for multi-source datasets to remove null values, outliers, and duplicate data to ensure data integrity and reliability. For example, for user electricity consumption data, if the electricity consumption value of some records exceeds the normal range (e.g., more than 3 times the standard deviation of the mean), they are considered outliers and removed. Simultaneously, users with no electricity consumption records (e.g., users with consecutive zero electricity consumption) are processed by marking them separately. These cleaning operations effectively eliminate the interference of noisy data on the clustering results. Furthermore, for user data with multiple duplicate records, key information fields (such as user ID, timestamp, etc.) are compared to retain only the latest valid data, avoiding the impact of redundant data on subsequent analysis.

[0035] After data cleaning, the data format needs to be standardized to address the issue of inconsistent formats across multiple data sources. For date-based data, all records are uniformly converted to the "YYYY-MM-DD" format for easier time series analysis. For monetary data, it is uniformly retained to two decimal places to ensure consistent numerical precision. For categorized data (such as user importance levels), it is converted to a numerical encoding format; for example, "Special Grade" corresponds to the code value 4, "Level 1" to 3, "Level 2" to 2, and "Temporary" to 1. This standardization process not only improves data consistency but also facilitates subsequent algorithmic processing and calculations.

[0036] After cleaning and standardization, the data needs to be normalized to address the differences between features of different magnitudes and ensure fairness among features. For continuous features (such as electricity consumption, payment delay days, etc.), the Min-Max normalization method is used to map them to the [0,1] interval. Through normalization, different feature values ​​are adjusted to the same magnitude range, avoiding excessive weighting of certain features due to their large magnitude, thereby improving the accuracy and computational efficiency of the clustering algorithm.

[0037] Data preprocessing, through a three-stage process of cleaning, standardization, and normalization, not only eliminates anomalies and redundant information from multi-source data but also ensures consistency in data format and magnitude, laying a solid foundation for subsequent clustering. This process ensures high quality and high consistency of input data, solving the problem of unstable clustering results due to poor data quality in traditional clustering methods, and providing reliable data support for the technical solution of this application.

[0038] Furthermore, in step S102, user labels and feature sets are generated based on the multi-source dataset.

[0039] In some embodiments, generating user tags based on the multi-source dataset includes: The association rule algorithm is used to analyze the association relationship between electricity consumption behavior, electricity consumption habits and payment records in the multi-source dataset to generate the user tags; the user tags include basic attribute tags, electricity consumption feature tags and payment behavior tags.

[0040] In some embodiments, the feature set includes electricity consumption fluctuation pattern features, payment timeliness features, and payment strategy response features; the electricity consumption fluctuation pattern features are used to describe the user's electricity consumption behavior patterns; the payment timeliness features are used to describe the user's payment behavior and response capability; and the payment strategy response features are used to describe the user's response to payment reminder strategies.

[0041] In this embodiment, generating user tags and feature sets based on the multi-source dataset is a crucial step in the user segmentation method. Its purpose is to comprehensively characterize the user's overall attributes and behavioral features through in-depth analysis and processing of multi-dimensional data, providing high-quality input for subsequent segmentation models. In some embodiments, the process of generating user tags includes using association rule algorithms to analyze the correlation between users' electricity consumption behavior, electricity consumption habits, and payment records in the multi-source dataset, extracting core information that reflects user characteristics, and generating three types of tags: basic attribute tags, electricity consumption feature tags, and payment behavior tags. In other embodiments, based on the generated user tags, a feature set is further extracted, including electricity consumption fluctuation patterns, payment timeliness features, and payment strategy response features. These features are used to accurately describe the user's behavioral patterns and responsiveness to payment collection strategies.

[0042] When generating user tags, analyzing multi-source data using association rule algorithms is crucial. By analyzing the potential correlations between electricity consumption behavior, habits, and payment records, tags that effectively characterize user traits can be extracted. For example, by analyzing electricity consumption data and fluctuation range over 12 consecutive months, electricity consumption feature tags can be generated, such as high-stability electricity consumption tags (e.g., fluctuation range less than 10%), high-fluctuation electricity consumption tags (e.g., fluctuation range greater than 30%), heating season electricity consumption tags (e.g., winter electricity consumption significantly higher than non-heating season), and zero-electricity user tags (e.g., users who have not used electricity for an extended period). These electricity consumption feature tags can be used to distinguish users with different electricity consumption patterns. For example, high-fluctuation electricity consumption may correspond to rental users or scenarios with frequent short-term use, while high-stability electricity consumption typically corresponds to ordinary household users with relatively stable daily electricity needs.

[0043] By analyzing users' basic profile data, fundamental attribute tags can be generated. For example, tags such as "Special Grade," "Level One," "Level Two," and "Temporary Important User" can be extracted from the user importance level field to differentiate user importance; tags for "Fee Control Whitelist Users" can be extracted from the fee control whitelist flag field to identify specially protected users unaffected by payment reminders and power outages; and tags for "Group Users" can be extracted by combining the main account flag field to distinguish between large electricity-consuming groups and ordinary users. Furthermore, by combining users' business background data, tags for rental housing users and ventilator users can be generated. The former identifies short-term rental users with significant electricity consumption fluctuations, while the latter identifies medical equipment users requiring special protection. These fundamental attribute tags provide a basis for subsequent differentiated strategies during user segmentation.

[0044] By analyzing user payment records, payment behavior tags can be generated. For example, by analyzing a user's payment records over the past 12 months and their payment timeliness, tags can be generated for on-time payers (e.g., users with 0 days of delay), low-frequency delay users (e.g., monthly average delays less than or equal to 1), and high-frequency delay users (e.g., monthly average delays greater than or equal to 2). Furthermore, by analyzing user complaint records and the implementation of collection strategies, tags can be generated for complaint users (e.g., users who have submitted complaints more than or equal to 1 in the past 3 months) and strategy failure users (e.g., users whose power outage or restoration strategies failed in the past month). These payment behavior tags not only reflect users' payment habits but also reveal their response patterns to collection strategies.

[0045] Based on the generated user tags, a feature set is further extracted to comprehensively describe the user's behavioral patterns and responsiveness. This feature set includes electricity consumption fluctuation patterns, payment timeliness characteristics, and payment strategy responsiveness characteristics. Among these, the electricity consumption fluctuation patterns characterize the user's electricity consumption behavior patterns, including the average electricity consumption over the past 12 months, electricity consumption variance, peak-to-valley ratio (e.g., the ratio of peak-hour electricity consumption to valley-hour electricity consumption), and seasonal fluctuation coefficient (e.g., the ratio of heating season electricity consumption to non-heating season electricity consumption). These features can quantitatively characterize the user's electricity consumption patterns. For example, the peak-to-valley ratio reflects the user's electricity load characteristics, and the seasonal fluctuation coefficient can distinguish between coal-to-electricity users whose electricity consumption increases significantly during the heating season and ordinary users with relatively stable electricity consumption patterns.

[0046] Payment timeliness characteristics are used to describe a user's payment behavior and responsiveness, including the average number of days of payment delay over the past 12 months, the maximum number of days of payment delay, the percentage of on-time payments, and the number of times a payment is overdue. For example, the average number of days of payment delay can reflect whether a user habitually delays payments, while the percentage of on-time payments reflects the stability of a user's payment habits. Combining these characteristics allows for an objective assessment of a user's payment behavior patterns, providing important reference for subsequent payment collection strategies.

[0047] Payment strategy response characteristics are used to describe users' responses to payment reminder strategies, including the response rate (e.g., the ratio of responses to received messages) and average response time (e.g., the time interval from receiving the message to payment) over the past three months. These characteristics can quantify users' sensitivity and compliance with payment reminder strategies. For example, users with high response rates and short average response times may not require multiple reminders, while users with low response rates may require more frequent reminders or even power outage strategies.

[0048] Through the steps described above, user labels and feature sets generated from multi-source datasets comprehensively characterize users' basic attributes, electricity consumption patterns, payment habits, and policy response capabilities, providing high-quality input for subsequent clustering processing. This process not only solves the problems of single user characterization and weak feature correlation in traditional clustering methods, but also achieves accurate quantification of user behavior through deep binding of labels and features, laying a solid foundation for the efficient operation of the clustering model.

[0049] Furthermore, after obtaining user tags and feature sets, it is necessary to process redundant and irrelevant features in the constructed feature sets and reduce the complexity of the feature sets to facilitate better subsequent clustering processing.

[0050] In some embodiments, the method further includes: analyzing the feature set using the Pearson correlation coefficient, removing features with correlation coefficient values ​​less than a first threshold to obtain a first set; filtering the first set using analysis of variance, removing features with variance values ​​less than a second threshold to obtain a second set; performing dimensionality reduction processing on the second set using principal component analysis, retaining features with cumulative variance contribution rates greater than a third threshold to obtain a third set, and performing clustering processing based on the third set.

[0051] In this embodiment, after obtaining user tags and feature sets, in order to further improve the efficiency and accuracy of the clustering model, the constructed feature sets need to be optimized, including removing redundant and irrelevant features, and reducing the complexity of the feature sets. Through these processing steps, the feature dimensions can be effectively reduced, and the discriminative power of the features can be improved, thereby providing a more concise and efficient input for subsequent clustering processing.

[0052] In the process of optimizing the feature set, Pearson correlation coefficient analysis is one of the important steps in eliminating redundant features. The Pearson correlation coefficient measures the linear correlation between two features, with a value ranging from -1 to 1. By calculating the correlation coefficient between each pair of features, it can be determined whether they contain redundant information. Specifically, when the absolute value of the Pearson correlation coefficient is close to 1, it indicates that the two features are highly correlated or approximately identical, while when the absolute value is close to 0, it indicates that the two features are weakly correlated or even unrelated. In an embodiment of this invention, a first threshold (e.g., 0.8) is set. When the absolute value of the correlation coefficient between two features is greater than this threshold, it is considered that there is a high correlation between the two features, and one of them needs to be eliminated to avoid interference from redundant information.

[0053] In practice, for each pair of features, the Pearson correlation coefficient is calculated. For example, suppose there are two features, "average delay days" and "on-time payment percentage." By calculating their correlation coefficient, a high negative correlation (close to -1) can be found between them. A negative correlation means that when the "average delay days" is large, the "on-time payment percentage" is usually small, and vice versa. In other words, these two features convey the same information to some extent, only in opposite directions. Therefore, to avoid redundancy and duplication between features, one feature needs to be retained while the other is removed.

[0054] When removing highly correlated features, the choice of which feature to retain is not random, but rather based on a comprehensive consideration of its business interpretability, data distribution stability, and contribution to the model. For example, between the features "average delay days" and "on-time payment percentage," "average delay days" is prioritized. This is because "average delay days" is a specific, continuous numerical feature that clearly quantifies users' payment delay behavior and has a relatively stable distribution; while "on-time payment percentage" is a proportional feature. Although it also reflects users' payment habits, its calculation depends on the number of payments, and it may fluctuate due to the strong dispersion of data distribution. Furthermore, from a business perspective, "average delay days" is more intuitive and easier to understand. Business personnel can directly use this feature to determine whether users frequently delay payments and the severity of the delays. "On-time payment percentage," as a relative indicator, is more abstract to interpret, hence the priority of retaining "average delay days." This choice ensures that the retained features are not only useful for model calculations but also provide business personnel with intuitive interpretation.

[0055] In processing large-scale feature sets, calculating Pearson correlation coefficients pairwise and removing features whose absolute correlation coefficients exceed a first threshold not only effectively reduces redundant information but also significantly lowers the feature dimensionality, ensuring computational efficiency for subsequent clustering model operations. Furthermore, this process ensures that the ultimately retained features are relatively independent, avoiding interference from redundant features causing information duplication in the clustering results. For example, if two highly correlated features are retained simultaneously, the clustering model may assign excessive weight to one type of information, thus reducing the diversity and accuracy of the clustering results. By removing redundant features through Pearson correlation coefficient analysis, the model input data becomes more concise and effective, laying a solid foundation for further optimizing the feature set and improving clustering performance.

[0056] Therefore, using Pearson correlation coefficient analysis is an indispensable and crucial step in feature optimization. By setting a first threshold and removing highly correlated features, we can not only reduce feature redundancy and improve the independence of the feature set, but also select and retain more interpretable and applicable features based on business logic, providing higher-quality input for subsequent clustering processing.

[0057] After removing redundant features, to further filter out features highly relevant to the grouping objective, it is necessary to use analysis of variance (ANOVA) to evaluate the importance of the features. ANOVA is a statistical method that measures the discriminative power of a feature for grouping by comparing the ratio of between-group variance to within-group variance. In the embodiments of this invention, the main purpose of ANOVA is to filter out features that can significantly reflect differences in user behavior, thereby eliminating irrelevant features with little impact on the grouping objective.

[0058] Specifically, the principle of analysis of variance (ANOVA) is to calculate the ratio of the between-group variance to the within-group variance for each feature across different classes of samples. The between-group variance reflects the degree of difference between samples from different classes, while the within-group variance reflects the degree of variability within samples from the same class. If the between-group variance of a feature is significantly greater than the within-group variance, it indicates that the feature has a significant discriminatory power between different classes and has a large impact on the grouping results. Conversely, if the between-group variance is small or close to the within-group variance, it indicates that the feature cannot effectively distinguish between samples from different classes and has a low contribution to the grouping objective.

[0059] To quantify these differences, this invention sets a second threshold (e.g., 0.01) to determine whether the feature variance is sufficiently significant. Specifically, the variance value of each feature is calculated and compared to the second threshold. If the variance value of a feature is less than the threshold, it is considered that the feature has no significant impact on the grouping objective and is therefore removed. For example, non-behavioral fields such as user names and contact information typically have values ​​that are fixed or have very small variations throughout the dataset. Therefore, the variance values ​​of these features are often far below the threshold, indicating that they have a weak ability to describe the differences in user behavior and will be considered irrelevant features and deleted.

[0060] Furthermore, in practical applications, ANOVA can also capture some implicit and irrelevant features. For example, if the value ranges of a feature completely overlap between different categories (i.e., there is no significant difference), even if its variance value does not reach a minimum, it will still be identified and removed by the ANOVA process because it cannot provide additional discriminative information. This process can effectively filter out features that do not contribute to the grouping objective or contribute very little, thereby further improving the quality of the feature set.

[0061] By using analysis of variance (ANOVA), features that are irrelevant or contribute little to the clustering objective are eliminated. The retained features more accurately reflect the differences between user behavior patterns and categories. This not only helps improve the accuracy of the clustering results but also further reduces the feature dimensionality, creating more optimized conditions for subsequent dimensionality reduction and clustering algorithm operation. The application of ANOVA combines statistical methods with business requirements, ensuring that the final retained features are strongly correlated with the clustering objective, making it an indispensable and crucial step in the feature optimization process.

[0062] After removing redundant and irrelevant features, the complexity of the feature set has been significantly reduced. However, to further simplify the feature dimensions and improve the efficiency of clustering calculations, dimensionality reduction processing is still needed for the selected feature set. Principal Component Analysis (PCA) is an important dimensionality reduction method in the process of optimizing the feature set. It aims to map data from a high-dimensional feature space to a low-dimensional space through linear transformation, while retaining as much information as possible from the original data. PCA not only reduces the dimensionality of features and lowers computational complexity but also eliminates correlations between original features, making the final feature set more concise and efficient. In the embodiments of this invention, PCA is used to perform dimensionality reduction processing on the selected feature set, generating a new set of orthogonal features (i.e., principal components), and ensuring that the cumulative variance contribution rate is greater than a set third threshold (e.g., 85%), thereby retaining most of the important information.

[0063] The core of principal component analysis (PCA) is to perform eigenvalue decomposition on the covariance matrix of the original feature data to find a new set of orthogonal coordinate axes (principal components). These axes are ordered according to the magnitude of the data variance. Specifically, the principal components are a set of linear combinations generated based on the weighted sum of the original features. Each principal component captures the direction of the largest variance in the data and they are orthogonal to each other (i.e., independent). This orthogonality effectively eliminates potential correlations between the original features, thus avoiding feature redundancy.

[0064] In practice, the data in the feature set first needs to be standardized. For example, normalization or Z-score standardization can be used to adjust the value range of all features to the same order of magnitude (e.g., mean 0, standard deviation 1). The purpose of standardization is to eliminate dimensional differences between different features, ensuring that the contribution weights of principal component analysis to all features remain consistent. Next, based on the standardized feature data, the covariance matrix is ​​calculated. The covariance matrix is ​​a symmetric matrix where each element represents the covariance between two features, reflecting their linear relationship.

[0065] Subsequently, eigenvalue decomposition is performed on the covariance matrix to obtain a set of eigenvalues ​​and corresponding eigenvectors. Eigenvalues ​​represent the amount of variance information contained in each principal component, while eigenvectors represent the orientation of the principal components in the original feature space. The eigenvectors are sorted in descending order of eigenvalues; the larger the eigenvalue, the more important the corresponding principal component, and the more variance information it can capture. In embodiments of this invention, the principal components with the largest eigenvalues ​​are selected to ensure that the cumulative variance contribution rate of these principal components is greater than a third threshold (e.g., 85%). The cumulative variance contribution rate represents the proportion of variance in the original data that the retained principal components can explain. For example, if the cumulative variance contribution rate of the first 5 principal components is 90%, it means that these 5 principal components can explain 90% of the information in the original data.

[0066] After principal component selection, the original high-dimensional feature data is projected onto these principal components to generate new low-dimensional feature data. The new principal component data has several significant characteristics: First, all principal components are orthogonal, eliminating correlations between original features and ensuring feature independence; second, the low-dimensional feature data retains most of the important information from the original data, reducing information loss; third, the principal components represent linear combinations of the original features, thus preserving the business significance of the original data. For example, the original 15+ dimensional features may be reduced to 5-8 dimensions after principal component analysis. Although the number of features is reduced, the generated principal components can comprehensively reflect information such as electricity consumption fluctuation patterns, payment timeliness, and payment strategy response.

[0067] Another advantage of Principal Component Analysis (PCA) is its ability to improve the efficiency and stability of subsequent clustering algorithms. In high-dimensional feature spaces, the data distribution can become sparse, making traditional clustering algorithms (such as K-means) susceptible to noise and irrelevant features when calculating distances. PCA, by reducing dimensionality, significantly lowers the feature space's dimension, resulting in a more compact data distribution. This allows clustering algorithms to calculate distances between samples more efficiently and generate more stable clustering results. Furthermore, because PCA prioritizes information based on feature importance, clustering algorithms can utilize the most important principal components, thereby enhancing the accuracy of grouping.

[0068] Principal component analysis (PCA) not only reduces the dimensionality of the feature set but also preserves the most important information in the data and eliminates correlations and redundancies in the original feature set. This process effectively solves the problems of high computational complexity and model instability caused by high-dimensional features in traditional clustering methods, while ensuring the quality of the data input required for clustering, providing more efficient feature support for the clustering method of this invention. Finally, the optimized low-dimensional feature set can significantly improve the efficiency and accuracy of clustering, ensuring the interpretability and applicability of the clustering results in practical business applications.

[0069] Furthermore, in step S103, a clustering algorithm is used to perform clustering based on the user tags and the feature set to obtain initial clustering results.

[0070] In some embodiments, the step of using a clustering algorithm to perform clustering processing based on the user tags and the feature set to obtain an initial clustering result includes: determining an initial search interval for the K value; randomly initializing the centroid for each K value and recording the intra-cluster sum of squares and silhouette coefficient corresponding to each centroid; drawing an intra-cluster sum of squares curve based on the intra-cluster sum of squares corresponding to each K value and determining the inflection point of the intra-cluster sum of squares curve; obtaining the K value corresponding to the inflection point and using the silhouette coefficient to help confirm the optimal K value; and performing clustering based on the optimal K value to obtain the initial clustering result.

[0071] In this embodiment, step S103, which involves using a clustering algorithm to perform clustering based on the user tags and the feature set to obtain initial clustering results, is one of the core processes for user clustering in this invention. In this process, by combining and analyzing the user tags and feature set, a clustering algorithm is used to effectively classify users according to their behavioral patterns, providing a foundation for subsequent clustering optimization and business rule verification. In some embodiments, to improve the scientific nature of clustering and the stability of the clustering results, it is first necessary to determine an appropriate number of clusters (i.e., the K value), and then execute the clustering algorithm based on the optimal K value to generate initial clustering results.

[0072] In clustering, determining the K value is a crucial step in the clustering algorithm. The K value represents the number of clusters, and its magnitude directly affects the clustering effect and business adaptability. If the K value is too small, the clustering results may be too coarse, failing to accurately characterize the behavioral features of different user groups; if the K value is too large, the clusters may be too scattered, making it difficult to translate into practical business strategies. Therefore, in the embodiments of this invention, a method for dynamically determining the K value is adopted. Through a scientific algorithm evaluation process, the instability caused by traditionally setting the K value based on experience is avoided.

[0073] Specifically, the first step is to determine the initial search range for the K value. Based on the business scenario and expense control requirements, and combined with the analysis of user behavior data, a reasonable range for the K value is initially defined. For example, in an expense control scenario, users are typically divided into a small number of groups to distinguish between core protected users, ordinary users, and high-risk users; therefore, the search range for the K value can be set to 5 to 9. Next, for each candidate K value, multiple rounds of K-means clustering experiments are performed. In each experiment, K cluster centroids are randomly initialized, and the distance between each sample point and the centroid is calculated based on the set user labels and feature sets. The sample point is then assigned to the cluster corresponding to the nearest centroid. After one iteration, the centroid position of each cluster is calculated based on the newly assigned clusters, and the above process is repeated until the position of the centroid no longer changes significantly or the preset number of iterations is reached.

[0074] After each K-means clustering experiment, the sum of squares (SSE) and silhouette coefficient (SC) within each K value need to be recorded. The SSE is an important indicator of clustering effectiveness, reflecting the tightness of cluster samples. A smaller SSE value indicates higher similarity among samples within a cluster and better clustering results. The silhouette coefficient assesses the overall quality of clustering, with a value ranging from -1 to 1. It measures the similarity of samples within a cluster and the difference between different clusters to determine the tightness and separation of clusters; a value closer to 1 indicates better clustering results.

[0075] Based on the recorded intra-cluster sum of squares (SSE-K value), an SSE-K value curve is plotted, and the trend of the curve is observed. Generally, as the K value increases, the SSE value gradually decreases, but the rate of decrease may begin to slow significantly at a certain K value, forming a clear inflection point. This K value is usually considered a better choice because further increasing the K value, while further reducing the SSE, significantly reduces the marginal benefit. In the embodiments of this invention, by observing the inflection point of the SSE-K value curve, a candidate optimal K value can be preliminarily determined. For example, when K increases from 6 to 7, the rate of decrease in SSE may drop sharply from 32% to 8%, at which point K=7 can be considered a potentially optimal K value.

[0076] After determining the candidate K values, further verification and confirmation using silhouette coefficients are needed. Specifically, the average silhouette coefficient corresponding to each candidate K value is calculated, and their values ​​are compared. The K value with the highest silhouette coefficient is selected as the final optimal K value. For example, within the search interval from K=5 to K=9, the silhouette coefficients for each K value are: K=5 (0.61), K=6 (0.67), K=7 (0.72), K=8 (0.65), and K=9 (0.58). It can be observed that when K=7, the silhouette coefficient reaches the highest value (0.72), indicating that the similarity of samples within the cluster is the highest and the difference between clusters is the most significant. Therefore, K=7 is ultimately selected as the optimal K value.

[0077] However, in some cases, the inflection point of the SSE curve may not match the trend of the silhouette coefficient. For example, suppose in a certain experiment, the SSE curve reaches an inflection point when K=7, but the corresponding silhouette coefficient does not reach a high value, while the silhouette coefficient is higher when K=6 or K=8. In this case, by observing the decrease in SSE from K=6 to K=9, it can be found that K=8 is the next inflection point with the largest decrease. To verify whether K=8 is the optimal K value, the silhouette coefficient corresponding to this K value needs to be recalculated and compared with other candidate K values. If the silhouette coefficient of K=8 is significantly higher than other K values, then K=8 is finally selected as the optimal K value; otherwise, continue to look for the next inflection point with the largest decrease, repeat the above verification process, until a K value that can simultaneously satisfy the requirements of an optimal SSE inflection point and silhouette coefficient is found.

[0078] Taking a specific example, suppose there is a set of user data that needs to be clustered, with an initial search range of K=5 to 9. After performing K-means clustering experiments, the calculated SSE values ​​are: K=5 (4000), K=6 (2700), K=7 (1800), K=8 (1500), and K=9 (1400), with corresponding silhouette coefficients of: K=5 (0.61), K=6 (0.67), K=7 (0.65), K=8 (0.72), and K=9 (0.68). By observing the SSE-K value curve, it can be found that when K=7, the SSE curve shows a significant inflection point, but the silhouette coefficient at this time does not reach the highest value (0.65), while the silhouette coefficient of K=8 (0.72) is significantly higher than other K values. Therefore, in this case, K=8 is selected as the optimal K value, and K-means clustering is performed based on K=8 to obtain the initial clustering results.

[0079] After determining the optimal value of K, a K-means clustering algorithm is executed based on this value to obtain initial clustering results. In this process, each user is assigned to the cluster most similar to their behavioral pattern based on their corresponding label and feature vector. The initial clustering results identify the cluster number to which each user belongs, providing a foundation for subsequent clustering result verification and business rule verification. Simultaneously, the initial clustering results also reflect the overall distribution characteristics of the user group, such as the distribution of user numbers in different clusters and their core characteristics, providing important basis for analyzing and adjusting the clustering strategy.

[0080] The clustering method based on user tags and feature sets described above not only achieves initial user segmentation but also ensures the stability and scientific validity of the segmentation results by dynamically optimizing the K-value. This avoids the instability caused by blind parameter setting in traditional methods. The initial segmentation results lay a solid foundation for subsequent segmentation optimization and business verification, ensuring that the segmentation results accurately reflect differences in user behavior and providing important technical support for the refined management of expense control.

[0081] Furthermore, in step S104, the initial clustering result is verified, and the initial clustering result is adjusted based on the verification result to obtain the final clustering result.

[0082] In some embodiments, the step of verifying the initial clustering result and adjusting the initial clustering result based on the verification result to obtain the final clustering result includes: verifying whether the initial clustering result meets a first condition; in response to not meeting the first condition, adjusting the weight of the user tags corresponding to the first condition in the initial clustering result; the first condition is that users on the expense control whitelist and group accounts in the initial clustering result are concentrated in the same cluster, and the proportion of important users in the same cluster is greater than or equal to a fourth threshold; verifying whether the initial clustering result meets a second condition; in response to not meeting the first condition, optimizing the K value and re-clustering; the second condition is that the proportion of users corresponding to the user tags in each cluster is greater than a fifth threshold, and the proportion of the user tags in other clusters is less than a sixth threshold; verifying whether the initial clustering result meets a third condition; in response to not meeting the third condition, splitting or merging the clusters in the initial clustering result, adjusting the centroids of the clusters, and re-clustering; the third condition is that the ratio of the number of users in each cluster in the initial clustering result to the total number of users is greater than a seventh threshold.

[0083] In this embodiment, step S104 involves verifying the initial clustering results and adjusting them based on the verification results to obtain the final clustering results. This is a crucial step in ensuring that the clustering results meet business needs and actual operational scenarios. In the initial stage of clustering, clustering algorithms primarily classify users based on labels and feature data. However, since clustering algorithms are inherently data-driven, their output may lack adaptability to actual business logic and requirements. Therefore, to ensure that the clustering results possess both algorithmic accuracy and meet business practicality and feasibility, multi-dimensional verification and adjustment of the initial clustering results are necessary. Specifically, in some embodiments, the verification and adjustment process includes multi-level checks on the first, second, and third conditions, and, when necessary, dynamic adjustments to user label weights, K-values, and clusters to optimize the clustering results.

[0084] During the verification process, the initial clustering results must first be verified to ensure they meet the first condition. This first condition requires that users on the fee control whitelist and group customers must be concentrated in the same cluster, and that the proportion of important users (such as top-tier and first-tier users) in the same cluster must be greater than or equal to the fourth threshold (e.g., 90%). The core purpose of this condition is to ensure that key user categories achieve a higher degree of cluster concentration, facilitating consistent service or protection for these users in subsequent operational strategies. For example, users on the fee control whitelist and group customers are typically important customer groups for power companies, requiring special protection; therefore, these users should not be dispersed into different clusters. If the initial clustering results show that users on the fee control whitelist or group customers are scattered across multiple clusters, or if the proportion of important users in a certain cluster is lower than the fourth threshold, the label weights for these users need to be adjusted. Specifically, the weight of labels such as "fee control whitelist flag" or "importance level" can be increased, giving these features a greater weight in the clustering process and thus enhancing the users' tendency to belong to the same cluster. After adjustment, the clustering algorithm is re-executed to generate new clustering results.

[0085] After verifying the first condition, it is necessary to further verify whether the initial clustering results meet the second condition. The second condition requires that the proportion of users with the corresponding user tag in each cluster be greater than the fifth threshold (e.g., 85%), and that the proportion of that tag in other clusters be less than the sixth threshold (e.g., 30%). The purpose of this condition is to ensure that each cluster has a clear business meaning; that is, each cluster should be dominated by a certain core user tag, while avoiding the widespread distribution of users with the same tag across multiple clusters, which could lead to ambiguity in the clustering results. For example, if the proportion of users with the main tag in a cluster is too low, it may indicate that the business characteristics of that cluster are not distinct enough; while if users with a certain tag appear in multiple clusters simultaneously, it may make it difficult to translate the clustering results into a clear business strategy. If the initial clustering results do not meet the second condition, the K value needs to be optimized, and clustering needs to be re-executed. Specifically, the number of clusters can be adjusted by increasing or decreasing the K value, thereby making the core characteristics of each cluster more prominent and avoiding overlap in business meaning between clusters. The optimized K value can better reflect the actual user distribution and make the clustering results more aligned with business needs.

[0086] After verifying the first two conditions, it is also necessary to verify whether the initial clustering results meet the third condition. The third condition requires that the ratio of the number of users in each cluster in the initial clustering results to the total number of users must be greater than the seventh threshold (e.g., 5%). The core purpose of this condition is to avoid clusters that are too small (i.e., "small clusters") or too large (i.e., "large clusters"), thereby ensuring that the clustering results can reflect the diversity of user behavior while maintaining a certain level of business significance. For example, clusters that are too small may be due to some feature weights being set too high or data anomalies. Such clusters may lack actual business value due to the small number of users. On the other hand, clusters that are too large may be due to the clustering algorithm failing to fully recognize the fine-grained characteristics of the data, resulting in users of different categories being assigned to the same cluster, thereby reducing the granularity of the clustering. If the initial clustering results do not meet the third condition, the clusters need to be split or merged, and the centroid positions of the clusters need to be adjusted before re-performing the clustering. For example, for an overly large cluster, it can be split into two or more sub-clusters; while for an overly small cluster, it can be merged with neighboring clusters, or more sample points can be attracted into the cluster by adjusting the centroid position, so that the proportion of users in each cluster is maintained within a reasonable range.

[0087] The three-layer verification mechanism in this application is complementary and mutually reinforcing. The first layer of verification focuses on the concentration of core user categories, ensuring a reasonable distribution of key users and providing a basic guarantee for the clustering results. The second layer of verification further refines the process, ensuring that the business characteristics of each cluster are unique and clear, avoiding business overlap between clusters. The third layer of verification verifies the balance and coverage of the clustering structure from a global perspective, ensuring that the clustering results not only reflect the diversity of user behavior but also adapt to overall business needs. Through layer-by-layer verification, the clustering results are optimized at both the local and global levels. The final generated clustering results are both data-driven and scientific, meeting the refined management and actual operational needs of expense control, providing comprehensive support for the technical solution of this invention.

[0088] Through the aforementioned verification and adjustment process, the final clustering results not only reflect the diversity of user behavior but also align with business logic and operational needs. For example, the adjusted clustering results ensure that users on the expense control whitelist and group accounts are concentrated in the same cluster, facilitating the implementation of specific protection strategies. Simultaneously, the core characteristics of each cluster are more distinct, avoiding overlap between clusters and making the clustering results easier to translate into specific business rules. Furthermore, the balanced proportion of users in each cluster ensures sufficient coverage and business significance. The final clustering results possess both data-driven scientific rigor and adaptability to actual business scenarios, providing strong support for the refined management and operation of expense control services.

[0089] Furthermore, in this embodiment, after the above process, seven stable clusters are finally formed. The core characteristics, business attributes, and operation strategies of each cluster are shown in Table 1 below: Table 1 Cluster Description

[0090] Furthermore, based on the clustering analysis results, clear segmentation rules were formulated for each group, considering factors such as user type, importance level, and collection response time, forming a group segmentation rule library. Simultaneously, factors such as the importance, risk level, and collection difficulty of each group were comprehensively considered to determine its priority. By deeply extracting the core characteristics of the seven groups, the abstract clustering results were transformed into standardized business rules, forming a group rule library that can directly guide operations. Furthermore, based on a three-dimensional evaluation system of "importance-risk level-collection difficulty," a seven-level priority ranking was established, clarifying the differentiated allocation logic of collection resources. This effectively solves the pain points of traditional clustering and grouping—"abstract results, difficult to implement, and lack of execution basis." The dynamic updating feature of the rule library ensures flexible adaptability to business changes, allowing the grouping results to be directly transformed into executable collection business actions, significantly reducing the operational costs for business personnel and improving the operational efficiency and strategy accuracy of expense control.

[0091] Based on the clustering results, the core features of each group were extracted, and clear and executable group rules were formulated, as detailed in the table below:

[0092] Furthermore, based on the importance of the groups (e.g., Level 1 groups are whitelisted / group accounts, with the highest importance), risk level (e.g., Level 6 groups have high frequency of non-response, with the highest risk of overdue payments), and collection difficulty (e.g., Level 4 rental users are difficult to collect payments from), a seven-level priority system is established (Level 1 > Level 2 > ... > Level 7), clearly defining the order of allocation of collection resources.

[0093] Furthermore, in some embodiments, the method further includes: periodically updating the final clustering result; assigning corresponding priorities to the clusters in the final clustering result; and, in response to a new user assignment event, matching clusters in the final clustering result in descending order of priority.

[0094] In this embodiment, to address changes in user behavior and dynamic adjustments to business needs, the method further includes periodically updating the final clustering results and combining cluster priority rules to achieve rapid allocation of new users. This dynamic update mechanism ensures the real-time nature and business adaptability of the clustering results, while optimizing the efficiency of refined management of expense control services. The update process involves both readjusting existing user clustering results and determining the cluster affiliation and priority matching for new users, thereby achieving continuous optimization and immediate adaptation of the clustering results.

[0095] The core of periodic updates lies in the full recalculation and dynamic adjustment of the final clustering results. Specifically, the system triggers a clustering update task according to a set time period (e.g., every 30 days), automatically synchronizing the latest user data, including user electricity consumption behavior data, payment records, request records, and service tags. Newly collected data, after preprocessing, is used to update user tags and feature sets, thereby reflecting changes in user behavior within the recent period. For example, for some users, their electricity consumption fluctuations may change significantly due to seasonal factors, or their timely payment characteristics may change due to recent frequent payment delays; these changes will be reflected in the updated feature set. After the new feature set is generated, the system re-clusters all users based on established clustering algorithms and verification rules, generating updated final clustering results. The updated clustering results better reflect the latest user status and ensure that the clustering results continuously adapt to the dynamically changing needs of the cost control scenario.

[0096] After generating the updated clustering results, each cluster is assigned a corresponding priority. The cluster priority is determined based on a comprehensive evaluation of multiple factors, including user importance, risk level, and difficulty of collection. For example, clusters containing users on the prepaid billing whitelist and group accounts are typically given the highest priority to ensure these core protected users receive priority access to resources and service support. Conversely, high-risk user clusters with long payment delays and repeated failures to respond to collection strategies are given lower priority, requiring stricter collection and power outage strategies for these users. Setting priority rules not only helps power companies allocate collection resources more efficiently but also enables rapid response to user needs in specific scenarios, such as prioritizing power stability for important users during high-load periods.

[0097] When a new user joins, a new user assignment event is triggered, and the new user is quickly assigned based on the updated final clustering results and priority rules. Specifically, the new user's electricity usage behavior data and basic profile data are first preprocessed to generate initial user tags and feature sets. Subsequently, the new user is matched sequentially from high to low according to the cluster priority in the final clustering results. The matching process is based on a cluster rule base, that is, by checking whether the new user's features conform to the division rules of a certain cluster, if they conform to the rules of a high-priority cluster, the new user is directly assigned to the corresponding cluster, and subsequent rule matching is terminated to avoid duplicate assignment. For example, if a new user's expense control whitelist flag is "yes", the user will be directly assigned to the highest priority cluster, without needing to continue matching other cluster rules. Through this priority matching mechanism, the clustering and assignment of new users can be completed efficiently, ensuring the stability and timeliness of the clustering results.

[0098] Furthermore, for users already grouped, if their behavioral characteristics change significantly during periodic updates (e.g., an increase in payment delays, a significant increase in electricity consumption volatility), the system will reassess their group affiliation based on the latest feature set and adjust it according to priority rules. For example, a user who originally belonged to the "reminders only (low frequency)" cluster might be reassigned to the "reminders immediately (high frequency)" cluster to match their new behavioral patterns and risk level if their payment delays have increased significantly in recent months and their response rate to reminder SMS messages has dropped sharply. This would adapt to their new behavioral patterns and risk level. By dynamically adjusting the user grouping results, the system can effectively address dynamic changes in user behavior and ensure that the grouping results always reflect the latest user behavioral characteristics.

[0099] By combining a periodic update mechanism with priority rules, this invention not only maintains the real-time nature and accuracy of clustering results but also enables rapid allocation of new users and dynamic adjustment of existing users, thereby ensuring that clustering continuously adapts to the operational needs of expense control. Ultimately, the dynamic adjustment of clustering results balances the personalized characteristics of user behavior with the universality of business logic, providing continuous and reliable data support and strategic basis for the refined management of expense control.

[0100] Furthermore, in some embodiments, the method further includes: identifying anomalies in users who fail to match any group during the grouping process and users who match multiple group rules simultaneously; and updating and adjusting the final grouping result in response to the number of identified anomaly users exceeding an eighth threshold.

[0101] In this embodiment, to further improve the accuracy and business adaptability of the clustering results, the method also includes identifying and processing abnormal users, and updating and adjusting the final clustering results as necessary. Abnormal users mainly include two categories: users who fail to match to any group, and users who match multiple group rules simultaneously. Due to the special characteristics of these two types of users or the complexity of their data, they may not be accurately assigned using existing group rules. If left unaddressed, this could affect the overall quality and usability of the clustering results. Therefore, by establishing an identification and response mechanism for abnormal users, the integrity and effectiveness of the clustering results can be ensured.

[0102] For users who fail to be matched to any group, the characteristic is that they failed to meet the conditions of any group rule during the grouping process. This phenomenon may be due to the user's unique behavioral characteristics, making it impossible to classify them into existing clusters, or it may be due to the insufficient coverage of certain rules in the rule base, failing to encompass the characteristics of this type of user. For example, some users may exhibit extreme power consumption patterns or prolonged periods of zero power consumption, and these characteristics are not fully described by existing rules, thus preventing these users from being matched to any group. For these unmatched users, the system will automatically mark them as abnormal users and record their key characteristic information.

[0103] For users who match multiple group rules simultaneously, the characteristic is that satisfying the conditions of two or more cluster rules leads to a conflict in their affiliation. For example, a user may exhibit a low-frequency delayed payment behavior pattern, but also show high-frequency delayed payment characteristics in certain months, potentially fitting the rules of both the "constant reminders (low-frequency)" and "constant reminders (high-frequency)" clusters. Such user affiliation conflicts may lead to ambiguity and uncertainty in the clustering results, thereby affecting the execution of subsequent business strategies. Similarly, the system will mark these users as abnormal users and record the cluster rule information they simultaneously match.

[0104] After identifying anomalous users, the system further counts their number. If the number of anomalous users exceeds a preset eighth threshold (e.g., 3% of the total number of users), an update and adjustment of the final grouping results will be triggered. Specifically, the update and adjustment process includes in-depth analysis of the feature data of anomalous users to identify the characteristic patterns of unmatched users or the reasons for grouping conflicts. For example, for unmatched users, it can analyze whether their characteristics have obvious regularities and expand or optimize the group rule base accordingly to cover the characteristics of these users. For example, for users with consistently zero battery power, a new "zero battery power" tag rule can be added, placing them in a separate cluster. For users who match multiple group rules simultaneously, the condition thresholds in the rule base can be optimized to identify rules with higher priority, thereby eliminating rule conflicts. For example, for users whose payment delays fall between high and low frequencies, the threshold for high-frequency delays can be adjusted to make it more stringent, thus placing such users in the low-frequency delay cluster.

[0105] Furthermore, after updating the group rule base, all users need to be re-grouped to generate the updated final grouping results. The updated grouping results not only cover previously unmatched users but also clearly identify the clusters of users with conflicting affiliations, while ensuring that the grouping results for other users are unaffected. Through this process, the completeness and consistency of the grouping results are significantly improved, ensuring that all users are correctly assigned to a specific cluster and eliminating rule conflicts and ambiguous affiliations.

[0106] By identifying and processing abnormal users, the clustering method of this invention effectively solves the problem of incomplete clustering results caused by insufficient rule coverage or conflicts in traditional clustering methods. Even under complex and diverse user behaviors, the clustering results can still maintain high accuracy and adaptability, while ensuring that the clustering rule base can be continuously optimized and expanded as user behavior changes, adapting to dynamically changing business needs. The final generated clustering results not only have data-driven scientific validity but also directly support the refined management of expense control business, providing a reliable basis for policy execution.

[0107] As can be seen from the above embodiments, the user segmentation method described in this application, by acquiring multi-source datasets and generating user tags and feature sets based on these datasets, and combining them with clustering algorithms to segment users, can achieve accurate user segmentation. Finally, through verification and adjustment, a segmentation result that meets business requirements is obtained, ensuring the accuracy and applicability of the segmentation result. Therefore, this application has designed a rigorous process from the data source to the generation of segmentation results, integrating multiple data processing and analysis technologies to comprehensively improve the scientific nature and business adaptability of the segmentation.

[0108] First, this application utilizes association rule algorithms to analyze users' electricity consumption behavior, habits, and payment records, generating basic attribute tags, electricity consumption feature tags, and payment behavior tags. This comprehensively characterizes users' behavioral features, providing high-quality data support for subsequent user segmentation. These tags effectively reflect users' core attributes, electricity consumption patterns, and payment methods, providing accurate input for the segmentation model.

[0109] Based on this, this application constructs characteristics of electricity consumption fluctuation patterns, payment timeliness, and payment strategy response, which are used to clearly describe users' electricity consumption behavior patterns, payment timeliness, and response to payment reminder strategies, respectively. These fine-grained feature descriptions not only improve the accuracy of clustering but also make the clustering results more interpretable and provide business guidance value.

[0110] To further optimize the feature set, this application utilizes Pearson correlation coefficient to remove redundant features, analyzes variance to filter irrelevant features, and combines principal component analysis to reduce the dimensionality of the feature set. This effectively reduces the feature dimensionality, improves the computational efficiency and accuracy of the clustering model, and retains the core information reflecting user behavior. This process ensures the conciseness and efficiency of the feature set, laying a solid foundation for the execution of the clustering algorithm.

[0111] In implementing the clustering algorithm, this application dynamically determines the optimal K value and optimizes it by combining the inflection point of the sum of squares curve within the cluster and the silhouette coefficient, thereby ensuring the rationality and stability of the final clustering results. Through scientific parameter optimization, user groups can be accurately divided, avoiding the instability of clustering results caused by traditional empirically setting K values, and making the clustering results more closely match the actual behavioral characteristics of users.

[0112] After obtaining the initial clustering results, this application performs multiple verifications on the clustering results and adjusts them in conjunction with business logic to ensure the concentration of important users such as those on the expense control whitelist and group users. Simultaneously, by optimizing the K-value and adjusting abnormal clusters, the occurrence of abnormal clusters is avoided, further enhancing the practicality and business guidance value of the clustering results.

[0113] Furthermore, this application utilizes a periodic update mechanism based on the priority rules of the final user grouping results to dynamically adjust the grouping results and quickly group new users. In the context of constantly changing user behavior and business needs, this dynamic update mechanism ensures the real-time nature and effectiveness of the grouping results, keeping the grouping system consistently accurate and efficient.

[0114] Finally, this application also identifies anomalies in users who are not matched to any group during the grouping process and users matched by multiple rules, and optimizes the grouping rules based on the dynamic changes in the number of anomaly users. This process can effectively handle anomalies in grouping, ensure the stability and scalability of the grouping system, and ensure that the rule base can continuously adapt to changes in user behavior.

[0115] In summary, this application significantly improves the scientific rigor, applicability, and stability of clustering through a comprehensive solution that integrates multi-source data, constructs precise features, optimizes efficient clustering, performs multiple checks and adjustments, and dynamically updates and handles anomalies. This provides strong technical support for the refined management and efficient operation of expense control.

[0116] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the user segmentation method described in any of the above embodiments.

[0117] Figure 2 This embodiment illustrates a more specific hardware structure of an electronic device, which may include a processor 1010, a memory 1020, an input / output interface 1030, a communication interface 1040, and a bus 1050. The processor 1010, memory 1020, input / output interface 1030, and communication interface 1040 are interconnected internally via the bus 1050.

[0118] The processor 1010 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.

[0119] The memory 1020 can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, dynamic storage device, etc. The memory 1020 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented by software or firmware, the relevant program code is stored in the memory 1020 and is called and executed by the processor 1010.

[0120] The input / output interface 1030 is used to connect input / output modules to realize information input and output. Input / output modules can be configured as components within the device (not shown in the figure) or externally connected to the device to provide corresponding functions. Input devices may include keyboards, mice, touchscreens, microphones, various sensors, etc., while output devices may include displays, speakers, vibrators, indicator lights, etc.

[0121] The communication interface 1040 is used to connect a communication module (not shown in the figure) to enable communication between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0122] Bus 1050 includes a pathway for transmitting information between various components of the device, such as processor 1010, memory 1020, input / output interface 1030, and communication interface 1040.

[0123] It should be noted that although the above-described device only shows the processor 1010, memory 1020, input / output interface 1030, communication interface 1040, and bus 1050, in specific implementations, the device may also include other components necessary for normal operation. Furthermore, those skilled in the art will understand that the above-described device may only include the components necessary for implementing the embodiments of this specification, and not necessarily all the components shown in the figures.

[0124] The electronic devices described above are used to implement the corresponding user segmentation methods in any of the foregoing embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0125] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this application also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to execute the user grouping method as described in any of the above embodiments.

[0126] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.

[0127] The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to execute the user grouping method as described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.

[0128] Those skilled in the art should understand that the discussion of any of the above embodiments is merely exemplary and is not intended to imply that the scope of this application (including the claims) is limited to these examples; within the framework of this application, the technical features of the above embodiments or different embodiments can also be combined, the steps can be implemented in any order, and there are many other variations of different aspects of the embodiments of this application as described above, which are not provided in the details for the sake of brevity.

[0129] Additionally, to simplify the description and discussion, and to avoid obscuring the embodiments of this application, the well-known power / ground connections to integrated circuit (IC) chips and other components may or may not be shown in the provided drawings. Furthermore, the apparatus may be shown in block diagram form to avoid obscuring the embodiments of this application, and this also takes into account the fact that the details of the implementation of these block diagram apparatuses are highly dependent on the platform on which the embodiments of this application will be implemented (i.e., these details should be fully understood by those skilled in the art). While specific details (e.g., circuits) have been set forth to describe exemplary embodiments of this application, it will be apparent to those skilled in the art that the embodiments of this application can be implemented without these specific details or with variations thereof. Therefore, these descriptions should be considered illustrative rather than restrictive.

[0130] Although this application has been described in conjunction with specific embodiments thereof, many substitutions, modifications, and variations of these embodiments will be apparent to those skilled in the art from the foregoing description. For example, other memory architectures (e.g., dynamic RAM (DRAM)) may be used with the embodiments discussed.

[0131] The embodiments of this application are intended to cover all such substitutions, modifications, and variations that fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the embodiments of this application should be included within the protection scope of this application.

Claims

1. A method of user segmentation, characterized in that, include: Obtain multi-source datasets; User labels and feature sets are generated based on the multi-source dataset; Using a clustering algorithm, clustering is performed based on the user tags and the feature set to obtain initial clustering results; The initial clustering results are validated, and the initial clustering results are adjusted based on the validation results to obtain the final clustering results.

2. The method of claim 1, wherein, The generation of user tags based on the multi-source dataset includes: The association rule algorithm is used to analyze the association relationship between electricity consumption behavior, electricity consumption habits and payment records in the multi-source dataset to generate the user tags; the user tags include basic attribute tags, electricity consumption feature tags and payment behavior tags.

3. The method of claim 1, wherein, The feature set includes electricity consumption fluctuation pattern features, payment timeliness features, and payment strategy response features; the electricity consumption fluctuation pattern features are used to describe the user's electricity consumption behavior patterns; the payment timeliness features are used to describe the user's payment behavior and response capability; and the payment strategy response features are used to describe the user's response to payment reminder strategies.

4. The method of claim 1, wherein, The method further includes: The feature set is analyzed using the Pearson correlation coefficient, and features with correlation coefficient values ​​less than a first threshold are removed to obtain a first set; The first set is filtered using analysis of variance, and features with variance values ​​less than a second threshold are removed to obtain the second set; Principal component analysis is used to reduce the dimensionality of the second set, retaining features whose cumulative variance contribution rate is greater than a third threshold, to obtain the third set, and then the third set is used for clustering.

5. The method according to claim 1, characterized in that, The process of using a clustering algorithm to perform clustering based on the user tags and the feature set to obtain initial clustering results includes: Determine the initial search range for the value of K; For each K value, the centroid is randomly initialized, and the sum of squares within the cluster and the contour coefficient corresponding to each centroid are recorded; Based on the intra-cluster sum of squares corresponding to each K value, a cluster sum of squares curve is plotted, and the inflection point of the cluster sum of squares curve is determined. Obtain the K value corresponding to the inflection point, and use the contour coefficient to help confirm the optimal K value; Clustering is performed based on the optimal K value to obtain the initial clustering results.

6. The method according to claim 5, characterized in that, The step of verifying the initial clustering results and adjusting the initial clustering results based on the verification results to obtain the final clustering results includes: Verify whether the initial clustering result meets the first condition. If the first condition is not met, adjust the weight of the user tags in the initial clustering result corresponding to the first condition. The first condition is that the cost control whitelist users and group users in the initial clustering result are concentrated in the same cluster, and the proportion of important users in the same cluster is greater than or equal to the fourth threshold. Verify whether the initial clustering result meets the second condition. If the first condition is not met, optimize the K value and re-cluster. The second condition is that the proportion of users with the corresponding user label in each cluster is greater than the fifth threshold, and the proportion of the user label in other clusters is less than the sixth threshold. Verify whether the initial clustering result meets the third condition. If the third condition is not met, split or merge the clusters in the initial clustering result, adjust the centroids of the clusters, and then re-cluster them. The third condition is that the ratio of the number of users in each cluster in the initial clustering result to the total number of users is greater than the seventh threshold.

7. The method according to claim 1, characterized in that, The method further includes: The final clustering results are updated periodically; the clusters in the final clustering results are assigned corresponding priorities. In response to a new user assignment event, clusters are matched sequentially from highest to lowest priority in the final clustering results.

8. The method according to claim 7, characterized in that, The method further includes: Anomaly identification is performed for users who fail to be matched to any group during the grouping process and users who match multiple group rules simultaneously; In response to the number of identified anomalous users exceeding the eighth threshold, the final clustering results are updated and adjusted.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1 to 8.

10. A non-transitory computer-readable storage medium storing computer instructions, characterized in that, The computer instructions are used to cause the computer to perform the method according to any one of claims 1 to 8.