A sample optimization method and device in a Spark framework

By combining the optimized SMOTE algorithm and clustering algorithm, and employing the determination of the optimal number of clusters and adaptive random code generation, the sample sampling problem in big data processing is solved, achieving efficient sample expansion and improved model generalization ability.

CN115248991BActive Publication Date: 2026-06-16CHINA MOBILE GROUP JIANGSU +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA MOBILE GROUP JIANGSU
Filing Date
2021-04-26
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing technologies for big data processing suffer from problems such as overfitting, information loss, and noisy samples due to duplicate data sampling, especially when the distribution of positive and negative samples is complex, resulting in high computational cost and poor model generalization ability.

Method used

By combining the optimized SMOTE algorithm and clustering algorithm, the sample generation process is optimized through determining the optimal number of clusters and generating adaptive random codes, reducing noisy samples and achieving efficient multi-sample oversampling. Spark SQL statements are used to integrate the sample data.

🎯Benefits of technology

It effectively handles noise issues during sample generation, improves the universality of sample expansion and model generalization ability, and reduces computational load and cluster computing pressure.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115248991B_ABST
    Figure CN115248991B_ABST
Patent Text Reader

Abstract

The application provides a sample optimization method and device in a Spark framework, comprising: obtaining modeling data samples in a preset scene; and based on an optimized SMOTE algorithm and a clustering algorithm, optimizing the modeling data samples to obtain a sample optimization result. The application effectively processes noise problems in the sample generation process in combination with the optimized SMOTE algorithm and the clustering algorithm, so that the sample expansion is more universal and suitable for implementation under a big data framework.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of big data processing technology, and in particular to a sample optimization method and apparatus in the Spark framework. Background Technology

[0002] In daily big data processing, samples often need to be processed in various ways. The size of the sample directly determines the accuracy of model training and the generalization ability of the model.

[0003] Generally, the distribution of samples needs to accurately reflect the probability distribution of the actual data, and the amount of data from samples with different labels should be roughly equal to avoid sample skew. However, in practical applications, positive and negative sample data often differ significantly. For example, in precision marketing scenarios (users with specific needs), the amount of data from positive samples is generally much smaller than that from negative sample user groups (the general public). Existing methods for handling these differences between positive and negative samples often include:

[0004] 1. Oversampling: Randomly oversample a small number of data points in the training samples to make the amount of data for positive and negative samples similar;

[0005] 2. Undersampling: Remove some data from categories with a large amount of data;

[0006] 3. SMOTE algorithm: It adopts a clustering method similar to K-Means, interpolating the target sample group with its neighboring points to introduce new sample points.

[0007] It can be observed that the above three solutions have the following drawbacks:

[0008] The first type of algorithm achieves sample data balance by using data from categories with less repetition. The problem with this type of method is that repetitive data can lead to severe overfitting, especially when dividing the training, testing, and validation data, resulting in duplication across the three datasets. In practical applications, this significantly reduces the model's generalization ability.

[0009] The second type of algorithm adopts the strategy of removing sample data points. Removing data points means losing data, which will lead to serious information loss problems, especially when the distribution of positive and negative sample data is complex and the training index has a high dimension.

[0010] The existing SMOTE algorithm in the third category obtains new sample points by interpolating from nearby sample points of the same category. This type of algorithm achieves the "generation" of new sample points to a certain extent and has better application effects compared to the first two schemes. However, when the distribution of positive and negative samples is more complex, the interpolation strategy may cause the generated sample data set to be inconsistent with the actual distribution of positive and negative data, resulting in noisy samples and making subsequent classification training more difficult. Therefore, in practical applications, there are oversampling schemes based on clustering. In addition, the interpolation strategy based on distance calculation leads to a large amount of computation in new scenarios such as big data.

[0011] Therefore, a new method for optimizing the processing of big data samples is needed to solve the above problems. Summary of the Invention

[0012] This invention provides a sample optimization method and apparatus in the Spark framework to solve various defects in sample sampling in the processing of big data frameworks in the prior art.

[0013] In a first aspect, the present invention provides a sample optimization method in the Spark framework, comprising:

[0014] Obtain modeling data samples from a preset scenario;

[0015] Based on the optimized SMOTE algorithm and clustering algorithm, the modeling data samples are optimized to obtain the sample optimization results.

[0016] In one embodiment, the optimization of the modeling data samples based on the optimized SMOTE algorithm and clustering algorithm to obtain sample optimization results specifically includes:

[0017] An optimal cluster number determination algorithm is used to remove noisy samples from the modeling data samples to obtain the sample clustering results;

[0018] Based on the adaptive random code generation algorithm, the sample clustering results are subjected to multi-sample mixed averaging to obtain the sample optimization results.

[0019] In one embodiment, the step of using an optimal cluster number determination algorithm to remove noisy samples from the modeling data samples to obtain sample clustering results specifically includes:

[0020] Determine the preset K value in the cluster, and the range of the preset K value;

[0021] Obtain the average distance from each data point to the mass point within each cluster, and the entropy value corresponding to the data with sample labels within each cluster;

[0022] Based on the average distance, the minimum value of the average distance, the entropy value, and the minimum value of the entropy value, an error function is obtained;

[0023] Obtain the relationship curve between the error function and the preset K value, and extract the inflection point of the relationship curve as a hyperparameter;

[0024] Based on the values ​​of the relationship curve under the hyperparameter, sample clustering is performed to obtain the sample clustering results.

[0025] In one embodiment, the step of performing multi-sample mixed averaging on the sample clustering results based on the adaptive random code generation algorithm to obtain the sample optimization result specifically includes:

[0026] Determine the random code for each sample in the clustering results;

[0027] New samples are generated by averaging samples with the same random code within the same cluster.

[0028] The new sample is processed based on a preset salting iterative processing algorithm to obtain the sample optimization result.

[0029] In one embodiment, determining the random code for each sample in the sample clustering result specifically includes:

[0030] A unique ID is generated for each sample, and the unique ID is salted using a preset number of iterations to obtain a salted ID.

[0031] The salted ID is encoded using the MD5 encryption algorithm, and the remainder after dividing the data of the first preset number of bits by the preset length is used as the random code.

[0032] In one embodiment, processing the new sample based on a preset salting iterative processing algorithm to obtain the sample optimization result specifically includes:

[0033] The new sample ID is generated by adding the maximum ID value in the cluster where the new sample is located, plus a preset number of iterations and adding salt. A new random code is then obtained based on the new sample ID.

[0034] When the number of newly added samples is less than a preset ratio of the total number of samples, the number of newly added samples will automatically increase.

[0035] In one embodiment, the optimization of the modeling data samples based on the optimized SMOTE algorithm and clustering algorithm to obtain sample optimization results further includes:

[0036] The sample data was integrated using Spark SQL statements.

[0037] Secondly, the present invention also provides a sample optimization device in the Spark framework, comprising:

[0038] The acquisition module is used to acquire modeling data samples from a preset scene;

[0039] The optimization module is used to optimize the modeling data samples based on the optimized SMOTE algorithm and clustering algorithm to obtain the sample optimization results.

[0040] Thirdly, the present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the sample optimization method in any of the above-described Spark frameworks.

[0041] Fourthly, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the sample optimization method in any of the above-described Spark frameworks.

[0042] The sample optimization method and apparatus in the Spark framework provided by this invention address the shortcomings of sample sampling processing in the process of big data sample processing. By combining the optimized SMOTE algorithm and clustering algorithm, it effectively handles the noise problem in the sample generation process, making the sample expansion more universal and suitable for implementation in the big data framework. Attached Figure Description

[0043] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0044] Figure 1 This is a flowchart illustrating the sample optimization method in the Spark framework provided by this invention.

[0045] Figure 2 This is a schematic diagram illustrating the combination of the SMOTE algorithm and clustering algorithm provided by this invention;

[0046] Figure 3 This is a graph showing the relationship between K and Dis in the experimental data provided by this invention;

[0047] Figure 4 This is a schematic diagram of the sample size changing with iteration provided by the present invention;

[0048] Figure 5This is a schematic diagram of the sample optimization device in the Spark framework provided by the present invention;

[0049] Figure 6 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0050] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0051] To address the shortcomings of existing technologies, this invention proposes a sample optimization method within the Spark framework. This method combines the SMOTE algorithm with a clustering algorithm, determining the optimal number of clusters to minimize noisy samples. Simultaneously, it optimizes the original new sample point generation strategy by adaptively generating random codes to achieve efficient multi-sample oversampling. The implementation process is simple and can be translated into SQL language for sample data integration.

[0052] Figure 1 This is a flowchart illustrating the sample optimization method in the Spark framework provided by this invention, as shown below. Figure 1 As shown, it includes:

[0053] 101. Obtain modeling data samples from the preset scene;

[0054] This invention primarily targets data samples in big data processing scenarios, such as...

[0055] 102. Based on the optimized SMOTE algorithm and clustering algorithm, the modeling data samples are optimized to obtain the sample optimization results.

[0056] The existing SMOTE algorithm is optimized by combining it with clustering methods. By determining the optimal number of clusters, using an adaptive random code generation strategy, and employing a multi-sample oversampling strategy, the goal of balancing samples is achieved within a limited number of iterations, reducing the workload of manual debugging and the computational and bandwidth pressure on the cluster.

[0057] This invention combines the optimized SMOTE algorithm and clustering algorithm to effectively handle the noise problem in the sample generation process, making the sample expansion more universal and suitable for implementation in a big data framework.

[0058] Based on the above embodiments, step 102 of the method specifically includes:

[0059] An optimal cluster number determination algorithm is used to remove noisy samples from the modeling data samples to obtain the sample clustering results;

[0060] Based on the adaptive random code generation algorithm, the sample clustering results are subjected to multi-sample mixed averaging to obtain the sample optimization results.

[0061] Specifically, this invention employs a clustering method that automatically determines the optimal number of clusters. By utilizing the inflection point of the clustering cost curve, noisy samples are removed from the modeling data samples to obtain the sample clustering results. Through the optimized SMOTE algorithm, a scheme of generating random codes and multi-sample weighted averaging is adopted. Based on an adaptive random code generation strategy, the random code length is automatically expanded when the iteration efficiency decreases by defining a threshold, thereby achieving rapid sample expansion. Different salting processes are then performed to ensure that the generated sample data has sufficient randomness, while the entire iteration process does not fall into an infinite loop.

[0062] The present invention combines clustering algorithms with the SMOTE algorithm, optimizes the original new sample point generation strategy by determining the optimal number of clusters, and achieves efficient multi-sample oversampling by adaptively generating random codes.

[0063] Based on any of the above embodiments, the step of using the optimal cluster number determination algorithm to remove noise samples from the modeling data samples and obtain sample clustering results specifically includes:

[0064] Determine the preset K value in the cluster, and the range of the preset K value;

[0065] Obtain the average distance from each data point to the mass point within each cluster, and the entropy value corresponding to the data with sample labels within each cluster;

[0066] Based on the average distance, the minimum value of the average distance, the entropy value, and the minimum value of the entropy value, an error function is obtained;

[0067] Obtain the relationship curve between the error function and the preset K value, and extract the inflection point of the relationship curve as a hyperparameter;

[0068] Based on the values ​​of the relationship curve under the hyperparameter, sample clustering is performed to obtain the sample clustering results.

[0069] Specifically, to avoid noisy samples caused by SMOTE, the result of sample clustering should be used as the output of SMOTE, such as... Figure 2As shown in (a), without clustering, newly generated samples may be distributed within the distribution area of ​​another category (after multiple iterations, the small circle area around the spiral sample at the center of the small circle will be covered by spiral samples); by using a clustering strategy and applying the SMOTE algorithm within the same cluster, new samples can be generated within the same cluster, avoiding noisy samples, such as... Figure 2 As shown in (b) (samples are generated only within clusters); in fact, the key point of this process in the model is how to choose a suitable K value. If K is too small, there may only be a few sample points in each cluster, and the oversampling within the cluster is actually not much different from duplicate data, such as... Figure 2 As shown in (c); however, if the number of clusters is too small, it may still lead to the generation of noisy data points, such as... Figure 2 (d) shows that (a sample generated within the large cluster B will overlap with the pentagonal sample within it).

[0070] This invention employs a joint objective functional approach, combining the discreteness of data points to centroids and the discrete entropy of samples within clusters, to determine the selection of model hyperparameters. The specific process is as follows:

[0071] Suppose k varies from k_min to k_max, and after each iteration, calculate the following two values:

[0072] 1) The average distance Dis_a_k from each data point to the mass point within each cluster;

[0073] 2) The entropy value Dis_b_k of all data with sample labels within each cluster;

[0074] 3) Calculate the error function for this experiment. The error function is defined as: Dis_k = 0.5 * Dis_a_k / Dis_a_k_min + 0.5 * Dis_b_k / Dis_b_k_min.

[0075] The above process yields a curve relating the error function value Dis_k to k. The inflection point of this curve is chosen as a suitable value for k. Figure 3 The figure shows the relationship curve in a certain experiment. Based on the position of the inflection point in the figure, k=25 is selected as the optimal hyperparameter.

[0076] This invention employs a clustering method that automatically determines the optimal number of clusters. By utilizing the inflection point of the clustering cost curve, it ensures the effectiveness of clustering, namely, a reasonable number of clusters and the number of samples within each cluster, while effectively reducing the workload of repeated manual testing.

[0077] Based on any of the above embodiments, the step of performing multi-sample mixed averaging processing on the sample clustering results based on the adaptive random code generation algorithm to obtain the sample optimization result specifically includes:

[0078] Determine the random code for each sample in the clustering results;

[0079] New samples are generated by averaging samples with the same random code within the same cluster.

[0080] The new sample is processed based on a preset salting iterative processing algorithm to obtain the sample optimization result.

[0081] Specifically, determining the random code for each sample in the sample clustering result includes:

[0082] A unique ID is generated for each sample, and the unique ID is salted using a preset number of iterations to obtain a salted ID.

[0083] The salted ID is encoded using the MD5 encryption algorithm, and the remainder after dividing the data of the first preset number of bits by the preset length is used as the random code.

[0084] The step of processing the new sample based on a preset salting iterative processing algorithm to obtain the sample optimization result specifically includes:

[0085] The new sample ID is generated by adding the maximum ID value in the cluster where the new sample is located, plus a preset number of iterations and adding salt. A new random code is then obtained based on the new sample ID.

[0086] When the number of newly added samples is less than a preset ratio of the total number of samples, the number of newly added samples will automatically increase.

[0087] Specifically, due to the massive volume of operator big data, and with the sample data used for modeling amounting to around millions of data points, the computational efficiency of conventional big data processing techniques based on the MAP-REDUCE framework suffers significantly in scenarios involving data shuffling, such as sorting. Implementing SMOTE directly in Spark would increase both programming complexity and computational load. Therefore, this invention proposes an adaptive random code generation method, employing a multi-sample mixed averaging approach to generate new samples, as detailed below:

[0088] First, a random code is generated. A unique ID is generated for each sample, such as a mobile phone number from a carrier or an auto-incrementing ID from general data. Salting is performed using the number of iterations, and the salted ID is encoded using MD5 encryption. To ensure sufficient randomness, the remainder of the first n bits divided by m is taken as the random code for that record. After this iteration, n=1. Taking the remainder of each bit divided by m as the random code is to avoid a large amount of duplicate data after extending the random code length.

[0089] If m < 16, and m is too large, the number of samples under the same random code will be too small, and the effect will be close to that of general oversampling; if m is too small, expanding the length of the random code will not significantly improve the iteration efficiency; in practical applications, m = 4 is generally chosen.

[0090] Secondly, there is multi-sample oversampling, which averages samples with the same random code within the same cluster to generate new samples. Even if there is only one sample under that random code, it is allowed to be generated. That is, there is a certain degree of overlap between the generated new sample data and the original data. When integrating with the original sample set, it is possible to choose whether to retain the duplicate data.

[0091] Secondly, the strategy for generating IDs for new samples is to use the maximum ID value in the cluster plus the number of iterations plus salt. By combining different salting methods for each iteration, it is ensured that the iteration process will not fall into an infinite loop, while improving the randomness of the generated samples. The extreme case here is that when the length of n is too long, each sample has a different random code, and the generated new sample data are all repetitions of existing data. Using different generation schemes can avoid this problem to a certain extent.

[0092] If the sample random codes used are too short, all samples within the same cluster may have the same random code. In this case, the number of new samples added each time is equal to the number of clusters after clustering. To ensure iteration efficiency, an adaptive random code definition scheme is proposed: when the number of new samples is less than 10% of the total number of samples, n increments automatically.

[0093] The adaptive random code generation algorithm used in this invention avoids the process of searching for the closest points of the same class around the sample. Therefore, it can greatly reduce the process of full data shuffling and exchange, and effectively reduce the bandwidth pressure caused by data exchange between different machines in a real cluster.

[0094] Based on any of the above embodiments, the step of optimizing the modeling data samples based on the optimized SMOTE algorithm and clustering algorithm to obtain sample optimization results further includes:

[0095] The sample data was integrated using Spark SQL statements.

[0096] It is understood that the algorithm proposed in this invention conforms to the implementation scheme of database SQL and can directly integrate sample data in the form of Spark SQL. In practical applications, depending on whether duplicate data is retained, union or union all can be easily selected for data integration. The specific algorithm is as follows:

[0097] Algorithm: An Optimized SMOTE Algorithm and Its Implementation in Spark

[0098] Input: Sample data D with cluster labels, number of target samples

[0099] n = 1, m = 4, I = 0

[0100] While the iteration stopping condition is not met (sample size is less than the target sample size) {

[0101] Generate a unique ID for each sample

[0102] Generate salted random key C = substr(MD5(concat(ID,I))1,n).%4

[0103] New samples are obtained by averaging the samples with the same random code within the cluster, and then combined to form a new sample data set:

[0104]

[0105] Output: Data after sample augmentation

[0106] This invention transforms the entire core implementation process into SQL language, enabling flexible handling of whether or not to retain duplicate data.

[0107] Based on any of the above embodiments, the above scheme is applied to a precision marketing application scenario, with an actual sample data volume of 300:9000; the initial model parameters are defined as n=1; m=4, and a strategy allowing duplicate data is adopted. An automatic cluster number determination scheme is used to divide the original samples into 6 clusters. After optimized SMOTE processing, the sample size change curve with the number of iterations is shown below. Figure 4 As shown, within 11 iterations, the sample data expanded from 300 to 8909. The random code length *n* increased after the fifth iteration, and the corresponding sample expansion rate increased significantly, demonstrating the effectiveness of the sample expansion scheme and adaptive random code generation method in this proposal. When these samples were input into a machine learning method, compared to traditional upsampling schemes, the model's generalization ability on the validation set (completely independent of the test and training sets) improved by approximately 10%.

[0108] The sample optimization device in the Spark framework provided by this invention is described below. The sample optimization device in the Spark framework described below can be referred to in correspondence with the sample optimization method in the Spark framework described above.

[0109] Figure 5 This is a schematic diagram of the sample optimization device in the Spark framework provided by the present invention, as shown below. Figure 5 As shown, it includes: an acquisition module 51 and an optimization module 52; wherein:

[0110] The acquisition module 51 is used to acquire modeling data samples in a preset scenario; the optimization module 52 is used to optimize the modeling data samples based on the optimized SMOTE algorithm and clustering algorithm to obtain sample optimization results.

[0111] This invention combines the optimized SMOTE algorithm and clustering algorithm to effectively handle the noise problem in the sample generation process, making the sample expansion more universal and suitable for implementation in a big data framework.

[0112] Based on any of the above embodiments, the optimization module 52 includes a clustering submodule 521 and an optimization submodule 522, wherein:

[0113] The clustering submodule 521 is used to remove noisy samples in the modeling data samples by using the optimal cluster number determination algorithm to obtain the sample clustering result; the optimization submodule 522 is used to perform multi-sample mixed averaging processing on the sample clustering result based on the adaptive random code generation algorithm to obtain the sample optimization result.

[0114] Based on any of the above embodiments, the clustering submodule 521 is specifically used for:

[0115] Determine the preset K value in the cluster, and the range of the preset K value;

[0116] Obtain the average distance from each data point to the mass point within each cluster, and the entropy value corresponding to the data with sample labels within each cluster;

[0117] Based on the average distance, the minimum value of the average distance, the entropy value, and the minimum value of the entropy value, an error function is obtained;

[0118] Obtain the relationship curve between the error function and the preset K value, and extract the inflection point of the relationship curve as a hyperparameter;

[0119] Based on the values ​​of the relationship curve under the hyperparameter, sample clustering is performed to obtain the sample clustering results.

[0120] Based on any of the above embodiments, determining the preset K value in the cluster and the range of the preset K value specifically includes:

[0121] Determine the random code for each sample in the clustering results;

[0122] New samples are generated by averaging samples with the same random code within the same cluster.

[0123] The new sample is processed based on a preset salting iterative processing algorithm to obtain the sample optimization result.

[0124] Based on any of the above embodiments, determining the random code of each sample in the sample clustering result specifically includes:

[0125] A unique ID is generated for each sample, and the unique ID is salted using a preset number of iterations to obtain a salted ID.

[0126] The salted ID is encoded using the MD5 encryption algorithm, and the remainder after dividing the data of the first preset number of bits by the preset length is used as the random code.

[0127] Based on any of the above embodiments, the step of processing the new sample using a preset salting iterative processing algorithm to obtain the sample optimization result specifically includes:

[0128] The new sample ID is generated by adding the maximum ID value in the cluster where the new sample is located, plus a preset number of iterations and adding salt. A new random code is then obtained based on the new sample ID.

[0129] When the number of newly added samples is less than a preset ratio of the total number of samples, the number of newly added samples will automatically increase.

[0130] Based on any of the above embodiments, the optimization module 52 further includes an integration submodule 523, which is used to integrate sample data using Spark SQL statements.

[0131] Figure 6 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 6 As shown, the electronic device may include a processor 610, a communications interface 620, a memory 630, and a communication bus 640. The processor 610, communications interface 620, and memory 630 communicate with each other via the communication bus 640. The processor 610 can call logical instructions in the memory 630 to execute a sample optimization method in the Spark framework. This method includes: acquiring modeling data samples from a preset scenario; optimizing the modeling data samples based on the optimized SMOTE algorithm and clustering algorithm to obtain the sample optimization result.

[0132] Furthermore, the logical instructions in the aforementioned memory 630 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0133] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program stored on a non-transitory computer-readable storage medium, the computer program including program instructions, when the program instructions are executed by a computer, the computer is able to execute the sample optimization method in the Spark framework provided by the above methods, the method including: obtaining modeling data samples in a preset scenario; optimizing the modeling data samples based on the optimized SMOTE algorithm and clustering algorithm to obtain sample optimization results.

[0134] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon. When executed by a processor, the computer program is implemented to perform the sample optimization methods in the Spark frameworks provided above. The method includes: acquiring modeling data samples in a preset scenario; optimizing the modeling data samples based on the optimized SMOTE algorithm and clustering algorithm to obtain sample optimization results.

[0135] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0136] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0137] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A sample optimization method in a Spark framework, characterized in that, The method comprises the following steps: obtaining modeling data samples in a preset scene; optimizing the modeling data samples based on an optimized SMOTE algorithm and a clustering algorithm to obtain sample optimization results; the optimization of the modeling data samples based on the optimized SMOTE algorithm and the clustering algorithm to obtain the sample optimization results specifically comprises: removing noise samples in the modeling data samples by using an optimal clustering cluster number determination algorithm to obtain sample clustering results; and performing multi-sample mixed average processing on the sample clustering results based on an adaptive random code generation algorithm to obtain the sample optimization results; the multi-sample mixed average processing on the sample clustering results based on the adaptive random code generation algorithm to obtain the sample optimization results specifically comprises: determining random codes of each sample in the sample clustering results; performing average sampling on samples with the same random code in the same cluster to generate new samples; and processing the new samples based on a preset salt iteration processing algorithm to obtain the sample optimization results.

2. The sample optimization method in Spark framework according to claim 1, wherein, the removal of the noise samples in the modeling data samples by using the optimal clustering cluster number determination algorithm to obtain the sample clustering results specifically comprises: determining a preset K value in clustering and an interval range of the preset K value; obtaining an average distance of each data point in each cluster to a mass point and an entropy value corresponding to a data with a sample label in each cluster; obtaining an error function based on the average distance, a minimum value of the average distance, the entropy value and a minimum value of the entropy value; obtaining a relationship curve of the error function and the preset K value, and extracting an inflection point of the relationship curve as a hyperparameter; performing sample clustering based on a value of the relationship curve at the hyperparameter to obtain the sample clustering results.

3. The method of claim 1, wherein, the determination of the random codes of each sample in the sample clustering results specifically comprises: generating a unique ID for each sample, and performing salt addition on the unique ID by using a preset iteration number to obtain a salted ID; encoding the salted ID based on an MD5 encryption algorithm, and extracting a remainder of a preset length from data of a front preset number of bits as the random code.

4. The method of claim 1, wherein, the processing of the new samples based on the preset salt iteration processing algorithm to obtain the sample optimization results specifically comprises: processing a maximum ID in a cluster where the new sample is located by using salt addition of a preset iteration number to generate a new sample ID, and obtaining a new random code based on the new sample ID; when a sample addition amount is less than a preset ratio of a total sample amount, the sample addition amount is self-increased.

5. The method of claim 1, wherein, the optimization of the modeling data samples based on the optimized SMOTE algorithm and the clustering algorithm to obtain the sample optimization results further comprises: integrating sample data by using a Spark SQL statement.

6. A sample optimization apparatus in a Spark framework, characterized by, The method comprises the following steps: an obtaining module, configured to obtain modeling data samples in a preset scene; an optimization module, configured to optimize the modeling data samples based on an optimized SMOTE algorithm and a clustering algorithm to obtain sample optimization results; the optimization of the modeling data samples based on the optimized SMOTE algorithm and the clustering algorithm to obtain the sample optimization results specifically comprises: An optimal clustering cluster number determination algorithm is used to remove noise samples in the modeling data samples to obtain sample clustering results; and an adaptive random code generation algorithm is used to perform multi-sample mixed average processing on the sample clustering results to obtain the sample optimization results. The adaptive random code generation algorithm is used to perform multi-sample mixed average processing on the sample clustering results to obtain the sample optimization results, and specifically includes: Random codes of each sample in the sample clustering results are determined; samples with the same random codes in the same cluster are averaged to generate new samples; and the new samples are processed based on a preset salt iteration processing algorithm to obtain the sample optimization results.

7. An electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, characterized in that, The processor executes the computer program to implement the steps of the sample optimization method in the Spark framework according to any one of claims 1 to 5.

8. A non-transitory computer-readable storage medium having stored thereon a computer program, characterized in that, The computer program is executed by the processor to implement the steps of the sample optimization method in the Spark framework according to any one of claims 1 to 5.