Clothing sales forecasting system based on big data analysis
By leveraging Apache Spark's distributed computing and data mining technologies, combined with clustering and Apriori algorithms, a hybrid prediction model was constructed, solving the challenges of apparel sales data processing and prediction, and achieving efficient and accurate apparel sales forecasting.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- FUZHOU RONGZHIQUAN SOFTWARE TECHNOLOGY CO LTD
- Filing Date
- 2025-03-13
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies struggle to effectively process large-scale data from apparel production and sales, make it difficult to utilize Apache Spark for distributed processing, perform cluster analysis and correlation mining, and build hybrid prediction models for apparel sales forecasting.
By processing clothing datasets using distributed computing based on Apache Spark, data mining is performed using clustering and Apriori algorithms, and a hybrid prediction model is constructed to predict clothing sales. This model combines time-series prediction, correlation prediction, and visual prediction models for dynamic weighted fusion.
It enables efficient processing of large-scale apparel data, uncovers potential correlations and influencing factors, and improves the accuracy and real-time performance of apparel sales forecasts.
Smart Images

Figure CN120235643B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of data processing technology, specifically a clothing sales forecasting system based on big data analysis. Background Technology
[0002] In today's apparel industry, big data analytics plays a crucial role. Apache Spark, a fast and versatile computing engine, utilizes an in-memory computing model to store intermediate results in memory, significantly improving processing speed. By leveraging Apache Spark, large-scale apparel datasets and image data can be processed efficiently. Through the application of clustering algorithms, correlation analysis methods, and the Apriori algorithm, the potential information within multi-source apparel data is deeply mined, and the influence of factors such as production, environment, sales season, and clothing style is analyzed. Ultimately, based on the data mining results, a hybrid predictive model is constructed to achieve accurate predictions of apparel sales, providing strong data support for apparel companies' production and sales decisions.
[0003] The existing technology has the following problems: it is difficult to use Apache Spark for distributed processing of data collected and integrated from the clothing production and sales process; it is difficult to use clustering algorithms on Spark for cluster analysis; it is difficult to mine the correlation between clothing image data and sales data; it is difficult to use the Apriori algorithm to mine the correlation; and it is difficult to build a hybrid prediction model for clothing sales forecasting and real-time monitoring. Summary of the Invention
[0004] To address the problems existing in the prior art, this invention proposes a clothing sales forecasting system based on big data analysis;
[0005] Therefore, a first aspect of the present invention provides a clothing sales forecasting system based on big data analysis, comprising the following modules:
[0006] Data collection module: Collects product data and production data during the garment production process, and collects sales data, customer data, environmental data, and garment image data during the garment sales process;
[0007] Data processing module: Integrates collected product data, production data, sales data, customer data, and environmental data into a clothing dataset, and performs data cleaning and standardization on the integrated clothing dataset; filters and denoises the collected clothing image data and performs image enhancement.
[0008] Distributed computing processing module: By using Apache Spark, clothing datasets and clothing image data are imported into the Spark cluster for distributed computing processing;
[0009] Data Mining Module: Perform cluster analysis on the clothing dataset using clustering algorithms on Spark; conduct correlation analysis on clothing image data and sales data; use the Apriori algorithm to mine associations based on the preprocessed clothing dataset, and analyze and calculate the impact of production, environment, sales season, and clothing style.
[0010] Predictive Analysis Module: Constructs a hybrid forecasting model for predictive analysis; updates the hybrid forecasting model parameters in real time by calculating the relative deviation between actual and predicted sales data.
[0011] Preferably, product data and production data are collected during the garment production process, and sales data, customer data, environmental data, and garment image data are collected during the garment sales process, including the following steps:
[0012] Collect product data and production data during the garment production process. Product data includes: cost, fabric composition, style design and production process; production data includes: production time, production quantity, production equipment uptime, failure rate and production efficiency.
[0013] Collect sales data, customer data, environmental data, and clothing image data during the clothing sales process. Sales data includes: sales time, sales quantity, sales amount, and profit; customer data includes: number of customers, time spent in different clothing areas, number of times they tried on clothes, clothing style selection, return rate, and customer satisfaction; environmental data includes: temperature, humidity, wind speed, and light intensity.
[0014] Preferably, by using Apache Spark, the clothing dataset and clothing image data are imported into a Spark cluster for distributed computing processing, including the following steps:
[0015] Set up an Apache Spark cluster environment to convert preprocessed product data, production data, sales data, customer data, and environmental data into Spark-supported DataFrame or Dataset formats; integrate TensorFlow on Spark for distributed processing of clothing image data;
[0016] By leveraging Spark's distributed storage and computing capabilities, clothing datasets and clothing image data are distributed and stored across different nodes in the cluster.
[0017] The genetic algorithm scheduler is integrated into Spark's task scheduling framework to dynamically allocate tasks to different Spark nodes. By monitoring the execution of task scheduling, performance data is collected, including task execution time, task waiting time, data transfer time, and resource utilization. Based on the performance data, an optimized task scheduling and allocation strategy is obtained through the iterative process of the genetic algorithm.
[0018] Using Spark's distributed computing capabilities, various data in the clothing dataset are processed in parallel; statistical indicators of various data in the clothing dataset are calculated, including: maximum value, minimum value, mean, median, and standard deviation.
[0019] Based on the preprocessed clothing image data, an image processing library is used to convert the clothing image data into tensor representations. Spark's distributed computing framework is used to perform parallel feature extraction on the clothing image data. A convolutional neural network model is built using a deep learning framework to automatically extract edge features, texture features, clothing color features, and style features from the clothing image data. The convolutional neural network model is trained using historical clothing image data, and the trained model is then used to automatically extract features from the real-time collected and preprocessed clothing image data, outputting the feature extraction results. The extracted feature results are then concatenated to generate a feature vector.
[0020] Preferably, cluster analysis of the clothing dataset is performed using a clustering algorithm on Spark, including the following steps:
[0021] On Spark, K-Means is used as the clustering algorithm; the value of K is initially set using the elbow rule; K initial centroids are randomly selected; the distance from each data point to all centroids is calculated using the Euclidean distance method, and each data point is assigned to the cluster represented by the nearest centroid; the average value of all data points in each cluster is recalculated as the new centroid; the above steps are repeated until the centroids no longer change or the maximum number of iterations is reached.
[0022] Based on the above steps, cluster analysis is performed on the clothing dataset to obtain clusters for product data, production data, sales data, customer data, and environmental data. The Spark MLlib library is used to calculate the silhouette coefficient of each data point, the average silhouette coefficient of the entire clothing dataset, and the average silhouette coefficient of the data points in each cluster.
[0023] Preferably, a correlation analysis is performed on the clothing image data and sales data, including the following steps:
[0024] Based on the convolutional neural network model on Spark, features are automatically extracted and feature vectors are generated from the collected and preprocessed clothing image data. Correlation analysis is then performed using the Spark MLlib library, and the correlation coefficient between the feature vectors and sales data is calculated using the following formula:
[0025]
[0026] Where G represents the correlation coefficient, X i Y is the i-th eigenvalue in the eigenvector. i X is the i-th sales data value. 均 and Y 均 These represent the mean of all feature values in the feature vector and the mean of all sales data, respectively.
[0027] Preferably, the Apriori algorithm is used to mine associations based on the preprocessed clothing dataset, and the impact of production, environment, sales season, and clothing style is analyzed and calculated, including the following steps:
[0028] The minimum support threshold is set to 0.1 and the minimum confidence threshold is set to 0.7. On the clothing dataset, the Apriori algorithm in association rule mining is used to generate frequent itemsets. The frequent itemsets are obtained by filtering with support greater than or equal to the minimum support threshold.
[0029] Feature filtering is performed on frequent itemsets generated by the Apriori algorithm to select itemsets that contain production data features, sales season features, environmental data features, clothing style features, and sales data features.
[0030] Association rules are selected from the frequent itemset, and support, confidence, and lift are calculated for each generated association rule. Association rules with confidence greater than or equal to the minimum confidence threshold are selected based on the set minimum confidence threshold.
[0031] The production impact is obtained by multiplying the support of all production data features and sales data features by the improvement of production data features on sales data features, and then summing the products.
[0032] The environmental impact score is obtained by multiplying the support of all environmental data features and sales data features by the improvement of the environmental data features on the sales data features, and then summing the products.
[0033] The sales season influence is obtained by multiplying the support of all sales season features and sales data features by the boost of sales season features to sales data features, and then summing the products.
[0034] The influence of clothing styles is obtained by multiplying the support of all clothing style features and sales data features by the improvement of clothing style features on sales data features, and then summing the products.
[0035] Preferably, a hybrid prediction model is constructed for predictive analysis, including the following steps:
[0036] After collecting and preprocessing historical clothing datasets and historical clothing image data, we use distributed computing and data mining analysis to obtain statistical indicators, cluster analysis results, correlation analysis results, and association rule results for various data in the historical clothing dataset.
[0037] Based on the various analytical results obtained above, a training dataset is constructed. The training dataset includes: statistical indicators of various data in the historical clothing dataset; clusters of product data, production data, sales data, customer data, and environmental data, as well as the contour coefficient of each data point, the average contour coefficient of the entire clothing dataset, and the average contour coefficient of data points in each cluster; the correlation coefficient between historical clothing image feature vectors and sales data; and the influence of production, environment, sales season, and clothing style.
[0038] Long Short-Term Memory (LSTM) networks are used as the time-series prediction model for analysis and prediction. The training dataset is arranged chronologically and divided into multiple time windows, each containing sales data for a specific time period. Within each time window, the sales quantity, sales revenue, and profit are labeled. The labeled training dataset is input into the time-series prediction model for training. Real-time collected and preprocessed clothing datasets and clothing image data are input into the time-series prediction model, which outputs the predicted sales data values: sales quantity, sales revenue, and profit.
[0039] The training dataset is input into the association prediction model for training; the real-time collected and preprocessed clothing dataset and clothing image data are processed by distributed computing and data mining analysis to obtain the real-time training dataset, and the real-time training dataset is input into the association prediction model. The output prediction values are: purchase intention, purchase frequency, repurchase rate, sales quantity, sales amount and profit.
[0040] The clothing categories are labeled as: shirts, skirts, pants, and coats; the labeled training dataset is input into the visual prediction model for training; the real-time collected and preprocessed clothing dataset and clothing image data are input into the visual prediction model, and the predicted categories of the clothing image data are output, with the predicted values being the probability of shirts, skirts, pants, and coats.
[0041] Based on the predicted values output by the time-series prediction model, the association prediction model, and the visual prediction model after training, the actual values that are consistent with the predicted values are collected in real time.
[0042] A hybrid prediction model is obtained by integrating the time-series prediction model, the correlation prediction model, and the visual prediction model through a dynamic weighted fusion method. The weights of each prediction model are calculated based on their mean squared errors (MSEs). The formula for calculating the MSE is as follows: N represents the total number of historical clothing datasets and images collected, x i For the actual value, y i This is a predicted value;
[0043] The weighting formula for the time series prediction model is calculated as follows: Formula for calculating the weights of the association prediction model: Weight formula for computer vision prediction model:
[0044] Obtain the weights w of the time series prediction model t The weights w of the correlation prediction model g The weights w of the visual prediction model s , where M1, M2 and M3 are the mean squared errors of the time series prediction model, the correlation prediction model and the visual prediction model, respectively;
[0045] The predicted values of the time series prediction model, the correlation prediction model, and the visual prediction model are multiplied by the corresponding weights of the time series prediction model, the correlation prediction model, and the visual prediction model, respectively, and then summed to obtain the predicted value of the hybrid prediction model.
[0046] Based on the real-time collected and preprocessed clothing dataset and clothing image data, the data are input into the training time-series prediction model, the association prediction model, and the visual prediction model, respectively. The prediction values of the three trained models are then input into the hybrid prediction model to obtain the prediction results of the hybrid prediction model, including sales volume, sales revenue, and profit.
[0047] Preferably, the parameters of the hybrid forecasting model are updated in real time by analyzing and calculating the relative deviation between actual and predicted sales data, including the following steps:
[0048] Establish a real-time monitoring mechanism, calculate the difference between actual sales data and predicted sales data, and then divide it by the predicted sales data to obtain the relative deviation value;
[0049] When the relative deviation exceeds the standard deviation of historical sales data, a variant of stochastic gradient descent in the online learning algorithm is used to update the parameters of the hybrid prediction model in real time based on newly collected product data, production data, sales data, customer data, environmental data, and clothing image data.
[0050] Compared with the prior art, the beneficial effects of the present invention are:
[0051] This invention comprehensively collects data from the garment production and sales process and utilizes Apache Spark for distributed computing processing. This enables efficient processing of large-scale garment datasets and image data, significantly shortening data processing time and improving data processing efficiency. It meets the characteristics of the garment industry, which has large data volume and high processing requirements, and provides strong support for rapid and timely data mining and predictive analysis.
[0052] This invention uses a data mining module to conduct in-depth analysis of clothing datasets and image data using clustering algorithms and the Apriori algorithm. It uncovers potential relationships and influencing factors in clothing data, including: production impact, environmental impact, sales season impact, and clothing style impact, as well as the correlation between clothing data and image data.
[0053] This invention integrates a time-series prediction model, a correlation prediction model, and a visual prediction model using a dynamic weighted fusion method to obtain a hybrid prediction model. By using the hybrid prediction model to predict clothing sales, and by updating the model parameters in real time, it is possible to predict clothing sales more accurately. Attached Figure Description
[0054] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0055] Figure 1 This is a system module diagram of the present invention. Detailed Implementation
[0056] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0057] Please see Figure 1The first aspect of this invention provides a clothing sales forecasting system based on big data analysis, comprising the following modules:
[0058] Data collection module: Collects product data and production data during the garment production process, and collects sales data, customer data, environmental data, and garment image data during the garment sales process;
[0059] Data processing module: Integrates collected product data, production data, sales data, customer data, and environmental data into a clothing dataset, and performs data cleaning and standardization on the integrated clothing dataset; filters and denoises the collected clothing image data and performs image enhancement.
[0060] Distributed computing processing module: By using Apache Spark, clothing datasets and clothing image data are imported into the Spark cluster for distributed computing processing;
[0061] Data Mining Module: Perform cluster analysis on the clothing dataset using clustering algorithms on Spark; conduct correlation analysis on clothing image data and sales data; use the Apriori algorithm to mine associations based on the preprocessed clothing dataset, and analyze and calculate the impact of production, environment, sales season, and clothing style.
[0062] Predictive Analysis Module: Constructs a hybrid forecasting model for predictive analysis; updates the hybrid forecasting model parameters in real time by calculating the relative deviation between actual and predicted sales data.
[0063] Specifically, product and production data are tracked and recorded in real time during garment production, and sales, customer, environmental, and garment image data are also recorded during the garment sales process. The collected product, production, sales, customer, and environmental data are integrated according to a unified format and standard to form a garment dataset. Data cleaning and standardization of the garment dataset include removing duplicate, erroneous, and incomplete data records, and standardizing the time format and numerical range of the data. Garment image data is filtered and denoised to remove noise and interference from the images. Image clarity and contrast are enhanced to facilitate subsequent analysis. The garment dataset and garment image data are imported into an Apache Spark cluster, leveraging Spark's distributed computing capabilities for rapid processing and analysis of large-scale data. Clustering algorithms on Spark are used to perform cluster analysis on the garment dataset to identify different categories of garments. Correlation analysis is performed on the garment image data and sales data to identify the relationship between image features and sales performance. Using the Apriori algorithm, association rules are mined from the clothing dataset to analyze the impact of production, environment, sales season, and clothing style on sales. The influence degrees of production, environment, sales season, and clothing style are calculated to provide a basis for decision-making. Based on the data mining analysis results, a hybrid prediction model is constructed to predict clothing sales data. By analyzing and calculating the relative deviation between actual and predicted sales data, the parameters of the hybrid prediction model are dynamically adjusted.
[0064] In one embodiment of the present invention, collecting product data and production data during the garment production process, and collecting sales data, customer data, environmental data, and garment image data during the garment sales process, includes the following steps:
[0065] Collect product data and production data during the garment production process. Product data includes: cost, fabric composition, style design and production process; production data includes: production time, production quantity, production equipment uptime, failure rate and production efficiency.
[0066] Collect sales data, customer data, environmental data, and clothing image data during the clothing sales process. Sales data includes: sales time, sales quantity, sales amount, and profit; customer data includes: number of customers, time spent in different clothing areas, number of times they tried on clothes, clothing style selection, return rate, and customer satisfaction; environmental data includes: temperature, humidity, wind speed, and light intensity.
[0067] Specifically, cost accounting tools are used to accurately record the raw material costs, labor costs, transportation costs, and other related expenses for each garment. Based on information provided by suppliers or laboratory test results, the proportions of fabric components used in the garments, such as cotton, polyester, and wool, are recorded. Style design data, provided by the design team, covers design elements such as garment pattern, color, and design elements, and can be recorded through a digital design platform or in the form of drawings. Production process data during garment processing is recorded, including cutting, sewing, ironing, and the technical standards and process requirements used. RFID radio frequency identification technology is used to record the specific start and end times of garment production, and the number of garments produced is counted in real time. Sensors installed on production equipment monitor the equipment's operating status, runtime, and malfunctions, calculating the uptime and failure rates. Combining production time and quantity, the production volume per unit time is calculated to evaluate production efficiency. A POS system records sales time, sales quantity, and sales revenue. Combining sales revenue and cost data, the profit for each garment is calculated, and the quantity of various garments in inventory is updated and recorded in real time. The system counts the number of customers purchasing clothing and records their dwell time and number of attempts at different clothing items by installing surveillance cameras or using smart sensors in the sales area. It collects customer preferences for clothing styles through sales records and customer feedback. Customer return requests are recorded, and customer satisfaction information is collected through questionnaires or online reviews. Temperature and humidity sensors are installed in the sales area to monitor and record environmental temperature and humidity in real time; wind speed is recorded using anemometers or weather station data; and light sensors are installed to monitor the lighting levels in the sales area to ensure good display effects. Photographers or the design team capture and collect clothing image data, including details such as style, color, and material of the clothing.
[0068] In one embodiment of the present invention, by using Apache Spark, the clothing dataset and clothing image data are imported into a Spark cluster for distributed computing processing, including the following steps:
[0069] Set up an Apache Spark cluster environment to convert preprocessed product data, production data, sales data, customer data, and environmental data into Spark-supported DataFrame or Dataset formats; integrate TensorFlow on Spark for distributed processing of clothing image data;
[0070] By leveraging Spark's distributed storage and computing capabilities, clothing datasets and clothing image data are distributed and stored across different nodes in the cluster.
[0071] The genetic algorithm scheduler is integrated into Spark's task scheduling framework to dynamically allocate tasks to different Spark nodes. By monitoring the execution of task scheduling, performance data is collected, including task execution time, task waiting time, data transfer time, and resource utilization. Based on the performance data, an optimized task scheduling and allocation strategy is obtained through the iterative process of the genetic algorithm.
[0072] Using Spark's distributed computing capabilities, various data in the clothing dataset are processed in parallel; statistical indicators of various data in the clothing dataset are calculated, including: maximum value, minimum value, mean, median, and standard deviation.
[0073] Based on the preprocessed clothing image data, an image processing library is used to convert the clothing image data into tensor representations. Spark's distributed computing framework is used to perform parallel feature extraction on the clothing image data. A convolutional neural network model is built using a deep learning framework to automatically extract edge features, texture features, clothing color features, and style features from the clothing image data. The convolutional neural network model is trained using historical clothing image data, and the trained model is then used to automatically extract features from the real-time collected and preprocessed clothing image data, outputting the feature extraction results. The extracted feature results are then concatenated to generate a feature vector.
[0074] Specifically, download and configure Spark from the official Apache Spark website. In `spark-defaults.conf`, set the Spark master node address, serialization method, and memory configuration. In `spark-env.sh`, set environment variables such as `JAVA_HOME` and `HADOOP_HOME`. Add worker node information to the `workers` file. Use the `scp` command to distribute the Spark installation directory to other nodes in the cluster, and add Spark environment variables on all nodes. Preprocess product data, production data, sales data, customer data, and environmental data, including data cleaning, format conversion, and missing value handling. Convert the preprocessed data into Spark-supported DataFrame or Dataset formats for subsequent distributed computing. Install TensorFlow on each node of the Spark cluster and configure the TensorFlow on Spark environment to run on the Spark cluster. Utilize the distributed computing capabilities of TensorFlow on Spark to process clothing image data, such as image enhancement and image scaling. Distribute the clothing dataset and clothing image data across different nodes in the Spark cluster to achieve distributed data storage. Leveraging Spark's distributed computing capabilities, parallel processing of data stored in the cluster is performed to improve data processing efficiency. A genetic algorithm scheduler is integrated into Spark's task scheduling framework. The genetic algorithm scheduler dynamically allocates tasks to different Spark nodes to optimize task execution efficiency. The execution of task scheduling is monitored, and performance data is collected, including task execution time, task waiting time, data transfer time, and resource utilization. Based on the performance data, an optimized task scheduling allocation strategy is obtained through an iterative process using the genetic algorithm. Spark's distributed computing capabilities are also used to process various data points in the clothing dataset in parallel. Statistical indicators for various data points in the clothing dataset are calculated, including maximum, minimum, mean, median, and standard deviation. Based on the preprocessed clothing image data, the TensorFlow image processing library is used to convert the clothing image data into tensor representations. Parallel feature extraction is then performed on the clothing image data using Spark's distributed computing framework. A convolutional neural network model is built using a deep learning framework to automatically extract edge features, texture features, clothing color features, and style features from the clothing image data. The convolutional neural network model is trained using historical clothing image data. The trained convolutional neural network model is used to automatically extract features from pre-processed clothing images in real time and output the feature extraction results. The extracted features are then concatenated to generate a feature vector.Specifically, the extracted feature parameters include: Color features, which use color histograms to describe the pixel proportion and distribution of different colors in the image, as well as the visual weight of different colors in the overall image; and extracting the color saturation and brightness of the clothing. Texture features, which identify texture types such as cotton, silk, and denim, extract the coarseness of each texture type to represent its size and graininess, and the direction and distribution of each texture type in the clothing image, including: the horizontal, vertical, or diagonal direction of stripes, and the size and arrangement of checks. The surface area of the outer contour of the garment is extracted, as well as the surface area of internal shapes, including patterns, pockets, collars, cuffs, and other components. Edge features, which extract edge sharpness and edge smoothness.
[0075] In one embodiment of the present invention, cluster analysis of a clothing dataset is performed using a clustering algorithm on Spark, including the following steps:
[0076] On Spark, K-Means is used as the clustering algorithm; the value of K is initially set using the elbow rule; K initial centroids are randomly selected; the distance from each data point to all centroids is calculated using the Euclidean distance method, and each data point is assigned to the cluster represented by the nearest centroid; the average value of all data points in each cluster is recalculated as the new centroid; the above steps are repeated until the centroids no longer change or the maximum number of iterations is reached.
[0077] Based on the above steps, cluster analysis is performed on the clothing dataset to obtain clusters for product data, production data, sales data, customer data, and environmental data. The Spark MLlib library is used to calculate the silhouette coefficient of each data point, the average silhouette coefficient of the entire clothing dataset, and the average silhouette coefficient of the data points in each cluster.
[0078] Specifically, ensure you have a properly configured Spark cluster, including a Master node and multiple Worker nodes. Upload the clothing dataset to HDFS or another distributed storage system so that Spark can efficiently access and process it. Add a dependency on MLlib, Spark's machine learning library, to your Spark project. First, determine a suitable K value by plotting the sum of squared errors (SSE) for different K values and observing the elbow position. The elbow position is typically the point where the SSE begins to drop sharply and then flattens out; the K value corresponding to this point is considered the optimal number of clusters. Randomly select K data points from the clothing dataset as initial centroids. For each data point in the clothing dataset, calculate the Euclidean distance from each data point to all K centroids; assign each data point to the cluster represented by its nearest centroid. For each cluster, calculate the average of all data points within it, and use this as the new centroid. Repeat the above steps until the position of the centroid no longer changes significantly or the preset maximum number of iterations is reached. Based on the clustering results, divide the clothing dataset into clusters for product data, production data, sales data, customer data, and environmental data. The quality of clustering results is evaluated using silhouette coefficients from the Spark MLlib library. Silhouette coefficients measure the similarity of a data point to other points within its cluster and its dissimilarity to points in other clusters. The silhouette coefficients for each data point, the average silhouette coefficient for the entire clothing dataset, and the average silhouette coefficient for each data point within each cluster are calculated.
[0079] In one embodiment of the present invention, correlation analysis is performed on clothing image data and sales data, including the following steps:
[0080] Based on the convolutional neural network model on Spark, features are automatically extracted and feature vectors are generated from the collected and preprocessed clothing image data. Correlation analysis is then performed using the Spark MLlib library, and the correlation coefficient between the feature vectors and sales data is calculated using the following formula:
[0081]
[0082] Where G represents the correlation coefficient, X i Y is the i-th eigenvalue in the eigenvector. i X is the i-th sales data value. 均 and Y 均 These represent the mean of all feature values in the feature vector and the mean of all sales data, respectively.
[0083] Specifically, a convolutional neural network (CNN) model on Spark is used to extract features from the preprocessed clothing image data. The CNN model automatically learns and extracts features from the image through convolutional layers, pooling layers, and other structures, including color, edges, texture, and shape. After feature extraction, the image data is converted into feature vectors for subsequent correlation analysis. The correlation analysis tools provided by the Spark MLlib library are used to perform correlation analysis between the feature vectors and sales data. By substituting the extracted clothing image features and sales data values into the correlation coefficient formula between the feature vectors and sales data, the correlation coefficient is obtained. The correlation coefficient measures the degree of linear correlation between the feature vectors and sales data. The closer the correlation coefficient is to 1 or -1, the stronger the correlation; the closer the correlation coefficient is to 0, the weaker the correlation.
[0084] Based on the calculated correlation coefficients, we can determine which clothing image features have a significant impact on sales data.
[0085] In one embodiment of the present invention, the Apriori algorithm is used to mine associations based on the preprocessed clothing dataset, and the production impact, environmental impact, sales season impact, and clothing style impact are analyzed and calculated, including the following steps:
[0086] The minimum support threshold is set to 0.1 and the minimum confidence threshold is set to 0.7. On the clothing dataset, the Apriori algorithm in association rule mining is used to generate frequent itemsets. The frequent itemsets are obtained by filtering with support greater than or equal to the minimum support threshold.
[0087] Feature filtering is performed on frequent itemsets generated by the Apriori algorithm to select itemsets that contain production data features, sales season features, environmental data features, clothing style features, and sales data features.
[0088] Association rules are selected from the frequent itemset, and support, confidence, and lift are calculated for each generated association rule. Association rules with confidence greater than or equal to the minimum confidence threshold are selected based on the set minimum confidence threshold.
[0089] The production impact is obtained by multiplying the support of all production data features and sales data features by the improvement of production data features on sales data features, and then summing the products.
[0090] The environmental impact score is obtained by multiplying the support of all environmental data features and sales data features by the improvement of the environmental data features on the sales data features, and then summing the products.
[0091] The sales season influence is obtained by multiplying the support of all sales season features and sales data features by the boost of sales season features to sales data features, and then summing the products.
[0092] The influence of clothing styles is obtained by multiplying the support of all clothing style features and sales data features by the improvement of clothing style features on sales data features, and then summing the products.
[0093] Specifically, the minimum support threshold is set to 0.1. This threshold is used to filter frequent itemsets; only itemsets with a support greater than or equal to this threshold are retained. The minimum confidence threshold is set to 0.7. This threshold is used to filter association rules; only rules with a confidence level greater than or equal to this threshold are considered valid rules. The minimum support and minimum confidence thresholds can be adjusted through experiments or practical considerations. An initial value can be set, and the algorithm can be run to observe the quantity and quality of the resulting frequent itemsets and association rules. If there are too many frequent itemsets, the support threshold can be appropriately increased; if there are too few association rules, the confidence threshold can be appropriately decreased. The Apriori algorithm is applied to the preprocessed clothing dataset to generate frequent itemsets. Frequent itemsets are generated iteratively, starting with a frequent 1-itemset and progressively generating higher-order frequent itemsets. From the generated frequent itemsets, itemsets containing production data features, sales season features, environmental data features, clothing style features, and sales data features are selected. The feature parameters for production data features, environmental data features, clothing style features, and sales data features are the collected production data, environmental data, clothing style data, and sales data, respectively. The sales season feature divides clothing sales into spring, summer, autumn, and winter based on the time points of the four seasons. Association rules are generated from the frequent itemset, and the support, confidence, and lift of each association rule are calculated. Association rules with a confidence level greater than or equal to a set minimum confidence threshold are selected. The support and lift of all association rules containing both production and sales data features are multiplied by the product, and these products are summed to obtain the production impact score, which reflects the degree of influence of production factors on sales data. Similarly, the support and lift of all association rules containing both environmental and sales data features are multiplied by the product, and these products are summed to obtain the environmental impact score, which reflects the degree of influence of environmental factors on sales data. Calculate the product of the support and lift of all association rules that include sales season characteristics and sales data characteristics, and sum these products to obtain the sales season impact score, which reflects the degree of influence of the sales season on sales data. Similarly, calculate the product of the support and lift of all association rules that include clothing style characteristics and sales data characteristics, and sum these products to obtain the clothing style impact score, which reflects the degree of influence of clothing style on sales data. Based on the calculated production impact score, environmental impact score, sales season impact score, and clothing style impact score, interpret the degree of influence of different factors on sales data.
[0094] In one embodiment of the present invention, a hybrid prediction model is constructed for predictive analysis, including the following steps:
[0095] After collecting and preprocessing historical clothing datasets and historical clothing image data, we use distributed computing and data mining analysis to obtain statistical indicators, cluster analysis results, correlation analysis results, and association rule results for various data in the historical clothing dataset.
[0096] Based on the various analytical results obtained above, a training dataset is constructed. The training dataset includes: statistical indicators of various data in the historical clothing dataset; clusters of product data, production data, sales data, customer data, and environmental data, as well as the contour coefficient of each data point, the average contour coefficient of the entire clothing dataset, and the average contour coefficient of data points in each cluster; the correlation coefficient between historical clothing image feature vectors and sales data; and the influence of production, environment, sales season, and clothing style.
[0097] Long Short-Term Memory (LSTM) networks are used as the time-series prediction model for analysis and prediction. The training dataset is arranged chronologically and divided into multiple time windows, each containing sales data for a specific time period. Within each time window, the sales quantity, sales revenue, and profit are labeled. The labeled training dataset is input into the time-series prediction model for training. Real-time collected and preprocessed clothing datasets and clothing image data are input into the time-series prediction model, which outputs the predicted sales data values: sales quantity, sales revenue, and profit.
[0098] The training dataset is input into the association prediction model for training; the real-time collected and preprocessed clothing dataset and clothing image data are processed by distributed computing and data mining analysis to obtain the real-time training dataset, and the real-time training dataset is input into the association prediction model. The output prediction values are: purchase intention, purchase frequency, repurchase rate, sales quantity, sales amount and profit.
[0099] The clothing categories are labeled as: shirts, skirts, pants, and coats; the labeled training dataset is input into the visual prediction model for training; the real-time collected and preprocessed clothing dataset and clothing image data are input into the visual prediction model, and the predicted categories of the clothing image data are output, with the predicted values being the probability of shirts, skirts, pants, and coats.
[0100] Based on the predicted values output by the time-series prediction model, the association prediction model, and the visual prediction model after training, the actual values that are consistent with the predicted values are collected in real time.
[0101] A hybrid prediction model is obtained by integrating the time-series prediction model, the correlation prediction model, and the visual prediction model through a dynamic weighted fusion method. The weights of each prediction model are calculated based on their mean squared errors (MSEs). The formula for calculating the MSE is as follows: N represents the total number of historical clothing datasets and images collected, x i For the actual value, y i This is a predicted value;
[0102] The weighting formula for the time series prediction model is calculated as follows: Formula for calculating the weights of the association prediction model: Weight formula for computer vision prediction model:
[0103] Obtain the weights w of the time series prediction model t The weights w of the correlation prediction model g The weights w of the visual prediction model s , where M1, M2 and M3 are the mean squared errors of the time series prediction model, the correlation prediction model and the visual prediction model, respectively;
[0104] The predicted values of the time series prediction model, the correlation prediction model, and the visual prediction model are multiplied by the corresponding weights of the time series prediction model, the correlation prediction model, and the visual prediction model, respectively, and then summed to obtain the predicted value of the hybrid prediction model.
[0105] Based on the real-time collected and preprocessed clothing dataset and clothing image data, the data are input into the training time-series prediction model, the association prediction model, and the visual prediction model, respectively. The prediction values of the three trained models are then input into the hybrid prediction model to obtain the prediction results of the hybrid prediction model, including sales volume, sales revenue, and profit.
[0106] Specifically, information is collected from historical clothing datasets and historical clothing image data, including but not limited to product data, production data, sales data, customer data, and environmental data. Preprocessing steps such as data cleaning, noise reduction, and normalization are performed to ensure data quality. Various statistical indicators in the historical clothing dataset are calculated, such as mean, standard deviation, maximum, and minimum values. Clustering algorithms are used to cluster product data, production data, sales data, customer data, and environmental data, and the silhouette coefficient of each data point and the average silhouette coefficient of the entire dataset are calculated. The correlation coefficient between the feature vectors of historical clothing images and sales data is analyzed. Association rules are mined from the historical data to obtain the influence of production, environment, sales season, and clothing style. A Long Short-Term Memory (LSTM) network is used as the time-series prediction model. The training dataset is arranged chronologically and divided into multiple time windows, with sales quantity, sales amount, and profit labeled within each time window. The labeled training dataset is input into the LTM network model for training. Distributed computing and data mining techniques are used to process real-time data to obtain a real-time training dataset, which is then input into the association prediction model for training. The clothing categories are labeled, including but not limited to shirts, skirts, pants, and coats. The labeled training dataset is input into the convolutional neural network of the visual prediction model for training. The mean squared error is calculated based on the actual values in the historical dataset and the predicted values of each model. The weights of the time-series prediction model, the association prediction model, and the visual prediction model are calculated based on the mean squared error. The predicted values of each model are multiplied by their corresponding weights to obtain the predicted value of the hybrid prediction model. The real-time collected and preprocessed clothing dataset and clothing image data are input into each model, and the predicted values of each model are output. The predicted values of each model are input into the hybrid prediction model to obtain the final prediction results, including but not limited to sales quantity, sales revenue, and profit.
[0107] In one embodiment of the present invention, the parameters of the hybrid forecasting model are updated in real time by analyzing and calculating the relative deviation between actual and predicted sales data, including the following steps:
[0108] Establish a real-time monitoring mechanism, calculate the difference between actual sales data and predicted sales data, and then divide it by the predicted sales data to obtain the relative deviation value;
[0109] When the relative deviation exceeds the standard deviation of historical sales data, a variant of stochastic gradient descent in the online learning algorithm is used to update the parameters of the hybrid prediction model in real time based on newly collected product data, production data, sales data, customer data, environmental data, and clothing image data.
[0110] Specifically, the relative deviation value is obtained by calculating the difference between actual sales data and predicted sales data, and then dividing it by the predicted sales data. This relative deviation value reflects the degree of difference between the predicted and actual sales data. A threshold is set, which can be determined based on the standard deviation of historical sales data. Standard deviation is an important indicator of data volatility; when the relative deviation value exceeds this threshold, it means that the hybrid prediction model needs to be updated. By monitoring the relative deviation value in real time, the model update mechanism is triggered when it exceeds the threshold. In this embodiment, a variant of stochastic gradient descent in online learning algorithms is selected to update the parameters of the hybrid prediction model in real time. Stochastic gradient descent is an optimization algorithm suitable for large-scale datasets and online learning scenarios. When the relative deviation value exceeds the threshold, new product data, production data, sales data, customer data, environmental data, and clothing image data are collected. This data is integrated into the model to provide rich information to support model updates. Using a variant of the stochastic gradient descent algorithm, the parameters of the hybrid prediction model are updated in real time based on the newly collected data, and the prediction error is reduced by continuously adjusting the model parameters.
[0111] The above embodiments are only used to illustrate the technical methods of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical methods of the present invention without departing from the spirit and scope of the technical methods of the present invention.
Claims
1. A clothing sales forecasting system based on big data analysis, characterized in that, Includes the following modules: Data collection module: Collects product data and production data during the garment production process, and collects sales data, customer data, environmental data, and garment image data during the garment sales process; Data processing module: Integrates collected product data, production data, sales data, customer data, and environmental data into a clothing dataset, and performs data cleaning and standardization on the integrated clothing dataset; The collected clothing image data is filtered, denoised, and enhanced. Distributed computing processing module: By using Apache Spark, clothing datasets and clothing image data are imported into the Spark cluster for distributed computing processing; Data Mining Module: Perform cluster analysis on the clothing dataset using clustering algorithms on Spark; conduct correlation analysis on clothing image data and sales data; use the Apriori algorithm to mine associations based on the preprocessed clothing dataset, and analyze and calculate the impact of production, environment, sales season, and clothing style. Predictive Analysis Module: Constructs a hybrid forecasting model for predictive analysis; updates the hybrid forecasting model parameters in real time by calculating the relative deviation between actual and predicted sales data; Collect product and production data during the garment production process, and collect sales, customer, environmental, and garment image data during the garment sales process, including the following steps: Collect product data and production data during the garment production process. Product data includes: cost, fabric composition, style design and production process; production data includes: production time, production quantity, production equipment uptime, failure rate and production efficiency. Collect sales data, customer data, environmental data, and clothing image data during the clothing sales process. Sales data includes: sales time, sales quantity, sales amount, and profit; customer data includes: number of customers, time spent in different clothing areas, number of times they tried on clothes, clothing style selection, return rate, and customer satisfaction; environmental data includes: temperature, humidity, wind speed, and light intensity. Constructing a hybrid prediction model for predictive analysis includes the following steps: After collecting and preprocessing historical clothing datasets and historical clothing image data, we use distributed computing and data mining analysis to obtain statistical indicators, cluster analysis results, correlation analysis results, and association rule results for various data in the historical clothing dataset. Based on the various analytical results obtained above, a training dataset is constructed. The training dataset includes: statistical indicators of various data in the historical clothing dataset; clusters of product data, production data, sales data, customer data, and environmental data, as well as the contour coefficient of each data point, the average contour coefficient of the entire clothing dataset, and the average contour coefficient of data points in each cluster; the correlation coefficient between historical clothing image feature vectors and sales data; and the influence of production, environment, sales season, and clothing style. Long Short-Term Memory (LSTM) networks are used as the time-series prediction model for analysis and prediction. The training dataset is arranged chronologically and divided into multiple time windows, each containing sales data for a specific time period. Within each time window, the sales quantity, sales revenue, and profit are labeled. The labeled training dataset is input into the time-series prediction model for training. Real-time collected and preprocessed clothing datasets and clothing image data are input into the time-series prediction model, which outputs the predicted sales data values: sales quantity, sales revenue, and profit. The training dataset is input into the association prediction model for training; the real-time collected and preprocessed clothing dataset and clothing image data are processed by distributed computing and data mining analysis to obtain the real-time training dataset, and the real-time training dataset is input into the association prediction model. The output prediction values are: purchase intention, purchase frequency, repurchase rate, sales quantity, sales amount and profit. The clothing categories are labeled as: shirts, skirts, pants, and coats; the labeled training dataset is input into the visual prediction model for training; the real-time collected and preprocessed clothing dataset and clothing image data are input into the visual prediction model, and the predicted categories of the clothing image data are output, with the predicted values being the probability of shirts, skirts, pants, and coats. Based on the predicted values output by the time-series prediction model, the association prediction model, and the visual prediction model after training, and collect the actual values that are consistent with the predicted values in real time. A hybrid prediction model is obtained by integrating the time-series prediction model, the correlation prediction model, and the visual prediction model through a dynamic weighted fusion method. The weights of each prediction model are calculated based on their mean squared errors (MSEs). The formula for calculating the MSE is as follows: N represents the total number of historical clothing datasets and images collected. This is the actual value. This is a predicted value; The weighting formula for the time series prediction model is calculated as follows: The formula for calculating the weights of the correlation prediction model is as follows: The weight formula for a computational vision prediction model: ; Obtain the weights of the time series prediction model Weights of the correlation prediction model Weights of visual prediction models ,in , and These are the mean squared errors of the time-series prediction model, the correlation prediction model, and the visual prediction model, respectively. The predicted values of the time series prediction model, the correlation prediction model, and the visual prediction model are multiplied by the corresponding weights of the time series prediction model, the correlation prediction model, and the visual prediction model, respectively, and then summed to obtain the predicted value of the hybrid prediction model. Based on the real-time collected and preprocessed clothing dataset and clothing image data, the data are input into the training time-series prediction model, the association prediction model, and the visual prediction model, respectively. The prediction values of the three trained models are then input into the hybrid prediction model to obtain the prediction results of the hybrid prediction model, including sales volume, sales revenue, and profit.
2. The apparel sales forecasting system based on big data analysis according to claim 1, characterized in that, By using Apache Spark, the clothing dataset and clothing image data are imported into a Spark cluster for distributed computing processing, including the following steps: Set up an Apache Spark cluster environment to convert preprocessed product data, production data, sales data, customer data, and environmental data into Spark-supported DataFrame or Dataset formats; integrate TensorFlow on Spark for distributed processing of clothing image data; By leveraging Spark's distributed storage and computing capabilities, clothing datasets and clothing image data are distributed and stored across different nodes in the cluster. The genetic algorithm scheduler is integrated into Spark's task scheduling framework to dynamically allocate tasks to different Spark nodes. By monitoring the execution of task scheduling, performance data is collected, including task execution time, task waiting time, data transfer time, and resource utilization. Based on the performance data, an optimized task scheduling and allocation strategy is obtained through the iterative process of the genetic algorithm. Using Spark's distributed computing capabilities, various data in the clothing dataset are processed in parallel; statistical indicators of various data in the clothing dataset are calculated, including: maximum value, minimum value, mean, median, and standard deviation. Based on the preprocessed clothing image data, an image processing library is used to convert the clothing image data into tensor representations. Spark's distributed computing framework is used to perform parallel feature extraction on the clothing image data. A convolutional neural network model is built using a deep learning framework to automatically extract edge features, texture features, clothing color features, and style features from the clothing image data. The convolutional neural network model is trained using historical clothing image data, and the trained model is then used to automatically extract features from the real-time acquired and preprocessed clothing image data, outputting the feature extraction results. The extracted feature results are then concatenated to generate a feature vector.
3. The apparel sales forecasting system based on big data analysis according to claim 2, characterized in that, Cluster analysis of the clothing dataset was performed using clustering algorithms on Spark, including the following steps: On Spark, K-Means is used as the clustering algorithm; the value of K is initially set using the elbow rule; K initial centroids are randomly selected; the distance from each data point to all centroids is calculated using the Euclidean distance method, and each data point is assigned to the cluster represented by the nearest centroid; the average value of all data points in each cluster is recalculated as the new centroid; the above steps are repeated until the centroids no longer change or the maximum number of iterations is reached. Based on the above steps, cluster analysis is performed on the clothing dataset to obtain clusters for product data, production data, sales data, customer data, and environmental data. The Spark MLlib library is used to calculate the silhouette coefficient of each data point, the average silhouette coefficient of the entire clothing dataset, and the average silhouette coefficient of the data points in each cluster.
4. The apparel sales forecasting system based on big data analysis according to claim 3, characterized in that, Correlation analysis of clothing image data and sales data includes the following steps: Based on the convolutional neural network model on Spark, features are automatically extracted and feature vectors are generated from the collected and preprocessed clothing image data. Correlation analysis is then performed using the Spark MLlib library, and the correlation coefficient between the feature vectors and sales data is calculated using the following formula: ; Where G represents the correlation coefficient. It is the i-th eigenvalue in the eigenvector. It is the i-th sales data value. and Let represent the mean of all eigenvalues in the eigenvector and the mean of all sales data.
5. The apparel sales forecasting system based on big data analysis according to claim 1, characterized in that, The Apriori algorithm was used to mine associations based on the preprocessed clothing dataset, and the impact of production, environment, sales season, and clothing style was analyzed and calculated, including the following steps: The minimum support threshold is set to 0.1 and the minimum confidence threshold is set to 0.
7. On the clothing dataset, the Apriori algorithm in association rule mining is used to generate frequent itemsets. The frequent itemsets are obtained by filtering with support greater than or equal to the minimum support threshold. Feature filtering is performed on frequent itemsets generated by the Apriori algorithm to select itemsets that contain production data features, sales season features, environmental data features, clothing style features, and sales data features. Association rules are selected from the frequent itemset, and support, confidence, and lift are calculated for each generated association rule. Association rules with confidence greater than or equal to the minimum confidence threshold are selected based on the set minimum confidence threshold. The production impact is obtained by multiplying the support of all production data features and sales data features by the improvement of production data features on sales data features, and then summing the products. The environmental impact score is obtained by multiplying the support of all environmental data features and sales data features by the improvement of the environmental data features on the sales data features, and then summing the products. The sales season influence is obtained by multiplying the support of all sales season features and sales data features by the boost of sales season features to sales data features, and then summing the products. The influence of clothing styles is obtained by multiplying the support of all clothing style features and sales data features by the improvement of clothing style features on sales data features, and then summing the products.
6. The apparel sales forecasting system based on big data analysis according to claim 1, characterized in that, The parameters of the hybrid forecasting model are updated in real time by calculating the relative deviation between actual and predicted sales data, including the following steps: Establish a real-time monitoring mechanism, calculate the difference between actual sales data and predicted sales data, and then divide by the predicted sales data to obtain the relative deviation value; When the relative deviation exceeds the standard deviation of historical sales data, a variant of stochastic gradient descent in the online learning algorithm is used to update the parameters of the hybrid prediction model in real time based on newly collected product data, production data, sales data, customer data, environmental data, and clothing image data.