[0043] Refer to the attached drawings.
[0044] The download forecast method of the present invention includes the following steps:
[0045] 1) Obtain historical data of the app to be predicted from the background data, including the downloads of the app in the known m days;
[0046] 2) Source data processing. Process the data in step 1) to generate a discrete time series x of length L to represent the download curve of each app. So far, the download curves of all apps form a discrete time series training data set. It includes the following steps:
[0047] The step 2) specifically includes the following steps:
[0048] (1) Given download threshold thr; thr is a manually designated parameter, and the default thr=0.1. Based on the determination of thr, the L of the entire training data set can be calculated 1 And L 2 , Generally adjust thr so that L 1 +L 2 Not less than 2/3 of the original sequence length.
[0049] (2) For each discrete time series x in the discrete time series data set, calculate the corresponding L 1 (x) and L 2 (x), where L 1 (x) means from L p For the first few days, the download volume drops to thr*v for the first time. p The number of days used, correspondingly, L 2 (x) is expressed from L p Starting a few days to the right, the download volume drops to thr*v for the first time p Number of days used; L p Is the day specified in the sequence, v p Is the peak download volume.
[0050] (3) Calculate L in all training data sets 1 (x) and L 2 (x) average value L 2 And L 2. Plus: intercept L before the peak of each discrete sequence 1 Tianhehou L 2 Days of download data, the number of days on the left is less than L 1 When, fill with the data on the right. Correspondingly, the data on the left is used to fill in the lack of data on the right. To ensure that the length of all sequences is L (L=L 1 +L 2 ). So far, the source data has been processed into discrete time series of length L.
[0051] 3) Pattern clustering, clustering the discrete time series data set generated in step 2) to obtain k download patterns, which specifically includes the following steps:
[0052] (1) Set the number k of pattern clusters in the training data set; k is a manually designated parameter, and the default k=6, and its specific value is adjusted according to the clustering effect.
[0053] (2) Randomly designate k curves as the centers of k clusters from the training data set, and calculate the curve distance d(x,c) from each non-central discrete time series x to k centers;
[0054] d(x,c) represents the distance between x and a certain cluster center, and c specifically refers to the cluster center (center). According to d(x,c), the discrete sequence is divided to the nearest cluster center. class.
[0055] (3) Update cluster center k, the target of each cluster center update becomes to minimize F;
[0056] Under a given class division, the goal of cluster update is to minimize the sum of squared distances from each discrete time series in the class to the class center.
[0057] F = X k = 1 K X x i A C k d ( x i , μ k ) 2 - - - ( 1 )
[0058] Where μ k , C k These are the center of the kth class and the curve belonging to the kth class.
[0059] According to formula (1), the value of each k-th category update can be derived:
[0060] μ k * = arg min μ X x i A C k d ( x i , μ ) 2 - - - ( 2 )
[0061] among them Is the value of the center of the k-th category after the update.
[0062] ( 4 ) α = x i T μ | | xi | | 2
[0063] α is the scaling factor of the ordinate of another discrete sequence.
[0064] μ k * = arg min μ 1 | | μ | | 2 X x i A C k | | ( x i x i T | | xi | | 2 - I ) μ | | 2 - - - ( 3 )
[0065] Further formula 4 can be derived from formula 3:
[0066] μ k * = arg min μ 1 | | μ | | 2 μ T X x i A C k ( I - x i x i T | | xi | | 2 ) μ - - - ( 4 )
[0067] make M = X x i A C k ( I - x i x i T | | x i | | 2 ) , Can get The final calculation method:
[0068] μ k * = arg min μ μ T Mμ | | μ | | 2
[0069] therefore, It is the eigenvector corresponding to the smallest eigenvalue of matrix M.
[0070] Among them, T means transpose the matrix, C k Represents the set of curves belonging to the kth category in the current division. x i Represents the i-th item in the discrete sequence x, the value of M is only in x i Correlation, no specific physical meaning, I is a constant, corresponding to the i-th item in x, and u represents the matrix formed by various centers.
[0071] The algorithm used for pattern clustering is similar to k-means, and each iteration is divided into two steps. Unlike the Euclidean distance used by k-means, the curve distance calculation method in the definition is used here.
[0072] The update of the clustering center by the way of finding the feature vector makes the realization of the algorithm easier, and also effectively reduces the complexity of the solution.
[0073] 4) Download forecast. Given the download curve of an app in m days, match k download patterns, calculate the total downloads in the next (L-m) days, and get the forecast result, which specifically includes the following steps:
[0074] (1) Given an app, the download curve of the first m days is a discrete time series test of length m, and the calculation test and each center (the cluster center itself is a discrete sequence of length L) constituted by the first m days Cosine similarity of discrete sequence, select the most similar class center c;
[0075] (2) The following L-m-day total download volume pred is predicted as follows:
[0076] pred = X j = 1 m test j X j = 1 m c j * X j = 1 + m L c j ,
[0077] Among them, c represents the selected most similar class center, then cj represents the jth item of the discrete sequence c.