[0086] Example 2
[0087] More specifically, for a common user information extraction section of a single model, use self-focus mechanism instead of the original poolization technology, the principle of self-focus mechanism image 3 As shown, Query and Key are hidden states of decoders and encoders, and Value is an embedding vector to extract information. Query and Key have generated corresponding weight A after a cost of calculating similarity, normalization, mask, and SoftMax. The obtained weight A is multiplied by the information vector value, which gives weights for each input vector according to the similarity. It will make it unforeseen or special. (Specifically, using max-pooling will remain in two fields, that is, the specialty but loss of generality. Use Average-Pool, simply retain the mean of both, that is, the general but lost specialty.)
[0088] More specifically, the DCTOR model (hereinafter referred to as the SA-Dctor Model) is based on self-focus mechanism. Figure 4 As shown, after replacing the Pooling layer using the self-focus mechanism, the vector of each input is given a weight, and the information can be extracted according to the similarity of the common user vectors of the two fields, while retaining two fields. Particularity and generality. In addition, if Figure 5 As shown, the self-focus mechanism is a parallel end-to-end algorithm, and the input sequence vector A is weighted after weighing the sequence vector B, and the parallelism of the algorithm makes the model well on large-scale data. Application performance.
[0089] In the specific implementation process, the Amazon data set is used as the experimental data of the present invention, which is generally used for training and testing of cross-domain recommendation algorithms, and the data set is close to information such as rating, voting, and product metadata. Amazon scores a 44 category, and the present invention selects four representative commodity categories (movie domain M, book domain B, music domain C, and electronic product domain E).
[0090] First, pretreatment is performed, the specific process includes:
[0091] The anchor user and the product of the number of scores are too small, and the serious data sparsiness of the data is alleviated;
[0092] In the music domain C, Book Domain B, the film domain M, and the electronic product domain E have retained data of a common user;
[0093] Users who have reserved more scores in each area;
[0094] Finally, the data set is classified into value type information and category information, where the numerical information mainly includes rating information, and the category information mainly includes item description information, brands, and links.
[0095] Among them, the music domain C is used as the target domain, the book domain B, the movie domain M, and the electronic product domain E are respectively used as the auxiliary domain, respectively, and the corresponding data is input to the SA-Dctor model. The present invention uses the movie domain M and the music domain C as an example, and other dual domain single mode training processes are basically the same as those of this embodiment, the process is:
[0096] The required data is mainly embedding to quantization processing in the embedding layer. details as follows:
[0097] 1 Score matrix R from the film domain M and music domain C, respectively m R c Using the LFM method to decompose the potential factor matrix of the corresponding field, ie the user matrix U of the movie domain M m , Item matrix V m , User matrix U C , Item matrix V C. Next, the potential factor generating process in the present invention will be described. Taking the potential factor generation process of the film domain M as an example, the potential factor of the music domain C can be obtained:
[0098] First, the rating matrix R is M * N is used to use a singular value decomposition (SVD) method. m Decompose the form of a user matrix U and K * N-dimensional item matrix multiplied by M * K dimension, as Figure 6 Indicated. Where M is the number of users, N is the number of items, and k is the dimension of the potential factor vector, that is, the dimension of implicit features.
[0099] More specifically, the above-mentioned singular value decomposition method is implemented as follows:
[0100] The score matrix R for M * N is used to use a singular value decomposition method. m Decompose the form of the user matrix U and K * N-dimensional item matrix of M * K dimension, specifically:
[0101] Ask the score matrix R m Mi-feature value, constitute M characteristic values in diagonal array;
[0102] Set U is an orthogonal base matrix under M-well space, and solves R m * V = λ * u orthogonal matrix V, get R m = U * λ * v T Decomposition form; let U m = U * λ, V m = V T The resulting decomposition potential factor is:
[0103] In order to make the product of the two potential factors closer to the actual score matrix, a mean square error is used as a loss function, and the specific formula is:
[0104]
[0105] Among them, R m Represents the actual score matrix, U m And V m The potential factors of users and items, respectively.
[0106] Next, the loss function is to be biased, and the gradient update formula for the parameter is:
[0107] U m = U m -n (-e m v m ) = U m + ηE m v m
[0108]
[0109] Among them, η is the learning rate, controls the steps per update; then the gradient decrease method is used to solve, the gradient decrease method is actually an iterative calculation method;
[0110] First, random initialization U m And V m Value;
[0111] Then solve the parameter U each time according to the gradient update formula m And V m Value, will u m And V m The value of the loss is calculated by the loss function;
[0112] The above calculation step is repeated, and when the value of the loss function is minimal or the value of the loss function does not change, the parameter value U m And V m That is, the optimal parameter value is completed to solve the potential factor.
[0113] 2 Use the Word2Vec method from the text data from the movie domain m and the music domain C to map the text data into the user vector and the item vector, the user vector UC of the field M m , Item vector VC m , User vector uc, target domain C C , Item vector VC C The present invention uses the CBOW model of the specific Word2VEC method to train the word vector.
[0114] The above CBOW model core thinking is removed from a sentence, and uses this word to predict this removed word. Model Figure 7 Indicated.
[0115] More specifically, the present invention will be described with reference to the method of generating word vector of text information.
[0116] First, use Chinese commonly deactivated word table vocabulary, where the word table size is V, and the window size C = 10;
[0117] Second, the input layer of the model is a single-high-featured code encoded 1 * V dimensional vector {x1, ..., x c }, The hidden layer is a n-dimensional vector, and the output layer is also a single thermal encoded word vector y.
[0118] More specifically, the input vector is connected to the hidden layer through a V × N-dimensional weight matrix W; the hidden layer is connected to the output layer through a N × V weight matrix W 'to get a word vector, the vector is removed The word corresponding to the one-HOT vector. Specific mathematics is as follows:
[0119]
[0120] H = [h 1 , ...., h n ]
[0121] Y = h T W '
[0122] among them, For each word, the V-dimensional alone encoding vector, W is the V × N-weight weight matrix, H is n-dimensional vector, and W 'is a matrix of N × V dimension, Y is a single-thermal encoding vector of V-dimensional.
[0123] As can be seen from the above steps, a vector representation of all text information for data sets can be obtained by a continuous phrase model. Since the unique encoding vector is sparse, the relationship is not conducive to the word vector, and the matrix composed of a single thermal encoding vector also has a problem of dimension explosion. Therefore, this method is to be the unique encoding vector of each word. i Perform Embedding to quantization. Specific steps are as follows:
[0124] Build a word vector in the user unit Y i , By vector splicing, get the matrix Y of V × M;
[0125] Training a V × N Embedding Weight Matrix W through full connection network e , Make the matrix y with the weight matrix to obtain a M × N-dimensional text vector matrix R v , Mathematics is expressed as:
[0126] R v = Y T W e
[0127] User vectors of Movie domain M using a matrix decomposition based on the potential factor solving method of matrix decomposition m;
[0128] Repeat steps S221 to S224, to obtain the item vector Vc m , User vector UC, music domain C C , Item vector VC C.
[0129] More specifically, in the shared layer, it is necessary to fuse the user matrices and item matrices in two different fields, and feature extraction. details as follows:
[0130] The existing model uses the characteristics of poolization technology, but whether it is Max-pooling or average-pooling to extract the vector, the operation of only the pooling window is taken, and the operation of other values is abandoned. The information contained in the discarded value will be lost. Specifically, such as Figure 8 The Max-Pool is only retains the maximum value in the pool window and abandoned other values, so that the features extracted are particularly in general. Such as Figure 9 As shown, Average-Pool is only retains the average of all values in the pool window, making the extracted feature to lose particularity. Therefore, the existing model uses poolization techniques to make the extracted feature have poor generalization performance.
[0131] In response to the above, the present invention proposes a self-focus mechanism to replace the pool chemical technique for feature fusion, compared to poching techniques, and the self-focus mechanism is weighted according to the similarity, and is not limited by the text length. Can perfectly fuse the feature information of each vector. Such as image 3 As shown, the specific structure of the self-focus mechanism includes:
[0132] First, input includes three matrices including Q (Query), K (KEY), V (Value), and matrix Q and matrix K, MATMUL;
[0133] Then, if it is divided into one calculation, if divided by a scale scale The result is then normalized to the probability distribution using SoftMAX operation.
[0134] Finally, the obtained probability distribution is multiplied by matrix V (Matmul) to obtain weights.
[0135] After the operation, I got a HEAD. After h, I got HEAD; the HEAD obtained was spliced.
[0136] In the specific implementation process, the user matrix U m For query, user matrix U C To KEY, score matrix R s For Value; first make the query and key to calculate the similarity, and perform SCAL Dot-Prodect operation on the points to obtain similarity matrix A u;
[0137] Softmax normalized operation is performed on each similarity matrix, and the weight matrix W is obtained. u Finally, the weight matrix A u Get a head output with Value to get a HEAD output, remember to u Specifically, the specific expression is:
[0138] Hide u = Attention (U m U c R s )
[0139]
[0140] Where D is the matrix U m U C Dimensions, R s The split score matrix of field M and target domain C;
[0141] Similarly, the user vector UC is obtained by the above calculation m , User vector uc C Splicing text vector matrix H y;
[0142] Finally, will h u , H y Do splicing, get the common user information matrix of score information and text information in the domain M and target domain C, specifically:
[0143] Di mc = Concat (H u , H y )
[0144] Where D mc The common user information matrix of scoring information and text information is included in the domain M and the target domain C; repeatedly performing the above steps to obtain the common user information matrix D of the book B and the music domain C. bc , Electronic product domain E and music domain CC common user information matrix D ec
[0145] More specifically, will D mc , D bc And D ec Enter a self-focus mechanism Layer to weigh the fusion, the fused vector is set, and the multi-domain fusion contains the score information and the fusion matrix U.