[0025] The present invention will be further explained below in conjunction with the drawings:
[0026] Such as figure 1 As shown, the specific steps of the emotion classification method that combines deep and shallow features of the present invention are:
[0027] Step 1: Collect sentiment text corpus from the Internet, and manually mark the categories. For example, the text label of positive emotion is 1 and the text label of negative emotion is 2. And remove the space at the beginning and end of the text, and express the data in the text as a sentence, which is convenient for subsequent processing. And the corpus is divided into training set and test set. The training set is used to train the emotion classification model, and the test set is used to test the effect of the model classification.
[0028] Step 2: First collect sentiment dictionaries from the Internet. The sentiment dictionary is a basic resource for text sentiment analysis, which is actually a collection of sentiment words. Broadly speaking, it refers to a phrase or sentence that contains emotional orientation; in a narrow sense, it refers to a collection of emotionally inclined words. Sentiment dictionary generally contains two parts, a dictionary of positive emotion words and a dictionary of negative emotion words.
[0029] Then the Chinese word segmentation is performed on the corpus in step 1. The word segmentation method used in this article is a Chinese word segmentation algorithm based on the combination of dictionary reverse maximum matching algorithm and statistical word segmentation strategy. The word segmentation dictionary is constructed hierarchically, and the word segmentation dictionary collection is composed of two parts: the core dictionary and the temporary dictionary. Count the authoritative corpus of entries, and use the secondary hash structure to store the core dictionary. The sentiment dictionary is selected as the corpus loaded in the temporary dictionary. After the initial formation of the word segmentation dictionary, the word segmentation system enters the stage of autonomous learning. When the emotional text is segmented, if there are newly counted words in the temporary dictionary, the word frequency of the word is increased by one, otherwise the new word is added to the temporary dictionary. After accumulating the word frequency, it is judged whether the word frequency meets the set threshold. If it is satisfied, it will be moved to the core dictionary and the entry will be cleared in the temporary dictionary. Count and record the number of learned emotional texts. If it is greater than a predetermined value, the temporary dictionary is cleared. The entries in the updated core dictionary are used as the basis for word segmentation, and the inverse maximum matching algorithm is used to segment the emotional text.
[0030] After word segmentation, each text is a text corpus composed of words separated by spaces. Then collect the stop vocabulary list, manually delete the words useful to the experiment in the stop vocabulary list, and remove the stop words in the corpus after the word segmentation according to the stop vocabulary list. Stop words are removed to save storage space and improve efficiency.
[0031] Step 3: Use regular expressions to extract tags, nouns, adverbs, adjectives and prepositions from the corpus obtained in Step 2 to form a new corpus. If the text is too large, it is easy to cause the disaster of dimensionality when expressed as a feature vector. Extracting part of the important words in the text can better represent the text and solve the problem of dimensionality.
[0032] Step 4: Use Doc2vec to train the word vector model on the corpus in step 2 and obtain the deep feature vector of the emotional text. Doc2vec is a shallow model used to obtain the deep features of words and texts. It not only takes into account the semantic relationship between words, but also takes into account the order between words, which can well represent the characteristics of words and text. . Doc2vec uses two important models-PV-DBOW and PV-DM models. For the PV-DBOW and PV-DM models, it also gives two sets of algorithms-HierarchicalSoftmax and NegativeSampling. This paper uses the PV-DM model based on HierarchicalSoftmax algorithm. The input of the PV-DM model is a variable-length paragraph (ParagraphId) and all the words (Words) in the paragraph. The ParagraphId in this article represents emotional text. The output is the word predicted based on ParagraphId and Words.
[0033] The training process of PV-DM model:
[0034] Map each ParagraphId and Words into a unique paragraph vector (ParagraphVector) and a unique word vector (WordVector), and put all ParagraphVectors into matrix D and all WordVectors into matrix W by column. Accumulate or connect ParagraphVector and WordVector as the input of Softmax in the output layer. The output layer Softmax uses the entries in ParagraphId as leaf nodes, and the number of times the entries appear in the text corpus as the weights to construct a Huffman tree. Establish the objective function:
[0035] 1 T X t = k T - k log p ( w t | w t - k , ... , w t + k ) - - - ( 1 )
[0036] Where T represents the number of word vectors, w t , W t-k Etc. represent each word vector.
[0037] p ( w t | , w t - k , ... , w t + k ) = e y w t X i e y i - - - ( 2 )
[0038] Every y i Is the unnormalized log probability of each word vector i, y i The calculation formula is:
[0039] y=b+Uh(w t-k ,...,w t+k;W,D)(3)
[0040] Among them, U and b are the parameters of Softmax, and h is formed by the accumulation or concatenation of ParagraphVector and WordVector extracted from the D and W matrices.
[0041] During the training process, the ParagraphId remains unchanged, and all words in the text share the same ParagraphVector, which is equivalent to using the semantics of the entire text every time the probability of a word is predicted. The objective function is optimized to obtain the optimal vector representation of the word. Use the stochastic gradient ascent method to optimize the objective function of the above formula, and obtain the vector θ of the word u in the iterative process u The update formula is:
[0042] θ u : = θ u + η [ L x ( u ) - σ ( w ( x ~ ) T θ u ) ] w ( x ~ ) - - - ( 4 )
[0043] The update formula is:
[0044] θ u ∈R n Represents an auxiliary vector corresponding to word u, L x (u) represents the label of the word u, Express word The corresponding vector, σ is a logistic regression function, Express word The label of η represents the learning rate. The vector θ of word u in the iterative process u And words Vector of All have been updated on the original basis, making the vector's ability to express words stronger, the vector continues to evolve with the update, and the quality of the vector's representation is also improved.
[0045] In the prediction phase, a ParagraphId is reassigned to the text to be predicted, the word vector and the parameters of the output layer Softmax keep the parameters obtained in the training phase unchanged, and the random gradient ascent method is used to train the text to be predicted. After convergence, the ParagraphVector of the text is finally obtained, which is the deep feature vector of the text, and these deep feature vectors are processed into a data format that can use SVM.
[0046] Step 5: Use TF-IDF to train the corpus obtained in step 3 and obtain the shallow feature vector of the emotional text.
[0047] In a given sentiment text, term frequency (TF) refers to the frequency of a given word in the text. This number is a normalization of the termcount to prevent it from being biased towards longer text. (The same word may have a higher number of words in a long text than in a short text, regardless of whether the word is important or not.) For a word in a particular document t i In other words, its importance can be expressed as:
[0048] tf i , j = n i , j X k n k , j - - - ( 6 )
[0049] Where n i,j Indicates that the word is in the text d j The number of occurrences in d j The sum of the occurrences of all words in.
[0050] Inverse document frequency (IDF) is a measure of the universal importance of words. The IDF of a particular word can be obtained by dividing the total number of texts by the number of texts containing the word, and then taking the logarithm of the obtained quotient:
[0051] idf i = l o g | D | | { j : t i A d j } | - - - ( 7 )
[0052] Where |D| represents the total number of texts in the sentiment corpus, |{j:t i ∈ d j }| means the word t is included i If the word is not in the corpus, it will cause the dividend to be zero, so in general, use 1+|{j:t i ∈ d j }|, finally get the TF-IDF value of a word:
[0053] tfidf i,j =tf i,j ×idf i (8)
[0054] Calculate all the words in an emotional text, and put the obtained TF-IDF value into a new text to get the shallow feature vector of the text. Then calculate the shallow feature vectors of all texts.
[0055] Step 6: Put the deep feature vectors of all the texts obtained in step 4 into one text, each line represents a text vector, and also put the shallow feature vectors of all the texts obtained in step 5 into one text , Each line also represents a text vector. Since the deep features obtained in step 4 and the shallow features obtained in step 5 are equally important in sentiment classification, the weight ratio of the two features is set to 1:1, and the two Each line of the text is directly connected end to end to obtain a new emotional text feature vector.
[0056] Step 7: Input the text feature vector of the training set in the corpus in step 6 into the SVM to train an emotion classification model.
[0057] Introduce the nonlinear function φ(x), and put the input space R n Map to the m-dimensional feature space, and then construct a boundary hyperplane in the high-dimensional space. The hyperplane can be defined as follows:
[0058] X j = 1 m w j * φ ( x ) + b * = 0 - - - ( 9 )
[0059] Where w j * Is the weight connecting the feature space to the output space, b * Is the offset value.
[0060] In order to obtain the optimal hyperplane, the weight vector and the offset value should be minimized and meet the constraints: y i (wx i +b)≥1-ξ i ,i=1,2,...,m, where, ξ i It is a positive slack variable, which increases the fault tolerance of the slack variable. According to the principle of structural risk minimization, the objective function of minimization at this time is:
[0061] J ( w , ξ ) = 1 2 | | w | | 2 + C X j = 1 N ξ j - - - ( 10 )
[0062] Where C is the penalty parameter. According to Lagrange's theorem, the Lagrange multiplier α is introduced i , The kernel function K(x i ,x)=φ(x i )φ(x) can be transformed into solving the minimum value of the following objective function:
[0063] W ( α ) = 1 2 X i = 1 N X j = 1 N α i α j y i y j K ( x i , x j ) - X i = 1 N α i - - - ( 11 )
[0064] Which meets the constraints:
[0065] The optimal hyperplane can be expressed as:
[0066] X i = 1 N α i * y i K ( x i , x ) + b * = 0 - - - ( 12 )
[0067] The classification decision function can be expressed as:
[0068] f ( x ) = s i g n ( X i = 1 N α i * y i K ( x i , x ) + b * ) - - - ( 13 )
[0069] After the training is completed, save the emotion classification model.
[0070] Step 8: Input the text feature vector of the test set in the corpus in step 6 into SVM, and classify the emotion category according to the model that has been trained in step 7. If the label of the actual output text is equal to 1, it is determined that the text is positive Emotion, if the label of the actual output text is not equal to 1 (that is, the label is equal to 2), it is determined that the text represents negative emotions, and the number of differences between the label of the actual output text and the label of the expected output text is counted, and the emotion classification is calculated Accuracy.
[0071] The above embodiments should be understood as only used to illustrate the present invention and not to limit the protection scope of the present invention. After reading the recorded content of the present invention, technical personnel can make various changes or modifications to the present invention, and these equivalent changes and modifications also fall within the scope defined by the claims of the present invention.