[0054] The present invention provides a method and system for filtering spam based on deep learning. In order to make the objectives, technical solutions and effects of the present invention clearer and clearer, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
[0055] The present invention provides a method for filtering spam based on deep learning. Through the self-learning ability of the deep belief network, combined with the advantages of big data, the large number of samples existing on the network are used to learn to improve the classification ability. On the one hand, it can improve the identification of spam On the other hand, the deep belief network is a semi-supervised learning model, which can be trained with a large-scale non-standard sample set. Compared with the traditional supervised learning model, it can save the time and time required to label a large number of samples. Manpower.
[0056] Such as figure 1 As shown, a method for filtering spam based on deep learning, wherein the method for filtering spam based on deep learning includes:
[0057] S100: Process the email sample to generate the first vector space model, and build a deep confidence network;
[0058] In the embodiment of the present invention, the mail sample is preferably a training mail set, which refers to a set composed of a large number of mails of known categories, and may also be referred to as a training set for short. Through training mail samples, the characteristics of each mail category can be summarized.
[0059] The concept of deep learning originates from the research of artificial neural networks, and a multilayer perceptron with multiple hidden layers is a deep learning structure. Deep learning forms a more abstract high-level representation attribute category or feature by combining low-level features to discover distributed feature representations of data.
[0060] Vector Space Model (VSM: Vector Space Model), which simplifies the processing of text content into vector operations in vector space, and expresses semantic similarity with spatial similarity, which is intuitive and easy to understand. When a document is represented as a vector in a document space, the similarity between the documents can be measured by calculating the similarity between the vectors.
[0061] In the field of information filtering and retrieval, in order to facilitate calculations, vector space models are often used to represent text. This model first selects characteristic items with representative ability from the text
[0062] Deep Belief Network (DBN), a dual model that can be used as a generative model or a judgment model by training the weights of neurons in it, so that the entire neural network can generate training data according to the maximum probability. It can be used to identify features, classify data, and even generate data.
[0063] DBN is composed of multiple layers of neurons, which are divided into dominant neurons (referred to as explicit elements) and recessive neurons (referred to as hidden elements, also called feature detectors); the explicit elements are used to receive input and the hidden elements are used To extract features. The connection between the top two layers is undirected and can form a joint memory; while the other lower layers are connected up and down directed connections. The bottom layer represents the data vector, and each neuron represents one dimension of the data vector.
[0064] In the embodiment of the present invention, a deep confidence network composed of a feedforward neural network with a deep architecture is preferably used as a network model for training mail classification, which can complete complex function approximation with fewer parameters.
[0065] S200: Process the test mail to generate a second vector space model;
[0066] The test mail is processed in the form of a vector space model, which means that a text or mail is represented as an n-dimensional vector. Since natural text cannot be directly processed by the constructed classification algorithm, the text needs to be processed first. A certain processing, converted into a form that can be recognized by the classifier, assuming that the values of n feature items of a document are w1, w2,..., wn, since they are from the same mail to be filtered, they are regarded as a whole Consider, let these feature items form a feature vector d, that is, each text is regarded as a vector in an n-dimensional space, and its representation is: d(w1, w2,..., wn), where wi is the i-th The weight of the feature item, n is the number of feature items. The feature item can be a word, word, phrase or some concept, preferably a word, in order to have higher classification accuracy. In this way, the text representation is transformed into text segmentation first, and then these words are used as the dimension of the vector to represent the text.
[0067] In the embodiment of the present invention, a document refers to a mail or a fragment in the mail, such as a paragraph, a sentence group, or a sentence.
[0068] Weight is a relative concept, for a certain index. The weight of an indicator refers to the relative importance of the indicator in the overall evaluation. The weight is to be separated from a number of evaluation indicators, and the corresponding weights of a group of evaluation indicator systems are reorganized into a weight system.
[0069] S300: Use the constructed deep belief network to detect the second vector space model;
[0070] Using the constructed deep confidence network to detect the second vector space model refers to using the trained deep confidence network to process the mail to be filtered, classify the mail to be filtered, and check whether it is spam or normal mail; that is, this step can be used again. Expressed as: the constructed deep confidence network is used to classify the mail to be filtered expressed as the second vector space model, where the categories include spam and normal mail.
[0071] S400: Output the detection result.
[0072] Outputting the detection result refers to outputting the results of whether the filtered mail after the above steps is spam or what type of mail it belongs to in the training mail set, so that the mail recipient or the system knows the mail category, and other processing procedures can be added later. For example, add the category or the source address of the email to the blacklist, graylist or whitelist after confirmation by the email recipient.
[0073] The method for filtering spam based on deep learning provided by the present invention adopts the method of constructing a deep confidence network and detecting and testing mails through the constructed deep confidence network, which improves the accuracy and stability of identifying spam and saves The time and labor required to label a large number of samples.
[0074] Further, in the method for filtering spam based on deep learning, the S100 specifically includes:
[0075] S110: Training mail sample;
[0076] S120: Preprocess the trained mail sample, determine the characteristics of the spam and construct a feature set;
[0077] There are two types of vector space models, Boolean and numerical. In the numerical vector space model, the calculation of the weight of the feature item is expressed by term frequency (TF, Term Frequency, representing the number of times the feature word appears in the text) or TF-IDF (TF-inverse document frequency, inverted word frequency) and other methods, the latter is a related combination of TF and DF.
[0078] Therefore, when the text is represented by the vector space model, since the dimension of the vector space is determined by the number of words in the text set, the dimension is quite large. However, many information of the text is highly redundant, so dimensionality reduction is required. And feature extraction. The specific steps are: preprocessing the text, removing stop words and words that appear too frequently in the text; adopting a specific feature selection method to select feature items for the words; it can also include the steps: adding other features as needed to improve Classification effect.
[0079] The Boolean vector space model is a simple text representation model. The status of the feature item in the text has only two forms, 0 or 1. 0 means that the feature item does not appear in the text, and 1 means that the text contains a feature item. The Boolean vector space model expresses the text as a sequence of 0/1 through a string of 0 and 1. The advantages of this model are relatively simple design and high classification efficiency.
[0080] S130: Generate a first vector space model according to the constructed feature set;
[0081] The process of generating the first vector space model is a process of vectorizing all the features in the feature set and storing them according to the vector space mode.
[0082] S140: Construct a deep confidence network according to the generated first vector space model.
[0083] Further, in the method for filtering spam based on deep learning, the S120 specifically includes:
[0084] S121: Perform word segmentation on the email sample after training;
[0085] Chinese word segmentation methods can be divided into three categories: word segmentation methods based on dictionary string matching, word segmentation methods based on understanding, and word segmentation methods based on statistics.
[0086] Dictionary-based string matching word segmentation method, also called mechanical word segmentation method, which matches the Chinese character string to be analyzed with an entry in a sufficiently large machine dictionary according to a certain strategy. If a string is found in the dictionary, The match is successful. According to the different scanning directions, the string matching word segmentation method can be divided into forward matching and reverse matching; according to the priority matching of different lengths, it can be divided into maximum matching and minimum matching. The two commonly used word segmentation methods are as follows:
[0087] (1) The forward maximum matching method. The purpose of forward maximum matching is to separate the longest compound words. Its basic idea is: assuming that the number of Chinese characters contained in the longest entry in the word segmentation dictionary is n, the first n characters in the current string of the document being processed are used as matching fields to search the dictionary. If such a word exists in the dictionary, the match is successful, and the matching field is segmented as a word. If such a word is not found in the dictionary, the matching will fail, the last word in the matching field will be removed, and the remaining string will be matched again... and so on, until the match is successful, split out one The length of the word or remaining string is zero. This completes a round of matching, and then takes an n-word string for matching processing until the document is scanned.
[0088] (2) Reverse maximum matching method. The basic principle of the reverse maximum matching method is the same as the forward maximum matching method. The difference is that the direction of word segmentation is opposite to the forward maximum matching method, and the word segmentation dictionary used is also different. In the actual processing, the documents are first processed in reverse order to generate reverse documents. Then, according to the reverse order dictionary, the forward maximum matching method can be used to process the reverse order document.
[0089] The word segmentation method based on comprehension achieves the effect of word recognition by letting a computer simulate human's understanding of sentences. The basic idea is to perform syntactic and semantic analysis while segmenting words, and use syntactic and semantic information to deal with ambiguity. It usually includes three parts: the word segmentation subsystem, the syntax and semantics subsystem, and the general control part. Under the coordination of the general control part, the word segmentation subsystem can obtain syntactic and semantic information about words, sentences, etc. to judge the ambiguity of the word segmentation, that is, it simulates the process of human understanding of sentences.
[0090] Based on the statistical word segmentation method, from a formal point of view, a word is a stable combination of characters, so in the context, the number of simultaneous occurrences of adjacent characters can better reflect the possibility of forming a word. Count the frequency or probability of words appearing at the same time. The higher the frequency, the more likely it is to form a word. Therefore, using the results of word frequency statistics to help word segmentation will produce certain effects. This method only needs to count the frequency of the word groups in the corpus, and does not need to segment the dictionary, so it is also called the word segmentation method without the dictionary or the statistical word extraction method.
[0091] S122: Construct a dictionary based on all the classified entries;
[0092] While constructing the dictionary, it can also calculate the global factors of all entries, and place the calculated values in the dictionary so that it can be directly called in the subsequent process.
[0093] S123: Count the word frequencies of the remaining entries after removing the stop words in the constructed dictionary.
[0094] Before or after processing natural language data (or text), certain words or words are automatically filtered out. These words or words are called Stop Words. In the present invention, they preferably appear frequently in the text. , Words that do not contribute much to the classification of the text.
[0095] From S121 to S123, and the above-mentioned "dimensionality reduction processing and feature extraction specific steps are: preprocess the text, remove stop words and words that appear less frequently in the text; use specific feature selection methods to select feature items for words; It can also include the step: adding other features as needed"; it can be seen that the order of steps S122 and S123 can be changed.
[0096] Further, in the method for filtering spam based on deep learning, the S130 specifically includes:
[0097] S131: Vectorize all the features in the constructed feature set, and store them according to the mode of the vector space;
[0098] Vectorizing all the features in the constructed feature set can be said to be converted into feature vectors respectively.
[0099] S132: Normalize the generated feature vector.
[0100] Normalization is a way to simplify calculations, that is, a dimensional expression is transformed into a non-dimensional expression and becomes a scalar.
[0101] After S132, it can also include the step of assigning different weights to the obtained feature vectors. The weight value of the original feature weight is selected as the TF-IDF of the word in the preprocessed text, which can directly call the stored in the dictionary. Global factor, the calculation method is as shown in formula (1):
[0102] TF-IDF=(TF/Ni)*lg(N/DF) (1);
[0103] Among them, Ni is the total number of words in the email; TF is the word frequency of a given word in the document; IDF is the inverse document frequency, which is a measure of the importance of a word; N is the total number of documents; DF is the total number of documents containing the word.
[0104] Further, in the method for filtering spam based on deep learning, the S140 includes:
[0105] S141: Fully train the Nth RMB to obtain the weight of the RMB;
[0106] Restricted Boltzmann Machine (RBM) is a randomly generated neural network that can learn the probability distribution from an input data set. It is a component of DBN. Each RBM can be used as a clusterer alone. . RMB is divided into explicit layer and hidden layer. The explicit layer is composed of explicit elements and is used to input training data; the hidden layer is composed of hidden elements and is used as a feature detector. The explicit elements in the same explicit layer are independent of each other, and they are only connected to the hidden elements in the hidden layer; similarly, the hidden elements in the hidden layer are also independent of each other, and they are only connected to the explicit elements in the explicit layer.
[0107] RBM is mainly defined by an energy function: as shown in formula (2):
[0108] E(v,h|θ)=-b t v-c t h-h t Wv (2);
[0109] According to formula (2), it can be concluded that the information vector of the hidden layer and the information vector of the explicit layer in RMB satisfy the probability distribution shown in formula (3) and formula (4):
[0110] P(v i =1|h)=σ(b i +∑ j w ji h j ) (3);
[0111] P(h j =1|v)=σ(c j +∑ i w ji v i ) (4);
[0112] The update formulas of the parameters that can be obtained by using the log-likelihood function are formula (5), formula (6) and formula (7):
[0113] ΔW ji =η( i h j data - i h j confabula ) (5);
[0114] Δb i =η( i data - i confabula ) (6);
[0115] Δc j =η( j data - j confabula ) (7).
[0116] In the DBN training process, you can use the greedy method to train each layer of RBM layer by layer, that is, the S140 step is specifically: first fully train the first RBM; fix the weight and offset of the first RBM, and then use its invisible nerve The state of the element is used as the input vector of the second RBM; after fully training the second RBM, stack the second RBM on top of the first RBM, and repeat the above steps until all RMB training is completed.
[0117] S142: Fix the weight and offset of the Nth RMB, and use the state of its recessive neuron as the input vector of the next RMB;
[0118] S143: Perform training for the next RMB until all RMB training is completed.
[0119] After this step, it can also include the step of using the error back propagation process in the traditional neural network to tune the entire network. This step can eliminate the errors accumulated by the greedy method for RMB training layer by layer.
[0120] Mail filtering is a two-category problem. When using neural networks to deal with such problems, the top neurons generally represent the number of categories. Therefore, in order to achieve spam filtering, you can set the final BP network output layer to contain two neurons, input The number of neurons in the layer is the size of the vocabulary obtained after preprocessing. In the embodiment of the present invention, since the RBM generally runs on binary input data, the RBM may preferably use a binary vector.
[0121] The specific training process of DBN is to first obtain the weights of the generative model through pre-training through an unsupervised greedy layer-by-layer method. In this training phase, a vector v is generated in the explicit layer, and the value is passed to the hidden layer through it. In turn, the input of the explicit layer will be randomly selected to try to reconstruct the original input signal. Finally, these new visual neuron activation units will forward pass to reconstruct the hidden activation units. In the training process, the visual vector value is first mapped to hidden elements; then the explicit layer units are reconstructed from hidden layer units; these new explicit layer units are mapped to hidden elements again, so that new hidden elements are obtained. In this way, the training time will be significantly reduced, because only a single step is required to approach the maximum likelihood learning. Each layer added to the network improves the log probability of the training data.
[0122] After pre-training, DBN can adjust the discrimination performance by using the labeled data with the BP algorithm. Here, a label set will be attached to the top layer, and a classification surface of the network is obtained through a bottom-up, learned recognition weight. This performance will be better than a network trained by a pure BP algorithm.
[0123] Specifically, first train the first layer with uncalibrated data, and learn the parameters of the first layer during training. This layer can be regarded as a hidden layer of a three-layer neural network that minimizes the difference between output and input. Due to the model capacity The limitation of and the sparsity constraint enables the obtained model to learn the structure of the data itself, thereby obtaining features that are more expressive than the input; after learning the n-1th layer, the output of the n-1 layer is taken as the nth The input of the layer, the nth layer is trained, and the parameters of each layer are obtained respectively.
[0124] The parameters of the entire multi-layer model are further adjusted based on the parameters of each layer obtained in the first step. This step is a supervised training process; the first step is similar to the random initialization process of neural networks, because the first step of deep learning is not random initialization , But obtained by learning the structure of the input data, so this initial value is closer to the global optimum, which can achieve better results. After the trained deep Zhixin network is obtained, the vector space generated by the test sample can be used as input to obtain the mail category.
[0125] Such as figure 2 As shown, a spam filtering system based on deep learning, wherein the spam filtering system based on deep learning includes:
[0126] The training module 100 is used to process email samples to generate a first vector space model and build a deep confidence network, as described in detail above;
[0127] The test module 200 is configured to process the test mail to generate the second vector space model, which is specifically described above;
[0128] The detection module 300 is used to detect the second vector space model by using the constructed deep belief network, which is specifically described above;
[0129] The output module 400 is used to output the detection result, which is specifically described above.
[0130] Further, in the spam filtering system based on deep learning, the training module 100 specifically includes:
[0131] The training sub-module is used to train email samples, as described above;
[0132] The preprocessing sub-module is used to preprocess the e-mail samples after training, determine the characteristics of spam e-mails and construct a feature set, as described above;
[0133] The model construction sub-module is used to generate the first vector space model according to the constructed feature set, which is specifically described above;
[0134] The DBN construction sub-module is used to construct a deep confidence network according to the generated first vector space model, as described in detail above.
[0135] Further, in the spam filtering system based on deep learning, the preprocessing sub-module specifically includes:
[0136] The word segmentation unit is used to segment the email samples after training, as described above;
[0137] The calculation unit is used to calculate the global factors corresponding to all the classified entries, as described above;
[0138] The dictionary construction unit is used to construct the dictionary according to all the classified entries and the calculated global factors, as described above;
[0139] The term frequency statistics unit is used to count the term frequencies of the remaining terms after the stop words are removed from the constructed dictionary, as described in detail above.
[0140] Further, in the spam filtering system based on deep learning, the model construction sub-module specifically includes:
[0141] The feature processing unit is used to vectorize all the features in the constructed feature set and store them according to the mode of the vector space, as described in detail above;
[0142] The normalization processing unit is used to normalize the generated feature vector, which is specifically described above.
[0143] Further, in the spam filtering system based on deep learning, the DBN construction sub-module specifically includes:
[0144] The training unit is used to fully train the Nth RMB to obtain the weight of the RMB, as described above; the RMB processing unit is used to fix the weight and offset of the Nth RMB, and use its hidden neuron The state is used as the input vector of the next RMB, as described above.
[0145] It should be understood that the application of the present invention is not limited to the above examples. For those of ordinary skill in the art, improvements or transformations can be made based on the above description, such as the processing sequence of the vector space model feature items, etc. All these improvements and transformations should belong to The protection scope of the appended claims of the present invention.