Network information risk identification method and system
An information and network technology, applied in the Internet field, can solve the problems of inability to identify the latest risk information and low update efficiency, and achieve the effect of improving the ability of network information risk identification and expanding coverage
Pending Publication Date: 2019-10-22
INDUSTRIAL AND COMMERCIAL BANK OF CHINA
0 Cites 0 Cited by
AI-Extracted Technical Summary
Problems solved by technology
[0005] The embodiment of the present invention provides a network information risk identification method, which is used to solve the technical problem that the existing network information risk identification method is based on the feature database maintained by experts or based on the artificial speech rule library. Due to the low update efficiency, the latest risk information cannot be identified. The method includes: acquiring network information data, wherein the network information data includes: structured data and unstructured data; performing normalization processing on the network information data, and storing the normalized result in a corpus feature database, wherein the normalization The normalized result includes the normalized result corresponding to the structured data and the entry vector sequence corresponding to the unstructured data; the entry vector sequence corresponding to the unstructured data is input into the document vector sequence generation model obtained in advance, and the output is not The document vector sequence corresponding to the structured data; the normalized result corresponding to the structured data and the document vector sequence corresponding to the unstructured data are input into the risk prediction model obtained by pre-training, and the risk prediction result of the network information data is output, where , the risk prediction result is also used to update the corpus feature library
[0006] The embodiment of the present invention also provides a network information risk identification system, which is used to solve the technical problem that the existing network information risk identification method is based on the feature database maintained by experts or based on the artificial speech rule library. Due to the low update efficiency, the latest risk information cannot be identified. , the system includes: a data collection and processing unit for collecting network information data, wherein the network information data includes: structured data and unstructured ...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreMethod used
In order to visually display the risk prediction result of the risk model training unit 104 output, the embodiment of the present invention can also include the prediction result display unit 104, connect the risk model training unit 104, adopt the risk of the risk model training unit 104 output by visualization technology The prediction results are displayed intuitively and pushed to relevant business personnel. At the same time, it can also present the results in multi-dimensional charts according to the actual scene requirements, and push them according to business needs, so as to realize timely early warning of risk events.
In summary, after the network information data provided by the embodiment of the present invention is obtained, the network information data is normalized, and the normalization result of the structured data in the network information data is corresponding to the unstructured data The entry vector sequence of the unstructured data is stored in the corpus feature library, and based on the pre-trained document vector sequence generation model, th...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View moreAbstract
The invention discloses a network information risk identification method and a network information risk identification system. The method comprises the steps of obtaining network information data which comprises structured data and unstructured data; normalization processing is conducted on the network information data, normalization results are stored in a corpus feature library, and the normalization results comprise normalization results corresponding to the structured data and entry vector sequences corresponding to the unstructured data; inputting the entry vector sequence corresponding to the unstructured data into a pre-trained document vector sequence generation model, and outputting the document vector sequence corresponding to the unstructured data; and inputting the normalization result corresponding to the structured data and the document vector sequence corresponding to the unstructured data into a pre-trained risk prediction model, and outputting a risk prediction resultof the network information data, the risk prediction result being further used for updating the corpus feature library. The effect of quickly and accurately identifying the enterprise risk is achieved.
Application Domain
Technology Topic
Image
Examples
- Experimental program(1)
Example Embodiment
[0018] In order to make the purposes, technical solutions and advantages of the embodiments of the present invention more clearly understood, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. Here, the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, but not to limit the present invention.
[0019] The embodiment of the present invention provides a network information risk identification system, figure 1 A schematic diagram of a network information risk identification system provided in an embodiment of the present invention, such as figure 1 As shown, the system includes: a data acquisition processing unit 101 , a data normalization processing unit 102 , a document vector sequence generation unit 103 and a risk model training unit 104 .
[0020] The data collection and processing unit 101 is used to collect network information data, wherein the network information data includes: structured data and unstructured data; the data normalization processing unit 102 is connected to the data collection and processing unit 101 and used to The network information data is normalized, and the normalized result is stored in the corpus feature database, where the normalized result includes the normalized result corresponding to the structured data and the term vector sequence corresponding to the unstructured data; document; The vector sequence generating unit 103 is connected to the corpus feature database 12, and is used for generating a model based on the document vector sequence obtained by pre-training, processing the term vector sequence corresponding to the unstructured data, and generating the document vector sequence corresponding to the unstructured data. The risk model training unit 104 is connected with the document vector sequence generation unit 103 and the corpus feature library 12 respectively, and is used for receiving the document vector sequence output by the document vector sequence generation unit 103, and searches for the corresponding structured data from the corpus feature library 12 Normalize the results, and based on the pre-trained risk prediction model, process the normalized results corresponding to structured data and the document vector sequence corresponding to unstructured data to obtain the risk prediction results of network information data. The prediction result is also used to update the corpus feature base.
[0021] like figure 1 As shown, the data acquisition and processing unit 101 is responsible for cleaning existing documents, extracting high TF-IDF value entries, and using this as a keyword to crawl network information data from major portal websites, including unstructured data (information text information) and structured data (information-related attributes), and store all data (including existing document information and network information data crawled from the web) into the basic document library 11 . The cleaning refers to processing existing documents such as word segmentation and removal of stop words. The information-related attributes refer to whether the information appears on the homepage, the information comes from a portal website or a forum, the forwarding volume of each website, and the duration of news fermentation, etc. The basic document library 11 is used to store simply cleaned data, including existing document information and crawled network information data. The existing document information refers to the enterprise-related information accumulated by the business department. The network information data refers to crawling information text information (unstructured data) and information-related attributes (structured data) from major portal websites.
[0022] The aforementioned data normalization processing unit 102, which is connected to the data acquisition processing unit 101, is used for normalizing the network information data, and storing the normalization result in the corpus feature database 12, wherein the normalization The result includes the normalization result corresponding to the structured data and the term vector sequence corresponding to the unstructured data; in the embodiment, the data normalization processing unit 102 is connected with the data acquisition processing unit 101, including The normalization processing unit 102 is connected to the basic document library 11, and the basic document library 11 is connected to the data acquisition processing unit 11; the data normalization processing unit 102 performs normalization processing on the network information data, which may include: the data normalization processing unit 102 is connected to The basic document library 11 calculates the unstructured data (each information text information) stored in the basic document library 11 through tools such as Word2Vec, and obtains a high TF-IDF term vector sequence corresponding to the unstructured data, which is stored in the corpus feature library. 12: Normalize the structured data (information-related attribute information), and store the normalization result corresponding to the structured data in the corpus feature database 12. The corpus feature database 12 stores the high TF-IDF term vector sequence and the normalization result corresponding to the structured data. The high TF-IDF term vector sequence is composed of high TF-IDF term vectors in the information document information, and represents the meaning of the information document information.
[0023]The document vector sequence generation unit 103 is connected to the corpus feature library 12, and is used to process the entry vector sequence corresponding to the unstructured data based on the pre-trained document vector sequence generation model to generate the unstructured data Corresponding document vector sequence; in an embodiment, before processing the term vector sequence corresponding to the unstructured data to generate the document vector sequence corresponding to the unstructured data, the document vector sequence generating unit 103 needs to be The sequence generation model is trained; the training method may include: obtaining the first training sample data, wherein the aforementioned first training sample data includes: multiple groups of entry vector sequences and entry positive and negative labels; according to the aforementioned first training sample data, by Machine learning training to obtain a document vector sequence generation model. The aforementioned multiple sets of entry vector sequences and entry positive and negative labels are obtained from the connected corpus feature library 12, wherein the entry vector sequences are normalized by the data normalization processing unit to normalize network information. As a result, the positive and negative labels of the entry are the result of labeling the entry by the information labeling unit 3 . After obtaining the document vector sequence generation model through machine learning training according to the first training sample data, it also includes: obtaining first verification sample data, wherein the first verification sample data includes: multiple sets of term vector sequences and Entry positive and negative labels; the entry vector sequence in the first verification sample data is input to the document vector sequence generation model, and the entry positive and negative labels corresponding to the entry vector sequence in the first verification sample data are output ; Comparing the positive and negative labels of the entries in the first verification sample data with the positive and negative labels of the entries output by the document vector sequence generation model; verifying the document vector sequence generation model according to the comparison result. In a specific implementation, the aforementioned machine learning training may be training of a neural network model. In an embodiment, the document vector sequence generating unit 103 obtains the document vector sequence 13 through neural network model training. The input layer of the neural network model refers to the high TF-IDF value entry vector sequence in the corpus feature library 12, and the output layer of the neural network model is the positive and negative labels of the entry vector sequence marked by the information labeling unit 105, extracting The hidden layer vector sequence of the neural network model is used as the document vector sequence 13, and is passed into the risk model training unit 104 as input data for enterprise information risk prediction.
[0024] The risk model training unit 104 is connected to the document vector sequence generation unit 13 and the corpus feature library 12 respectively, and is used to receive the document vector sequence output by the document vector sequence generation unit 13, and the document vector sequence output from the corpus feature library 12. Find the normalized result corresponding to the structured data, and process the normalized result corresponding to the structured data and the document vector sequence corresponding to the unstructured data based on the pre-trained risk prediction model , to obtain a risk prediction result of the network information data, wherein the risk prediction result is also used to update the corpus feature library.
[0025] Before inputting the normalized result corresponding to the structured data and the document vector sequence corresponding to the unstructured data into the pre-trained risk prediction model, and outputting the risk prediction result of the network information data, it is necessary to The risk prediction model of the risk model training unit 104 is trained; the training method may include: obtaining second training sample data, wherein the second training sample data includes: multiple sets of document vector sequences and document risk classification labels; according to the first 2. Training sample data, and obtain a risk prediction model through machine learning training. Wherein, the aforementioned document vector sequence is transmitted by the document vector sequence generation unit 13 , and the aforementioned document risk classification label is the result of labeling the document vector sequence by the information labeling unit 3 . After obtaining the risk prediction model through machine learning training according to the second training sample data, it may also include: obtaining second verification sample data, wherein the second verification sample data includes: multiple sets of entry vector sequences and word Article positive and negative labels; the entry vector sequence in the second verification sample data is input to the risk prediction model, and the entry positive and negative labels corresponding to the entry vector sequence in the second verification sample data are output; The positive and negative labels of the entries in the second verification sample data are compared with the positive and negative labels of the entries output by the risk prediction model; according to the comparison result, the risk prediction model is verified. In a specific implementation, the aforementioned machine learning training may be training of a neural network model. In an embodiment, the risk model training unit 104 receives the normalization result (normalized data) corresponding to the structured data in the corpus feature database 12 and the document vector sequence 13 corresponding to the unstructured data, and at the same time from the corpus feature database Extract the structured data in 12, splice the document vector sequence 13 and the structured data, use the spliced sequence as the input layer of the neural network, and use the document information label as the output layer to train the neural network model to predict information risk classification, An empirical formula is established for the normalization results corresponding to the risk prediction results of the network information data and the structured data obtained from the corpus feature database 12, which are used to fit all the data, thereby predicting the enterprise risk classification. The aforementioned empirical formula:
[0026] I=A*xyzmn+B
[0027] Among them, x is the data normalized by importance (whether the home page appears); y is the data normalized by the source (portal or forum); z is the data normalized by the website forwarding volume; m is the news fermentation duration Time-normalized data; n is information risk classification; A and B are undetermined coefficients, which are determined by specific business scenarios.
[0028] In order to realize the real-time update of the corpus feature library, the network information risk identification system provided by the embodiment of the present invention can also include: a corpus feature library update unit 105, connected with the risk model training unit 104, used for predicting probability of network information data greater than Or under the situation equal to the threshold value, according to the entry vector sequence corresponding to the unstructured data in the network information data, update the corpus feature library; The information labeling unit 106 is connected with the corpus feature library updating unit 105 for prediction in the network information data When the probability is lower than the threshold, the word vector sequence corresponding to the unstructured data in the network information data is marked, and the corpus feature library is updated according to the marked result.
[0029] In order to realize the above machine learning training functions, such as image 3 As shown in the structural diagram of a neural network model of a network information risk identification system provided in an embodiment of the present invention, the embodiment of the present invention provides a neural network model including an input layer, a hidden layer (or an intermediate layer) and an output layer; in order to achieve the above The machine learning training function of the document vector sequence generation unit 103 and the risk model training unit 104, the embodiment of the present invention provides a neural network model, such as figure 2 As shown in the schematic diagram of the neural network model of a network information risk identification system provided in the embodiment of the present invention, the neural network of the network information risk identification system in the embodiment of the present invention may include: a model input unit 201, a model training unit 202, a model prediction Unit 203, model verification unit 204, model correction unit 205:
[0030] The model input unit 201 is configured to use the vector sequence as the input layer of the model, and the corresponding label as the output layer of the model. The vector sequence refers to the entry vector sequence in the document vector sequence generating unit 103, and refers to the document vector sequence in the risk model training unit 104; the label refers to the entry positive in the document vector sequence generating unit 103 Negative labels and risk classification labels in the risk model training unit 104;
[0031] The model training unit 202, in the document vector sequence generation unit 103, is used to reversely solve the model hidden layer matrix sequence through the input layer and output layer data, reduce the error value below the set value, and save the neural network obtained after training. Network model and hidden layer matrix sequence; in the risk model training unit 104, it is used to solve the risk classification label data of the output layer through the input layer and hidden layer data, reduce the error value to below the set value, and save the training The resulting neural network model and the risk classification label data of the output layer;
[0032] The model prediction unit 203 is used to input the neural network model after the training of the vector sequence as the verification sample data to obtain the label data after the training; the label data after the training in the document vector sequence generation unit 103 is the positive and negative labels of the entry, in The label data trained in the risk model training unit 104 is a risk classification label;
[0033] The model verification unit 204 is used to compare the predicted label data after training with the label data marked by the information labeling unit, and obtain the correlation between the label classification probability and classification accuracy after training;
[0034] Model modification unit 205 is used to update the neural network model. In the corpus feature library new unit 105, the corpus that is greater than or equal to the threshold condition will be directly included in the corpus feature library 12, and those that are not satisfied will be included in the corpus feature library 12 after re-labeling information. When the data level of the corpus feature library 12 is increased by a set percentage, the neural network model is retrained; the aforementioned set percentage may be 10% in an example.
[0035] In the above neural network model, the document vector sequence generation unit 103 reversely solves the hidden layer through the input layer and the output layer, and the output is the hidden layer of the neural network model, and uses the hidden layer matrix sequence to represent the document; the risk model training unit 104 forwardly solves the output layer through the input layer and the hidden layer, the output is the output layer of the neural network model, and directly uses the result of the output layer for risk classification.
[0036] like figure 1 As shown, the corpus feature library update unit 105 is responsible for setting the grading threshold, directly incorporating the entry vector sequence corresponding to the unstructured data in the network information data greater than or equal to the threshold value into the corpus feature library 12, and transferring the data smaller than the threshold value into the corpus feature library 12. The information labeling unit 3 is re-labeled and then incorporated into the corpus feature library 11 and the document vector system sequence 13; the aforementioned grading threshold needs to be adjusted empirically during specific implementation. If it is good, it can be adjusted downward. If the effect of the risk prediction model is not good, it can be adjusted upward. The grading threshold is set by the technician according to the actual effect.
[0037] like figure 1 As shown, the information labeling unit 106 is connected with the corpus feature database updating unit 105, and is responsible for labeling the unstructured data (high TF-IDF entry vector sequences and information document information) in the corpus feature database 12, and integrating and storing the information in In the corpus feature library 12, the high TF-IDF entry vector sequence and the positive and negative labels of the entry are passed into the document vector sequence generation unit 103 for generating the document vector sequence 13; the normalized result corresponding to the structured data, the document vector The document vector sequence 13 and risk classification labels generated by the sequence generation unit 103 are sent to the risk model training unit 104 for risk prediction. The risk classification label refers to the risk level labeling of high TF-IDF entry vector sequences (information document information), such as non-risk, low-risk, high-risk, etc. The entry positive and negative labels refer to the positive and negative labeling of the high TF-IDF value entries extracted from the document, such as positive, negative, unbiased, etc. The information integration refers to integrating information into a high TF-IDF entry vector sequence (information document information)-document labeling, entry information-entry labeling format and re-storing in the corpus feature library 12 .
[0038] In order to visually display the risk prediction results output by the risk model training unit 104, the embodiment of the present invention may also include a prediction result display unit 104, which is connected to the risk model training unit 104, and visualizes the risk prediction results output by the risk model training unit 104. Visually display and push to relevant business personnel. At the same time, it can also present the results in multi-dimensional charts according to the actual scene requirements, and push them according to business needs, so as to realize timely early warning of risk events.
[0039] In another embodiment of the present invention, as figure 1The data acquisition and processing unit 101 shown is connected with the basic document library 11, and stores the acquired network information data in the basic document library 11; the basic document library 11 is connected with the data normalization processing unit 102, and the data after simple cleaning is passed into the data The normalization processing unit 102; the data normalization processing unit 102 is connected with the corpus feature database 12; the corpus feature database 12 is connected with the document vector sequence generation unit 103 and the risk model training unit 104, and the unstructured data is subjected to vectorization processing to obtain The entry vector sequence is passed into the document vector sequence generating unit 103, the structured data is normalized and passed to the risk model training unit 104; the document vector sequence generating unit 103 is connected with the risk model training unit 104, and the entry vector The sequence is processed as a document vector sequence 13 and passed to the risk model training unit 104; the risk model training unit 104 is connected with the prediction result presentation unit 107, and the risk classification data is passed into the prediction result presentation unit 107; the risk model training unit 104 is also connected to the corpus feature The database update unit 105 is connected, and the risk classification data is passed into the corpus feature database update unit 105; the corpus feature database update unit 105 is connected with the corpus feature database 12, the information labeling unit 106, and the document vector sequence 13, and is greater than or equal to the threshold. The data is passed into the corpus feature database 12, and the risk classification data less than the threshold is passed into the information labeling unit 106; document vector) is re-labeled, and the labeled term vector is passed into the corpus feature database 12, and the labeled document vector is passed into the document vector sequence 13.
[0040] The embodiments of the present invention also provide an AN, NAS, and ANCP system, as described in the following embodiments. Since the principle of these devices for solving the problem is similar to the method for scheduling and controlling user traffic, the implementation of these devices can refer to the implementation of the method, and the repetition will not be repeated.
[0041] The embodiments of the present invention also provide a method for identifying network information risks, as described in the following embodiments. Since the principle of the method for solving the problem is similar to that of a network information risk identification method and system, the implementation of the method can refer to the implementation of a network information risk identification method and system, and the repetition will not be repeated.
[0042] Figure 4 A schematic diagram of a network information risk identification method provided in an embodiment of the present invention, such as Figure 5 As shown, the network information risk identification method according to the embodiment of the present invention may include the following steps:
[0043] S401, obtain network information data, wherein the network information data includes: structured data and unstructured data;
[0044] S402, normalize the network information data, and store the normalization result in the corpus feature database, where the normalization result includes the normalization result corresponding to the structured data and the term vector corresponding to the unstructured data sequence;
[0045] S403: Input the term vector sequence corresponding to the unstructured data into the document vector sequence generation model obtained by pre-training, and output the document vector sequence corresponding to the unstructured data.
[0046] As an optional implementation manner, before the term vector sequence corresponding to the unstructured data is input into the document vector sequence generation model obtained by pre-training, and the document vector sequence corresponding to the unstructured data is output, the embodiment of the present invention provides The network information risk identification method can also include the following steps: obtaining first training sample data, wherein the first training sample data includes: multiple groups of entry vector sequences and entry positive and negative labels; Learning and training to obtain a document vector sequence generation model.
[0047] Further, after obtaining a document vector sequence generation model through machine learning training according to the first training sample data, the method for identifying risks in network information provided by the embodiment of the present invention may further include the following steps: acquiring first verification sample data, wherein the first verification sample data is obtained. A verification sample data includes: multiple sets of entry vector sequences and entry positive and negative labels; input the entry vector sequence in the first verification sample data into the document vector sequence generation model, and output the entry vector sequence in the first verification sample data The positive and negative labels of the corresponding entry; compare the positive and negative labels of the entry in the first verification sample data with the positive and negative labels of the entry output by the document vector sequence generation model; according to the comparison result, the document vector sequence generation model is verified.
[0048] S404, input the normalization result corresponding to the structured data and the document vector sequence corresponding to the unstructured data into the risk prediction model obtained by pre-training, and output the risk prediction result of the network information data, wherein the risk prediction result is also used for Update the corpus feature library.
[0049] As an optional implementation, before the normalization result corresponding to the structured data and the document vector sequence corresponding to the unstructured data are input into the pre-trained risk prediction model and the risk prediction result of the network information data is output , the network information risk identification method provided by the embodiment of the present invention may further include the following steps: obtaining second training sample data, wherein the second training sample data includes: multiple groups of document vector sequences and document risk classification labels; Data, and a risk prediction model is obtained through machine learning training.
[0050] Further, after the risk prediction model is obtained through machine learning training according to the second training sample data, the method for identifying the risk of network information provided by the embodiment of the present invention may further include the following steps: acquiring second verification sample data, wherein the second verification The sample data includes: multiple sets of entry vector sequences and entry positive and negative labels; input the entry vector sequence in the second verification sample data into the risk prediction model, and output the entry corresponding to the entry vector sequence in the second verification sample data Positive and negative labels; compare the positive and negative labels of the entry in the second verification sample data with the positive and negative labels of the entry output by the risk prediction model; and verify the risk prediction model according to the comparison result.
[0051] In order to realize the real-time update of the corpus feature database, the network information risk identification method provided by the embodiment of the present invention may further include the following steps: obtaining the predicted probability of the network information data; The corpus feature database is updated for the sequence of term vectors corresponding to the structured data; if the predicted probability is less than the threshold, the sequence of term vectors corresponding to the unstructured data in the network information data is labeled, and the corpus feature database is updated according to the labeling results.
[0052] The embodiment of the present invention also provides a specific implementation process of the above-mentioned network information risk identification method, including:
[0053] Step 1: Perform word segmentation and stop word removal processing on existing documents, filter out high TF-IDF entries, and crawl network information data from the Internet. Network information data includes: structured data (information-related attributes) and unstructured data data (info text messages);
[0054] The TF-IDF value of the aforementioned entry is calculated according to the following formula:
[0055] TFIDF i,j =tf i,j ×idf i;
[0056] Among them, TFIDF i,j means file d j entry t in i The TF-IDF value of , used to evaluate the importance of a term to a document in a document set or a corpus; tf i,j (word frequency) represents the term t j in file d j the number of occurrences in; idf i (reverse file frequency) means that the fewer documents contain the term and the larger the idf, the term has a good ability to distinguish between categories.
[0057] Step 2: Use computing tools such as Word2Vec to calculate the high TF-IDF term vector sequence of the unstructured data of the information document, and normalize the structured data.
[0058] Step 3: Use the information labeling unit 3 to label the high TF-IDF value entries extracted from the document with positive and negative labels, such as positive, negative, unbiased, etc., input them into the pre-trained document vector sequence generation model, and extract the hidden layer vector sequence as a sequence of document vectors.
[0059] like Figure 5 As shown in the word frequency index table of a network information risk identification method according to the embodiment of the present invention, the entries with high TF-IDF values are extracted from the "Operational Risk Information Morning News", and the top four entries are "Qianzhuang" and "Fraud". , "Strike" and "Central Bank", the TF-IDF value reached 6 or more. Then use high TF-IDF value entries as keywords to crawl key information from major portal websites, such as Image 6 As shown, with the entry "Qianzhuang" with a high TF-IDF value, crawled from the portal website "The big case of "Moon Island" Internet private lottery involving a total amount of 3.3 billion was finally detected", "Jiangsu police cracked a large transnational network". Gambling Cases, etc.
[0060] The text is subjected to word segmentation and filtering, including separating the text into each Chinese word and removing words in the text that have no effect on the meaning of the text.
[0061] Described calculating the document vector by extracting the neural network model hidden layer vector sequence, including:
[0062] Through the neural network model, the word vector is calculated for the high TF-IDF value entry in the document to obtain the vector of the entry. Specifically, the feature extraction is performed on each word vector according to the following formula to obtain the feature extraction result:
[0063] s t =tanh(U 1 x t +W 1 s t-1 );
[0064] o t =tanh(U s s t +W 2 o t-1 );
[0065] where, s t-1 represents the previous position document vector x t-1 preliminary characteristics of ; s t Represents the current position document vector x t preliminary characteristics of ; o t-1 represents the previous position document vector x t-1 comprehensive characteristics; o t Represents the current position document vector x t preliminary characteristics of ; U 1 , W 1 , U 2 , W 2 Represents the weight matrix of the formula.
[0066] Take the hidden layer matrix sequence of the neural network as the document vector sequence, and the vector format is as follows:
[0067] a=[x 0 ,x 1 ,...x T ,x T-1 ];
[0068] Among them, a represents the document vector sequence of length T, 0≤t≤T-1; x T Represents the current position document vector; x T-1 Represents the previous position document vector.
[0069] Neural network models such as figure 2 As shown, the entry vector sequence is used as the input layer of the model, the positive and negative labels of the entry marked by the information labeling unit 106 are used as the output layer of the entry, and the middle hidden layer is extracted as the document vector sequence, and the vector format is as follows:
[0070] w=[x0 ,x 1 ,...x n ];
[0071] Among them, w is the document vector sequence, and the document vector length n is 50.
[0072] Step 4: Use the information labeling unit 106 to perform risk grading and labeling on the document information, and establish a corpus feature database based on the document vector sequence and the normalized information-related attribute information (information importance, source, forwarding amount, fermentation time, etc.) 12.
[0073] Step 5: Obtain information document risk classification through the pre-trained neural network model, establish an empirical formula between the information document risk classification result and other structured data in the corpus feature database, and predict enterprise risk classification.
[0074] According to the following formula, according to the feature extraction results of all document vectors in the document vector sequence, calculate the probability that the document vector sequence belongs to each risk level, and judge the classification result of the document vector sequence according to the probability:
[0075]
[0076] where σ(O) j Represents the probability that the document vector sequence belongs to the current classification; O represents the feature of the document vector sequence; K represents that the document vector sequence contains K risk level classifications; j represents the current risk level.
[0077] In an embodiment, it also includes using the entry vector sequence and the classification result data as sample data to train the neural network model as follows:
[0078] The document vector sequence is used as the input layer of the model, the risk classification label (stored in the corpus feature database 11) is used as the output layer of the model, and a part of the document vector sequence is selected as the verification data to verify the accuracy of the model;
[0079] The hidden layer matrix sequence of the model is reversely solved through the input layer and output layer data, the error value is reduced below the set value, and the neural network model and the hidden layer matrix sequence obtained after training are saved;
[0080] Input the document vector sequence as the verification sample data into the trained neural network model to obtain the post-training risk level data;
[0081] Compare the post-training risk level data with the data marked by the information labeling unit, and obtain the error relationship between the post-training risk classification probability and the classification accuracy:
[0082] L(Y,P(Y|X))=-logP(Y|X);
[0083] P(Y|X)=1/1+e -YY';
[0084] Among them, Y is the information labeling result data; Y' is the classification data of the training results; X is the verification sample data; P(Y|X) is the probability that the X samples are correctly classified after training, and L is the classification result and information labeling after training The error value between the results.
[0085] Step 6: Set a grading threshold. When there is new document data for risk prediction, and the predicted probability is greater than or equal to the threshold, the data is included in the training set to retrain the model; when the predicted probability is less than the threshold, the data is included in the information labeling unit 106 and relabeled. Re-train the model by adding it to the training set.
[0086] like Image 6 As shown in the schematic diagram of the risk identification result of a network information risk identification method provided in the embodiment of the present invention, the threshold is set to x (experts adjust the parameters by observing the experimental results, generally set to 80%), and the trained neural network model is used to The document "The "Moon Island" Internet Private Lottery Case involving an Amount of 3.3 Billion was Finally Detected" was identified. The document was identified as safe by the model and the predicted probability was greater than 80%, so it could be output and displayed and included in the corpus feature database. The document "Jiangsu police cracked a large transnational online gambling case" was predicted to be low risk, but the predicted probability was lower than 80%, so it was included in the information labeling unit 3 and relabeled.
[0087] Step 7: Display of risk prediction results. The results can be presented in multiple dimensions according to the needs of the century scenario, and the push function can be implemented according to business needs to realize timely early warning of risk events.
[0088] The embodiment of the present invention also provides a computer device, which is used in the existing network information risk identification method based on the feature database maintained by experts or the manual speech rule database. The device includes a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor implements the above-mentioned network information risk identification method when the processor executes the computer program.
[0089] The embodiment of the present invention also provides a computer-readable storage medium, which is used for the technical problem that the latest risk information cannot be identified due to the low update efficiency due to the existing network information risk identification method based on the feature library maintained by experts or the artificial speech rule library. , the computer-readable storage medium stores a computer program for executing the above-mentioned network information risk identification method.
[0090] To sum up, after obtaining the network information data, the embodiments of the present invention perform normalization processing on the network information data, and combine the normalization result of the structured data in the network information data with the entry corresponding to the unstructured data The vector sequence is stored in the corpus feature database, and the model is generated based on the document vector sequence obtained by pre-training, and the document vector sequence corresponding to the unstructured data is generated according to the term vector sequence corresponding to the unstructured data. Finally, based on the pre-trained risk prediction model , according to the normalization result corresponding to the structured data in the network information data and the document vector sequence corresponding to the unstructured data, the enterprise risk of the network information data is predicted, and the risk prediction result corresponding to the network information data is obtained. Through the embodiments of the present invention, the technical effect of quickly and accurately identifying enterprise risks from massive network information data can be achieved. Since the present invention updates the corpus feature database according to the risk prediction result obtained by identifying the network information data, it can respond to changes in network information in a timely manner. Changes in information, and continue to expand the coverage of model risk prediction, and improve the ability to identify network information risks.
[0091] As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
[0092] The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce in the process of realization Figure 1 process or processes and/or blocks Figure 1 A means for the functions specified in a block or blocks.
[0093] These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The device is implemented in the process Figure 1 process or processes and/or blocks Figure 1 the function specified in a box or boxes.
[0094] These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that Directives are provided for implementing the process in Figure 1 process or processes and/or blocks Figure 1 The steps of the function specified in the box or boxes.
[0095] The specific embodiments described above further describe the purpose, technical solutions and beneficial effects of the present invention in detail. It should be understood that the above descriptions are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention shall be included within the protection scope of the present invention.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more PUM


Description & Claims & Application Information
We can also present the details of the Description, Claims and Application information to help users get a comprehensive understanding of the technical details of the patent, such as background art, summary of invention, brief description of drawings, description of embodiments, and other original content. On the other hand, users can also determine the specific scope of protection of the technology through the list of claims; as well as understand the changes in the life cycle of the technology with the presentation of the patent timeline. Login to view more.
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more Similar technology patents
Language resource management method and device for application program
InactiveCN104123150AIncrease publishing speedExpand coverageProgram loading/initiatingMultiple languageApplication software
Owner:BAIDU ONLINE NETWORK TECH (BEIJIBG) CO LTD
Geographic information data acquisition method, device and system
InactiveCN108282739AExpand coverageImprove collection efficiencyGeographical information databasesLocation information based serviceReal-time computingData acquisition
Owner:NAVINFO
After-flaming air nozzle arrangement structure
Owner:YANTAI LONGYUAN POWER TECH
Detection agent for detecting prostate cancer and application thereof
InactiveCN102162006AEnhanced signalExpand coverageMicrobiological testing/measurementFluorescence/phosphorescenceETV1Ethylenediaminetetraacetic acid
Owner:BEIJING GP MEDICAL TECH
Radio frequency/laser cooperative rapid capturing, tracking and aligning method
ActiveCN110233665AExpand coverageImprove detection sensitivityFree-space transmissionRadio frequencyHigh probability
Owner:10TH RES INST OF CETC
Classification and recommendation of technical efficacy words
- Expand coverage
Semiconductor device and method for manufacturing the semiconductor device
InactiveUS20070252233A1Expand coverageEasy to integrateSolid-state devicesSemiconductor/solid-state device manufacturingPhysicsHigh resistance
Owner:SEMICON ENERGY LAB CO LTD
Method and systems for real-time active refinement of search results
InactiveUS20080120289A1Expand coverageImprove cost-per-acquisitionWeb data indexingDigital data processing detailsGraphical user interfaceKeyhole
Owner:MYOGGER
Method and apparatus for employing multiple axial-sources
InactiveUS20050135550A1Expand coverageMaterial analysis using wave/particle radiationX-ray tube electrodesPhysicsRadiation
Owner:GENERAL ELECTRIC CO
Industrial control protocol fuzzing test method based on protocol state
ActiveCN105763392AReduce blindnessExpand coverageData switching networksOn-ProtocolReal-time computing
Owner:PLA UNIV OF SCI & TECH
Mobile phone network management systems
ActiveUS8364141B1Facilitate rapid accessExpand coverageError preventionTransmission systemsNetwork managementSpatial data structure
Owner:ACTIX