Auditing methods and devices for shared data
By combining trained neural network models with audit rules, the problem of low efficiency in shared data auditing was solved, achieving efficient and accurate audit processing, reducing manual intervention, and improving audit efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 中移信息技术有限公司
- Filing Date
- 2022-04-26
- Publication Date
- 2026-06-30
AI Technical Summary
Existing shared data auditing technologies are labor-intensive and have low auditing efficiency.
A pre-trained neural network model is used in conjunction with audit rules to audit shared data. By establishing a hybrid neural network model and updating network weights using keyword sample sets, audit efficiency is improved.
It improves the processing accuracy and efficiency of shared data auditing, reduces the impact of subjective factors on auditors, avoids duplication of work, and presents audit results intuitively with the issuance of warnings.
Smart Images

Figure CN117009800B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of big data technology, specifically to a method and apparatus for auditing shared data. Background Technology
[0002] Data sharing can break down information protection barriers between departments and regions, make fuller use of existing data resources, and reduce repetitive work such as data collection and acquisition. Therefore, auditing shared data is also very important.
[0003] Existing shared data auditing technologies typically involve establishing interaction channels between the auditing system and various external business systems to collect and persist raw data. Based on file processing technology, pre-defined data is categorized, processed, and stored. Finally, corresponding audit rule scripts are designed based on the generated data and the data before migration. Auditors then use these audit rule scripts to verify, analyze, and process the various data, generating audit results which are then displayed in the form of reports or documents.
[0004] Auditors are responsible for verifying and analyzing various data, which is labor-intensive and inefficient. Summary of the Invention
[0005] This application provides a method and apparatus for auditing shared data, which solves the technical problem of low efficiency in auditing shared data.
[0006] In a first aspect, embodiments of this application provide a method for auditing shared data, including:
[0007] Based on the trained neural network model, obtain the audit model;
[0008] Based on the aforementioned audit model, the shared data is audited.
[0009] The trained neural network model is obtained by training based on the following steps:
[0010] A first hybrid neural network model was established based on shared data samples.
[0011] The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model;
[0012] The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
[0013] In one embodiment, establishing a first hybrid neural network model based on data-shared content samples includes:
[0014] Data identification is performed on the shared data sample to obtain the data type corresponding to the shared data sample;
[0015] The collinearity correlation between the data types is analyzed using a scatter plot matrix to obtain feature vectors.
[0016] Based on the feature vectors, a neural network is established;
[0017] Based on the aforementioned neural network and recurrent neural network, a first hybrid neural network model is established.
[0018] In one embodiment, the semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain a second hybrid neural network model, including:
[0019] Based on the manual annotation results and the annotation results corresponding to the first hybrid neural network model, the overlap rate of the annotation results is obtained;
[0020] When the overlap rate of the labeled results reaches the overlap rate threshold and the number of training times of the first hybrid neural network model reaches the preset training threshold, the second neural network hybrid model is obtained.
[0021] In one embodiment, updating the network weights of the second hybrid neural network model based on the keyword sample set to obtain the trained neural network model includes:
[0022] Obtain a keyword sample from the keyword sample set;
[0023] Based on the keyword samples, the second hybrid neural network model is trained, and the network weights of the second hybrid neural network model are updated;
[0024] If the error between the output value of the second hybrid neural network model after updating the network weights and the sample value corresponding to the keyword sample is less than a preset error, it is determined whether each keyword sample in the keyword sample set has completed the training of the second hybrid neural network model.
[0025] If all keyword samples in the keyword sample set have completed the training of the second hybrid neural network model, the trained neural network model is obtained.
[0026] In one embodiment, prior to establishing the first hybrid neural network model based on shared data samples, the following steps are included:
[0027] The Z-score standardization method is used to standardize and store the sharing requirement information of the shared data samples.
[0028] In one embodiment, obtaining the audit model based on the trained neural network model includes:
[0029] The optimized preset audit rules are inserted into the trained neural network model to obtain the audit model.
[0030] In one embodiment, the keyword sample set is obtained by extracting keywords from the shared data samples using the TextRank algorithm.
[0031] Secondly, embodiments of this application provide a shared data auditing device, comprising:
[0032] The model acquisition module is used to acquire the audit model based on the trained neural network model;
[0033] An audit module is used to perform audit processing on shared data based on the audit model.
[0034] The trained neural network model is obtained by training based on the following steps:
[0035] A first hybrid neural network model was established based on shared data samples.
[0036] The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model;
[0037] The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
[0038] Thirdly, embodiments of this application provide an electronic device, including a processor and a memory storing a computer program, wherein the processor executes the program to implement the shared data auditing method described in the first aspect.
[0039] Fourthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the shared data auditing method described in the first aspect.
[0040] The shared data auditing method and apparatus provided in this application improve the auditing efficiency and processing accuracy of shared data by combining a neural network model with auditing technology. Attached Figure Description
[0041] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0042] Figure 1 This is a flowchart illustrating the shared data auditing method provided in an embodiment of this application;
[0043] Figure 2 This is one of the flowcharts illustrating the process of obtaining a trained neural network model provided in the embodiments of this application;
[0044] Figure 3 This is a schematic diagram of the data identification process provided in the embodiments of this application;
[0045] Figure 4 This is a schematic diagram of the error backpropagation process provided in an embodiment of this application;
[0046] Figure 5 This is the second schematic diagram of the process for obtaining a trained neural network model provided in the embodiments of this application;
[0047] Figure 6 This is the third flowchart illustrating the process of obtaining a trained neural network model provided in this application embodiment;
[0048] Figure 7 This is a schematic diagram of the audit process provided in the embodiments of this application;
[0049] Figure 8 This is a schematic diagram of the content recognition process provided in an embodiment of this application;
[0050] Figure 9 This is a schematic diagram of the structure of the shared data auditing device provided in the embodiments of this application;
[0051] Figure 10 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application. Detailed Implementation
[0052] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0053] Figure 1This is a flowchart illustrating the shared data auditing method provided in an embodiment of this application. Figure 1 As shown in the embodiment of this application, a method for auditing shared data is provided, which may include:
[0054] Step 101: Obtain the audit model based on the trained neural network model;
[0055] The trained neural network model is obtained by training based on the following steps:
[0056] A first hybrid neural network model was established based on shared data samples.
[0057] The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model;
[0058] The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
[0059] Specifically, the shared data sample is the shared data extracted in a certain proportion based on the summary analysis results after the shared data has been summarized and analyzed.
[0060] A first hybrid neural network model is established by combining a neural network built based on shared data samples with a recurrent neural network.
[0061] Humans label shared data samples based on their experience and expertise, and obtain the human labeling results.
[0062] For example, labeling can be used to label the sentiment state of shared data. Sentiment states can be divided into positive sentiment, neutral sentiment, and negative sentiment, with positive sentiment labeled as +1, neutral sentiment as 0, and negative sentiment as -1.
[0063] The first hybrid neural network model also uses the same annotation on the shared data samples. The first hybrid neural network model is trained based on the manual annotation results, and the trained first hybrid neural network model is the second hybrid neural network model.
[0064] The keyword sample set is obtained by extracting keywords from shared data samples. After obtaining the keyword sample set, the second hybrid neural network model is trained using the keyword sample set to update the network weights of the model, thereby obtaining a trained neural network model.
[0065] Finally, based on the trained neural network model and the optimized preset audit rules, an audit model is obtained.
[0066] In one embodiment, prior to establishing the first hybrid neural network model based on shared data samples, the following steps are included:
[0067] The Z-score standardization method is used to standardize and store the sharing requirement information of the shared data samples.
[0068] Specifically, the data sharing requirements information in the shared data content is summarized and analyzed, and Z-score is used for standardized storage.
[0069] The information required for sharing mainly includes, but is not limited to: operator, operation date, shared data field name, source IP, source IP system / unit, destination IP, destination IP system / unit, and sharing requirement / reason.
[0070] The specific method for standardized storage is as follows: the Z-score standardization method is used to calculate the mean of the original values, then the standard deviation SD is calculated, then the mean is subtracted from each original value to obtain R, and finally the standard deviation SD is divided by R, so that the data is standardized and stored according to the uniform requirements of field content and format.
[0071] The formula for calculating the Z-score normalization method is as follows:
[0072]
[0073] In the formula, X′ represents the standardized value, and x represents the original value, the value before standardization. SD represents the mean of the original values, R represents the standard deviation of the original values, and R represents the difference between the original values and the mean.
[0074] In one embodiment, establishing a first hybrid neural network model based on data-shared content samples includes:
[0075] Data identification is performed on the shared data sample to obtain the data type corresponding to the shared data sample;
[0076] The collinearity correlation between the data types is analyzed using a scatter plot matrix to obtain feature vectors.
[0077] Based on the feature vectors, a neural network is established;
[0078] Based on the aforementioned neural network and recurrent neural network, a first hybrid neural network model is established.
[0079] Specifically, Figure 2 This is one of the flowcharts illustrating the process of obtaining a trained neural network model provided in this application embodiment, such as... Figure 2 As shown, the establishment of the first hybrid neural network model specifically includes the following steps:
[0080] Step 201: Data identification is performed by combining shared data samples and related characteristics.
[0081] Specifically, the shared data content and related characteristics include, but are not limited to, data field names, data types, lengths, and corresponding storage rules.
[0082] Figure 3 This is a schematic diagram of the data recognition process provided in the embodiments of this application, such as... Figure 3 As shown, data identification specifically includes the following steps:
[0083] Step 301: Establish a classification and grading system based on the shared data.
[0084] Step 302: Review the sensitive data domains and determine their classification level.
[0085] Step 303: Define data categories from a business perspective.
[0086] Step 304: Classify and collect the data through the data management system.
[0087] Specifically, based on the data sharing content, multiple classification and grading systems are built. Then, sensitive data domains in the data sharing content are sorted out. Sensitive data domains refer to data sets with a high degree of density. The sensitive data domains are assigned a level based on their own attributes. After sorting out the sensitive data domains, data categories are formulated from a business perspective, and the data is classified and collected through the data management system.
[0088] The classification and grading system includes classification principles and classification methods. The classification principle is to classify the shared data according to its source, content, and purpose. The grading principle is to grade the data according to its value, content sensitivity, impact, and distribution scope.
[0089] The specific classification method is as follows: based on the identified data assets, sensitive data is automatically detected. By using feature detection, the distribution of sensitive data in which data assets is located is determined. Then, the sensitive data assets are classified and labeled, and the owners of the sensitive data are identified. Based on the classified data assets, the business department performs sensitivity classification and divides the classified data assets into different sensitivity levels such as public, internal, and sensitive.
[0090] Data identification that can accurately distinguish data of similar type and content includes, but is not limited to: mobile phone numbers, ID card numbers, and data of similar type.
[0091] Step 202: Analyze the collinearity between data using the scatter plot matrix and extract feature vectors.
[0092] Specifically, the collinear correlation between multiple data types is analyzed using scatter plot matrices. The obtained data is then linearly transformed to a new coordinate system, such that the first largest variance of any data projection lies on the first coordinate axis, the second largest variance lies on the second coordinate axis, and so on, thereby reducing the data dimension of the feature vectors and extracting the feature vectors.
[0093] Step 203: Generate a neural network based on the extracted feature vectors.
[0094] Specifically, a neural network is generated based on the extracted feature vectors, and the neural network satisfies the following formula:
[0095] x(n+1)=W1u(n+1)+W2x(n)+W3y(n)
[0096] In the formula, x and y are the input and output of the neural network, respectively; W1, W2 and W3 are the transformation matrices between the current input, the current state of the neural network and the current output to the next state of the neural network, respectively; n represents the current time; and n+1 represents the next time.
[0097] Step 204: Combine recurrent neural networks to establish the first hybrid neural network model.
[0098] Specifically, a learning algorithm based on convolutional neural networks is employed to extract feature vectors, which are then combined and fed into subsequent mechanisms. A linear transformation is performed on the input vectors to achieve a convolutional neural network encoder for the knowledge graph. This is then combined with a recurrent neural network using the backpropagation (BP) algorithm, recursively connecting keywords as input data in a chain-like manner along the sequence's evolution direction.
[0099] Figure 4 This is a schematic diagram of the error backpropagation process provided in an embodiment of this application, as shown below. Figure 4 As shown, in Figure 4 In this diagram, t represents time step; x is the input layer, a vector representing the input layer value; s is the hidden layer, representing the hidden layer value; U is the weight matrix from the input layer to the hidden layer; o is the output layer, a vector representing the output layer value; V is the weight matrix from the hidden layer to the output layer; and W is the matrix weights of the hidden layer whose previous value is used as the current input.
[0100] Depend on Figure 4 It can be seen that the value s of the hidden layer depends not only on the current input x, but also on the value s of the previous hidden layer. The output layer o and the hidden layer s satisfy the following calculation formula:
[0101] o t =g(Vs) t )
[0102] s t=f(Ux t +Ws t-1 )
[0103] In the formula, o t Let represent the output layer value at time t, g represent the output layer activation function, V represent the weight matrix from the hidden layer to the output layer, and s represent the output layer value. t Let f represent the value of the hidden layer at time t, f represent the activation function of the hidden layer, U be the weight matrix from the input layer to the hidden layer, and x represent the value of the hidden layer at time t. t Let t represent the value of the input layer at time t, W represent the previous value of the hidden layer, and s represent the value of the input layer at time t. t-1 This represents the value of the hidden layer at time t-1.
[0104] By repeatedly substituting the formula below into the formula above, we can obtain:
[0105] o t =g(VS) t )=Vf(Ux t +Wf(Ux t-1 +Ws t-2 ))
[0106] =Vf(Ux t +Wf(Ux t-1 +Wf(Ux t-2 +Ws t-3 )))
[0107] =Vf(Ux t +Wf(Ux t-1 +Wf(Ux t-2 +Wf(Ux t-3 +、、))))
[0108] The calculation consists of three steps: 1. Calculate the output value of each neuron forward; 2. Calculate the error term value of each neuron backward, which is the partial derivative of the error function with respect to the weighted input of the neuron; 3. Calculate the gradient of each weight, and finally update the weights using the stochastic gradient descent algorithm. This generates a new sample of the text recognition output probability.
[0109] Considering that the simulated system is almost impossible to iterate step by step in one pattern and will be affected by many external factors, a hybrid neural network model is established for training.
[0110] The expression for the hybrid neural network model is shown below:
[0111] y(i,t)=α+X(i,t)β+ε(i,t)
[0112] i=1,2,,,,,N; t=1,2,,,,,T
[0113] In the formula, y(i,t) is the North Regression variable (scalar), α is the intercept, X(i,t) is the k*1 order regression variable column vector (including k regression quantities), β is the k*1 order regression coefficient column vector, and ε(i,t) is the error term (scalar).
[0114] In one embodiment, the semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain a second hybrid neural network model, including:
[0115] Based on the manual annotation results and the annotation results corresponding to the first hybrid neural network model, the overlap rate of the annotation results is obtained;
[0116] When the overlap rate of the labeled results reaches the overlap rate threshold and the number of training times of the first hybrid neural network model reaches the preset training threshold, the second neural network hybrid model is obtained.
[0117] Specifically, Figure 5 This is the second schematic diagram of the process for obtaining a trained neural network model provided in the embodiments of this application, such as... Figure 5 As shown, the first hybrid neural network model annotates the shared data content, and the annotation results corresponding to the first hybrid neural network model are obtained.
[0118] The obtained manual annotation results are compared with the annotation results corresponding to the first hybrid neural network model to obtain the overlap rate between the two annotation results.
[0119] If the overlap rate of the labeled results does not exceed the overlap rate threshold, samples are reselected from the shared data samples for iterative training. If the overlap rate exceeds the overlap rate threshold, the number of training iterations is then checked to see if it exceeds a preset training threshold. If the number of training iterations does not exceed the preset training threshold, samples are reselected from the shared data samples for iterative training. If the number of training iterations exceeds the preset training threshold, training is stopped, and the second neural network hybrid model is obtained.
[0120] In one embodiment, the keyword sample set is obtained by extracting keywords from the shared data samples using the TextRank algorithm.
[0121] Specifically, keyword extraction is performed on the shared data samples using the Textrank algorithm. This involves constructing a directed weighted graph G = (V, E) based on the shared data samples, where V is the set of nodes and E is the set of edges {E1, E2, ..., E...} in the graph. n}
[0122] Let Wji be the weight of the edge between any two nodes Vi and Vj in the graph. For a given node Vi, InVi is the set of all points pointing to (in) node Vi, and Out(Vi) is the set of points pointed to by node Vi. The score weight of node Vi is defined as follows:
[0123]
[0124] In the formula, WS(V i ) represents the score weight of node Vi, d represents the damping coefficient, and W represents the probability of pointing from a specific node to any other point in the graph, ranging from 0 to 1. Vi represents the i-th node, Vj represents the j-th node, and W represents the score weight of node Vi. E This represents the weight of the edge set in the graph. Represents the set of edges pointing to node Vi, Out(V j ) indicates that for node V j , by node V j The set of all points that point to (out), W jk Represents node V j With node V k The edge weights between them, WS(V) j ) represents the score weight of node Vj.
[0125] When calculating the weight of each node, an arbitrary initial value is assigned to each node in the graph, and then the calculation is performed recursively until the weight results converge. The extracted keywords are then organized into a keyword sample set.
[0126] In one embodiment, updating the network weights of the second hybrid neural network model based on the keyword sample set to obtain the trained neural network model includes:
[0127] Obtain a keyword sample from the keyword sample set;
[0128] Based on the keyword samples, the second hybrid neural network model is trained, and the network weights of the second hybrid neural network model are updated;
[0129] If the error between the output value of the second hybrid neural network model after updating the network weights and the sample value corresponding to the keyword sample is less than a preset error, it is determined whether each keyword sample in the keyword sample set has completed the training of the second hybrid neural network model.
[0130] If all keyword samples in the keyword sample set have completed the training of the second hybrid neural network model, the trained neural network model is obtained.
[0131] Specifically, Figure 6This is the third flowchart illustrating the process of obtaining a trained neural network model provided in this application embodiment, as shown below. Figure 6 As shown, a keyword sample is obtained from the keyword sample set and taken as the first keyword sample. Using the first keyword sample in the keyword sample set, the second hybrid neural network model is trained by the error correction method. After the error correction of one backpropagation is completed, the second hybrid neural network model will update the network weights once to obtain the third hybrid neural network model.
[0132] The error between the output value of the third hybrid neural network model and the sample value corresponding to the first keyword sample is compared with a preset error. If the error is greater than the preset error, the third hybrid neural network model is trained again using the first keyword sample. If the error is less than the preset error, it is determined whether all keyword samples have completed the training of the hybrid neural network model. If not, another keyword sample is obtained from the keyword sample set and taken as the second keyword sample. The third hybrid neural network model is trained using the second keyword sample, and the network weights of the third hybrid neural network model are updated. This process is repeated until all keyword samples in the keyword sample set have completed the training, at which point the trained neural network model is obtained.
[0133] In one embodiment, obtaining the audit model based on the trained neural network model includes:
[0134] The optimized preset audit rules are inserted into the trained neural network model to obtain the audit model.
[0135] Specifically, pre-set audit rules are established for the shared data content. These pre-set rules are then reviewed by both human and machine audits. Once approved, the audit rules are stored in the audit database. The pre-set audit rules can be configured according to the data sharing business logic. Simultaneously, an indicator library is created, comprising analytical models, and thresholds for each indicator are set according to actual needs.
[0136] The preset audit rules are shared with multiple terminal devices for display. At the same time, the comments of multiple terminal devices on the preset audit rules are obtained. Then, the preset audit rules are optimized based on the comments. After the optimization is completed, the optimized preset audit rules are stored and replaced in the audit database.
[0137] Comments can be positive or negative, and include, but are not limited to, the number of likes, comments, and dislikes.
[0138] The optimized preset audit rules are inserted into the trained neural network model to establish a complete audit semantic system, thereby obtaining the audit model.
[0139] Step 102: Based on the audit model, perform audit processing on the shared data.
[0140] Specifically, Figure 7 This is a schematic diagram of the audit process provided in the embodiments of this application, such as... Figure 7 As shown, during the audit model's operation, the shared data content identification results are compared with the required content. The comparison includes, but is not limited to, data size, content, operation time, and flow. Based on the pre-set data retrieval and calculation relationships, the indicator results are obtained, and data content that does not meet the requirements is output and assigned to the threshold level. When the audit analysis is completed, abnormal shared data is output and stored. At the same time, the cause analysis of the output abnormal shared data is performed to obtain the defects in the shared content data, and the abnormal shared data is repaired based on the obtained defects. Simultaneously, the optimized preset audit rules are further optimized, and the audit model is updated based on the optimization results. The audit data of the obtained shared data is also stored.
[0141] Establish a front-end page to ensure the consistency and compliance of the displayed data sharing content. Display audit data of the data sharing content through the client and issue early warning information based on preset relevant alarm information when there are problems with the audit results.
[0142] Figure 8 This is a schematic diagram of the content recognition process provided in an embodiment of this application, such as... Figure 8 As shown, content recognition specifically includes the following steps:
[0143] Step 801: Perform text recognition on the data sharing content using the text recognition model in the audit model.
[0144] Step 802: After recognition is complete, post-process the text content to filter out characters that do not meet the requirements;
[0145] Specifically, characters that do not meet the requirements include, but are not limited to: rare characters, ambiguous characters, incomplete characters, and obscure characters.
[0146] Step 803: Use the word2vec Continuous Bag-of-Words (CBOW) model to predict missing words.
[0147] The prediction method is as follows: First, a CBOW model is trained using open Chinese corpora and related corpora of the current shared data content type. This model predicts missing characters in the current shared data content type. The model input is the filtered text content, and the output is the predicted missing text content. Finally, the predicted missing text content is integrated with the existing text content to restore the text content, ensuring that the content recognition accuracy is no less than 90%.
[0148] The objective function of the CBOW model is shown below:
[0149]
[0150] In the formula, Context(w) represents the missing text content, w represents the given text content, p(Context(w)|w) represents the probability of generating the missing text content Context(w) based on the given text content w, u represents the sub-content in the missing text content, and p(u|w) represents the probability of generating the sub-content u in the missing text content based on the given text content w.
[0151] The shared data auditing method provided in this application combines a neural network model with auditing technology. By utilizing the neural network model to audit the shared data, the processing accuracy is improved, and the method is less susceptible to the influence of subjective factors by auditors. It also avoids repetitive and intensive work for employees, thus improving the efficiency of shared data auditing. Finally, the audit results are presented intuitively to auditors in a webpage format, and warnings are issued for abnormal audit results, thereby forming a comprehensive data sharing auditing technology.
[0152] The shared data auditing apparatus provided in the embodiments of this application is described below. The shared data auditing apparatus described below and the shared data auditing method described above can be referred to in correspondence.
[0153] Figure 9 This is a schematic diagram of the structure of the shared data auditing device provided in the embodiments of this application, as shown below. Figure 9 As shown, this application provides an auditing device for shared data, including: a model acquisition module 901 and an auditing module 902; wherein,
[0154] The model acquisition module 901 is used to acquire the audit model based on the trained neural network model;
[0155] Audit module 902 is used to perform audit processing on shared data based on the audit model;
[0156] The trained neural network model is obtained by training based on the following steps:
[0157] A first hybrid neural network model was established based on shared data samples.
[0158] The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model;
[0159] The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
[0160] In one embodiment, the model acquisition module 901 is further configured to perform data identification on the shared data sample and obtain the data type corresponding to the shared data sample;
[0161] The collinearity correlation between the data types is analyzed using a scatter plot matrix to obtain feature vectors.
[0162] Based on the feature vectors, a neural network is established;
[0163] Based on the aforementioned neural network and recurrent neural network, a first hybrid neural network model is established.
[0164] In one embodiment, the model acquisition module 901 is further configured to obtain the overlap rate of the annotation results based on the manual annotation results and the annotation results corresponding to the first hybrid neural network model;
[0165] When the overlap rate of the labeled results reaches the overlap rate threshold and the number of training iterations of the model reaches the preset training threshold, a second neural network hybrid model is obtained.
[0166] In one embodiment, the model acquisition module 901 is further configured to acquire a keyword sample from the keyword sample set;
[0167] Based on the keyword samples, the second hybrid neural network model is trained, and the network weights of the second hybrid neural network model are updated;
[0168] If the error between the output value of the second hybrid neural network model after updating the network weights and the sample value corresponding to the keyword sample is less than a preset error, it is determined whether each keyword sample in the keyword sample set has completed the training of the second hybrid neural network model.
[0169] If all keyword samples in the keyword sample set have completed the training of the second hybrid neural network model, the trained neural network model is obtained.
[0170] In one embodiment, the apparatus further includes a standardization module, which is used to standardize and store the shared demand information of the shared data samples using the Z-score standardization method.
[0171] In one embodiment, the model acquisition module 901 is further configured to insert the optimized preset audit rules into the trained neural network model to acquire the audit model.
[0172] In one embodiment, the keyword sample set is obtained by extracting keywords from the shared data samples using the TextRank algorithm.
[0173] Specifically, the shared data auditing device provided in this application embodiment can implement all the method steps implemented in the above method embodiment and can achieve the same technical effect. Here, the parts that are the same as those in the method embodiment and the beneficial effects will not be described in detail.
[0174] Figure 10 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application, such as... Figure 10 As shown, the electronic device may include: a processor 1010, a communication interface 1020, a memory 1030, and a communication bus 1040, wherein the processor 1010, the communication interface 1020, and the memory 1030 communicate with each other via the communication bus 1040. The processor 1010 can call a computer program in the memory 1030 to execute the steps of a shared data auditing method, such as including:
[0175] Based on the trained neural network model, obtain the audit model;
[0176] Based on the aforementioned audit model, the shared data is audited.
[0177] The trained neural network model is obtained by training based on the following steps:
[0178] A first hybrid neural network model was established based on shared data samples.
[0179] The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model;
[0180] The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
[0181] Furthermore, the logical instructions in the aforementioned memory 1030 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0182] In one embodiment, establishing a first hybrid neural network model based on data-shared content samples includes:
[0183] Data identification is performed on the shared data sample to obtain the data type corresponding to the shared data sample;
[0184] The collinearity correlation between the data types is analyzed using a scatter plot matrix to obtain feature vectors.
[0185] Based on the feature vectors, a neural network is established;
[0186] Based on the aforementioned neural network and recurrent neural network, a first hybrid neural network model is established.
[0187] In one embodiment, the semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain a second hybrid neural network model, including:
[0188] Based on the manual annotation results and the annotation results corresponding to the first hybrid neural network model, the overlap rate of the annotation results is obtained;
[0189] When the overlap rate of the labeled results reaches the overlap rate threshold and the number of training iterations of the model reaches the preset training threshold, a second neural network hybrid model is obtained.
[0190] In one embodiment, updating the network weights of the second hybrid neural network model based on the keyword sample set to obtain the trained neural network model includes:
[0191] Obtain a keyword sample from the keyword sample set;
[0192] Based on the keyword samples, the second hybrid neural network model is trained, and the network weights of the second hybrid neural network model are updated;
[0193] If the error between the output value of the second hybrid neural network model after updating the network weights and the sample value corresponding to the keyword sample is less than a preset error, it is determined whether each keyword sample in the keyword sample set has completed the training of the second hybrid neural network model.
[0194] If all keyword samples in the keyword sample set have completed the training of the second hybrid neural network model, the trained neural network model is obtained.
[0195] In one embodiment, prior to establishing the first hybrid neural network model based on shared data samples, the following steps are included:
[0196] The Z-score standardization method is used to standardize and store the sharing requirement information of the shared data samples.
[0197] In one embodiment, obtaining the audit model based on the trained neural network model includes:
[0198] The optimized preset audit rules are inserted into the trained neural network model to obtain the audit model.
[0199] In one embodiment, the keyword sample set is obtained by extracting keywords from the shared data samples using the TextRank algorithm.
[0200] On the other hand, embodiments of this application also provide a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can perform the steps of the shared data auditing method provided in the above embodiments, such as including:
[0201] Based on the trained neural network model, obtain the audit model;
[0202] Based on the aforementioned audit model, the shared data is audited.
[0203] The trained neural network model is obtained by training based on the following steps:
[0204] A first hybrid neural network model was established based on shared data samples.
[0205] The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model;
[0206] The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
[0207] On the other hand, embodiments of this application also provide a processor-readable storage medium storing a computer program for causing a processor to execute the steps of the shared data auditing method provided in the above embodiments, such as including:
[0208] Based on the trained neural network model, obtain the audit model;
[0209] Based on the aforementioned audit model, the shared data is audited.
[0210] The trained neural network model is obtained by training based on the following steps:
[0211] A first hybrid neural network model was established based on shared data samples.
[0212] The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model;
[0213] The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
[0214] The processor-readable storage medium can be any available medium or data storage device that the processor can access, including but not limited to magnetic memory (e.g., floppy disk, hard disk, magnetic tape, magneto-optical disk (MO)), optical memory (e.g., CD, DVD, BD, HVD), and semiconductor memory (e.g., ROM, EPROM, EEPROM, non-volatile memory (NAND FLASH), solid-state drive (SSD)).
[0215] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0216] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0217] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. An auditing method of sharing data, characterized by, include: Based on the trained neural network model, obtain the audit model; Based on the audit model, the shared data is audited and abnormal shared data is output. The abnormal shared data is analyzed to identify defects in the shared content data, and the audit model is updated based on these defects. The trained neural network model is obtained by training based on the following steps: A first hybrid neural network model is established by combining a neural network built based on shared data samples with a recurrent neural network. The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model; The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
2. The auditing method of sharing data according to claim 1, wherein, The neural network constructed based on shared data samples, combined with a recurrent neural network, establishes a first hybrid neural network model, including: Data identification is performed on the shared data sample to obtain the data type corresponding to the shared data sample; The collinearity correlation between the data types is analyzed using a scatter plot matrix to obtain feature vectors. Based on the feature vectors, a neural network is established; Based on the aforementioned neural network and recurrent neural network, a first hybrid neural network model is established.
3. The method of claim 1, wherein, The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model, including: Based on the manual annotation results and the annotation results corresponding to the first hybrid neural network model, the overlap rate of the annotation results is obtained; When the overlap rate of the labeled results reaches the overlap rate threshold and the number of training times of the first hybrid neural network model reaches the preset training threshold, the second neural network hybrid model is obtained.
4. The method for auditing shared data according to claim 1, characterized in that, The step of updating the network weights of the second hybrid neural network model based on the keyword sample set to obtain the trained neural network model includes: Obtain a keyword sample from the keyword sample set; Based on the keyword samples, the second hybrid neural network model is trained, and the network weights of the second hybrid neural network model are updated; If the error between the output value of the second hybrid neural network model after updating the network weights and the sample value corresponding to the keyword sample is less than a preset error, it is determined whether each keyword sample in the keyword sample set has completed the training of the second hybrid neural network model. If all keyword samples in the keyword sample set have completed the training of the second hybrid neural network model, the trained neural network model is obtained.
5. The method for auditing shared data according to claim 1, characterized in that, Before establishing the first hybrid neural network model by combining the neural network constructed based on shared data samples with a recurrent neural network, the following steps are included: The Z-score standardization method is used to standardize and store the sharing requirement information of the shared data samples.
6. The method for auditing shared data according to claim 1, characterized in that, Based on the trained neural network model, obtain the audit model, including: The optimized preset audit rules are inserted into the trained neural network model to obtain the audit model.
7. The method for auditing shared data according to claim 1, characterized in that, The keyword sample set was obtained by extracting keywords from the shared data samples using the TextRank algorithm.
8. A shared data auditing device, characterized in that, include: The model acquisition module is used to acquire the audit model based on the trained neural network model; The audit module is used to perform audit processing on the shared data based on the audit model and output abnormal shared data. The update module is used to perform cause analysis on the abnormal shared data, obtain defects in the shared content data, and update the audit model based on the defects in the shared content data. The trained neural network model is obtained by training based on the following steps: A first hybrid neural network model is established by combining a neural network built based on shared data samples with a recurrent neural network. The semantic annotations in the first hybrid neural network model are trained based on the manually annotated results to obtain the second hybrid neural network model; The network weights of the second hybrid neural network model are updated based on the keyword sample set to obtain the trained neural network model.
9. An electronic device comprising a processor and a memory storing a computer program, characterized in that, When the processor executes the computer program, it implements the shared data auditing method according to any one of claims 1 to 7.
10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the auditing method for shared data as described in any one of claims 1 to 7.