A cloud platform-based software running background data security management system
By introducing request parsing, feature extraction, baseline acquisition, risk assessment, and blocking decision-making modules into the online education platform, the problems of behavioral discrepancies and environmental interference under the matching state of user ability and course difficulty are solved, and efficient identification and adaptive detection of abnormal access are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUNAN DIGITAL TECHNOLOGY CO LTD
- Filing Date
- 2026-05-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing online education data security solutions fail to effectively identify behavioral differences when users' abilities match the difficulty of courses, cannot identify continuous theft using temporal continuity and environmental interference, and have static and fixed detection thresholds, leading to misjudgments or missed detections of abnormal access.
The system employs a request parsing module, a feature extraction module, a baseline acquisition module, a risk assessment module, and a blocking decision module. Through voice activity detection, voiceprint feature extraction, context feature splicing, temporal continuity verification, and environmental noise compensation, it dynamically generates blocking thresholds to achieve layered detection and adaptive adjustment of user behavior.
It improves the accuracy of identifying spoofed learning behavior, accurately identifies persistent theft behavior, reduces false positives and false negatives, and adapts to changes in the stability of user behavior and the system's busy/idle periods.
Smart Images

Figure CN122241663A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of security management technology and relates to a software operation background data security management system based on a cloud platform. Background Technology
[0002] In online English learning platforms, students upload classroom dialogues in real time for oral practice via voice interaction APIs. The voice data packets contain biometric information such as user voiceprint characteristics and pronunciation habits, as well as private content, making data security protection extremely important.
[0003] Currently, data security solutions in the online education field mainly focus on data transmission encryption and storage encryption. Anomaly detection at the API call level often employs call frequency limits, simple identity token verification, or anomaly detection methods based on fixed statistical thresholds. For example, Chinese Patent Publication No. CN112801834B describes a security management system and method applied to a smart education platform. This system manages security at different levels of the smart education platform through platform application security management modules, data security management modules, and big data component security management modules. In terms of speech feature processing, existing technologies utilize variational autoencoders for speaker identification or concatenate speech feature vectors with user information feature vectors for risk identification.
[0004] However, the existing technologies mentioned above have the following drawbacks: 1. The existing technologies do not establish a hierarchical behavioral baseline based on the matching status between course difficulty and user ability. All users are subject to a uniform detection strategy, which does not consider the differences in behavioral characteristics of the same user when facing courses of different difficulty levels, resulting in poor accuracy in detecting abnormal access disguised as learning behaviors that do not match the user's ability.
[0005] 2. Existing technologies cannot identify persistent theft that exploits temporal continuity and environmental interference. Attackers slowly crawl voice samples at extremely low frequencies, hijack legitimate sessions and simulate normal interaction rhythms, or use environmental fluctuations such as network latency and background noise to cover up abnormal behavior. Existing solutions lack temporal continuity verification and environmental noise compensation mechanisms, making it difficult to distinguish between real user fluctuations and attack behavior.
[0006] 3. Existing technologies have statically fixed detection thresholds, which cannot adapt to individual behavioral fluctuations. In existing solutions, the blocking thresholds are not dynamically adjusted according to the stability of the user's behavior and the system's busy / idle periods. This makes users with stable behavior prone to being misjudged due to minor deviations, while users with large behavioral fluctuations may miss anomalies due to overly broad thresholds. Summary of the Invention
[0007] In view of this, in order to solve the problems mentioned in the background technology above, a software runtime background data security management system based on a cloud platform is proposed.
[0008] The objective of this invention can be achieved through the following technical solution: This invention provides a software runtime background data security management system based on a cloud platform, including: a request parsing module, a feature extraction module, a baseline acquisition module, a risk assessment module, and a blocking decision module.
[0009] The request parsing module is connected to the feature extraction module, the feature extraction module is connected to the baseline acquisition module, the baseline acquisition module is connected to the risk assessment module, and the risk assessment module is connected to the blocking decision module.
[0010] The request parsing module obtains the call request for the voice interaction API, parses the call request to generate request metadata and raw voice data stream.
[0011] The feature extraction module uses speech activity detection to remove the silence segments from the original speech data stream, performs embedded feature extraction on the speech data stream after removing the silence segments, and combines it with request metadata to extract contextual features, concatenating them to obtain the current fused feature vector.
[0012] The baseline acquisition module acquires the user identifier associated with the call request, determines the historical fusion feature vector set and user behavior baseline corresponding to the group according to difficulty and ability matching, and dynamically generates a confidence blocking threshold based on the user behavior baseline.
[0013] The risk assessment module performs temporal continuity verification and environmental noise compensation on the current fused feature vector, compares it with the historical fused feature vector set to calculate reconstruction error, and generates an anomaly risk score.
[0014] The blocking decision module blocks the call request and sends a voice biometric verification request when the abnormal risk score is greater than the confidence blocking threshold.
[0015] Compared with the prior art, the beneficial effects of the present invention are as follows: (1) The present invention divides the historical normal access records into three groups according to the matching status of the user's ability level benchmark value and the course difficulty label: difficulty lower than ability, difficulty matching ability, and difficulty higher than ability. The median of the fused feature vector in each group is calculated as the baseline center value, the interquartile range is calculated as the tolerance interval width and the covariance matrix. By establishing a hierarchical behavioral baseline based on the matching status of course difficulty and user ability, the detection strategy can adapt to the behavioral differences of the same user under different difficulty courses, directly improving the accuracy of identifying learning behaviors disguised as inconsistent with the user's ability.
[0016] (2) This invention performs a dimension-by-dimensional temporal continuity check on the current fused feature vector, corrects the jump dimensions that exceed the preset multiple of the tolerance interval width to a reasonable range, and attenuates the timestamp-related dimensions based on network jitter parameters and adjusts the gain of the energy-related dimensions based on device noise parameters. As a result, it can separate the fluctuations of real user behavior from temporal continuity and environmental interference, and accurately identify continuous theft behavior that slowly crawls at extremely low frequencies or hijacks sessions and then simulates a normal rhythm.
[0017] (3) This invention extracts the tolerance interval width from the user behavior baseline to calculate the weights of each feature dimension, and calculates the weighted Mahalanobis distance using the covariance matrix as the Mahalanobis distance benchmark. Then, the mean of the tolerance interval width is multiplied by the time period base value corresponding to the current system time period to obtain the attenuation factor, and the product of the two is used as the dynamic confidence blocking threshold. This threshold adapts to the stability of the user's behavior and the busy / idle time of the system. The threshold is tightened for users with stable behavior to reduce false positives, and the threshold is relaxed for users with fluctuating behavior to avoid missed detections. Attached Figure Description
[0018] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 This is a schematic diagram of the module connection of the present invention;
[0020] Figure 2 This is a logical diagram of the timing continuity verification in the risk assessment module of the present invention;
[0021] Figure 3 This is a logical diagram illustrating the reconstruction error calculation in the risk assessment module of this invention. Detailed Implementation
[0022] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0023] Please see Figure 1 As shown, the present invention provides a software runtime background data security management system based on a cloud platform, including: a request parsing module, a feature extraction module, a baseline acquisition module, a risk assessment module, and a blocking decision module.
[0024] The request parsing module is connected to the feature extraction module, the feature extraction module is connected to the baseline acquisition module, the baseline acquisition module is connected to the risk assessment module, and the risk assessment module is connected to the blocking decision module.
[0025] The request parsing module obtains the call request for the voice interaction API, parses the call request to generate request metadata and raw voice data stream.
[0026] Specifically, the goal of the request parsing module is to receive and initially process voice interaction requests from user clients, converting them from network transmission formats into standardized data structures required for subsequent security analysis. The processing flow begins at the cloud platform gateway layer, a network service component deployed in front of cloud applications, serving as a unified entry point for all external requests. This layer is responsible for traffic routing, load balancing, and the initial execution of security policies.
[0027] In this embodiment, the process of parsing and executing user call requests for the online English learning platform is as follows: the Hypertext Transfer Protocol request sent to the voice service interface is intercepted at the cloud platform gateway layer, and the request header and request body are extracted.
[0028] Extract the API key, request timestamp, and user identifier from the request header and combine them into request metadata.
[0029] The speech data payload is parsed from the request body, decompressed and decoded, and a pulse code modulation audio data stream is generated as the original speech data stream.
[0030] It should be noted that the request header contains metadata about the call request itself, the extracted API key is a unique string used to verify the client's identity and authorize its access permissions, the request timestamp is a numerical value that records the time the request was initiated, usually in Unix timestamp format, used to prevent replay attacks and perform timeliness verification, and the user identifier is a string that uniquely identifies the user currently operating.
[0031] The request body carries the actual data sent by the client. The voice data payload is compressed and encoded binary audio data. To save network bandwidth during transmission, the voice data payload typically uses methods such as... Since lossless compression algorithms are used for compression, decompression is required to restore the compressed data to its original size. The header byte sequence of the decompressed binary data block is then examined to identify the audio encoding format identifier. If MP3 is identified, frame synchronization, side information parsing, Huffman decoding, and subband synthesis filtering are performed sequentially to output a PCM sampling sequence. If AAC is identified, ADTS frame parsing, noise-free decoding, inverse quantization, spectrum reconstruction, and filter combination are performed sequentially to output a PCM sampling sequence. If PCM is identified, the binary data block is directly output as the PCM sampling sequence. It should be noted that the different audio decoding methods used for different audio encoding format identifiers are existing technologies; this invention only provides the flow chart and does not detail the specific execution process.
[0032] The output PCM sampling sequence is organized into a continuous pulse code modulation audio data stream according to sampling rate, bit depth, and number of channels.
[0033] The feature extraction module uses speech activity detection to remove the silence segments of the original speech data stream, performs embedded feature extraction on the speech data stream after removing the silence segments, and combines it with request metadata to extract contextual features, and concatenates them to obtain the current fused feature vector.
[0034] Specifically, after obtaining the request metadata and the raw speech data stream, the feature extraction module aims to transform the two heterogeneous data into a unified numerical representation that can be processed by machine learning models.
[0035] In this embodiment, the current fused feature vector is obtained by detecting silent segments in the original speech data stream using a speech activity detection algorithm, and obtaining effective speech segments after removing the silent segments.
[0036] Specifically, the raw audio data stream is divided into frames according to a preset frame length (20 milliseconds) and frame shift (10 milliseconds), and each frame is called a short time frame.
[0037] For each short-time frame, the sum of squared amplitudes of the sample points within it is calculated as the short-time energy value.
[0038] The maximum short-time energy value in the initial period of the original voice data stream (e.g., the first 200 milliseconds) is taken as the upper bound of the silence energy.
[0039] If the energy of the current frame is greater than the upper bound of the silence energy, it is judged as a speech frame; otherwise, it is judged as a silence frame. The silence segment consisting of three or more consecutive silence frames is removed, and the speech frames are retained and spliced together to form a valid speech segment.
[0040] Speaker voiceprint features and semantic features of speech content are extracted from effective speech segments. The extracted voiceprint feature vector and semantic feature vector are weighted and summed according to preset weights to generate speech fusion features.
[0041] Extract user identifier, request timestamp, and API key from request metadata. Based on user identifier, count the total number of API calls and the total amount of uplink voice data within the sliding time window. Vectorize and concatenate the request timestamp, API key, total number of API calls, and total amount of uplink voice data to generate contextual features.
[0042] The speech fusion features are concatenated with the context features to form the current fusion feature vector.
[0043] It should be noted that the extraction of speaker voiceprint features and semantic features of speech content for effective speech segments are based on existing technologies. Specifically, speaker voiceprint feature extraction can be performed using the x-vector system based on deep neural networks or the traditional i-vector method, while semantic features of speech content can be extracted using acoustic models based on end-to-end speech recognition, such as DeepSpeech and Wav2Vec2. The weights assigned to the voiceprint feature vector and semantic feature vector are pre-calibrated based on the discriminative experiments of real attack samples and normal samples, and can be exemplarily set to 0.6 and 0.4, respectively.
[0044] The span of the sliding time window is set to the average conversation duration of users on the online English learning platform. The corresponding time period before the timestamp of the current call request is extracted according to this span as the sliding time window.
[0045] The baseline acquisition module acquires the user identifier associated with the call request, determines the historical fusion feature vector set and user behavior baseline corresponding to the group according to difficulty and ability matching, and dynamically generates a confidence blocking threshold based on the user behavior baseline.
[0046] In this embodiment, the step of determining the historical fusion feature vector set and user behavior baseline corresponding to the group based on difficulty and ability matching includes: Step 1, obtaining normal call records: obtaining API call records associated with the user identifier that are in the time sequence before the current call request and have been marked as normal access. Each record contains a fusion feature vector, a course difficulty label, and the user's real-time ability score.
[0047] The criteria for marking API call records as normal access are: the API call request was previously allowed by the blocking decision module and subsequent voice biometric verification was passed, or the request has been manually reviewed and confirmed as normal access by the platform administrator.
[0048] The course difficulty level is preset to three levels: beginner, intermediate, and advanced.
[0049] The user's real-time ability score is calculated and stored in real time by the platform based on indicators such as the user's historical course completion rate, test accuracy, and pronunciation accuracy, through the platform's built-in weighted scoring mechanism. This invention can directly extract the application based on the user's identifier, and the value range is a continuous value from 0 to 100.
[0050] The platform's built-in weighted scoring mechanism employs a well-known multi-indicator linear weighted comprehensive scoring method, specifically including: First, selecting user's historical course completion rate, test accuracy, and pronunciation accuracy as evaluation indicators; second, using min-max normalization to map the original values of each indicator to the [0, 1] interval; third, assigning preset weights to each indicator (the sum of the weights is 1); finally, calculating the sum of the products of the normalized values of each indicator and their corresponding weights, and multiplying by 100 to obtain a capability score of 0-100. The preset weights can be configured according to the platform's operational strategy; for example, the weight for course completion is 0.4, the weight for test accuracy is 0.4, and the weight for pronunciation accuracy is 0.2.
[0051] Step 2, Difficulty Level Mapping: Calculate the median of the user's ability score within a preset historical period as the baseline value of the ability level, and compare it with the platform's preset global ability quantile to map it to the beginner, intermediate, or advanced ability level; specifically, the 33.3% and 66.7% quantiles of the user group to which the current user belongs (such as the same age group and the same language proficiency level) are used as the first threshold and the second threshold, respectively. The ability level baseline value is less than the first threshold and mapped to the beginner ability level, between the first threshold and the second threshold (including or equal to) and mapped to the intermediate ability level, and greater than the second threshold and mapped to the advanced ability level.
[0052] Special note: "Same age group" specifically refers to all users whose registered age differs by no more than 3 years, and "same language proficiency level" specifically refers to users whose scores differ by no more than 10 after the platform's initial ability assessment.
[0053] Step 3: Group by Ability-Difficulty Matching Status: Compare the course difficulty label in each API call record with the mapping level of the user's ability level baseline value: If the level corresponding to the course difficulty label is lower than the user's ability mapping level, it is marked as the group with difficulty lower than ability; if the two levels are the same, it is marked as the group with difficulty matching ability; if the level corresponding to the course difficulty label is higher than the user's ability mapping level, it is marked as the group with difficulty higher than ability; collect the historical fusion feature vectors corresponding to each of the three groups simultaneously.
[0054] Step 4: Calculate robust statistics within each group: For the set of historical fusion feature vectors within each group, calculate the following robust statistics for each feature dimension: take the median of all values as the baseline center value of the group; take the interquartile range of all values as the tolerance interval width of the group.
[0055] Simultaneously, the covariance matrix between each feature dimension within each group is calculated.
[0056] The group identifier, baseline center value vector, tolerance interval width vector, and covariance matrix are collectively encapsulated into a user behavior baseline.
[0057] Step 5: Match the current call request and output the set: Obtain the course difficulty tag and user ability level benchmark value corresponding to the current call request, and determine the group to which the current call request belongs according to the comparison rules in Step 3; if there is at least one historical fusion feature vector in the group, extract all historical fusion feature vectors in the group into a historical fusion feature vector set, and output the user behavior baseline of the group at the same time; otherwise, downgrade to using the global baseline composed of all normal records.
[0058] It should be noted that the process of obtaining the global baseline is as follows: retrieve API call records marked as normal access within a preset historical period from the platform database of the user group to which the current user belongs, extract the fusion feature vector from each record, and form a global normal fusion feature vector set.
[0059] For the global normal fusion feature vector set, the median of each feature dimension is calculated as the global baseline center value, the interquartile range of each feature dimension is calculated as the global tolerance interval width, and the covariance matrix between each feature dimension is calculated as the global covariance matrix.
[0060] The global baseline is encapsulated together with the global baseline center value vector, the global tolerance interval width vector, and the global covariance matrix.
[0061] In addition, if the global baseline is not built, the abnormal risk score and threshold comparison step is skipped, all call requests are allowed by default and voice biometric verification is forcibly triggered until a preset minimum number of normal access records are collected.
[0062] In this embodiment, the confidence blocking threshold is dynamically generated based on the user behavior baseline, including: extracting the baseline center value vector, tolerance interval width vector, and covariance matrix from the user behavior baseline of the group to which the current call request belongs; wherein, each element in the baseline center value vector corresponds to a feature dimension in the current fused feature vector, and each element in the tolerance interval width vector also corresponds to the same feature dimension, used to characterize the allowable range of normal fluctuation of the feature dimension; the feature dimensions of the current fused feature vector include: the voiceprint component dimension and semantic component dimension under the voice fusion feature, and the timestamp component dimension, device fingerprint encoding component dimension, API call frequency component dimension, and total voice data component dimension under the context feature.
[0063] The weights of each feature dimension are calculated based on the tolerance interval width. Specifically, the inverse of the tolerance interval width of each feature dimension is calculated as the original weight; the original weights of all feature dimensions are summed to obtain the total weight; and the original weight of each dimension is divided by the total weight to obtain the normalized dimension weight.
[0064] The dimension weights are applied to the linear weighted calculation of the difference between the current fused feature vector and the baseline center value vector, and the weighted Mahalanobis distance is calculated in combination with the covariance matrix as the Mahalanobis distance benchmark value; specifically, the following steps are performed: the difference between the current fused feature vector and the baseline center value vector is calculated to obtain the deviation vector.
[0065] By comparing the arrangement order of each feature dimension in the deviation vector, the normalized dimension weights of each feature dimension are combined to form a dimension weight vector.
[0066] Multiply the bias vector by the dimension weight vector to obtain the weighted bias vector.
[0067] The Mahalanobis distance baseline value is calculated based on the covariance matrix and the weighted deviation vector. The calculation formula is as follows: .
[0068] In the formula, This represents the calculated Mahalanobis distance baseline value. Represents the covariance matrix. Denotes the inverse of the covariance matrix. This represents the weighted bias vector.
[0069] It is a quadratic form representing the quadratic product of the weighted bias vector and the inverse of the covariance matrix. Essentially, it performs a whitening transformation on the weighted bias vector using the inverse of the covariance matrix, eliminating correlations and dimensional differences between dimensions to obtain a dimensionless scalar squared distance. The larger this value, the greater the deviation of the current sample from the baseline center in a statistical distributional sense.
[0070] The square root is used to reduce the above quadratic form to an intuitive distance in Euclidean space.
[0071] This formula measures the statistical distance between the current fused feature vector and the user behavior baseline. The larger the calculated Mahalanobis distance baseline value, the more the current behavior deviates from the user's normal pattern.
[0072] The mean of each element in the tolerance interval width vector is taken as the average tolerance width. The current system time period identifier is obtained. The system time period identifier includes three categories: peak time period, off-peak time period, and nighttime time period. Each system time period corresponds to a preset time period base value. The reciprocal of the average tolerance width is normalized and multiplied by the time period base value to obtain the attenuation factor. The attenuation factor is inversely proportional to the average tolerance width. The value range of the attenuation factor is limited to between 0.5 and 1.5. When the average tolerance width is large, the attenuation factor tends to be 0.5. When the average tolerance width is small, the attenuation factor tends to be 1.5.
[0073] The product of Mahalanobis distance and attenuation factor is used as the confidence blocking threshold. After each successful voice biometric verification, the fused feature vector of the current call request, its corresponding course difficulty label, and the user's real-time ability score are added to the corresponding historical group, triggering the recalculation of the baseline center value vector, tolerance interval width vector, and covariance matrix of the historical group, thereby achieving incremental updates of the user behavior baseline.
[0074] The risk assessment module performs temporal continuity verification and environmental noise compensation on the current fused feature vector, compares it with the historical fused feature vector set to calculate reconstruction error, and generates an anomaly risk score.
[0075] Reference Figure 2 As shown, in this embodiment, the temporal continuity verification of the current fused feature vector includes: obtaining the fused feature vector of the previous normal call request corresponding to the user identifier, and performing a dimension-by-dimensional difference calculation between it and the current fused feature vector to obtain a difference vector; if the user identifier does not have any previous normal call request, the temporal continuity verification is skipped, and the current fused feature vector is directly output as the result after the temporal continuity verification without any dimension replacement or correction.
[0076] The feature dimensions whose absolute difference value in the difference vector is greater than a preset multiple of the width of their corresponding tolerance interval are designated as jump dimensions. The preset multiple is 1.5, derived from the classic rule in statistics for identifying outliers using box plots: values exceeding 1.5 times the interquartile range are considered outliers. This rule is used to determine whether the current value of a certain dimension has undergone an unreasonable jump relative to the previous normal value. In other embodiments of the invention, the preset multiple can also be configured between 1.2 and 2.0 according to the system's sensitivity requirements.
[0077] For each component marked as a jump dimension, its value is replaced with the algebraic sum of the corresponding dimension's value in the previous normal fusion feature vector and the tolerance interval width. The direction of the algebraic sum is consistent with the sign of the difference, which means: if the current value is larger than the previous normal value, it is corrected to the previous normal value plus one tolerance interval width; if it is smaller than the previous normal value, it is corrected to the previous normal value minus one tolerance interval width.
[0078] The components not marked as abrupt dimensions are kept unchanged, and all components are replaced and recombined to obtain the current fused feature vector after temporal continuity verification.
[0079] In this embodiment, environmental noise compensation is performed on the current fused feature vector, including: reading the latency value and packet loss rate of the current call request from the gateway transport layer, normalizing them, and summing them to obtain the network jitter parameter.
[0080] Extract the energy of each short frame within the first silent segment from the raw speech data stream, and take the median of all frame energies as the background noise energy estimate; calculate the average frame energy of the effective speech segments, and use the ratio of the background noise energy to the average frame energy of the effective speech segments as the device noise parameter.
[0081] The network jitter parameter is used to attenuate and correct the timestamp-related dimensions in the context features. The timestamp-related dimensions include the time period encoding of the request timestamp and the request interval deviation value. The request interval deviation value specifically refers to the difference between the current request timestamp and the previous normal request timestamp, minus the median of the user's historical normal request intervals.
[0082] The decay-corrected value is calculated for each timestamp-related dimension using the following formula: .
[0083] In the formula, Represents the raw values of the timestamp-related dimensions. This represents the value after decay correction for timestamp-related dimensions. This represents the network jitter attenuation coefficient, used to control the attenuation strength of network jitter parameters on timestamp-related dimensions. An example value of 2 can be used. This represents the network jitter parameter (dimensionless).
[0084] It is an exponential decay factor, whose value decreases as the network jitter parameter increases, used to simulate the exponential decline in the credibility of timestamp-related dimensions due to network jitter.
[0085] The entire formula means that the original value of the timestamp-related dimension is multiplied by a decay factor to obtain the corrected value. The more severe the network jitter, the closer the corrected value is to 0, so as to reduce the impact of this dimension in subsequent evaluation and help avoid misleading anomaly detection due to timestamp distortion caused by network latency or packet loss.
[0086] Gain adjustment is performed on the energy-related dimension of speech fusion features based on device noise parameters. The energy-related dimension includes short-time energy mean and amplitude variance. The short-time energy mean refers to the average value of the energy of each frame in the effective speech segment, and the amplitude variance refers to the variance of the amplitude of the sampling points in the effective speech segment.
[0087] The gain-adjusted value for each energy-related dimension is calculated using the following formula: .
[0088] In the formula, Represents the raw values of energy-related dimensions. This represents the value after adjusting for energy-related dimension gains. Indicates equipment noise parameters. This represents the noise gain coefficient, used to control the gain strength of the device's noise parameters on the energy-related dimension. An example value of 1.5 can be used.
[0089] This represents the linear gain factor, whose value increases linearly with the increase of device noise parameters. It is used to compensate for the attenuation effect of background noise on the speech energy-related dimension.
[0090] The overall formula means that the original value of the energy-related dimension is multiplied by a linear gain factor to obtain the compensated value. The stronger the environmental noise, the greater the compensation factor, which restores the speech energy that was originally submerged by noise to a level close to that of a noise-free environment. This helps to avoid misjudging abnormalities due to weak features caused by noise in the terminal's data acquisition.
[0091] After attenuation correction for timestamp-related dimensions and gain adjustment for energy-related dimensions, the current fused feature vector after environmental noise compensation is obtained.
[0092] Reference Figure 3 As shown, in this embodiment, the calculation of reconstruction error by comparing the historical fusion feature vector set includes: using the historical fusion feature vector set as normal samples, training a variational autoencoder model, the training process enabling the encoder to learn to map the input feature vector to the mean and logarithmic variance of the latent variable distribution, enabling the decoder to learn to reconstruct the original feature vector from the latent variables, and minimizing the reconstruction error as the optimization objective.
[0093] It should be noted that when the number of samples in the historical fusion feature vector set is greater than or equal to the preset minimum number of training samples, the training process is executed; otherwise, the reconstruction error calculation is skipped, and the blocking decision module directly regards the current call request as low risk and adopts a global weighted scoring strategy: the abnormal risk score is set as the average reconstruction error of the global normal samples and participates in the subsequent threshold comparison.
[0094] After training is complete, the current fused feature vector is input into the encoder, which outputs the mean and log-variance of the corresponding latent variable distribution.
[0095] Latent variables are obtained by sampling from a Gaussian distribution defined by the mean and log-variance of the latent variable distribution.
[0096] The latent variables are input into the decoder to obtain the reconstructed fused feature vector.
[0097] Calculate the residual vector between the current fused feature vector and the reconstructed fused feature vector, obtain the set of residual vectors corresponding to all samples in the historical fused feature vector set, calculate the mean vector and covariance matrix of the residual vector set, and calculate the Mahalanobis distance of the residual vectors according to the above formula for calculating the Mahalanobis distance benchmark value as the anomaly risk score.
[0098] It should be noted that the variational autoencoder model consists of an encoder network, a decoder network, and a sampling layer connected in series between them. The encoder is responsible for compressing high-dimensional input data into a low-dimensional probability distribution representation containing target information. The decoder attempts to reconstruct the original input data from the probability distribution representation. The sampling layer performs operations including receiving the mean vector output by the encoder. Sum of logarithmic variance vector ; Calculate the standard deviation vector From the standard normal distribution Randomly sample one with Noise vectors of the same dimension Generate latent variables through reparameterization techniques. ,in This indicates element-wise multiplication.
[0099] Training a variational autoencoder aims to enable the encoder and decoder to work together, reconstructing the input from the error using only normal samples. The training process is performed in the following steps: A1. Construct a training sample set by using the set of historical fused feature vectors as normal samples.
[0100] A2. Define the loss function: The total loss function consists of two parts: reconstruction loss and KL divergence. .
[0101] In the formula, Indicates the first A historical fusion feature vector, The decoder is for the first The vector reconstructed from the historical fusion feature vectors, where Indicates the number of the historical fusion feature vector. , This represents the total number of vectors within the historical fusion feature vector set. , They represent the first The mean and variance of the 1,343 ... Indicates the number of the latent variable. , The total number of latent variables. To preset the balance coefficient, a value of 0.5 can be used in this embodiment. Note the distinction: Indicates the first The log-variance of the 1 latent variables Indicates the first The square of the mean of each latent variable.
[0102] To reconstruct the loss term, mean squared error is used to measure the difference between the input vector and the decoder-reconstructed vector.
[0103] The KL divergence term is used to constrain the distribution of the encoder output. It approximates a standard normal distribution.
[0104] Since the total loss is the minimization objective, the negative sign before the KL divergence term maximizes the KL divergence term during the optimization process, thereby causing the encoder output distribution to approach the standard normal distribution.
[0105] The overall formula means: minimizing reconstruction error while maintaining a regular distribution of latent variables. After training, the model only has low reconstruction error for normal behavior; when abnormal behavior is input, the reconstruction error increases because the distribution deviates from the training data, thus serving as the basis for anomaly risk scoring.
[0106] A3. Perform training iterations: Initialize all network weights for the encoder and decoder, set the batch size, number of iterations, and learning rate for the Adam optimizer; for example, set the batch size to 64, the number of iterations to 100, and the learning rate for the Adam optimizer to 0.001.
[0107] For each training batch, perform the following sub-step: A31. Feed the batch's input vector into the encoder to obtain the mean vector. And the logarithmic variance vector.
[0108] A32. Calculate latent variables through sampling layers.
[0109] A33. The latent variables are fed into the decoder to obtain the reconstructed vector.
[0110] A34. Calculate the average loss of the current batch based on the loss function, and then backpropagate to update the trainable parameters of the encoder and decoder.
[0111] Repeat steps A31-A34 above until the loss converges or the preset maximum number of iterations is reached. The loss convergence condition can be exemplified by the average loss decrease over multiple consecutive iterations being less than a certain threshold. .
[0112] After training is complete, the network structure and weights of the encoder and decoder are saved as a variational autoencoder model specific to the current user.
[0113] A4. Validation of the model: Randomly select 10% of the samples from the training set, without participating in the training, to serve as the validation set. Input the validation set into the current user's dedicated variational autoencoder model to calculate the average reconstruction error. If the reconstruction error distribution on the validation set is basically consistent with that on the training set, the basis for which the basic consistency is determined is that the average error difference is less than 5%, then the model training is considered effective; otherwise, the hyperparameters need to be adjusted and retrained.
[0114] The blocking decision module blocks the call request and sends a voice biometric verification request when the abnormal risk score exceeds the confidence blocking threshold. The specific execution process is as follows: the current call request is rejected through the API gateway, and a challenge response requiring voice verification is returned to the client that issued the current call request.
[0115] The system receives the verification voice sample uploaded by the client based on the challenge response, extracts the voiceprint verification feature vector of the verification voice sample, compares it with the registered voiceprint feature vector corresponding to the user identifier, and if the similarity is greater than or equal to the preset verification threshold, the original fusion feature vector corresponding to the current call request is marked as a normal access sample and added to the user's historical fusion feature vector set. The tolerance interval width and confidence blocking threshold are recalculated to complete the incremental update of the user behavior baseline.
[0116] Otherwise, the current call request is considered an illegal attack, the blocking event is recorded and the blocking status is maintained. If illegal attacks are identified multiple times in a row, the API access permissions corresponding to the user identifier are temporarily locked, and a high-risk alert is pushed to the platform administrator interface.
[0117] It should be noted that the above-mentioned original fusion feature vector refers to the fusion feature vector that has not undergone temporal continuity verification and environmental noise compensation.
[0118] The above content is merely an example and illustration of the concept of the present invention. Those skilled in the art can make various modifications or additions to the specific embodiments described, or use similar methods to replace them, as long as they do not deviate from the concept of the invention or exceed the scope defined by the present invention, and all such modifications and additions should fall within the protection scope of the present invention.
Claims
1. A software runtime background data security management system based on a cloud platform, characterized in that, include: The request parsing module obtains the call request for the voice interaction API, parses the call request to generate request metadata and raw voice data stream; The feature extraction module uses speech activity detection to remove the silence segments from the original speech data stream, performs embedded feature extraction on the speech data stream after removing the silence segments, and combines it with request metadata to extract contextual features, and concatenates them to obtain the current fused feature vector; The baseline acquisition module acquires the user identifier associated with the call request, determines the historical fusion feature vector set and user behavior baseline corresponding to the group according to difficulty and ability matching, and dynamically generates a confidence blocking threshold based on the user behavior baseline. The risk assessment module performs temporal continuity verification and environmental noise compensation on the current fused feature vector, compares it with the historical fused feature vector set to calculate reconstruction error, and generates an anomaly risk score. The blocking decision module blocks the call request and sends a voice biometric verification request when the abnormal risk score is greater than the confidence blocking threshold.
2. The software runtime background data security management system based on a cloud platform according to claim 1, characterized in that, The parsing call request generates request metadata and raw voice data stream, including: Intercept the Hypertext Transfer Protocol (HTTP) requests sent to the voice service interface at the cloud platform gateway layer, and extract the request headers and request bodies; The API key, request timestamp, and user identifier are parsed from the request header and combined to form the request metadata; The speech data payload is parsed from the request body, decompressed and decoded, and a pulse code modulation audio data stream is generated as the original speech data stream.
3. The software runtime background data security management system based on a cloud platform according to claim 2, characterized in that, The current fused feature vector is obtained in the following way: The speech activity detection algorithm is used to detect silent segments in the raw speech data stream, and the effective speech segments are obtained after removing the silent segments; Speaker voiceprint features and semantic features of speech content are extracted from effective speech segments. The extracted voiceprint feature vector and semantic feature vector are weighted and summed according to preset weights to generate speech fusion features. Extract user identifier, request timestamp and API key from request metadata. Based on user identifier, count the total number of API calls and the total amount of uplink voice data within the sliding time window. Vectorize and concatenate the request timestamp, API key, total number of API calls and the total amount of uplink voice data to generate context features. The speech fusion features are concatenated with the context features to form the current fusion feature vector.
4. The software runtime background data security management system based on a cloud platform according to claim 1, characterized in that, The process of determining the historical fusion feature vector set and user behavior baseline corresponding to the group based on difficulty and ability matching includes: Retrieve API call records associated with the user identifier that are prior to the current call request in time sequence and have been marked as normal access. Each record contains a fused feature vector, a course difficulty label, and the user's real-time ability score. Calculate the median ability score of users within a preset historical period as the baseline value of ability level, and map it to a preset difficulty level; The mapping levels of the course difficulty labels in each record and the ability level benchmark values are compared, and the records are divided into three groups according to the matching status: difficulty below ability, difficulty matching ability, and difficulty above ability. For each group, the median of each feature dimension is calculated as the baseline center value, the interquartile range is calculated as the tolerance interval width, and the covariance matrix between each feature dimension is calculated to form the user behavior baseline. Obtain the course difficulty label and ability level benchmark value of the current call request, determine the group to which the current call request belongs, and if the group has historical records, extract all historical fusion feature vectors in the group as the historical fusion feature vector set; otherwise, downgrade to use the global baseline.
5. A software runtime background data security management system based on a cloud platform according to claim 4, characterized in that, The method of dynamically generating confidence blocking thresholds based on user behavior baselines includes: Extract the baseline center value vector, tolerance interval width vector, and covariance matrix from the user behavior baseline of the group to which the current call request belongs; The weights of each feature dimension are calculated based on the tolerance interval width and applied to the linear weighted calculation of the difference between the current fused feature vector and the baseline center value vector. The weighted Mahalanobis distance is calculated in combination with the covariance matrix as the Mahalanobis distance benchmark value. The mean of each element in the tolerance interval width vector is taken as the average tolerance width, and the attenuation factor is determined in combination with the time period base value corresponding to the current system time period. The product of the Mahalanobis distance benchmark and the attenuation factor is used as the confidence blocking threshold.
6. A software runtime background data security management system based on a cloud platform according to claim 4, characterized in that, Perform temporal continuity verification on the current fused feature vector, including: Obtain the fusion feature vector of the previous normal call request corresponding to the user identifier, and perform a dimension-by-dimensional difference calculation between it and the current fusion feature vector to obtain the difference vector; In the marked difference vector, the feature dimension whose absolute value of the difference is greater than a preset multiple of the width of its corresponding tolerance interval is designated as the jump dimension. For each component marked as a jump dimension, its value is replaced with the algebraic sum of the value of the corresponding dimension in the previous normal fusion feature vector and the width of the tolerance interval, with the direction of the algebraic sum consistent with the sign of the difference. The components not marked as abrupt dimensions are kept unchanged, and all components are replaced and recombined to obtain the current fused feature vector after temporal continuity verification.
7. A software runtime background data security management system based on a cloud platform according to claim 6, characterized in that, Before obtaining the fused feature vector of the previous normal call request corresponding to the user identifier, the following judgment is performed: if the user identifier does not have any previous normal call request, the temporal continuity check is skipped, and the current fused feature vector is directly output as the result after the temporal continuity check, without any dimension replacement or correction.
8. A software runtime background data security management system based on a cloud platform according to claim 3, characterized in that, Environmental noise compensation is performed on the current fused feature vector, including: Obtain the network jitter parameters and device noise parameters of the current call request. Based on the network jitter parameters, perform attenuation correction on the timestamp-related dimension in the context features. Based on the device noise parameters, adjust the gain on the energy-related dimension in the speech fusion features to obtain the compensated current fusion feature vector.
9. A software runtime background data security management system based on a cloud platform according to claim 1, characterized in that, The reconstruction error calculation for the comparative historical fusion feature vector set includes: Using historical fusion feature vector sets as normal samples, a variational autoencoder model is trained. The training process enables the encoder to learn to map the input feature vectors to the mean and log-variance of the latent variable distribution, and enables the decoder to learn to reconstruct the original feature vectors from the latent variables, with the optimization objective being to minimize the reconstruction error. After training is complete, the current fused feature vector is input into the encoder, which outputs the mean and log-variance of the corresponding latent variable distribution. Latent variables are obtained by sampling from a Gaussian distribution defined by the mean and log-variance of the latent variable distribution; The latent variables are input into the decoder to obtain the reconstructed fused feature vector; Calculate the residual vector between the current fused feature vector and the reconstructed fused feature vector, and calculate the Mahalanobis distance of the residual vector as an anomaly risk score.
10. A software runtime background data security management system based on a cloud platform according to claim 1, characterized in that, When the abnormal risk score is greater than the confidence blocking threshold, the call request is blocked and a voice biometric verification request is sent, including: The API gateway rejects the current call request and returns a challenge response requiring voice verification to the client that issued the current call request. The system receives the verification voice sample uploaded by the client based on the challenge response, extracts the voiceprint verification feature vector of the verification voice sample, compares the similarity with the registered voiceprint feature vector corresponding to the user identifier, and if the similarity is greater than or equal to the preset verification threshold, the original fusion feature vector corresponding to the current call request is marked as a normal access sample and added to the user's historical fusion feature vector set. The system then recalculates the tolerance interval width and the confidence blocking threshold to complete the incremental update of the user behavior baseline. Otherwise, the current call request is considered an illegal attack, the blocking event is recorded and the blocking status is maintained. If illegal attacks are identified multiple times in a row, the API access permissions corresponding to the user identifier are temporarily locked, and a high-risk alert is pushed to the platform administrator interface.