File sending risk early warning method based on multi-modal semantic analysis and behavior modeling
By employing a file outbound risk warning method based on multimodal semantic analysis and behavioral modeling, and combining multi-dimensional assessments such as file content, sender permissions, receiver relationships, and session scenarios, this method solves the problems of single risk judgment dimensions and high false alarm rates in existing technologies, achieving detailed risk analysis and a low false alarm rate.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGYANG TECH CO LTD
- Filing Date
- 2026-05-18
- Publication Date
- 2026-06-12
AI Technical Summary
Existing methods for risk warning of outgoing documents have a single dimension of risk assessment, a high false alarm rate, lack of interpretability, and cannot effectively combine the semantics of document content with behavioral context for comprehensive evaluation.
By using multimodal semantic analysis and behavioral modeling, file transfer records and metadata are obtained, and multi-dimensional risk features are extracted, including content semantic sensitivity, role permissions, receiver relationships, chat scenarios and time context assessments. When dimensional conflicts are detected, conflict resolution is performed, and an interpretable risk analysis report is generated.
It achieves integrated analysis from six dimensions: file content, sender permissions, receiver identity, session scenario, and outgoing time, which reduces false alarm rate, improves security operation efficiency, and provides detailed risk cause analysis.
Smart Images

Figure CN122196467A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data security protection technology, and in particular to a method for early warning of file outgoing risks based on multimodal semantic analysis and behavioral modeling. Background Technology
[0002] In the context of digital transformation, enterprises are increasingly reliant on instant messaging tools, email systems, and collaborative work platforms. Employees can use instant messaging tools to send files to external contacts, including business contracts, financial statements, design drawings, source code, customer lists, and other documents that contain the company's core trade secrets and intellectual property.
[0003] In existing technologies, file outgoing risk warnings are mainly divided into three categories: threshold-based warning schemes, which use static rules to judge surface features such as file quantity and size, but rely solely on statistical thresholds and cannot identify the semantic sensitivity of file content; statistical anomaly detection-based warning schemes, which identify deviations by establishing user behavior baselines, but the judgment of behavioral anomalies is separate from the semantics of the content, and false alarms are likely to occur when employees send out files in bulk for normal project needs; and keyword matching-based content detection schemes, which scan file names or content through a preset sensitive word library, but cannot understand the deep semantics of the file, nor can they make comprehensive judgments by combining the relationship between the sender and receiver, the sending time, and other contextual factors.
[0004] The above-mentioned solutions have the following problems in practical applications: First, the risk assessment dimension is singular, focusing only on surface features and ignoring the fusion analysis of file content semantics and behavioral context; Second, the false positive and false negative rates remain high. When the risk performance of multiple dimensions is inconsistent, for example, if the file content is sensitive but the sender and receiver are both internal personnel and the sending time is normal, the existing solution often directly judges it as high risk, and cannot eliminate the abnormal signals of a single dimension through cross-validation between dimensions, resulting in a large number of false positives; Third, the warning information lacks interpretability and cannot provide administrators with specific dimensional cause analysis that triggers risk assessment.
[0005] Therefore, there is an urgent need for a method that can perform integrated evaluation from multiple dimensions, including file content semantics and sender behavior, resolve abnormal signals to reduce false alarms when there are conflicts between dimensions, and output interpretable risk analysis results for early warning of file outgoing risks. Summary of the Invention
[0006] The purpose of this invention is to provide a method, system, and terminal device for early warning of document outbound risks based on multimodal semantic analysis and behavioral modeling, so as to solve the problems of high false alarm rate and lack of interpretability of early warning in the prior art.
[0007] In a first aspect, the present invention provides a method for early warning of document outbound risks based on multimodal semantic analysis and behavioral modeling, comprising: The file transfer records and their metadata are obtained through the data interface of the enterprise instant messaging system. The metadata includes sender information, receiver information, chat type, and file download address. The file entity is obtained based on the file download address, and multimodal content analysis is performed on the file entity to determine the content semantic sensitivity score of the file entity. Based on the aforementioned metadata, multidimensional risk features are extracted to determine risk scores for multiple behavioral dimensions. These multiple behavioral dimensions include sender role and permission dimensions, receiver relationship dimensions, chat scenario dimensions, sender behavior pattern dimensions, and time context dimensions. The sender role and permission dimensions are used to determine the role and permission matching score. The receiver relationship dimension and the chat scenario dimension are used together to determine the receiver relationship risk score. The sender behavior pattern dimension is used to determine the behavior pattern anomaly score. The time context dimension is used to determine the time context risk score. The content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the behavior pattern abnormal score, and the time context risk score are compared with the preset alarm thresholds of the corresponding dimensions. When it is detected that the scores of at least two dimensions are lower than their preset alarm thresholds, it is determined that there is a dimension conflict. The weighted sum of the above scores is reduced according to the preset conflict resolution rules to obtain the comprehensive risk score. When the overall risk score exceeds the preset reporting threshold, a risk analysis report is generated and an early warning message is pushed. The risk analysis report includes the dimension identifiers of the five dimensions whose scores exceed the preset warning threshold.
[0008] Furthermore, the step of reducing the weighted sum of the scores across the five dimensions according to preset conflict resolution rules includes: Obtain the number of dimensions whose scores exceed the preset alarm threshold among the five dimensions; The target attenuation coefficient is determined based on the correspondence between the number of dimensions and the preset attenuation coefficient. The weighted summation score is multiplied by the target attenuation coefficient to obtain the comprehensive risk score.
[0009] Furthermore, before comparing the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score with the preset alarm thresholds for the corresponding dimensions, the method further includes: Based on the risk characteristics of the receiver relationship dimension and the risk characteristics of the chat scenario dimension, it is determined whether the current file external distribution behavior is in a high-risk receiving scenario; wherein, the high-risk receiving scenario includes the risk characteristics of the receiver relationship dimension reflecting that the receiver is an external competitor, or the receiver is an unknown external organization and is interacting for the first time, or the risk characteristics of the chat scenario dimension reflecting that the chat type is an external group chat containing external members; If it is determined that the reception scenario is high-risk, the content semantic sensitivity score is increased according to the preset correction rules, and the preset alarm threshold corresponding to the content semantic sensitivity score is decreased simultaneously.
[0010] Further, the step of performing multimodal content analysis on the file entity to determine the content semantic sensitivity score of the file entity includes: Determine the file type corresponding to the file entity, and call the corresponding multimodal analysis model to perform content semantic extraction and classification based on the file type; If the file type is an image file, an image recognition model and an optical character recognition model are called to extract the scene and text of the file entity, and the extracted text content is classified into text topics. If the file type is a document file, call the document parsing library to extract the structured text content, and classify the extracted text content by text topic; If the file type is a video file, extract the video keyframes of the file entity and call the speech recognition model to transcribe the audio into text, and perform content analysis on the extracted keyframe images and the transcribed text; If the file type is a compressed file, the file entity is decompressed, and the corresponding content analysis is recursively performed on each decompressed file; Based on the content semantic extraction and classification results, determine the content semantic classification tags of the file entities; Based on the preset first mapping relationship between sensitivity level and score, the content semantic sensitivity score is determined according to the content semantic classification label.
[0011] Furthermore, the step of extracting multidimensional risk features based on the metadata to determine risk scores across multiple behavioral dimensions includes: Based on the metadata, the sender role permission features corresponding to the sender information, the receiver relationship features corresponding to the receiver information, and the chat scenario features corresponding to the chat type are extracted. Based on the sender information and timestamp statistical analysis, sender behavior pattern features and time context features are obtained. Based on a preset second mapping relationship between sender status, permission matching, and score, the role permission matching score is determined according to the sender role permission characteristics. Based on a preset third mapping relationship between receiver type, interaction relationship and score, the receiver relationship risk score is determined according to the receiver relationship characteristics and the chat scenario characteristics. Based on the preset fourth mapping relationship between the degree of behavioral deviation and the score, the abnormal score of the behavioral pattern is determined according to the characteristics of the sender's behavioral pattern. Based on the preset fifth mapping relationship between time attributes and scores, the time context risk score is determined according to the time context features.
[0012] Furthermore, based on the metadata, the sender's role and permission features, receiver's relationship features, and chat scenario features are extracted. Based on the sender information and the timestamp, statistical analysis is performed to obtain sender behavior pattern features and time context features, including: Query the enterprise organizational structure database to obtain the sender's department, job level, job permissions, and employment status, which will serve as the sender's role and permission characteristics. The external contact database is queried and the number of message exchanges and file transfers between senders and receivers within a preset historical period is counted to obtain receiver type, organization type, and historical interaction frequency, which are used as receiver relationship characteristics. Based on the chat type, it is determined whether it is a one-on-one chat, an internal group chat, or an external group chat. If it is a group chat, the number of group members and the proportion of external members are also counted as features of the chat scenario. The sender is statistically analyzed for the total number of outgoing files, the average daily number of outgoing files, the distribution of outgoing time, the distribution of outgoing file types, and the distribution of outgoing recipients within the preset historical time period. The degree of deviation of the current outgoing behavior from the historical baseline is calculated and used as the sender's behavioral pattern characteristics. Based on the comparison between the current timestamp and the pre-configured working hours, holidays, and departure times, it is determined whether the outgoing time belongs to working hours, non-working hours, or a sensitive time period, which is used as the time context feature.
[0013] Furthermore, after generating the risk analysis report and pushing the early warning message, the following steps are also included: Based on the comparison results between the comprehensive risk score and the preset first threshold and second threshold, the document outsourcing behavior is divided into a first risk level, a second risk level, or a third risk level, wherein the first threshold is greater than the second threshold; For outbound document releases at the first risk level, a highest priority alarm is triggered and an automatic interception process is executed. For document outsourcing behavior at the second risk level, a medium-priority alarm is triggered and the document is marked as pending review. For document-related external publication activities at the third risk level, no alert will be triggered; they will only be recorded in the behavior log. The automatic interception process includes notifying the security administrator and the sender's department head, freezing the sender's account, and prohibiting the file entity from flowing to downstream systems.
[0014] Furthermore, in the weighted summation operation, the weights of the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score decrease sequentially.
[0015] Secondly, this invention provides a document outbound risk early warning system based on multimodal semantic analysis and behavioral modeling, comprising: The data acquisition module is used to acquire file transfer records and metadata of the file transfer records through the data interface of the enterprise instant messaging system. The metadata includes sender information, receiver information, chat type and file download address. The multimodal content analysis module is used to obtain the file entity based on the file download address, perform multimodal content analysis on the file entity, and determine the content semantic sensitivity score of the file entity. The behavioral feature extraction module is used to extract multi-dimensional risk features based on the metadata and determine risk scores for multiple behavioral dimensions. These multiple behavioral dimensions include sender role and permission dimensions, receiver relationship dimensions, chat scenario dimensions, sender behavior pattern dimensions, and time context dimensions. The sender role and permission dimension is used to determine the role and permission matching score. The receiver relationship dimension and the chat scenario dimension are used together to determine the receiver relationship risk score. The sender behavior pattern dimension is used to determine the behavior pattern anomaly score, and the time context dimension is used to determine the time context risk score. The dynamic risk scoring engine compares the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score with the preset alarm thresholds for the corresponding dimensions. When it detects that the scores of at least two dimensions are lower than the corresponding preset alarm thresholds, it determines that there is a dimension conflict and reduces the weighted sum score of the above five dimensions according to the preset conflict resolution rules to obtain a comprehensive risk score. The intelligent early warning and tracing module is used to generate a risk analysis report and push an early warning message when the comprehensive risk score exceeds the preset reporting threshold. The risk analysis report includes feature dimension identifiers in five dimensions whose scores exceed the preset alarm threshold.
[0016] Thirdly, the present invention provides a terminal device, including a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein when the processor executes the computer program, it implements the file outgoing risk warning method based on multimodal semantic analysis and behavior modeling as described in any of the above.
[0017] Compared with the prior art, the beneficial effects of the present invention are as follows: First, by treating the receiver relationship dimension and chat scenario dimension as independent risk assessment dimensions, and generating corresponding risk scores together with the sender role and permission dimension, sender behavior pattern dimension, and time context dimension, and conducting joint assessment with the content semantic sensitivity score, we achieve integrated analysis from six dimensions: file content, sender permissions, receiver identity, conversation scenario, behavioral habits, and outgoing time. This solves the problem of single risk judgment dimensions in existing technologies and reduces false negatives caused by isolated assessments.
[0018] Next, by determining that a dimensional conflict exists when at least two dimensions' scores are below their preset alarm thresholds, and reducing the weighted sum score according to preset conflict resolution rules, the system achieves automatic identification and resolution of inconsistencies in multi-dimensional risk signals. When the file content is sensitive but the sender has normal permissions, the recipient is an internal employee, and the outgoing time is during working hours, the high score in the content dimension conflicts with the low scores in other dimensions. The conflict resolution rules can significantly reduce the overall risk score, effectively avoiding false alarms caused by anomalies in a single dimension, and solving the technical problems of lacking cross-validation between dimensions and high false alarm rates in existing technologies.
[0019] Finally, by including dimension identifiers in the risk analysis report that indicate scores exceeding preset alarm thresholds across five dimensions, security administrators can directly identify the specific dimension causing the risk assessment. This represents a technical improvement from outputting a single score to outputting interpretable multidimensional analysis results. It solves the problems of existing technologies where early warning information lacks risk cause analysis and security administrators need to spend a lot of time manually retrospectively investigating, thus improving the efficiency of security operations. Attached Figure Description
[0020] Figure 1 This is a flowchart illustrating a method for early warning of document outbound risks based on multimodal semantic analysis and behavioral modeling, provided in an embodiment of the present invention.
[0021] Figure 2 This is a schematic diagram of the structure of a file outbound risk warning system based on multimodal semantic analysis and behavioral modeling, provided in an embodiment of the present invention. Detailed Implementation
[0022] The technical solutions of this invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are merely some, not all, of the embodiments of this invention. The components of this invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the invention provided in the drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without inventive effort are within the scope of protection of this invention.
[0023] It should be noted that similar reference numerals and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this invention, terms such as "first," "second," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.
[0024] It should be noted that the method of this embodiment can be executed by a single device, such as a computer or server. The method of this embodiment can also be applied to a distributed scenario, where multiple devices cooperate to complete the task. In such a distributed scenario, one of these devices may execute only one or more steps of the method of this embodiment, and the multiple devices will interact with each other to complete the method described.
[0025] It should be noted that the above description describes some embodiments of the present invention. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims may be performed in a different order than that shown in the above embodiments and still achieve the desired results. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired results. In some embodiments, multitasking and parallel processing are also possible or may be advantageous.
[0026] Please see Figure 1 , Figure 1 This is a flowchart illustrating a method for early warning of document outbound risks based on multimodal semantic analysis and behavioral modeling, provided by an embodiment of the present invention.
[0027] An embodiment of the present invention provides a method for early warning of document outbound risks based on multimodal semantic analysis and behavioral modeling, comprising the following steps: S10. Obtain file transfer records and metadata of file transfer records through the data interface of the enterprise instant messaging system. The metadata includes sender information, receiver information, chat type and file download address.
[0028] In this embodiment, enterprise instant messaging systems (IM), such as WeChat Work, DingTalk, and Lark, obtain file transfer records and metadata in real-time or near real-time (e.g., with a delay of no more than 5 minutes) through the IM system's session archiving API interface or other IM system's log interface. Sender information includes user ID, name, department, and job title; receiver information includes user ID, name, and type (internal employee and external contact); chat types include one-on-one chat, internal group chat, and external group chat; metadata also includes timestamp, file name, file size, file type, and file download URL (Uniform Resource Locator).
[0029] S20. Obtain the file entity based on the file download address, perform multimodal content analysis on the file entity, and determine the content semantic sensitivity score of the file entity.
[0030] In this embodiment, the file entity is automatically downloaded to the local storage or object storage service of the analysis server based on the file download address, and different multimodal analysis models are called according to the file type to extract and classify the content semantics of the file entity in order to determine the content semantic sensitivity score of the file entity.
[0031] In one specific embodiment, step S20 includes the following steps: S21. Determine the file type corresponding to the file entity, and call the corresponding multimodal analysis model to perform content semantic extraction and classification based on the file type; If the file type is an image file, an image recognition model and an optical character recognition model are called to extract the scene and text of the file entity, and the extracted text content is classified into text topics. If the file type is a document file, call the document parsing library to extract the structured text content, and classify the extracted text content by text topic; If the file type is a video file, extract the video keyframes of the file entity and call the speech recognition model to transcribe the audio into text, and perform content analysis on the extracted keyframe images and the transcribed text; If the file type is a compressed file, the file entity is decompressed, and the corresponding content analysis is recursively performed on each decompressed file; S22. Based on the content semantic extraction and classification results, determine the content semantic classification label of the file entity; S23. Based on the preset first mapping relationship between sensitivity level and score, determine the content semantic sensitivity score according to the content semantic classification label.
[0032] First, determine the file type corresponding to the file entity. File type determination can be achieved by parsing the file extension or detecting the magic number in the file header. For example, files with extensions like jpg, png, bmp, and gif are identified as image files; files with extensions like doc, docx, xls, xlsx, ppt, pptx, and pdf are identified as document files; files with extensions like mp4, avi, and mov are identified as video files; files with extensions like zip, rar, and 7z are identified as compressed files; and files with extensions like py, java, js, and cpp are identified as source code files.
[0033] Based on the determined file type, the corresponding multimodal analysis model is invoked for content semantic extraction and classification. The specific analysis method is as follows: For image files, image recognition and optical character recognition (OCR) models are used for scene and text extraction. The image recognition model can employ a ResNet or EfficientNet-based scene classification model to identify the scene type in the image, including office scenes, financial statement screenshots, product design drawings, and ID photos. The OCR model can use PaddleOCR or Tesseract to extract the text content from the image. The extracted text content is then classified into different topics to obtain scene recognition results and text topic classification results.
[0034] For document files, a document parsing library is used to extract structured text content and tabular data. Document parsing libraries such as Apache POI, python-docx, or pdfplumber can be used. An NLP model is then used to classify the text content into topics. The NLP model can be a BERT-based or RoBERTa-based text classification model to determine if the document content belongs to categories such as financial statements, contracts, technical solutions, product quotations, customer lists, design drawings, source code, or work summaries, thus obtaining the topic classification results.
[0035] For video files, keyframes are extracted, image content analysis is performed on the keyframes, and a speech recognition model is used to transcribe the audio into text. Content analysis is then performed on the extracted keyframe images and the transcribed text to obtain the image content analysis results of the keyframes and the text topic classification results of the transcribed text.
[0036] For a compressed file, the file entity is decompressed, and the corresponding content analysis is recursively performed on each decompressed file to count the number of files and the distribution of file types within the compressed file, thereby obtaining the analysis results for all files within the compressed file.
[0037] For source code files, static code analysis tools are used to identify the code language and key function names, and to detect whether sensitive configuration information such as database passwords and API keys are contained, obtaining the identification and detection results.
[0038] After completing the above semantic extraction and classification, the semantic classification labels of the file entities are determined based on the results of the semantic extraction and classification. The semantic classification labels are used to identify the semantic category to which the file content belongs, such as financial statements, contracts and agreements, customer data, core modules of source code, product quotations, etc.
[0039] Specifically, due to the inconsistent output formats of the various modal analysis models in step S21—the image recognition model outputs scene categories, the OCR model outputs raw text, and the speech recognition model outputs transcribed text—these different output formats cannot be directly used for subsequent score mapping. Therefore, after performing multimodal content analysis on the document entities, the extracted multimodal content is input into the text topic classification model. The role of the text topic classification model is to uniformly transform the aforementioned inconsistent outputs into standardized semantic category labels, thereby achieving a mapping from multimodal raw information to standardized semantic categories, providing a unified input for subsequently determining the content semantic sensitivity score based on the first mapping relationship.
[0040] In one example, the text topic classification model uses a pre-trained language model based on BERT or RoBERTa, adding a fully connected classification layer and a Softmax output layer at its top. Each neuron in the Softmax output layer corresponds to a predefined semantic category, and the Softmax function normalizes the output vector into a probability distribution. The text topic classification model outputs the confidence score of the text content belonging to each predefined semantic category, expressed as a probability value, with the sum of the confidence scores for all categories being 1. The principle is that the Softmax function performs an exponential operation on each output component and then divides it by the sum of the exponential operations of all components. This normalization ensures that the output value falls between 0 and 1 and the sum is 1. Each component represents the probability that the model predicts the input belongs to that category.
[0041] The semantic category with the highest confidence score output by the text topic classification model is determined as the content semantic classification label. For example, if the confidence score vector output by the text topic classification model shows that the financial statement category has a confidence score of 0.96, the contract agreement category has a confidence score of 0.02, the technical solution category has a confidence score of 0.01, and the total confidence score of other categories is 0.01, then the financial statement category has the highest confidence score, and the content semantic classification label is determined to be the financial statement category.
[0042] Finally, based on the preset first mapping relationship between sensitivity level and score, the content semantic sensitivity score is determined according to the content semantic classification label.
[0043] In this embodiment, the first mapping relationship defines two layers of mapping: from semantic category to sensitivity level, and from sensitivity level to score range. The first layer of mapping is from semantic category to sensitivity level, specifically: financial statements, official contract versions, customer data, core module source code, patent documents, and bidding product quotations are mapped to a high sensitivity level; meeting minutes, draft project plans, product promotional materials, and training materials are mapped to a medium sensitivity level; daily work reports, general templates, and publicly available information are mapped to a low sensitivity or no sensitivity level. The second layer of mapping is from sensitivity level to score range, specifically: a high sensitivity level corresponds to a first score range, such as a score range of 90 to 100 points; a medium sensitivity level corresponds to a second score range, where the lower limit of the first score range is greater than the upper limit of the second score range, such as a score range of 60 to 89 points; a low sensitivity or no sensitivity level corresponds to a third score range, where the lower limit of the second score range is greater than the upper limit of the third score range, such as a score range of 0 to 59 points.
[0044] In practice, the extracted text content is input into the text topic classification model, the confidence scores of each semantic category output by the model are obtained, the semantic category with the highest confidence score is determined as the content semantic classification label, and the confidence score value corresponding to the content semantic classification label is obtained.
[0045] By querying the semantic category of the content and its corresponding sensitivity level in the first mapping relationship, the sensitivity level of the content and its semantic category can be determined. For example, if the content and its semantic category is "financial statements," querying the first mapping relationship will show that "financial statements" belongs to the high sensitivity level.
[0046] Then, based on the determined sensitivity level, the mapping between the sensitivity level and the score range in the first mapping relationship is queried to determine the corresponding score range. For example, a high sensitivity level corresponds to the first score range of 90 to 100 points.
[0047] Finally, within the defined score range, the specific content semantic sensitivity score is determined based on the confidence value corresponding to the content semantic classification label. The higher the confidence value, the closer the score is to the upper limit of the score range; the lower the confidence value, the closer the score is to the lower limit of the score range.
[0048] In one specific implementation, a baseline score is preset within the scoring range corresponding to each sensitivity level. For the high sensitivity level, the baseline score for financial statements is set at 95 points, the baseline score for the formal version of the contract is set at 94 points, the baseline score for customer data is set at 96 points, the baseline score for core module source code is set at 97 points, the baseline score for patent documents is set at 96 points, and the baseline score for the bidding version product quotation is set at 95 points. Based on the baseline score, adjustments are made according to the confidence level. When the confidence level is not lower than the first confidence threshold (e.g., 0.95), a preset adjustment score (e.g., 2 to 3 points) is added to the baseline score. When the confidence level is lower than the first confidence threshold but not lower than the second confidence threshold (e.g., 0.85), the baseline score remains unchanged. When the confidence level is lower than the second confidence threshold, the preset adjustment score is subtracted from the baseline score, but the score is not lower than the lower limit of the corresponding scoring range.
[0049] For example, when the text topic classification model determines the content semantic classification label as "financial statement" and the confidence level corresponding to this label is 0.96, since the confidence level of 0.96 exceeds the first confidence level threshold of 0.95, the score is a baseline score of 95 plus an adjusted score of 2, for a total of 97 points. When the confidence level is 0.88, the score is 95 points. When the confidence level is 0.82, the score is 95 points minus the adjusted score of 3, for a total of 92 points, which still falls within the score range of 90 to 100 points.
[0050] By combining the semantic understanding capability of the text topic classification model with the domain knowledge rules of the first mapping relationship, the above technical solution not only makes full use of the semantic recognition accuracy of the deep learning model, but also realizes the quantitative reflection of the reliability of the classification results through the dynamic adjustment of the confidence level within the score range, providing a more accurate input for subsequent comprehensive risk scoring.
[0051] In addition, for encrypted compressed files whose content cannot be parsed, since the actual content of the internal files cannot be obtained, the content semantic sensitivity score is directly set to the preset encrypted file risk score. This preset score can be set to 80 points to reflect the possible intention to evade detection in the encryption behavior itself.
[0052] S30. Based on the metadata, perform multi-dimensional risk feature extraction to determine risk scores for multiple behavioral dimensions; wherein, the multiple behavioral dimensions include sender role and permission dimension, receiver relationship dimension, chat scenario dimension, sender behavior pattern dimension, and time context dimension, the sender role and permission dimension is used to determine the role and permission matching score, the receiver relationship dimension and the chat scenario dimension are used to jointly determine the receiver relationship risk score, the sender behavior pattern dimension is used to determine the behavior pattern abnormality score, and the time context dimension is used to determine the time context risk score.
[0053] In this embodiment, based on the acquired metadata, risk features related to file outreach behavior across multiple dimensions are extracted, and risk scores for each dimension are determined based on these features. Specifically, five dimensions are included: sender role and permission dimension, used to measure the risk level of the sender's own identity and permissions; receiver relationship dimension, used to measure the risk level of the receiver's identity and its relationship with the sender; chat scenario dimension, used to measure the risk level of the conversation environment in which the file outreach occurs; sender behavior pattern dimension, used to measure the risk level of the current outreach behavior deviating from the historical behavior baseline; and time context dimension, used to measure the risk level of the outreach time. Based on the risk features of these five dimensions, the sender role and permission dimension is used to determine the role and permission matching score, the sender behavior pattern dimension is used to determine the behavior pattern anomaly score, the receiver relationship dimension and chat scenario dimension are used together to determine the receiver relationship risk score, and the time context dimension is used to determine the time context risk score. The score for each dimension is determined through a preset mapping relationship; a higher score indicates a greater risk for that dimension.
[0054] In one specific embodiment, step S30 includes the following steps: S31. Based on the metadata, extract the sender role permission features corresponding to the sender information, the receiver relationship features corresponding to the receiver information, and the chat scene features corresponding to the chat type, and obtain the sender behavior pattern features and time context features based on the sender information and timestamp statistical analysis.
[0055] This embodiment extracts corresponding identity features, relationship features, and conversation features based on sender information, receiver information, and chat type from the metadata. Simultaneously, it performs statistical analysis on the sender's historical outbound records to extract behavioral habit features and temporal features. These extracted features are used as risk features across five behavioral dimensions for subsequent calculation of risk scores for each dimension.
[0056] In one specific embodiment, step S31 includes the following steps: S311. Query the enterprise organizational structure database to obtain the sender's department, job level, job permissions, and employment status, which are used as the sender's role and permission characteristics. S312. Query the external contact database and count the number of message exchanges and file transfers between the sender and the receiver within a preset historical period to obtain the receiver type, the type of organization to which it belongs, and the frequency of historical interactions, as the receiver relationship characteristics. S313. Determine whether it is a one-on-one chat, an internal group chat, or an external group chat based on the chat type. If it is a group chat, also count the number of group members and the proportion of external members as features of the chat scenario. S314. Statistically analyze the total number of outgoing files, average daily number of outgoing files, outgoing time distribution, outgoing file type distribution, and outgoing object distribution of the sender within the preset historical time period, and calculate the degree of deviation of the current outgoing behavior from the historical baseline as the sender's behavioral pattern characteristics. S315. Based on the comparison between the current timestamp and the pre-configured working hours, holidays, and departure times, determine whether the outgoing time belongs to working hours, non-working hours, or a sensitive time period, and use this as the time context feature.
[0057] In this embodiment, risk features are extracted based on metadata from five behavioral dimensions: sender role and permission features, receiver relationship features, chat scenario features, sender behavior pattern features, and time context features.
[0058] The sender's department, job level, job authority, and employment status are retrieved from the enterprise organizational structure database to form the risk characteristics of the sender's role and authority dimension. Employment status includes both current and former employees. Job authority is used to determine whether the sender has the business authorization to send specific types of documents; for example, sending financial statements is within the authority of a finance professional, but outside the authority of a research and development professional.
[0059] The system retrieves the recipient's type and organization type from an external contact database and calculates the number of message exchanges and file transfers between the sender and recipient within a preset historical period to determine the frequency of historical interactions. This can be achieved using SQL aggregation queries or a log analysis platform. The preset historical period can be set to the past 30 days. Recipient types are categorized as internal employees and external contacts, and organization types are categorized as customers, suppliers, partners, competitors, and unknown. The recipient type, organization type, and historical interaction frequency collectively constitute the risk characteristics of the recipient relationship dimension.
[0060] The chat type field in the metadata determines whether the current conversation is a one-on-one chat, an internal group chat, or an external group chat. If it is a group chat, the total number of group members and the percentage of external members are further analyzed. The chat type, the number of group members, and the percentage of external members together constitute the risk characteristics of the chat scenario dimension.
[0061] The system statistically analyzes the total number of files sent, the average daily number of files sent, the distribution of sending times, the distribution of file types sent, and the distribution of recipients sent within a preset historical period. These statistical values are used as the historical baseline. The deviation of the current sending behavior from the historical baseline is calculated as a risk characteristic of the sender's behavioral pattern dimension. The deviation is calculated by comparing the current daily number of files sent with the historical average daily number of files sent plus a specified multiple of the standard deviation, and by determining whether the current file types and recipients sent appear in the historical distribution.
[0062] The system reads the current timestamp and compares it with a pre-configured working time range, a list of holidays, and the sender's departure date to determine whether the current outgoing time falls within working hours, non-working hours, or a sensitive period, using this as a risk characteristic within the time context dimension. For example, working hours can be configured as weekdays from 9:00 AM to 6:00 PM; non-working hours include weekday evenings, weekends, and public holidays; sensitive periods include the period within a specified number of days before the sender's departure, the period within a specified number of days before the deadline for bidding on major projects, or the financial settlement period.
[0063] S32. Based on the preset second mapping relationship between sender status, permission matching and score, determine the role permission matching score according to the sender role permission characteristics.
[0064] In this embodiment, after extracting the sender's role and permission features, the sender's employment status, job permission list, and department information are obtained. The second mapping relationship defines a four-level mapping rule from sender status and permission matching to score: Level 1: When the sender is currently unemployed, regardless of the type of file sent, the role-permission matching score is mapped to the highest-risk range, from 95 to 100 points, with a default value of 100 points. This rule applies to situations where any file sent by an unemployed employee is considered high-risk.
[0065] At the second level, when the sender is employed but the type of file being sent is outside their job's permission list, and the file contains highly sensitive content, the role-permission matching score is mapped to the second highest risk range, with a score range of 85 to 94 points, and a default value of 90 points. For example, an administrative staff member sending financial statements or a sales staff member sending core module source code. If the sender also exhibits other risk characteristics (such as recent unusual outbound behavior), the score will be increased by 2 to 4 points from the default value.
[0066] Level 3: When the sender is employed and the type of file being sent falls within their job's permission list, and the file contains highly sensitive content, the role-permission matching score is mapped to a medium-risk range, with a score between 40 and 60 points, and a default value of 50 points. This rule applies to situations where the sender sends highly sensitive files within their permissions, the permissions themselves are not deviated from, but the file is highly sensitive, resulting in a medium-level score. If the sender has sent similar highly sensitive files multiple times in the past without any violation, the score will be reduced by 5 to 8 points from the default value; if this is the first time sending a highly sensitive file, the score will be increased by 5 to 8 points.
[0067] Level 4: When the outgoing file contains low-sensitivity or no sensitive content, regardless of the sender's employment status, the role-based permission matching score is mapped to the low-risk range, with a score range of 0 to 39 points and a default value of 20 points. Fine-tuning is performed based on the specific file type and sending frequency; the lower limit is used for frequently sent file types, while the upper limit is used for occasionally sent file types.
[0068] The matching process for the above four-level rules is as follows: First, determine if the employment status is "currently unemployed." If so, apply the first-level rule directly. If not, determine if the outgoing document contains highly sensitive content. If it is highly sensitive, further determine if the outgoing document type falls within the sender's job permission list. If it is not within the permission list, apply the second-level rule; if it is within the permission list, apply the third-level rule. If the outgoing document contains low-sensitivity or no sensitive content, apply the fourth-level rule.
[0069] S33. Based on the preset third mapping relationship between receiver type, interaction relationship and score, determine the receiver relationship risk score according to the receiver relationship characteristics and the chat scenario characteristics.
[0070] In this embodiment, after extracting receiver relationship features and chat scenario features, the receiver type, organization type, historical interaction frequency, chat type, number of group members, and proportion of external members are obtained. The third mapping relationship defines a five-level mapping rule: In the first level, when the recipient is an external competitor, the recipient relationship risk score is mapped to the highest risk range, with a score range of 90 to 100 points and a default value of 95 points. If the competitor has no historical business dealings with the sender, the upper limit of 100 points is used; if there have been compliant business dealings in the past, the middle value of 95 points is used.
[0071] At the second level, when the recipient is an unknown external organization and this is the first interaction, the recipient relationship risk score is mapped to the second highest risk range, with a score range of 80 to 89 points, and a default value of 85 points. If the chat scenario characteristics indicate that the chat type is an external group chat and the proportion of external members in the group exceeds 50%, then the score is increased by 3 to 4 points based on the default value; if it is only a one-on-one chat scenario, then the default value is maintained.
[0072] Level 3: When the recipient is an external client or partner and highly sensitive documents are sent, the recipient relationship risk score is mapped to the medium risk range, with a score range of 60 to 79 points and a default value of 70 points. If the sender's historical interaction frequency with the recipient exceeds the preset interaction threshold, such as more than 10 file transfers in the past 30 days, the score will be reduced by 5 to 8 points; if the historical interaction frequency is lower than the preset interaction threshold, the score will be increased by 5 to 8 points.
[0073] Level 4: When the recipient is an external customer or partner with a high frequency of historical interactions, the recipient relationship risk score is mapped to a low-to-medium risk range, with a score range of 30 to 59 points and a default value of 45 points. If the content of the sent file is of low sensitivity or has no sensitive content, the score is further reduced by 10 to 15 points; if the content of the sent file is of medium sensitivity or high sensitivity, the default value is maintained.
[0074] Level 5: When the recipient is an internal employee and the chat type is a purely internal group chat, the recipient relationship risk score is mapped to the low-risk range, with a score range of 0 to 29 points and a default value of 15 points. In the case of a one-on-one chat, the score is set to the lower limit of 10 points; in the case of an internal group chat where all members are from the same department, the score is set to the lower limit of 5 points.
[0075] In the third mapping relationship, the adjustment effect of chat scenario features is reflected in two aspects: First, when the chat type is an external group chat (including external members), a preset adjustment score of 20 points is added to the scores at each level mentioned above, with the maximum score not exceeding 100 points after the addition. Second, in the second-level rules, the higher the proportion of external members in the external group chat, the greater the score increase.
[0076] The matching process for the above five-level rules is as follows: First, determine if the recipient is an external competitor. If so, apply the first-level rule directly. If not, determine if the recipient is an unknown external organization and this is the first interaction. If so, apply the second-level rule. If not, further determine the type of organization the recipient belongs to. If the recipient is an external client or partner, determine whether to apply the third or fourth-level rule based on the semantic sensitivity level of the sent document and the frequency of historical interactions between the sender and the recipient. If the recipient is an internal employee and the chat type is a purely internal group chat, apply the fifth-level rule. Each match determines whether to apply external group chat bonus points based on the chat type.
[0077] S34. Based on the preset fourth mapping relationship between the degree of behavioral deviation and the score, determine the abnormal score of the behavioral pattern according to the characteristics of the sender's behavioral pattern.
[0078] In this embodiment, after extracting the sender's behavioral pattern features, the deviation between the sender's current outbound behavior and the historical baseline is obtained. The historical baseline includes the total number of outbound files, the average daily number of outbound files, the distribution of outbound time, the distribution of outbound file types, and the distribution of outbound objects within a preset historical period. The deviation is quantified by comparing the difference between the current behavior and the historical baseline. The fourth mapping relationship defines a five-level mapping rule: Level 1: When the number of files sent out in a single day exceeds the historical daily average number of files sent out plus three standard deviations, the abnormal behavior pattern score is mapped to the highest risk range, with a score range of 90 to 100 points and a default value of 95 points. If the excess exceeds five times or more, the upper limit of 100 points is used; if the excess is exactly three times, the lower limit of 90 points is used.
[0079] Level Two: When the number of files sent out in a single hour exceeds the historical maximum number of files sent out in a single hour, the abnormal behavior pattern score is mapped to the second highest risk range, with a score range of 85 to 89 points and a default value of 87 points. If the type of file sent is also appearing for the first time, the score is increased by 2 points.
[0080] At Level 3, when the outgoing file type appears for the first time, the abnormal behavior pattern score is mapped to the third highest risk range, with a score range of 75 to 84 points and a default value of 80 points. If the first-time appearance of the file type is a highly sensitive content type, the upper limit of 84 points is used; if it is a low-sensitivity or non-sensitive content type, the lower limit of 75 points is used.
[0081] Level 4: When the outbound recipient is a new appearance and an external contact, the abnormal behavior pattern score is mapped to the medium risk range, with a score range of 70 to 74 points and a default value of 72 points. If the external contact belongs to an unknown organization, the upper limit of 74 points is used; if it is a customer or partner, the lower limit of 70 points is used.
[0082] Level 5: When all indicators of current outward behavior fluctuate within the normal range, the abnormal behavior pattern score is mapped to the low-risk range, with a score range of 0 to 50 points and a default value of 25 points. Fine-tuning is performed based on minor fluctuations in the degree of deviation: 25 points for fluctuations between 0 and 1 standard deviation, 40 points for fluctuations between 1 and 2 standard deviations, and 50 points for fluctuations exceeding 2 standard deviations but without triggering other level rules.
[0083] The matching and judgment process for the above five-level rules is as follows: First, determine whether the number of outgoing files sent in a single day exceeds the historical average plus three standard deviations. If so, apply the first-level rule. If not, determine whether the number of outgoing files sent in a single hour exceeds the historical maximum value. If so, apply the second-level rule. If not, determine whether the type of outgoing file is appearing for the first time. If so, apply the third-level rule. If not, determine whether the recipient of the outgoing file is appearing for the first time and is an external contact. If so, apply the fourth-level rule. If none of the above conditions are met, apply the fifth-level rule, and determine the specific score based on the deviation fluctuation range.
[0084] S35. Based on the preset fifth mapping relationship between time attributes and scores, determine the time context risk score according to the time context features.
[0085] In this embodiment, after extracting the time context features, the time category to which the current outgoing time belongs is obtained. The time category includes working hours, non-working hours, and sensitive time periods. Sensitive time periods are further subdivided into pre-departure sensitive periods (within a specified number of days before the sender's departure), business sensitive periods (within a specified number of days before the deadline for bidding on major projects), and financial sensitive periods (within the financial settlement period). The fifth mapping relationship defines four levels of mapping rules: Level 1: When the outbound message is sent within a specified number of days before the employee's departure, the time context risk score is mapped to the highest risk range, with a score range of 95 to 100 points and a default value of 97 points. If the departure date is within three days, the upper limit of 100 points is used; if the departure date is between four and seven days, the lower limit of 95 points is used. This rule applies to situations where the sender has submitted a resignation application and is within the handover period.
[0086] Level Two: When sending highly sensitive files outside of working hours, the time context risk score is mapped to the second highest risk range, with a score range of 80 to 94 points, and a default value of 87 points. Non-working hours include weekday evenings (e.g., 6:00 PM to 9:00 AM the next day), weekends, and public holidays. If the sending time is late at night (e.g., 12:00 AM to 6:00 AM), the score is increased by 5 to 7 points from the default value; if it is only during regular weekday evening hours, the default value is maintained.
[0087] Level 3: When the outgoing document is sent during a sensitive business period and contains highly sensitive files, the time context risk score is mapped to the third highest risk range, with a score range of 70 to 79 points, and a default value of 75 points. Sensitive business periods include the specified number of days before the bid deadline for major projects and the financial settlement period. If the semantic classification tag of the outgoing document's content is highly relevant to the business type of the sensitive business period, such as sending a bid quotation for a competitive product before the bid deadline, the upper limit of 79 points is used; if it is only time-sensitive but the document content is not highly relevant to the business type, the lower limit of 70 points is used.
[0088] Level 4: When the outgoing data is sent during working hours, the time context risk score is mapped to the low-risk range, with a score range of 0 to 50 points and a default value of 20 points. If the sent file contains highly sensitive content, the score is increased by 15 to 20 points from the default value; if the sent file contains moderately sensitive content, the score is increased by 5 to 10 points; if the sent file contains low-sensitivity or no sensitive content, the default value is maintained or further reduced to the lower limit.
[0089] It should be noted that both the first and second level rules of the fifth mapping relationship involve judging the sensitivity of the file. The first level rule applies regardless of the sensitivity of the file, since the state of being unemployed is already an extreme risk scenario. The second level rule only applies when both the conditions of being outside of work hours and sending a highly sensitive file are met. If the file sent outside of work hours is of low sensitivity or no sensitivity, it falls into the low-risk range of the fourth level rule.
[0090] The matching process for the above four-level rules is as follows: First, determine if the current time is within the specified number of days before leaving the company. If so, apply the first-level rule directly. If not, determine if the current time is outside of working hours. If outside of working hours, further determine if the sensitivity of the sent file is high. If high, apply the second-level rule; if low or no sensitivity, apply the fourth-level rule. If the current time is not outside of working hours, determine if the current time is within a sensitive business period. If within a sensitive business period and a highly sensitive file is sent, apply the third-level rule; otherwise, apply the fourth-level rule. If none of the above conditions are met, i.e., the current time is during normal working hours, apply the fourth-level rule.
[0091] By combining the results of multidimensional risk feature extraction with domain knowledge rules of each preset mapping relationship through the above technical solution, the risk score of each behavioral dimension not only reflects the type of risk feature, but also reflects the subtle differences in risk level through dynamic adjustment within the score range, providing reliable input data for subsequent comprehensive risk scoring and dimension conflict detection.
[0092] S40. The content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the behavior pattern abnormal score, and the time context risk score are compared with the preset alarm thresholds of the corresponding dimensions. When it is detected that the scores of at least two dimensions are lower than their preset alarm thresholds, it is determined that there is a dimension conflict. The weighted sum of the above scores is reduced according to the preset conflict resolution rules to obtain the comprehensive risk score.
[0093] In this embodiment, dimensional conflict refers to a situation where risk signals from multiple dimensions contradict each other. For example, when the content semantic sensitivity score exceeds its alarm threshold, it indicates that the file content is highly sensitive, but the role permission matching score is below its alarm threshold, it indicates that the sender is a normal, employed employee and sent the file within their authorized scope; the receiver relationship risk score is below its alarm threshold, it indicates that the receiver is an internal employee with a history of frequent interactions; and the time context risk score is below its alarm threshold, it indicates that the file was sent during working hours. In this case, a high score in the content dimension alone cannot support a high-risk determination. The sender role permission dimension, receiver relationship dimension, and time context dimension are all within the normal range, conflicting with the high-risk signal in the content dimension, indicating that the external release of this file is most likely a normal business operation rather than a malicious leak.
[0094] If only one dimension's score is detected to be below its preset alarm threshold, then no dimension conflict is determined. If at least two dimensions' scores are detected to be below their preset alarm thresholds, then a dimension conflict is determined. Once a dimension conflict is determined, the weighted sum of the scores for the five dimensions is reduced according to preset conflict resolution rules.
[0095] First, in a specific embodiment, in the weighted summation operation, the weights of the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score decrease sequentially.
[0096] In this embodiment, the weights are set in descending order because the causal correlation between risk signals of different dimensions and file leakage events varies. The content semantic sensitivity score directly reflects the sensitive attributes of the sent file itself and is the core basis for determining leakage risk, therefore it has the highest weight. The role permission matching score reflects whether the sender has permission to send this type of file; permission deviation is one of the necessary conditions for leakage behavior, so its weight is second highest. The receiver relationship risk score reflects the risk level of the file receiver's identity and interaction relationship; the receiver's identity is an external indicator of leakage behavior, so its weight is next highest. The abnormal behavior pattern score reflects the deviation of current outbound behavior from historical habits; abnormal behavior may be normal business fluctuations and is used as a reference factor, so its weight is low. The time context risk score reflects the degree of abnormality at the time of outbound transmission; time abnormality is only used as an auxiliary judgment factor, so its weight is the lowest.
[0097] Under the default configuration, the weights for the five dimensions are as follows: content semantic sensitivity score has a weight of 35%, role permission matching score has a weight of 25%, receiver relationship risk score has a weight of 20%, abnormal behavior pattern score has a weight of 15%, and time context risk score has a weight of 5%. The sum of the weights for the above five items is 100%.
[0098] The formula for calculating the weighted summation score is: Weighted sum score = Content semantic sensitivity score × 35% + Role permission matching score × 25% + Receiver relationship risk score × 20% + Abnormal behavior pattern score × 15% + Temporal context risk score × 5% In a specific example, the scores for each dimension of a certain document outreach behavior are as follows: content semantic sensitivity score 95, role permission matching score 40, receiver relationship risk score 90, abnormal behavior pattern score 78, and time context risk score 85. Substituting into the formula, the weighted sum score = 95 × 0.35 + 40 × 0.25 + 90 × 0.20 + 78 × 0.15 + 85 × 0.05 =33.25+10.0+18.0+11.7+4.25=77.2 points.
[0099] It is understandable that the above default weight configuration can be dynamically adjusted according to the actual business scenarios of enterprises. For example, for enterprises whose core business is R&D, the weight of the content semantic sensitivity score can be increased to 40%, and the weight of other dimensions can be reduced accordingly; for sales-driven enterprises, the weight of the receiver relationship risk score can be increased to 25%; for enterprises with high staff turnover, the weight of the role permission matching score can be increased to 30%. When adjusting, the sum of the five weights must be kept at 100%, and the decreasing order of the five dimensions' weights must remain unchanged.
[0100] Dynamic adjustment of weights can be achieved through a preset configuration parameter table. The configuration parameter table supports multiple preset schemes, including a default scheme, a scheme for R&D-oriented enterprises, a scheme for sales-oriented enterprises, and a scheme for high-turnover enterprises. Security administrators can select the appropriate scheme based on the actual situation of the enterprise, or customize the weight values for each dimension. The adjusted weights are recalculated using a weighted summation formula and take effect in real time.
[0101] In one specific embodiment, step S40, which reduces the weighted sum of the scores of the above five dimensions according to a preset conflict resolution rule, includes the following steps: S41. Obtain the number of dimensions whose scores exceed the preset alarm threshold among the five dimensions; S42. Determine the target attenuation coefficient based on the correspondence between the number of dimensions and the preset attenuation coefficient; S43. Multiply the weighted summation score by the target attenuation coefficient to obtain the comprehensive risk score.
[0102] In this embodiment, the dimensional conflict resolution mechanism is used to handle situations where risk signals from multiple dimensions are inconsistent, avoiding false alarms caused by anomalies in a single dimension leading to an inflated weighted summation score. The specific execution steps of the conflict resolution rules are as follows: In step S41, after the weighted summation of the scores across the five dimensions, the content semantic sensitivity score, role-permission matching score, receiver relationship risk score, abnormal behavior pattern score, and temporal context risk score are compared with the preset alarm thresholds for the corresponding dimensions. Each dimension has a preset independent alarm threshold, specifically configured as follows: the preset alarm threshold for the content semantic sensitivity score is 70 points, the preset alarm threshold for the role-permission matching score is 50 points, the preset alarm threshold for the receiver relationship risk score is 60 points, the preset alarm threshold for the abnormal behavior pattern score is 60 points, and the preset alarm threshold for the temporal context risk score is 60 points.
[0103] Obtain the number of dimensions whose scores exceed preset alarm thresholds across the five dimensions. Iterate through the scores of the five dimensions and compare them numerically with the preset alarm threshold for each dimension. If the score of a dimension is greater than or equal to its preset alarm threshold, add that dimension to the set of dimensions exceeding the threshold. Count the number of dimensions in this set, denoted as N.
[0104] For example, the scores for each dimension of a certain document outreach behavior are as follows: content semantic sensitivity score 95 (exceeding its warning threshold of 70), role and permission matching score 40 (below its warning threshold of 50), receiver relationship risk score 15 (below its warning threshold of 60), abnormal behavior pattern score 30 (below its warning threshold of 60), and time context risk score 10 (below its warning threshold of 60). In this case, only the content semantic sensitivity score exceeds its warning threshold, and N equals 1.
[0105] In another example, the scores for each dimension are as follows: content semantic sensitivity score 95 (exceeding the alarm threshold of 70), role permission matching score 90 (exceeding the alarm threshold of 50), receiver relationship risk score 85 (exceeding the alarm threshold of 60), abnormal behavior pattern score 78 (exceeding the alarm threshold of 60), and time context risk score 10 (below the alarm threshold of 60). In this case, four dimensions exceed their alarm thresholds, and N equals 4.
[0106] In step S42, the target attenuation coefficient is determined based on the correspondence between the number of dimensions N and the preset attenuation coefficient. This correspondence is preset as follows: when N equals 1, meaning only one dimension score exceeds its alarm threshold, the target attenuation coefficient is 0.5; when N equals 2, meaning two dimensions score exceed their alarm threshold, the target attenuation coefficient is 0.75; when N is greater than or equal to 3, meaning three or more dimensions score exceed their alarm threshold, the target attenuation coefficient is 1.0, meaning no attenuation is performed.
[0107] The technical principle behind the above correspondence is as follows: When multiple dimensions simultaneously emit high-risk signals, they corroborate each other, thus maintaining a high overall risk score, resulting in minimal or no attenuation. When only one dimension emits a high-risk signal while the others are within the normal range, the high score of that single dimension may be an occasional anomaly rather than a genuine risk. In this case, the weighted sum score should be significantly reduced to obtain the overall risk score, thereby mitigating the anomalous signal from the single dimension. The attenuation coefficient increases with the number of dimensions exceeding the threshold, reflecting the cumulative confidence effect of multi-dimensional cross-validation.
[0108] In step S43, the weighted summation score is multiplied by the target attenuation coefficient to obtain the comprehensive risk score.
[0109] In the first example above, the weighted sum score is 95×0.35+40×0.25+15×0.20+30×0.15+10×0.05=36.5 points. N equals 1, the target attenuation coefficient is 0.5, and the comprehensive risk score is 36.5×0.5 equals 18.25 points.
[0110] In the second example above, the weighted sum is 95×0.35+90×0.25+85×0.20+78×0.15+10×0.05=83.7 points. N equals 4, the target attenuation coefficient is 1.0, the comprehensive risk score is 83.7 points, and no attenuation is applied.
[0111] Through the aforementioned conflict resolution rules, this invention can automatically adjust the weighted summation score when risk signals across dimensions are inconsistent, resulting in a comprehensive risk score and effectively reducing false alarms caused by anomalies in a single dimension. Simultaneously, it maintains a high score when risk signals from multiple dimensions corroborate each other, ensuring accurate capture of truly high-risk behaviors.
[0112] In one specific embodiment, before comparing the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score with the preset alarm thresholds for the corresponding dimensions in step S40, the following scenario sensitivity processing may also be performed: S60. Based on the risk characteristics of the receiver relationship dimension and the risk characteristics of the chat scenario dimension, determine whether the current file outgoing behavior is in a high-risk receiving scenario; wherein, the high-risk receiving scenario includes the risk characteristics of the receiver relationship dimension reflecting that the receiver is an external competitor, or the receiver is an unknown external organization and is interacting for the first time, or the risk characteristics of the chat scenario dimension reflecting that the chat type is an external group chat containing external members; S70. If it is determined that the high-risk reception scenario is in effect, the content semantic sensitivity score is increased according to the preset correction rule, and the preset alarm threshold corresponding to the content semantic sensitivity score is decreased simultaneously.
[0113] In this embodiment, the scenario sensitivity processing step chooses to adjust the content semantic sensitivity score and alarm threshold instead of adjusting the receiver relationship risk score for the following reasons: First, in high-risk reception scenarios, the actual risk of file leakage is composed of both the file content sensitivity and the high-risk attributes of the receiver. Adjusting the content semantic sensitivity score can quantify the composite effect of risk transmission from the receiver to the content, while adjusting the receiver score cannot reflect this coupling relationship. Second, the receiver relationship risk score itself has already assigned a high score to the high-risk reception scenario through the third mapping relationship. Adjusting it further would result in the same risk feature being double-penalized, reducing the system's ability to distinguish risk levels. Third, in high-risk reception scenarios, the receiver dimension score usually exceeds the alarm threshold. If the content semantic dimension score is still below the threshold, the inconsistency between the two risk signals may trigger dimension conflict resolution, leading to a decrease in the overall score. Improving the content semantic sensitivity score and lowering its alarm threshold makes it easier for the content semantic dimension to form risk signals with the receiver dimension, avoiding missed detections due to dimension conflict resolution. In addition, it has higher detection sensitivity for file outbound behavior in high-risk reception scenarios. Even if the semantic classification of the file content does not belong to the highest sensitivity level, it can be captured by the system more sensitively, avoiding missed detections due to low content dimension scores.
[0114] Specifically, in step S60, the determination of a high-risk reception scenario is based on three conditions. The order of the three conditions can be arranged arbitrarily. If any one of the conditions is met, it is determined to be in a high-risk reception scenario, and there is no need to continue to determine the other conditions.
[0115] The first condition, the risk characteristics of the recipient relationship dimension, reflects that the recipient is an external competitor. The determination of an external competitor is based on whether the recipient's organization belongs to a pre-defined competitor list. This competitor list is pre-maintained by the security administrator and includes known competitor names and their associated domain names.
[0116] The second condition is that the risk characteristics of the receiver relationship dimension reflect that the receiver is an unknown external organization and this interaction is the first interaction. The determination of an unknown external organization is based on the receiver's organization type being in the unknown category. The determination of the first interaction is based on the fact that the number of message exchanges and file transfers between the sender and receiver are both zero within a preset historical period.
[0117] The third condition, based on the risk characteristics of the chat scenario, reflects that the chat type is an external group chat containing external members. The criteria for determining an external group chat is the presence of at least one external member, meaning the percentage of external members is greater than zero. When the percentage of external members in the group chat exceeds a preset threshold (e.g., 50%), the risk level is higher, but the determination condition itself only requires the presence of external members to be triggered.
[0118] In step S70, improving the content semantic sensitivity score and lowering its preset alarm threshold are two parallel adjustment operations, both based on preset correction rules.
[0119] For improving the content semantic sensitivity score, a multiplicative correction rule is used, multiplying the current value of the content semantic sensitivity score by a preset score correction coefficient. The default value of the score correction coefficient is 1.2, which means that the content semantic sensitivity score will be increased by 20% in high-risk reception scenarios. The adjusted score cannot exceed 100 points. For example, if the original content semantic sensitivity score is 70 points, multiplying it by 1.2 will result in a score of 84 points; if the original content semantic sensitivity score is 90 points, multiplying it by 1.2 will result in a score of 108 points, with a maximum value of 100 points.
[0120] For lowering the preset alarm threshold corresponding to the content semantic sensitivity score, a subtraction correction rule is used, subtracting the preset threshold correction amount from the current value of the alarm threshold. The default value for the threshold correction amount is 20 points. The adjusted alarm threshold will not be lower than the preset minimum threshold, such as 10 points. For example, if the plaintiff's alarm threshold is 70 points, subtracting 20 points results in 50 points; if the plaintiff's alarm threshold is 30 points, subtracting 20 points results in 10 points, and the threshold will not be lowered further.
[0121] The technical principle behind the aforementioned correction rules lies in the fact that, in high-risk reception scenarios, even if the semantic analysis result of the file content does not reach the highest sensitivity level, its potential risk may still be amplified due to the high risk at the receiving end. Therefore, a dual sensitivity mechanism is formed by increasing the content semantic sensitivity score and simultaneously lowering its alarm threshold. On the one hand, the increased score increases the contribution of this dimension in the weighted summation; on the other hand, the lowered alarm threshold makes this dimension more likely to be judged as exceeding the threshold, thereby reducing the probability of this dimension being judged as below the threshold in subsequent dimension conflict detection, and avoiding excessive attenuation of the overall risk score due to dimension conflict resolution triggered by the content dimension alone falling below the threshold.
[0122] In a specific application example, the original content semantic sensitivity score of a certain external file distribution behavior was 68 points, and the preset alarm threshold for the original content semantic sensitivity score was 70 points. The recipient was an external competitor, and the chat type was an external group chat, meeting the first and third conditions of a high-risk reception scenario, thus being judged as being in a high-risk reception scenario. After performing scenario sensitivity processing, the content semantic sensitivity score increased to 68 × 1.2 equals 81.6 points, rounded to 82 points; the preset alarm threshold decreased to 70 minus 20 equals 50 points. Before the adjustment, the score of 68 points in this dimension was lower than the alarm threshold of 70 points, and would be judged as below the threshold in dimension conflict detection; after the adjustment, the score of 82 points in this dimension was higher than the alarm threshold of 50 points, and would be judged as exceeding the threshold in dimension conflict detection. This change made the content semantic sensitivity dimension go from not triggering alarms to triggering alarms, effectively avoiding missed detections caused by low content dimension scores in high-risk reception scenarios.
[0123] It should be noted that the adjustments to the scene sensitivity processing only affect the content semantic sensitivity score and its alarm threshold, and do not affect the scores and alarm thresholds of the other four dimensions. The adjusted content semantic sensitivity score and alarm thresholds are only used for subsequent dimension conflict detection and comprehensive risk score calculation for this document's external release behavior, and do not have a lasting impact on the scores of other document external release behaviors.
[0124] S50. When the comprehensive risk score exceeds the preset reporting threshold, a risk analysis report is generated and an early warning message is pushed. The risk analysis report includes the dimension identifiers of the five dimensions whose scores exceed the preset alarm threshold.
[0125] In this embodiment, the preset reporting threshold can be set to 70 points. This threshold can be dynamically adjusted according to the actual business scenario of the enterprise. When the comprehensive risk score after dimensional conflict resolution exceeds 70 points, an early warning process is triggered; if the comprehensive risk score does not exceed 70 points, no early warning is triggered, and only the external document issuance behavior is recorded in the behavior log.
[0126] Once the overall risk score exceeds the preset reporting threshold, the system automatically generates a risk analysis report. The core content of the risk analysis report consists of the dimension identifiers for the five dimensions whose scores exceed the preset alarm thresholds. The dimension identifiers are determined by comparing the scores of each of the five dimensions with their corresponding preset alarm thresholds one by one. For dimensions whose scores exceed their preset alarm thresholds, their dimension identifiers are added to the dimension identifier list in the risk analysis report. Dimension identifiers can be represented by preset dimension names or dimension codes. For example, the identifier for the content semantic sensitivity dimension is "content sensitive," the identifier for the role-permission matching dimension is "permission abnormal," the identifier for the receiver relationship risk dimension is "receiver risk," the identifier for the abnormal behavior pattern dimension is "behavior abnormal," and the identifier for the time context risk dimension is "time abnormal."
[0127] In a specific example, the scores and corresponding preset alarm thresholds for each dimension of a certain document outreach behavior are as follows: Content semantic sensitivity score: 95 points, alarm threshold: 70 points; Role permission matching score: 40 points, alarm threshold: 50 points; Recipient relationship risk score: 90 points, alarm threshold: 60 points; Abnormal behavior pattern score: 78 points, alarm threshold: 60 points; Temporal context risk score: 85 points, alarm threshold: 60 points. Upon comparison, the content semantic sensitivity score exceeds its alarm threshold, the recipient relationship risk score exceeds its alarm threshold, the abnormal behavior pattern score exceeds its alarm threshold, and the temporal context risk score exceeds its alarm threshold, while the role permission matching score does not exceed its alarm threshold. Therefore, the risk analysis report includes four dimension indicators: content sensitivity, recipient risk, abnormal behavior, and temporal anomaly.
[0128] In another specific example, after scenario-sensitive processing and dimensional conflict resolution, the scores and corresponding preset alarm thresholds for each dimension of a certain outgoing file activity are as follows: Content semantic sensitivity score adjusted to 82 points, alarm threshold adjusted to 50 points; Role permission matching score 40 points, alarm threshold 50 points; Recipient relationship risk score 90 points, alarm threshold 60 points; Abnormal behavior pattern score 30 points, alarm threshold 60 points; Temporal context risk score 10 points, alarm threshold 60 points. Upon comparison, the content semantic sensitivity score exceeds the adjusted alarm threshold of 50 points, the recipient relationship risk score exceeds its alarm threshold of 60 points, while the scores of the other three dimensions do not exceed their respective alarm thresholds. Therefore, the risk analysis report includes two dimension identifiers: content sensitivity and recipient risk.
[0129] In addition to dimension identifiers, the risk analysis report also includes the following: a risk overview, summarizing the core reasons for the high-risk determination in one sentence, such as detecting alerts triggered by the content semantic sensitivity dimension and the recipient relationship risk dimension, indicating that the sender sent highly sensitive files to external competitors; a comprehensive risk score and the corresponding risk level; detailed scores for each dimension, including the specific score values for the five dimensions and the corresponding alert thresholds; sender information, including name, department, and employment status; recipient information, including name and organization type; file information, including file name, file type, and file size; and sending time information.
[0130] Warning messages are pushed through the following channels: displayed as highlighted red items on the front-end interface of the security management platform; sent to the security administrator's email address via the email server; and sent to the communication account of the security management person in charge via the enterprise instant messaging system's messaging interface. The warning message includes a warning title, comprehensive risk score, key dimension identifiers, sender and recipient information, and a link to view the risk analysis report. After clicking the link, the security administrator can view the complete risk analysis report on the security management platform and take actions such as marking false alarms, confirming risks, or freezing accounts.
[0131] In one specific embodiment, the risk analysis report further includes key risk factor annotations and processing suggestions. The key risk factor annotations are generated based on the feature dimension with the highest score among the five dimensions, and the processing suggestions are obtained by matching the feature dimension corresponding to the key risk factor annotations from a preset processing suggestion library.
[0132] In one specific embodiment, after generating the risk analysis report and pushing the early warning message, the method further includes: S80. Based on the comparison result between the comprehensive risk score and the preset first threshold and second threshold, the document outsourcing behavior is divided into a first risk level, a second risk level, or a third risk level, wherein the first threshold is greater than the second threshold. For outbound document releases at the first risk level, a highest priority alarm is triggered and an automatic interception process is executed. For document outsourcing behavior at the second risk level, a medium-priority alarm is triggered and the document is marked as pending review. For document-related external publication activities at the third risk level, no alert will be triggered; they will only be recorded in the behavior log. The automatic interception process includes notifying the security administrator and the sender's department head, freezing the sender's account, and prohibiting the file entity from flowing to downstream systems.
[0133] In this embodiment, after generating a risk analysis report and pushing out an early warning message, a graded handling step is also performed. Based on the comprehensive risk score, the document outreach behavior is divided into different risk levels, and differentiated handling strategies are adopted.
[0134] First, a first threshold and a second threshold are preset, with the first threshold being greater than the second threshold. The first threshold is set to 85 points, and the second threshold is set to 70 points. The first threshold is used to distinguish between the first and second risk levels, and the second threshold is used to distinguish between the second and third risk levels.
[0135] The comparison between the comprehensive risk score and the two thresholds, and the rules for classifying the risk levels, are as follows: When the comprehensive risk score is greater than or equal to the first threshold of 85 points, the act of releasing documents outside the office is classified as the first risk level. This level indicates that it is highly likely to be a malicious leak or a serious violation, requiring immediate action at the highest level.
[0136] When the overall risk score is greater than or equal to the second threshold of 70 points and less than the first threshold of 85 points, the document outreach behavior is classified as the second risk level. This level indicates that there are certain risk concerns, requiring close attention and manual review.
[0137] When the overall risk score is less than the second threshold of 70 points, the external issuance of documents will be classified as the third risk level. This level indicates normal business behavior and requires no manual intervention.
[0138] Differentiated response strategies should be implemented for different risk levels.
[0139] For outbound file sending at the highest risk level (Level 1), a top-priority alert is triggered, and an automatic interception process is executed. The top-priority alert is displayed in red on the security management platform front end and pushed in real-time via email and enterprise instant messaging to the security administrator and the sender's department head. The automatic interception process includes three specific operations: notifying the security administrator and the sender's department head, with the notification including a comprehensive risk score, key risk factor annotations, sender information, recipient information, file information, and sending time; freezing the sender's account, suspending the sender's permission to send files outbound via enterprise instant messaging (the freezing operation is implemented by calling the enterprise instant messaging user management interface); and preventing the file entity from flowing to downstream systems, i.e., blocking the distribution and download of this outbound file in the file transmission chain, ensuring that the file entity cannot be obtained by the recipient. Simultaneously, a mandatory manual review by the security administrator is required, and the frozen status remains in effect until the review.
[0140] For document outreach activities at the second risk level, a medium-priority alarm is triggered, and the document is marked as pending review. Medium-priority alarms are displayed in yellow on the security management platform front end and are periodically aggregated and pushed to the security administrator at regular intervals (e.g., hourly). Document outreach activities marked as pending review are recorded in the security audit log and included in the sender's risk behavior profile. Simultaneously, a self-check confirmation notification is sent to the sender, indicating that their document outreach activity has been deemed risky and requesting confirmation regarding whether it is a legitimate business requirement. The sender can perform a self-check by clicking the confirmation button in the notification, and the confirmation result is fed back to the security management platform.
[0141] For document outreach activities at the third risk level, no alerts are triggered, no risk analysis reports are generated, and only relevant information about the outreach activity is recorded in the behavior log. The behavior log includes sender information, recipient information, document information, timestamp, scores for each dimension, and a comprehensive risk score, used for behavior baseline modeling and post-event audit tracing. The document flows normally without any obstruction or interception.
[0142] In a specific application example, the overall risk score for a certain file outreach activity was 77.7. Compared to the preset first and second thresholds, 77.7 is higher than the second threshold of 70 but lower than the first threshold of 85, thus classifying it as a second-risk level. A medium-priority alarm is triggered, pushed to the security management platform with a yellow indicator, and marked as pending review. A self-check confirmation notification is sent to the sender, and this action is recorded in the security audit log.
[0143] In another specific application example, the overall risk score for a certain file outreach behavior is 92 points. Compared with the preset first and second thresholds, 92 points is higher than the first threshold of 85 points, classifying it as the first risk level. This triggers the highest priority red alert, notifying the security administrator and the sender's department head. The automatic interception process executes: freezing the sender's account, calling the enterprise instant messaging user management interface to disable the sender's file outreach permissions; preventing the file from flowing to downstream systems, blocking the file's distribution in the file transmission chain. Manual review is mandatory, and the frozen status remains in effect until the review.
[0144] Through the aforementioned tiered handling mechanism, this invention precisely matches document outsourcing behavior of different risk levels with corresponding handling strategies, ensuring that high-risk behaviors are promptly blocked and handled, medium-risk behaviors receive key attention and manual review, and low-risk behaviors do not interfere with normal business processes, thus achieving a balance between risk control and business efficiency.
[0145] For example, a risk analysis report may include the following: Risk Overview: It was detected that Zhang San, an employee of the Finance Department, sent three PDF financial statements containing the keyword "balance sheet" to Li Si, a contact of an external competitor, outside of working hours (2:30 a.m. on Saturday).
[0146] Overall risk score: 87.4 points, risk level: Level 1 risk.
[0147] Five-dimensional score details: (1) Content semantic sensitivity score: 97 points (the document content is classified as "financial statements - balance sheet", sensitivity level: high); (2) Role permission matching score: 55 points (the sender is a finance department employee, who has the right to send financial statements, but the recipient is an external competitor, and this is the first time that a highly sensitive document has been sent to an external party, so the degree of permission deviation is moderate). (3) Recipient relationship risk score: 100 points (the recipient is a member of an external competitor organization and has no history of business dealings). (4) Abnormal behavior pattern score: 100 points (the number of outgoing files in a single day is 15, which exceeds the historical average by 3 plus 3 times the standard deviation by 9, and the excess is 5 times). (5) Time context risk score: 94 points (the outgoing time is 2:30 am on Saturday, which is a late night period outside of working hours).
[0148] Key risk factors highlighted: recipients are external competitors, abnormally high daily outbound volume, outbound sending late at night outside of working hours, and highly sensitive financial data.
[0149] Document content preview: Provides a screenshot of the first page of the PDF and a summary of key paragraphs.
[0150] Sender behavior profile: A total of 12 files were sent out in the past 30 days, with an average of 0.4 files sent out per day. The most common recipients of these files were colleagues within the finance department, and the most common time for sending them was from 10:00 to 16:00 on weekdays.
[0151] Recommendations: With a comprehensive risk score of 87.4, this falls into the highest risk category. It is recommended to immediately freeze the sender's account and initiate a security investigation. Notify the security administrator and the sender's department head, prohibit the physical transfer of the file to downstream systems, and mandate manual review.
[0152] In one embodiment, the present invention further includes an adaptive optimization step: Regularly obtain the results of manual review and handling of alert messages from the security management platform. These reviews include confirming risks or marking false alarms. Based on these results, adjust the scoring rules for the five dimensions, preset alarm thresholds, and conflict resolution rules to improve the accuracy of the scoring calculation.
[0153] Specific optimization strategies include: if a certain type of file is frequently marked as a false alarm, then lower the content semantic sensitivity score threshold for that type of file; if a certain type of behavior is frequently identified as a real risk but the score is low, then increase the weight of the corresponding feature; and dynamically adjust the behavioral baseline model and permission rules for different departments and positions based on the actual business scenarios of the enterprise.
[0154] Compared with the prior art, the beneficial effects of the present invention are as follows: First, by treating the receiver relationship dimension and chat scenario dimension as independent risk assessment dimensions, and generating corresponding risk scores together with the sender role and permission dimension, sender behavior pattern dimension, and time context dimension, and conducting joint assessment with the content semantic sensitivity score, we achieve integrated analysis from six dimensions: file content, sender permissions, receiver identity, conversation scenario, behavioral habits, and outgoing time. This solves the problem of single risk judgment dimensions in existing technologies and reduces false negatives caused by isolated assessments.
[0155] Next, by determining that a dimensional conflict exists when at least two dimensions' scores are below their preset alarm thresholds, and reducing the weighted sum score according to preset conflict resolution rules, the system achieves automatic identification and resolution of inconsistencies in multi-dimensional risk signals. When the file content is sensitive but the sender has normal permissions, the recipient is an internal employee, and the outgoing time is during working hours, the high score in the content dimension conflicts with the low scores in other dimensions. The conflict resolution rules can significantly reduce the overall risk score, effectively avoiding false alarms caused by anomalies in a single dimension, and solving the technical problems of lacking cross-validation between dimensions and high false alarm rates in existing technologies.
[0156] Finally, by including dimension identifiers in the risk analysis report that indicate scores exceeding preset alarm thresholds across five dimensions, security administrators can directly identify the specific dimension causing the risk assessment. This represents a technical improvement from outputting a single score to outputting interpretable multidimensional analysis results. It solves the problems of existing technologies where early warning information lacks risk cause analysis and security administrators need to spend a lot of time manually retrospectively investigating, thus improving the efficiency of security operations.
[0157] Please see Figure 2 Based on the same inventive concept, and corresponding to the methods of any of the above embodiments, this invention also discloses a document outbound risk warning system based on multimodal semantic analysis and behavioral modeling, comprising: The data acquisition module is used to acquire file transfer records and metadata of the file transfer records through the data interface of the enterprise instant messaging system. The metadata includes sender information, receiver information, chat type and file download address. The multimodal content analysis module is used to obtain the file entity based on the file download address, perform multimodal content analysis on the file entity, and determine the content semantic sensitivity score of the file entity. The behavioral feature extraction module is used to extract multi-dimensional risk features based on the metadata and determine risk scores for multiple behavioral dimensions. These multiple behavioral dimensions include sender role and permission dimensions, receiver relationship dimensions, chat scenario dimensions, sender behavior pattern dimensions, and time context dimensions. The sender role and permission dimension is used to determine the role and permission matching score. The receiver relationship dimension and the chat scenario dimension are used together to determine the receiver relationship risk score. The sender behavior pattern dimension is used to determine the behavior pattern anomaly score, and the time context dimension is used to determine the time context risk score. The dynamic risk scoring engine compares the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score with the preset alarm thresholds for the corresponding dimensions. When it detects that the scores of at least two dimensions are lower than the corresponding preset alarm thresholds, it determines that there is a dimension conflict and reduces the weighted sum score of the above five dimensions according to the preset conflict resolution rules to obtain a comprehensive risk score. The intelligent early warning and tracing module is used to generate a risk analysis report and push an early warning message when the comprehensive risk score exceeds the preset reporting threshold. The risk analysis report includes feature dimension identifiers in five dimensions whose scores exceed the preset alarm threshold.
[0158] The system described in the above embodiments is used to implement the corresponding document outbound risk warning method based on multimodal semantic analysis and behavior modeling in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0159] Based on the same inventive concept, corresponding to the methods of any of the above embodiments, this invention also discloses a terminal device, including: a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements the file outgoing risk warning method based on multimodal semantic analysis and behavior modeling as described in any of the above embodiments.
[0160] Specifically, the device includes a processor, memory, input / output interfaces, a communication interface, and a bus. The processor, memory, input / output interfaces, and communication interface are interconnected within the device via the bus.
[0161] The processor can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this specification.
[0162] The memory can be implemented in the form of ROM (Read Only Memory), RAM (Random Access Memory), static storage device, or dynamic storage device. The memory can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory and called and executed by the processor.
[0163] Input / output interfaces are used to connect input / output modules to enable information input and output.
[0164] The communication interface is used to connect the communication module to enable communication and interaction between this device and other devices. The communication module can communicate via wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0165] A bus is a pathway that transmits information between various components of a device, such as processors, memory, input / output interfaces, and communication interfaces.
[0166] The terminal devices described above are used to implement the corresponding methods in any of the foregoing embodiments and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0167] Based on the same inventive concept, corresponding to any of the above embodiments, this invention also discloses a non-transitory computer-readable storage medium that stores computer instructions for causing a computer to execute the above-described file outgoing risk warning method based on multimodal semantic analysis and behavioral modeling.
[0168] The computer-readable medium of this embodiment includes permanent and non-permanent, removable and non-removable media, and information storage can be implemented by any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory, static random access memory, dynamic random access memory, other types of random access memory, read-only memory, electrically erasable programmable read-only memory, flash memory or other memory technologies, read-only optical disc storage, digital versatile optical disc or other optical storage, magnetic tape, disk storage or other magnetic storage devices, or any other non-transfer medium that can be used to store information accessible by a computing device.
[0169] The computer instructions stored in the storage medium of the above embodiments are used to cause the computer to execute the file outgoing risk warning method based on multimodal semantic analysis and behavior modeling as described in any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which will not be repeated here.
[0170] The above description is merely a preferred embodiment of the present invention and the technical principles employed. The present invention is not limited to the specific embodiments described herein, and various obvious changes, readjustments, and substitutions that can be made by those skilled in the art will not depart from the scope of protection of the present invention. Therefore, although the present invention has been described in detail through the above embodiments, the present invention is not limited to the above embodiments, and may include more other equivalent embodiments without departing from the concept of the present invention, the scope of which is determined by the scope of the claims.
Claims
1. A method for early warning of document outbound transmission risks based on multimodal semantic analysis and behavioral modeling, characterized in that, include: The file transfer records and their metadata are obtained through the data interface of the enterprise instant messaging system. The metadata includes sender information, receiver information, chat type, and file download address. The file entity is obtained based on the file download address, and multimodal content analysis is performed on the file entity to determine the content semantic sensitivity score of the file entity. Based on the aforementioned metadata, multidimensional risk features are extracted to determine risk scores for multiple behavioral dimensions. These multiple behavioral dimensions include sender role and permission dimensions, receiver relationship dimensions, chat scenario dimensions, sender behavior pattern dimensions, and time context dimensions. The sender role and permission dimensions are used to determine the role and permission matching score. The receiver relationship dimension and the chat scenario dimension are used together to determine the receiver relationship risk score. The sender behavior pattern dimension is used to determine the behavior pattern anomaly score. The time context dimension is used to determine the time context risk score. The content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the behavior pattern abnormal score, and the time context risk score are compared with the preset alarm thresholds of the corresponding dimensions. When it is detected that the scores of at least two dimensions are lower than their preset alarm thresholds, it is determined that there is a dimension conflict. The weighted sum of the above scores is reduced according to the preset conflict resolution rules to obtain the comprehensive risk score. When the overall risk score exceeds the preset reporting threshold, a risk analysis report is generated and an early warning message is pushed. The risk analysis report includes the dimension identifiers of the five dimensions whose scores exceed the preset warning threshold.
2. The method according to claim 1, characterized in that, The weighted sum score for reducing the scores of the above five dimensions according to the preset conflict resolution rules includes: Obtain the number of dimensions whose scores exceed the preset alarm threshold among the five dimensions; The target attenuation coefficient is determined based on the correspondence between the number of dimensions and the preset attenuation coefficient. The weighted summation score is multiplied by the target attenuation coefficient to obtain the comprehensive risk score.
3. The method according to claim 1, characterized in that, Before comparing the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score with the preset alarm thresholds for the corresponding dimensions, the method further includes: Based on the risk characteristics of the receiver relationship dimension and the risk characteristics of the chat scenario dimension, it is determined whether the current file external distribution behavior is in a high-risk receiving scenario; wherein, the high-risk receiving scenario includes the risk characteristics of the receiver relationship dimension reflecting that the receiver is an external competitor, or the receiver is an unknown external organization and is interacting for the first time, or the risk characteristics of the chat scenario dimension reflecting that the chat type is an external group chat containing external members; If it is determined that the reception scenario is high-risk, the content semantic sensitivity score is increased according to the preset correction rules, and the preset alarm threshold corresponding to the content semantic sensitivity score is decreased simultaneously.
4. The method according to claim 1, characterized in that, The step of performing multimodal content analysis on the file entity to determine the content semantic sensitivity score of the file entity includes: Determine the file type corresponding to the file entity, and call the corresponding multimodal analysis model to perform content semantic extraction and classification based on the file type; If the file type is an image file, an image recognition model and an optical character recognition model are called to extract the scene and text of the file entity, and the extracted text content is classified into text topics. If the file type is a document file, call the document parsing library to extract the structured text content, and classify the extracted text content by text topic; If the file type is a video file, extract the video keyframes of the file entity and call the speech recognition model to transcribe the audio into text, and perform content analysis on the extracted keyframe images and the transcribed text; If the file type is a compressed file, the file entity is decompressed, and the corresponding content analysis is recursively performed on each decompressed file; Based on the content semantic extraction and classification results, determine the content semantic classification tags of the file entities; Based on the preset first mapping relationship between sensitivity level and score, the content semantic sensitivity score is determined according to the content semantic classification label.
5. The method according to claim 1, characterized in that, The step of extracting multidimensional risk features based on the metadata to determine risk scores across multiple behavioral dimensions includes: Based on the metadata, the sender role permission features corresponding to the sender information, the receiver relationship features corresponding to the receiver information, and the chat scenario features corresponding to the chat type are extracted. Based on the sender information and timestamp statistical analysis, sender behavior pattern features and time context features are obtained. Based on a preset second mapping relationship between sender status, permission matching, and score, the role permission matching score is determined according to the sender role permission characteristics. Based on a preset third mapping relationship between receiver type, interaction relationship and score, the receiver relationship risk score is determined according to the receiver relationship characteristics and the chat scenario characteristics. Based on the preset fourth mapping relationship between the degree of behavioral deviation and the score, the abnormal score of the behavioral pattern is determined according to the characteristics of the sender's behavioral pattern. Based on the preset fifth mapping relationship between time attributes and scores, the time context risk score is determined according to the time context features.
6. The method according to claim 5, characterized in that, Based on the metadata, the sender's role and permission features, receiver's relationship features, and chat scenario features are extracted. Furthermore, based on the sender information and the timestamp, statistical analysis is performed to obtain sender behavior pattern features and time context features, including: Query the enterprise organizational structure database to obtain the sender's department, job level, job permissions, and employment status, which will serve as the sender's role and permission characteristics. The external contact database is queried and the number of message exchanges and file transfers between senders and receivers within a preset historical period is counted to obtain receiver type, organization type, and historical interaction frequency, which are used as receiver relationship characteristics. Based on the chat type, it is determined whether it is a one-on-one chat, an internal group chat, or an external group chat. If it is a group chat, the number of group members and the proportion of external members are also counted as features of the chat scenario. The sender is statistically analyzed for the total number of outgoing files, the average daily number of outgoing files, the distribution of outgoing time, the distribution of outgoing file types, and the distribution of outgoing recipients within the preset historical time period. The degree of deviation of the current outgoing behavior from the historical baseline is calculated and used as the sender's behavioral pattern characteristics. Based on the comparison between the current timestamp and the pre-configured working hours, holidays, and departure times, it is determined whether the outgoing time belongs to working hours, non-working hours, or a sensitive time period, which is used as the time context feature.
7. The method according to claim 1, characterized in that, In the weighted summation operation, the weights of the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score decrease sequentially.
8. The method according to claim 1, characterized in that, After generating the risk analysis report and sending the early warning message, the following is also included: Based on the comparison results between the comprehensive risk score and the preset first threshold and second threshold, the document outsourcing behavior is divided into a first risk level, a second risk level, or a third risk level, wherein the first threshold is greater than the second threshold; For outbound document releases at the first risk level, a highest priority alarm is triggered and an automatic interception process is executed. For document outsourcing behavior at the second risk level, a medium-priority alarm is triggered and the document is marked as pending review. For document-related external releases at the third risk level, no alerts are triggered; the information is simply recorded in the behavior log.
9. A document outbound risk early warning system based on multimodal semantic analysis and behavioral modeling, characterized in that, include: The data acquisition module is used to acquire file transfer records and metadata of the file transfer records through the data interface of the enterprise instant messaging system. The metadata includes sender information, receiver information, chat type and file download address. The multimodal content analysis module is used to obtain the file entity based on the file download address, perform multimodal content analysis on the file entity, and determine the content semantic sensitivity score of the file entity. The behavioral feature extraction module is used to extract multi-dimensional risk features based on the metadata and determine risk scores for multiple behavioral dimensions. These multiple behavioral dimensions include sender role and permission dimensions, receiver relationship dimensions, chat scenario dimensions, sender behavior pattern dimensions, and time context dimensions. The sender role and permission dimension is used to determine the role and permission matching score. The receiver relationship dimension and the chat scenario dimension are used together to determine the receiver relationship risk score. The sender behavior pattern dimension is used to determine the behavior pattern anomaly score, and the time context dimension is used to determine the time context risk score. The dynamic risk scoring engine compares the content semantic sensitivity score, the role permission matching score, the receiver relationship risk score, the abnormal behavior pattern score, and the time context risk score with the preset alarm thresholds for the corresponding dimensions. When it detects that the scores of at least two dimensions are lower than the corresponding preset alarm thresholds, it determines that there is a dimension conflict and reduces the weighted sum score of the above five dimensions according to the preset conflict resolution rules to obtain a comprehensive risk score. The intelligent early warning and tracing module is used to generate a risk analysis report and push an early warning message when the comprehensive risk score exceeds the preset reporting threshold. The risk analysis report includes feature dimension identifiers in five dimensions whose scores exceed the preset alarm threshold.
10. A terminal device, characterized in that, The method includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein the processor executes the computer program to implement the file outbound risk warning method based on multimodal semantic analysis and behavioral modeling as described in any one of claims 1 to 8.