A document similarity detection method based on multi-modal content fusion

By employing a multimodal content fusion-based document similarity detection method, which utilizes dynamic fusion weight adjustment of hash fingerprints and semantic feature vectors, the challenges of resource exhaustion and attack identification in high-concurrency scenarios are solved, achieving high-throughput and high-availability document similarity detection.

CN121882018BActive Publication Date: 2026-06-16CHENGDU YOUA NETWORK TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHENGDU YOUA NETWORK TECH CO LTD
Filing Date
2026-03-19
Publication Date
2026-06-16

Smart Images

  • Figure CN121882018B_ABST
    Figure CN121882018B_ABST
Patent Text Reader

Abstract

The present application relates to the field of data processing and information security technology, specifically to a document similarity detection method based on multi-modal content fusion, comprising: a multi-modal feature extraction step: parsing the target document into text, image and video data, extracting a deterministic hash fingerprint and constructing a semantic feature vector; a difference entropy value calculation step: calculating the hash similarity, calculating the semantic similarity when the hash similarity is lower than the threshold, and mapping the difference between the two to a modal decision confidence entropy; a game strategy solving step: collecting system computing resource load data, combining the confidence entropy to solve the computing cost perception factor using a dynamic game strategy model; a dynamic fusion and survival step: assigning asymmetric fusion weights based on the perception factor to generate a detection result; when the load is over the limit and the confidence entropy indicates high risk, triggering a degradation survival mechanism; the present application can automatically balance the recall rate and throughput in a resource-limited scenario, preventing single-point attacks from causing system avalanches.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of data processing and information security technology, specifically to a document similarity detection method based on multimodal content fusion. Background Technology

[0002] In the application scenarios of internet content security governance, compliant platforms rely on accurate content identification mechanisms to ensure ecological health and copyright protection, while regulatory systems usually need to combine multimodal data streams such as text, images and videos to perceive and intercept illegal or infringing documents in real time.

[0003] For document similarity detection, existing solutions generally adopt a full-scale deep semantic analysis architecture. This involves using a pre-trained deep neural network model to encode features of each modality in all uploaded data and directly using the distance calculation results of high-dimensional semantic vectors as the criterion to identify variant content. Although this solution has high recognition accuracy in low-traffic or non-adversarial environments, its over-reliance on computationally expensive deep inference calculations and lack of dynamic awareness of system resource load make it highly susceptible to server resource exhaustion and response delays when encountering high-concurrency bursts of traffic or generative adversarial cleaning attacks. The surge in cases is delayed; furthermore, existing technologies struggle to quantify the logical conflicts between surface features and deep semantics in documents, making it difficult to establish an effective balance between computational costs and detection defenses when facing sophisticated plagiarism attacks. This hinders the platform's ability to maintain high throughput and high availability under extreme load pressure. Therefore, establishing a dynamic game mechanism with computational cost awareness, which can effectively identify adversarial attack risks while adaptively adjusting the fusion weights and computational strategies of multimodal features based on system load, thereby balancing recall and service availability in resource-constrained scenarios, has become a pressing technical problem. Summary of the Invention

[0004] To address the aforementioned technical problems, this invention provides a document similarity detection method based on multimodal content fusion. Specifically, the technical solution of this invention includes:

[0005] The target document to be detected is obtained, and the target document is parsed into multiple modal data such as text, image and video. The deterministic hash fingerprint of each modal data is extracted and mapped to a high-dimensional semantic space to construct the semantic feature vector of each modal data.

[0006] Access the pre-defined compliance feature database and calculate the hash similarity between the deterministic hash fingerprint and the samples in the database;

[0007] When the hash similarity is lower than the preset filtering threshold, deep analysis is initiated to calculate the semantic similarity between the semantic feature vector and the samples in the database, and to calculate the numerical difference between the hash similarity and the semantic similarity. This numerical difference is then mapped to the modal decision confidence entropy, which characterizes the degree of conflict between modalities. The current system computing resource load data is collected, including the frequency of processing requests and the single-task processing latency.

[0008] By combining the confidence entropy of modal decision-making and utilizing a dynamic game strategy model that includes the mapping relationship between resource load and computational cost, the computational cost perception factor for the current target document is calculated.

[0009] Based on the computational cost-aware factor, asymmetric fusion weights are assigned to the semantic similarity and hash similarity of each modality data, and weighted fusion calculation is performed to generate the final similarity detection result of the target document.

[0010] When the system's computational resource load exceeds the preset circuit breaker threshold and the confidence entropy indicates a high risk, the degradation survival mechanism is triggered by adjusting the asymmetric fusion weights, and the final similarity detection result and corresponding handling instructions are output.

[0011] Preferably, in step one, the unstructured data stream uploaded by the user is received, and a multimodal parser is used to perform track splitting to separate the text stream, image frame sequence, and video keyframe sequence. A secure hash algorithm is used to generate a text fingerprint for the separated text stream, a perceptual hash algorithm is used to generate an image fingerprint for the image frame sequence, and a sequence fingerprint is generated for the video keyframe sequence, collectively referred to as a deterministic hash fingerprint. A pre-trained deep neural network model is used to perform feature encoding on the text stream, image frame sequence, and video keyframe sequence respectively to generate text semantic vectors, image semantic vectors, and video semantic vectors.

[0012] Preferably, in step two, the deterministic hash fingerprint is compared with the blacklist fingerprint in the compliance feature database using Hamming distance, and the Hamming distance is converted into a hash similarity of a normalized interval. If the hash similarity indicates no direct match, the cosine similarity between the text semantic vector, image semantic vector, and video semantic vector and the corresponding vector in the compliance feature database is calculated to obtain the semantic similarity. The absolute value of the difference between the semantic similarity and the hash similarity is calculated and normalized to obtain the normalized difference. The normalized difference is used as an uncertainty measure to generate the confidence entropy of modal decision-making. Specifically, when the semantic similarity value is greater than the hash similarity value and the difference exceeds a preset range, the confidence entropy shows a high value, indicating the risk of adversarial cleaning attacks.

[0013] Preferably, in step three, the query rate per second and average response time of the real-time monitoring system are used to calculate the resource utilization rate; the resource utilization rate and the confidence entropy of the modal decision are input into the dynamic game strategy model; if the resource utilization rate is in the low load range, the computational cost perception factor is set to a base value guided by the recall rate, so that the weight of the high computational cost modality remains normal; if the resource utilization rate is in the high load range, the computational cost perception factor is dynamically adjusted according to the value of the confidence entropy; the higher the confidence entropy, the greater the penalty of the computational cost perception factor on the high computational power consumption modality.

[0014] Preferably, in step four, a computational cost coefficient is defined for each modality, wherein the computational cost coefficient for the video modality is higher than that for the text modality; the weight parameters in the multimodal fusion formula are modified by combining the computational cost perception factor and the computational cost coefficient to generate asymmetric fusion weights; in normal mode, the asymmetric fusion weights focus on modalities with high semantic richness; in degraded survival mode, the hash similarity weights of modalities with low computational cost coefficients are forcibly increased, while the semantic similarity weights of modalities with high computational cost coefficients are decreased.

[0015] Preferably, step four further includes: monitoring system computing resource load data; when the resource occupancy rate exceeds the preset circuit breaker threshold and the confidence entropy of the modal decision exceeds the preset attack judgment threshold, the system is judged to enter the failure boundary state; in response to the failure boundary state, the dynamic circuit breaker mechanism is activated, the semantic computing weight of the video modality is directly reset to zero, and the final similarity detection result is generated only based on the hash similarity between the text modality and the image modality.

[0016] Preferably, the dynamic game strategy model sets a first threshold and a second threshold, with the first threshold being less than the second threshold. The model executes the following logic: when the resource utilization rate is less than or equal to the first threshold, the first strategy is output to maintain full-scale deep semantic computation for all modalities; when the resource utilization rate is greater than the first threshold but less than the second threshold, the second strategy is output to linearly reduce the weight of the video modality based on the computational cost awareness factor; when the resource utilization rate is greater than or equal to the second threshold and the confidence entropy of the modality decision is greater than the preset attack judgment threshold, the third strategy is output to execute a dynamic circuit breaker mechanism, sacrificing the accuracy of a single detection to restore the system throughput.

[0017] Preferably, step four further includes: comparing the generated final similarity detection result with a preset compliance judgment threshold; if the final similarity detection result is higher than the compliance judgment threshold, generating an interception instruction and marking the target document as non-compliant; if the final similarity detection result is lower than the compliance judgment threshold, generating a release instruction, but the target document that is released under the third strategy is marked as pending secondary verification and enters the asynchronous review queue after the system load decreases.

[0018] Preferably, the computational cost-aware factor is a dynamically adjusted variable whose value is determined by a calculation function configured such that the value of the variable increases with the current query rate per second of the system and decreases with the increase of the remaining available memory of the system. When calculating the asymmetric fusion weight, the computational cost-aware factor is used as the power parameter of the exponential operation to nonlinearly scale the distance calculation results in the high-dimensional semantic space, thereby accelerating the calculation process when resources are limited.

[0019] Preferably, the confidence entropy of modal decision is used to quantify the uncertainty of the system regarding the current detection result; when the system is under high load and the confidence entropy exceeds the preset uncertainty threshold, making it impossible to make a clear judgment within the preset delay requirement, the system executes a conservative strategy, prioritizes service availability, and terminates the comparison process by increasing the decision weight of the deterministic hash fingerprint.

[0020] Compared with the prior art, the present invention has the following beneficial effects:

[0021] 1. This invention introduces a dynamic game strategy model based on computational cost awareness, which effectively solves the contradiction between the rigid computing power requirements of deep semantic computing and the limited system resources in high-concurrency scenarios. Unlike traditional solutions that rely entirely on expensive deep neural network inference, this solution calculates the computational cost awareness factor by collecting system load data in real time and combining it with modal decision confidence entropy, and then dynamically adjusts the fusion weight of multimodal data. This mechanism can automatically find the optimal solution between pursuing recall and ensuring throughput in extreme scenarios where system resources are extremely limited, preventing system avalanche caused by the exhaustion of computing power at a single point.

[0022] 2. This invention constructs a confidence entropy measurement mechanism based on hash and semantic differences, achieving for the first time quantitative perception and accurate identification of generative adversarial cleansing attacks. By calculating the numerical difference in similarity between deterministic hash fingerprints and high-dimensional semantic feature vectors, and mapping this difference to an entropy value representing the degree of logical paradox, the system can keenly capture plagiarized documents that have modified surface features but retain core content. This method not only solves the problem that traditional technologies have difficulty in quantifying the surface and deep logical conflicts of documents, but also provides accurate risk quantification basis for subsequent resource scheduling, effectively identifying variant infringement content that are similar in form but different in substance.

[0023] 3. This invention implements a multi-level degradation survival mechanism and an asymmetric weight fusion strategy, which significantly improves the service availability and resilience of the compliance platform when it suffers large-scale attacks. Based on the computational cost coefficient of the modality, when the system detects that the resource utilization rate exceeds the circuit breaker threshold and there is a high risk, it automatically triggers a graded response. By linearly reducing or even forcibly truncating the computational weight of high-energy-consuming modalities, it turns to rely on low-cost text or hash features. This drastic survival strategy, which sacrifices the accuracy of a single detection in exchange for the recovery of the overall system throughput, ensures that the core detection service does not become paralyzed under overload conditions.

[0024] 4. This invention establishes an asynchronous closed-loop management system that allows for initial release followed by review, ensuring a smooth user upload experience while providing a controllable safety net against compliance risks. For target documents released under degradation policies or high-load circuit breaker conditions, the system marks them as requiring secondary verification and pushes them into the asynchronous review queue, utilizing idle computing power during off-peak periods for full-scale deep semantic recalculation. This mechanism mitigates the risk of missed checks that may occur under degradation modes, avoiding blocking normal user requests during peak attack periods and maintaining a long-term content security level for the platform through post-incident accountability mechanisms. Attached Figure Description

[0025] The present invention will be further explained below with reference to the accompanying drawings and embodiments:

[0026] Figure 1 This is a flowchart of the method of the present invention. Detailed Implementation

[0027] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments.

[0028] Example 1:

[0029] Please see Figure 1 A document similarity detection method based on multimodal content fusion, the specific steps of which include:

[0030] Step 1: Obtain the target document to be detected, parse the target document into multiple modal data such as text, image and video, extract the deterministic hash fingerprint of each modal data, map it to a high-dimensional semantic space, and construct the semantic feature vector of each modal data;

[0031] Step 2: Access the preset compliance feature database and calculate the hash similarity between the deterministic hash fingerprint and the samples in the database. When the hash similarity is lower than the preset filtering threshold, start deep analysis, calculate the semantic similarity between the semantic feature vector and the samples in the database, and calculate the numerical difference between the hash similarity and the semantic similarity. Map this numerical difference to the modal decision confidence entropy, which represents the degree of conflict between modalities.

[0032] Step 3: Collect current system computing resource load data, including processing request frequency and single task processing latency; combine the confidence entropy of modal decision-making, and use a dynamic game strategy model that includes the mapping relationship between resource load and computing cost to calculate the computing cost perception factor for the current target document.

[0033] Step 4: Based on the computational cost-aware factor, assign asymmetric fusion weights to the semantic similarity and hash similarity of each modality data, perform weighted fusion calculation, and generate the final similarity detection result of the target document; when the system's computational resource load exceeds the preset circuit breaker threshold and the confidence entropy indicates high risk, trigger the degradation survival mechanism by adjusting the asymmetric fusion weights, and output the final similarity detection result and the corresponding disposal instructions.

[0034] This embodiment details the overall execution logic of a document similarity detection method based on multimodal content fusion. This method aims to resolve the contradiction between the rigid computing power requirements of deep semantic computing and the limited resources of the system in high-concurrency scenarios. The system obtains the target document to be detected through a multimodal parsing module, parses the document in parallel into three independent data streams: text, image, and video, and performs two-layer feature extraction for each data stream: extracting a deterministic hash fingerprint for fast anchoring, and using a deep neural network to map to a high-dimensional semantic space to construct a semantic feature vector representing the main idea of ​​the content.

[0035] The system accesses a pre-defined compliance feature database and prioritizes calculating the similarity of deterministic hash fingerprints. If the hash similarity is below a pre-defined filtering threshold (e.g., 0.95), the system determines that no direct copying has occurred and initiates a deep analysis process to calculate the similarity of semantic feature vectors. Based on this, the system calculates the numerical difference between hash similarity and semantic similarity and maps this difference to modal decision confidence entropy. This entropy quantifies the degree of logical paradox between document surface features and deep semantics. The system collects real-time computing resource load data through resource monitoring probes, treating system load as the defender and the potential risks represented by confidence entropy as the attacker, and uses a dynamic game strategy model to calculate the computing cost perception factor. Based on this factor, the system assigns asymmetric fusion weights to each modality, generating the final similarity detection result. If the system resource load exceeds the circuit breaker threshold and the confidence entropy indicates high risk, the system triggers a degradation survival mechanism, forcibly suppressing the weights of high-energy-consuming modalities and outputting a handling instruction containing either downgraded release or delayed blocking.

[0036] This embodiment achieves, for the first time, quantitative awareness of generative adversarial cleaning attacks (GALs) by constructing an entropy metric for hash-semantic differences. Combined with a computational cost awareness factor, it can automatically find the optimal solution between pursuing recall and ensuring throughput in extreme scenarios where system resources are extremely limited, such as 98% load. This prevents system avalanche caused by single-point attacks and ensures the service availability of the platform when it is under attack.

[0037] Example 2:

[0038] Step one includes:

[0039] S11. Receive the unstructured data stream uploaded by the user, and use a multimodal parser to perform track splitting processing to separate the text stream, image frame sequence, and video keyframe sequence;

[0040] S12. A secure hash algorithm is used to generate text fingerprints for the separated text stream, a perceptual hash algorithm is used to generate image fingerprints for the image frame sequence, and a sequence fingerprint is generated for the video keyframe sequence. These are collectively referred to as deterministic hash fingerprints.

[0041] S13. Using a pre-trained deep neural network model, feature encoding is performed on the text stream, image frame sequence, and video keyframe sequence to generate text semantic vectors, image semantic vectors, and video semantic vectors, respectively.

[0042] This embodiment is a further specification of the multimodal feature extraction steps in Embodiment 1, aiming to achieve refined parsing of unstructured data; the system receives unstructured data streams uploaded by users, and uses a multimodal parser to perform file header parsing and streaming separation technology to accurately separate the mixed data stream into a text stream containing plain text content and metadata, an image frame sequence extracted from document or video screenshots, and a video keyframe sequence extracted using the inter-frame difference method;

[0043] For each separated data stream, the system employs a hierarchical hashing strategy to generate deterministic hash fingerprints: for text streams, a secure hashing algorithm is used, such as... To ensure local sensitivity, a perceptual hashing algorithm is used for the image frame sequence. Low-frequency features are extracted using DCT transform to ensure robustness to scaling and rotation, and the fingerprints of video keyframes are combined by timestamp to form a sequence fingerprint. The system utilizes pre-trained deep neural network models, including BERT for text and [other models] for images. The system uses a Transformer for video sequences to encode features of the three types of data, generating text semantic vectors, image semantic vectors, and video semantic vectors located in a shared high-dimensional semantic space. The system cascades fully connected mapping layers after the output layers of each pre-trained model, uses a triplet loss function for joint fine-tuning, forces the original features of different dimensions to be mapped to the same dimension of the metric space, such as 512 dimensions, and performs L2 normalization.

[0044] This embodiment, through track-based processing and dual feature extraction (hash plus semantics), retains the microsecond-level fast matching capability of traditional hash algorithms while introducing the semantic understanding capability of deep learning, providing a complete and multi-dimensional data foundation for subsequent identification of variant infringing documents that are similar in form but not in substance.

[0045] Example 3:

[0046] Step two includes:

[0047] S21. Compare the deterministic hash fingerprint with the blacklist fingerprint in the compliance feature database using Hamming distance, and convert the Hamming distance into hash similarity of normalized intervals.

[0048] S22. If the hash similarity indicates no direct match, calculate the cosine similarity between the text semantic vector, image semantic vector, and video semantic vector and the corresponding vector in the compliance feature database to obtain the semantic similarity.

[0049] S23. Calculate the absolute value of the difference between semantic similarity and hash similarity, and then normalize it to obtain the normalized difference.

[0050] S24. The normalized difference is used as a measure to represent the degree of conflict between modal decisions, and the confidence entropy of modal decision is generated. The configuration is that when the semantic similarity value is greater than the hash similarity value and the difference exceeds the preset range, the confidence entropy shows a high value, which indicates that there is a risk of adversarial cleaning attack based on logical paradox.

[0051] This embodiment further specifies the confidence entropy calculation logic in Embodiment 1, aiming to quantify the adversarial risk of documents. The system compares the deterministic hash fingerprint with the blacklist fingerprint in the compliance feature database using Hamming distance, and converts the result into a hash similarity of a normalized interval. In response to a hash similarity lower than the direct matching threshold, such as 0.95, the system further calculates the cosine similarity of each modality's semantic vector to obtain semantic similarity. The system calculates the absolute value of the difference between the semantic similarity and the hash similarity, and generates the modality decision confidence entropy based on this difference. The calculation formula is as follows:

[0052]

[0053] in, Modal decision confidence entropy, derived from system calculation, physically represents the degree of logical paradox between surface features and deep semantics in the current document, and is dimensionless;

[0054] The natural constant, also known as the Euler number, has a value of approximately 2.71828. In the formula, it serves as the base of the natural exponential function and is used to construct... Nonlinear activation functions of the form;

[0055] The absolute value of the similarity difference is derived from the calculation of the difference between semantic similarity and hash similarity. Its physical meaning is the inconsistency between modes, and it is dimensionless.

[0056] The entropy gain coefficient is derived from a preset constant, such as 10, and its physical meaning is to adjust... The steepness of a function is dimensionless;

[0057] The difference-sensitive threshold is derived from historical statistical data of adversarial examples, for example, 0.3. Its physical meaning is the baseline for triggering a high-risk judgment; it is dimensionless. It responds to semantic similarity values ​​being significantly greater than hash similarity values, with a difference of [missing value]. Exceed When the formula outputs A value close to 1 indicates a risk of adversarial cleaning attacks;

[0058] This embodiment mathematically transforms the semantic-hash difference into confidence entropy, which can accurately identify plagiarized documents that have modified hash features but retained core content. The higher the confidence entropy, the more suspicious the current document is, thus providing a quantitative basis for subsequent resource scheduling.

[0059] Example 4:

[0060] Step three includes:

[0061] S31. Real-time monitoring system query rate per second (QPS) and average response time (RT) are used to calculate resource utilization.

[0062] S32. Input the resource utilization rate and the confidence entropy of modal decision into the dynamic game strategy model;

[0063] S33. If the resource utilization rate is in the low load range, set the computing cost perception factor to a base value guided by recall rate, so that the modal weight of high computing cost remains normal.

[0064] S34. If the resource utilization rate is in the high load range, the computation cost perception factor is dynamically adjusted according to the value of the confidence entropy; the higher the confidence entropy, the greater the penalty of the computation cost perception factor on the high computing power consumption mode.

[0065] This embodiment is a further specification of the dynamic game strategy in Embodiment 3, aiming to establish a dynamic balance between computing power and risk; the system monitors the query per second (QPS) and average response time (RT) in real time through monitoring components, and calculates the resource utilization rate. The specific calculation method involves normalizing the current QPS and RT, using QPS as an indicator of CPU computing pressure and RT as an indicator of memory resident pressure, and then performing a weighted sum according to a preset weighting ratio, such as QPS weighting 0.6 and RT weighting 0.4, to ensure consistency of load metrics across different implementations. The system then modally calculates this resource utilization rate and modal decision confidence entropy. Simultaneously, the input is fed into the dynamic game strategy model; in response to a low resource utilization rate, such as less than 75%, the model determines that the system has spare capacity and calculates the cost-aware factor. Set a base value, such as 1.0, to maintain normal weights for high computational cost modes and ensure maximum recall.

[0066] Furthermore, for resource utilization in the transitional range between low and high load, such as [75%, 98%], to address the numerical jump caused by simple switching, the model is configured to execute a linear prediction interpolation buffering strategy; specifically, the system pre-calculates the theoretical sensing factor value at the starting point of the high load range, i.e., 98%. Then, the transition interval is calculated using the following interpolation formula. :

[0067]

[0068] This formula ensures that when When it increases from 75% to 98%, It can smoothly transition from the initial value of the 1.0 formula to the high-load formula, eliminating numerical jumps and oscillations caused by formula switching, and ensuring the mathematical continuity of the logic branches in the entire range of [0%, 100%]. In response to resource utilization being in a high-load range, such as greater than 98%, the model dynamically adjusts the computational cost perception factor based on the confidence entropy. To meet the requirements of rapid response, this embodiment first uses a simplified linear approximation model for preliminary calculation, and the logic adjustment follows these principles:

[0069]

[0070] in, The cost perception factor is derived from model calculations. Its physical meaning is the dynamic pricing coefficient of computational consumption for each modality, and it is dimensionless.

[0071] : Penalty gain coefficient, derived from a preset constant, such as 5.0, physically means the gain that adjusts the intensity of the penalty, and is dimensionless;

[0072] Modal decision confidence entropy, derived from the previous level of calculation, physically represents the degree of suspiciousness of a document, and is dimensionless;

[0073] Resource utilization rate, derived from real-time monitoring, physically represents the current load pressure of the system, and is expressed as a percentage. It should be noted that when substituting into the above formula for numerical calculations, this percentage value should be converted to decimal form. For example, when the resource utilization rate is 85%, the value should be 0.85 to ensure consistency of the calculation units.

[0074] This embodiment establishes a dynamic game mechanism of computing power and risk. When the system is under high load and faces suspected attacks, it automatically increases the computing cost awareness factor, thereby suppressing the resource consumption of high computing power mode. In essence, it avoids expensive and invalid calculations on suspicious documents when resources are scarce.

[0075] Example 5:

[0076] Step four includes:

[0077] S41. Define the computational cost coefficient for each modality, where the computational cost coefficient for the video modality is higher than that for the text modality;

[0078] S42. By combining the computational cost perception factor and the computational cost coefficient, the weight parameters in the multimodal fusion formula are corrected to generate asymmetric fusion weights.

[0079] S43. In normal mode, asymmetric fusion weights focus on modalities with high semantic richness;

[0080] S44. In the degraded survival mode, the hash similarity weight of the modal with low computational cost coefficient is forcibly increased, while the semantic similarity weight of the modal with high computational cost coefficient is decreased.

[0081] This embodiment further specifies the weight allocation logic in Embodiment 4, aiming to achieve cost-based dynamic weight reconstruction; the system defines the computational cost coefficient for each mode. Based on actual hardware test data, the coefficient for the video modality was set to be significantly higher than that for the text modality; for example, 50 for video and 1 for text. This was combined with the calculation of the cost-perceived factor. The system corrects the multimodal fusion formula, and the final fusion weights of each mode are adjusted. The calculation is as follows:

[0082]

[0083] in, The final fusion weight, derived from calculation correction, has the physical meaning of the influence of the mode in the final decision, and is dimensionless.

[0084] The modal weights are derived from preset values, such as 0.4 for video. Physically, they represent importance under ideal conditions and are dimensionless.

[0085] The cost-perceived factor is derived from the output of the game model. Its physical meaning is the degree of resource scarcity in the system, and it is dimensionless.

[0086] The calculation cost coefficient is derived from the definition based on actual hardware measurements. Its physical meaning is the sensitivity of this mode to computing power consumption, and it is dimensionless.

[0087] In normal mode, due to With the weights close to 1, the system maintains a baseline value, prioritizing video modalities with high semantic richness; in response to entering a degraded survival mode, due to the video's... Extremely large, its weight The low-cost modality decays exponentially, while the text modality decays more slowly, thus forcibly increasing the relative influence of the low-cost modality.

[0088] Specifically, regarding the calculation of the final similarity detection result of the target document generated in step four, in order to address the issue of weighting... To address the issue of the weighted sum approaching zero due to significant attenuation as the load increases, this system explicitly adopts a normalized weighted summation formula:

[0089]

[0090] in: The final similarity detection result is derived from normalized calculation. Its physical meaning is the comprehensive similarity score after multimodal fusion, with a value range of [0,1] and dimensionless. The final fusion weights for each modality are derived from the formula. The calculation results are dimensionless; The similarity scores for each modality are derived from the semantic similarity calculated in step S22 or the hash similarity obtained in step S12. They represent the alignment scores for a single modality dimension and are dimensionless. This normalization step eliminates the influence of the weight sum not being 1, ensuring the final score. It is always mapped within the [0,1] interval, thus ensuring its comparability with subsequent compliance judgment thresholds and avoiding full omissions caused by low scores under high load.

[0091] This embodiment achieves asymmetric weight adjustment by introducing an exponential computational cost penalty and a normalization fusion operator. When resources are scarce, it automatically blocks expensive video semantic judgment and instead relies on inexpensive text and hash judgment. Although it sacrifices some of the detection rate of video plagiarism, it effectively maintains the overall throughput of the system.

[0092] Example 6:

[0093] Step four also includes:

[0094] S45, Monitoring system calculates resource load data;

[0095] S46. When the resource utilization rate exceeds the preset circuit breaker threshold of 98% and the confidence entropy of the modal decision exceeds the preset attack judgment threshold, the system is judged to enter the failure boundary state.

[0096] S47. In response to the failure boundary state, the dynamic circuit breaker mechanism is activated, and the semantic calculation weight of the video modality is directly reset to zero. The final similarity detection result is generated only based on the hash similarity between the text modality and the image modality.

[0097] This embodiment further specifies the failure boundary handling in Embodiment 5, aiming to build the last line of defense for the system. The system continuously monitors computing resource load data. In response to the resource occupancy rate exceeding a preset circuit breaker threshold, such as 98%, and the average modal decision confidence entropy of the current batch of requests exceeding a preset attack judgment threshold, such as 0.8, the system determines that it has entered a failure boundary state. The system activates the dynamic circuit breaker mechanism, performs a hard truncation, and directly resets the semantic computing weight of the video modality to zero. The system generates the final similarity detection result based solely on the hash similarity between the text modality and the image modality, expressed by the formula:

[0098]

[0099] in, The final similarity detection result is derived from the circuit breaker calculation, and its physical meaning is the document similarity after downgrading, which is dimensionless.

[0100] Text hash similarity, derived from step S12, is dimensionless;

[0101] Image hash similarity, derived from step S12, is dimensionless;

[0102] The normalized weights are derived from the system's default configuration after a reset. The specific determination method is as follows: Select a historical set of violation documents containing no fewer than 10,000 samples, and test the precision (true positives / predicted positives) with only text hashing and only image hashing enabled. The measured text precision is approximately 1.5 times that of the image precision. Based on the inverse variance weighting principle in statistics, the weights should be inversely proportional to the error rate and directly proportional to the accuracy. The normalization calculation is performed based on this ratio to determine the set weights. That is, text weight, and That is, image weights, and satisfying This configuration ensures that when video semantics are missing, it prioritizes rapid screening based on text features, which are dimensionless.

[0103] This embodiment provides a strategy of cutting off an arm to save the system. When it is determined that the system is about to crash due to a GAL attack, all video semantic computation is directly cut off to ensure survival in extreme situations. Even under a sophisticated plagiarism attack, the core text and image detection services remained available, preventing the complete paralysis of the entire compliance platform.

[0104] Example 7:

[0105] In the dynamic game strategy model, a first threshold and a second threshold are set, and the first threshold is less than the second threshold. The model executes the following logic:

[0106] When the resource utilization rate is less than or equal to the first threshold, the first strategy is output to maintain full-scale deep semantic computation for all modalities; when the resource utilization rate is greater than the first threshold but less than the second threshold, the second strategy is output to linearly reduce the weight of the video modalities based on the computation cost awareness factor.

[0107] When the resource utilization rate is greater than or equal to the second threshold and the confidence entropy of the modal decision is greater than the preset attack judgment threshold, the third strategy is output to execute the dynamic circuit breaker mechanism, which reduces the accuracy of a single detection in exchange for the recovery of system throughput.

[0108] This embodiment is a further specification of the response logic of the dynamic game strategy model in Embodiment 6, aiming to achieve a hierarchical response; the model sets two key load thresholds: the first threshold For example, 75%, and the second threshold. For example, 98%, consistent with the circuit breaker threshold; responding to resource utilization being less than or equal to The model outputs the first strategy, maintaining semantic vector extraction and comparison across all modalities to achieve 100% detection accuracy; in response to resource occupancy exceeding... and less than The model outputs a second strategy, which linearly reduces the weights of video modalities based on the computational cost-aware factor.

[0109] It should be specifically noted here that, in order to implement the principle of adjusting according to the computational cost perception factor in the embodiments, the system adopts the following linear mapping formula in the second strategy stage of this embodiment:

[0110]

[0111] in, The real-time weight of the video modality is derived from linear calculation. Its physical meaning is the influence of the video modality under the current load, and it is dimensionless.

[0112] The basic weights of the video modal are derived from system presets, such as 0.4. Physically, they represent the original weights without load penalty and are dimensionless.

[0113] The cost perception factor is calculated from the solution in step three. Its physical meaning is the comprehensive pressure index of the current system, which includes load and risk and is dimensionless.

[0114] The dynamic attenuation coefficient is not a fixed constant, but is configured as follows: ,in, When the system reaches the second threshold (circuit breaker threshold) in resource utilization and the confidence entropy reaches its maximum value (1.0), substitute the values ​​into the formula. The calculated theoretical maximum cost factor, i.e. The physical meaning is a dimensionless normalization operator that ensures that the weight is precisely reduced to zero only when the system load actually reaches the circuit breaker threshold. This formula corrects the original pure load mapping logic, ensuring that the weight adjustment strictly follows the core variable of the computational cost perception factor, and achieves accurate weight reduction for high-risk or high-load documents.

[0115] At this point, video semantics are still being computed, but their influence decreases linearly with increasing load; this is in response to resource occupancy being greater than or equal to... The model outputs a third strategy, executing a dynamic circuit breaker mechanism. In this step, the system strictly follows the compound triggering logic, that is, while confirming that the resource utilization rate has reached the second threshold, it checks whether the confidence entropy of the modal decision exceeds the attack judgment threshold. The circuit breaker is activated only when both conditions are met, video semantic calculation is stopped and only hash comparison is retained. If the confidence entropy does not meet the standard, the downgraded but non-circuit breaker processing is maintained, such as maintaining the extreme state of the second strategy.

[0116] This embodiment achieves a smooth transition from perfect detection to lossy service through a three-level layering strategy. It explicitly sacrifices the accuracy of a single detection in exchange for instantaneous recovery of system throughput, avoiding the hard landing of traditional systems that directly reject service when overloaded, and providing a more resilient quality of service.

[0117] Example 8:

[0118] Step four also includes:

[0119] S48. Compare the generated final similarity detection result with the preset compliance judgment threshold;

[0120] S49. If the final similarity detection result is higher than the compliance judgment threshold, generate an interception instruction and mark the target document as non-compliant;

[0121] S50. If the final similarity detection result is lower than the compliance judgment threshold, a release instruction is generated. However, the target document that is released under the third strategy is marked as pending secondary verification and enters the asynchronous review queue after the system load decreases.

[0122] This embodiment further specifies the detection result processing logic in Embodiment 7, aiming to compensate for security vulnerabilities in the downgrade mode. The system compares the final similarity detection result with a preset compliance judgment threshold. In response to the result being higher than the threshold, the system generates an interception instruction and marks the target document as non-compliant. In response to the result being lower than the threshold, the system generates a release instruction. During this process, if the release decision is made under the third strategy, i.e., dynamic circuit breaker, the system will specially mark the target document as pending secondary verification and push the document ID into a low-priority asynchronous review queue. When the system load drops below the first threshold, the background performs a full deep semantic calculation on the documents in the queue again. If a missed judgment is found, post-event accountability will be pursued.

[0123] This embodiment constructs a closed-loop mechanism of allowing access first and then reviewing it, ensuring that user uploads are not blocked during peak attack periods to protect user experience. At the same time, it utilizes idle computing power during off-peak periods to compensate for the lack of detection accuracy in the downgrade mode, thus achieving controllable management of compliance risks.

[0124] Example 9:

[0125] The cost-perceived factor is a dynamically adjusted variable whose value is determined by a calculation function, configured as follows:

[0126] This causes the value of the variable to increase as the current query rate per second of the system increases, and to decrease as the remaining available memory of the system increases;

[0127] When calculating asymmetric fusion weights, a computational cost-aware factor is used as the power parameter of the exponential operation to nonlinearly scale the distance calculation results in the high-dimensional semantic space, thereby accelerating the calculation process when resources are limited.

[0128] This embodiment is a further refinement of the mathematical model for calculating the cost-perceived factor in Embodiment 1, aiming to precisely define the nonlinear relationship between load and cost; the system collects the current query rate per second in real time. and remaining available memory Determine the perceived cost factor through a calculation function. The formula is as follows:

[0129]

[0130] in, The cost-aware factor is derived from function calculation and its physical meaning is the resource scarcity degree, which is dimensionless. It should be noted that the formula used in this embodiment is a high-precision nonlinear alternative model relative to the linear general formula in embodiment 4. Unlike the linear superposition logic in embodiment 4, this formula adopts a multiplicative exponential model, which aims to solve the nonlinear impact of a single resource bottleneck on system stability in ultra-large-scale concurrency scenarios. The two are parallel technical solutions applicable to general scenarios and extreme high-concurrency scenarios, respectively.

[0131] The basic adjustment constant is derived from a preset value, such as 1.0, and is dimensionless.

[0132] Real-time QPS and real-time memory usage are collected by the system and are measured in times per second and GB, respectively.

[0133] The reference baseline value is derived from the system calibration; among which, The calibration is the saturation throughput measured by the system in offline load testing, for example... times / second The minimum safe memory threshold required for the system to ensure the operation of basic services is defined, for example. GB; This calibration method ensures that the factor calculation is based on the physical limits of the system, and the units are the same as above;

[0134] Sensitivity index, derived from empirical settings, such as 1.5 and 2.0; selected here. and The aim is to avoid the normalized weights defined in Example 6. A sign conflict occurs; its physical meaning is the response rate of the regulation factor to changes in flow and memory, which is dimensionless.

[0135] The adversarial weighting coefficient, derived from a preset value such as 2.0, physically represents the amplification factor of the confidence entropy on the computational cost. It is used to reflect the logic of attack and defense game and is dimensionless.

[0136] Modal decision confidence entropy, derived from step S24, has the physical meaning of the adversarial risk level of a document and is dimensionless;

[0137] In particular, regarding the singularity problem that may occur in this mathematical model under low-load scenarios, i.e. when... hour This embodiment explicitly introduces the basic value boundary clamping logic from Embodiment 4; in engineering implementation, the final factor used is... This clamping logic ensures that when the system is in a resource-sufficient state, the exponent term... Keeping it constant at 1 causes the distance scaling to degenerate into the original Euclidean distance, avoiding the effects of... This causes the physical meaning of compressing all distance errors to zero by causing the exponent to tend to infinity, thus invalidating the problem.

[0138] When calculating distance in a high-dimensional semantic space, define The system utilizes the original Euclidean distance between the semantic feature vector of the target document and the semantic feature vector of the database sample; As a power parameter pair The corrected distance is obtained by performing nonlinear scaling. The calculation formula is: ;

[0139] It should be noted that the asymmetric fusion weight described in the embodiments is specifically defined as a nonlinear metric weight in this embodiment; unlike the linear coefficient that directly acts on the similarity value in Embodiment 5. Unlike other embodiments, this embodiment changes the exponent of distance calculation, i.e. the curvature of space, to produce a feature recognition control effect equivalent to directly adjusting the weighting coefficients. This nonlinear scaling is mathematically a generalized weighted fusion category, which means that by reducing the distance resolution of high-cost modes in the feature space, it gains computational speed, thereby solving the problem of consistency in the definition of asymmetric fusion weights in different mathematical spaces.

[0140] This embodiment utilizes a design that positively correlates with QPS and negatively correlates with memory usage, and introduces confidence entropy. This ensures that the system is not only sensitive to memory exhaustion, but also able to detect the risk of adversarial attacks, preventing it from degenerating into ordinary flow control under simple high-traffic attacks.

[0141] Example 10:

[0142] Modal decision confidence entropy is used to quantify the degree of logical paradox between surface features and deep semantics in a system;

[0143] When the system is under high load and the confidence entropy exceeds the preset uncertainty threshold, making it impossible to make a clear judgment within the preset delay requirement, the system implements a conservative strategy, prioritizing service availability and terminating the comparison process by increasing the decision weight of the deterministic hash fingerprint.

[0144] This embodiment is a further specification of the modal decision confidence entropy application logic in Embodiment 1, aiming to handle extreme uncertainty scenarios. Modal decision confidence entropy essentially quantifies the system's degree of confusion regarding the current detection result. In response to the system being under high load and the confidence entropy exceeding a preset uncertainty threshold, such as 0.7, the system determines that it cannot make a clear judgment within a preset delay, such as 200ms. At this time, the system executes a conservative strategy, immediately stops the ongoing deep neural network inference, and increases the decision weight of the deterministic hash fingerprint to 100%, making a judgment based solely on the hash result. If the hash does not match, the system is allowed to proceed by default.

[0145] This embodiment embodies the engineering principle of prioritizing service availability. In the midst of extremely uncertain computational quagmire, it forces the system to cut off entanglements and quickly close the loop, preventing the entire processing queue from being dragged down by the processing timeout of a single complex sample, thus ensuring the overall stability of the system.

[0146] It should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and are not intended to limit it. Although the present invention has been described in detail with reference to preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A document similarity detection method based on multimodal content fusion, characterized in that, The specific steps include: Step 1: Obtain the target document to be detected, parse the target document into multiple modal data such as text, image and video, extract the deterministic hash fingerprint of each modal data, map it to a high-dimensional semantic space, and construct the semantic feature vector of each modal data; Step 2: Access the preset compliance feature database and calculate the hash similarity between the deterministic hash fingerprint and the samples in the database. When the hash similarity is lower than the preset filtering threshold, start deep analysis, calculate the semantic similarity between the semantic feature vector and the samples in the database, and calculate the numerical difference between the hash similarity and the semantic similarity. Map this numerical difference to the modal decision confidence entropy, which represents the degree of conflict between modalities. Step 3: Collect current system computing resource load data, including processing request frequency and single task processing latency; combine the confidence entropy of modal decision-making, and use a dynamic game strategy model that includes the mapping relationship between resource load and computing cost to calculate the computing cost perception factor for the current target document. Step 4: Based on the computational cost awareness factor, assign asymmetric fusion weights to the semantic similarity and hash similarity of each modality data, perform weighted fusion calculation, and generate the final similarity detection result of the target document; when the system's computational resource load exceeds the preset circuit breaker threshold and the confidence entropy indicates high risk, trigger the degradation survival mechanism by adjusting the asymmetric fusion weights, and output the final similarity detection result and the corresponding handling instructions. Step two includes: S21. Compare the deterministic hash fingerprint with the blacklist fingerprint in the compliance feature database using Hamming distance, and convert the Hamming distance into hash similarity of normalized intervals. S22. If the hash similarity indicates no direct match, calculate the cosine similarity between the text semantic vector, image semantic vector, and video semantic vector and the corresponding vector in the compliance feature database to obtain the semantic similarity. S23. Calculate the absolute value of the difference between semantic similarity and hash similarity, and then normalize it to obtain the normalized difference. S24. The normalized difference is used as an uncertainty measure to generate the confidence entropy of modal decision-making; wherein, it is configured that when the semantic similarity value is greater than the hash similarity value and the difference exceeds a preset range, the confidence entropy presents a high value, indicating that there is a risk of adversarial cleaning attack. Step three includes: S31. Real-time monitoring system query rate per second (QPS) and average response time (RT) are used to calculate resource utilization. S32. Input the resource utilization rate and the confidence entropy of modal decision into the dynamic game strategy model; S33. If the resource utilization rate is in the low load range, set the computing cost perception factor to a base value guided by recall rate, so that the modal weight of high computing cost remains normal. S34. If the resource utilization rate is in the high load range, the computing cost perception factor is dynamically adjusted according to the value of the confidence entropy; the higher the confidence entropy, the greater the penalty of the computing cost perception factor on the high computing power consumption mode. A simplified linear approximation model is used for preliminary solution calculation, and the logic is adjusted according to the following principles: in, The cost perception factor is derived from model calculations. Its physical meaning is the dynamic pricing coefficient of computational consumption for each modality, and it is dimensionless. : Penalty gain coefficient, derived from a preset constant, physically means the gain that adjusts the intensity of the penalty, and is dimensionless; Modal decision confidence entropy, derived from the previous level of calculation, physically represents the degree of suspiciousness of a document, and is dimensionless; Resource utilization rate, derived from real-time monitoring, physically represents the current load pressure of the system, and is expressed as a percentage. When substituting this percentage into the above formula for numerical calculations, it should be converted to a decimal form to ensure consistency of the calculation units.

2. The document similarity detection method based on multimodal content fusion according to claim 1, characterized in that: Step one includes: S11. Receive the unstructured data stream uploaded by the user, and use a multimodal parser to perform track splitting processing to separate the text stream, image frame sequence, and video keyframe sequence; S12. A secure hash algorithm is used to generate text fingerprints for the separated text stream, a perceptual hash algorithm is used to generate image fingerprints for the image frame sequence, and a sequence fingerprint is generated for the video keyframe sequence. These are collectively referred to as deterministic hash fingerprints. S13. Using a pre-trained deep neural network model, feature encoding is performed on the text stream, image frame sequence, and video keyframe sequence to generate text semantic vectors, image semantic vectors, and video semantic vectors, respectively.

3. The document similarity detection method based on multimodal content fusion according to claim 2, characterized in that: Step four includes: S41. Define the computational cost coefficient for each modality, where the computational cost coefficient for the video modality is higher than that for the text modality; S42. By combining the computational cost perception factor and the computational cost coefficient, the weight parameters in the multimodal fusion formula are corrected to generate asymmetric fusion weights. S43. In normal mode, asymmetric fusion weights focus on modalities with high semantic richness; S44. In the degraded survival mode, the hash similarity weight of the modal with low computational cost coefficient is forcibly increased, while the semantic similarity weight of the modal with high computational cost coefficient is decreased.

4. The document similarity detection method based on multimodal content fusion according to claim 3, characterized in that: Step four also includes: S45, Monitoring system calculates resource load data; S46. When the resource utilization rate exceeds the preset circuit breaker threshold of 98% and the confidence entropy of the modal decision exceeds the preset attack judgment threshold, the system is judged to enter the failure boundary state. S47. In response to the failure boundary state, the dynamic circuit breaker mechanism is activated, and the semantic calculation weight of the video modality is directly reset to zero. The final similarity detection result is generated only based on the hash similarity between the text modality and the image modality.

5. The document similarity detection method based on multimodal content fusion according to claim 4, characterized in that: The dynamic game strategy model sets a first threshold and a second threshold, where the first threshold is less than the second threshold. The model then executes the following logic: When the resource utilization rate is less than or equal to the first threshold, the first strategy is output to maintain full-scale deep semantic computation for all modalities. When the resource utilization rate is greater than the first threshold and less than the second threshold, the second strategy is output, which linearly reduces the weight of the video modality based on the computational cost perception factor. When the resource utilization rate is greater than or equal to the second threshold and the confidence entropy of the modal decision is greater than the preset attack judgment threshold, the third strategy is output, and the dynamic circuit breaker mechanism is executed to restore the system throughput at the cost of reducing the accuracy of a single detection.

6. The document similarity detection method based on multimodal content fusion according to claim 5, characterized in that: Step four also includes: S48. Compare the generated final similarity detection result with the preset compliance judgment threshold; S49. If the final similarity detection result is higher than the compliance judgment threshold, generate an interception instruction and mark the target document as non-compliant; S50. If the final similarity detection result is lower than the compliance judgment threshold, a release instruction is generated. However, the target document that is released under the third strategy is marked as pending secondary verification and enters the asynchronous review queue after the system load decreases.

7. The document similarity detection method based on multimodal content fusion according to claim 1, characterized in that: The computational cost perception factor is a dynamically adjustable variable whose value is determined by a calculation function configured such that the value of the variable increases with the increase of the current query rate per second of the system and decreases with the increase of the remaining available memory of the system. When calculating the asymmetric fusion weights, a computational cost-aware factor is used as the power parameter for the exponential operation to nonlinearly scale the distance calculation results in the high-dimensional semantic space, thereby accelerating the calculation process when resources are limited.

8. The document similarity detection method based on multimodal content fusion according to claim 1, characterized in that: The confidence entropy of the modal decision is used to quantify the uncertainty of the system regarding the current detection result; When the system is under high load and the confidence entropy exceeds the preset uncertainty threshold, making it impossible to make a clear judgment within the preset delay requirement, the system implements a conservative strategy, prioritizing service availability and terminating the comparison process by increasing the decision weight of the deterministic hash fingerprint.