An AI contract tamper-proofing method based on OCR+large language model

By employing an AI-based contract anti-tampering method based on OCR and a large language model, and utilizing a multi-scale adaptive enhancement and semantic bridging fusion model to generate text fingerprints, the high false alarm rate caused by the sensitivity of contract tampering detection to OCR recognition errors is solved, achieving highly accurate contract tampering detection.

CN121686471BActive Publication Date: 2026-06-12BEIJING NANCAL RUIYUAN DIGITAL TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING NANCAL RUIYUAN DIGITAL TECH CO LTD
Filing Date
2026-02-09
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for contract tampering detection are sensitive to OCR recognition errors, resulting in a high false alarm rate and an inability to effectively distinguish between genuine tampering and recognition noise.

Method used

An AI-based contract anti-tampering method based on OCR and a large language model is adopted. Scanned images are processed through multi-scale adaptive enhancement algorithms and super-resolution reconstruction technology. Text fingerprints are generated by combining a semantic bridging fusion model and a hydrodynamic diffusion equation. The key elements of the contract are structured and semantically normalized to generate a concentration gradient matrix of a steady-state distribution field as a text fingerprint. Cosine similarity is calculated to mark the location of discrepancies.

Benefits of technology

It significantly reduces the false alarm rate caused by OCR recognition errors, improves the accuracy and reliability of contract tampering detection, effectively identifies genuine contract tampering, and reduces the false positive alarm rate.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121686471B_ABST
    Figure CN121686471B_ABST
Patent Text Reader

Abstract

The application provides an AI contract tamper-proofing method based on OCR+large language model, and belongs to the technical field of large language model.The application improves the image quality by performing multi-scale adaptive enhancement and super-resolution reconstruction on the contract scan image, adopts a semantic bridging fusion model to deeply fuse the OCR recognition result and the enhanced image in a sparse coding framework to realize context correction of low-confidence characters, performs semantic normalization on the corrected text and establishes a clause semantic vector sequence, calculates the steady-state distribution field of semantic concentration based on a fluid dynamics diffusion equation and extracts a concentration gradient matrix as a text fingerprint, realizes contract tamper-proofing detection with robustness to OCR recognition errors through fingerprint similarity comparison, and solves the technical problem of high false positive rate caused by the sensitivity of contract tamper-proofing detection to OCR recognition errors.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of large language model technology, specifically, it relates to an AI contract anti-tampering method based on OCR + large language model. Background Technology

[0002] Contract tampering detection is a crucial aspect of enterprise risk management. Traditional technologies employ digital fingerprinting based on hash functions to verify document integrity. This method calculates a fixed-length hash value from the contract text as a tamper-proof identifier; any text alteration changes the hash value, triggering an alarm. In practical applications, scanned contract images need to be converted into editable text using OCR technology before hash calculation. However, quality issues such as uneven lighting, blurring, and seal obstruction in the scanned images lead to character-level errors in OCR recognition, resulting in inconsistent recognition results for contracts with the same content under different scanning conditions. Traditional technologies treat OCR recognition and text fingerprint generation as independent modules, processing them sequentially. Recognition errors cannot be corrected by subsequent steps, and misrecognition of a single character can completely alter the hash value, making it impossible for the system to distinguish between genuine tampering and recognition noise. Some solutions attempt to alleviate this problem by improving OCR recognition rates, but accuracy remains insufficient for low-quality scanned images, and the semantic impact of recognition errors cannot be quantified. In other words, existing technologies suffer from a high false alarm rate due to the sensitivity of contract tampering detection to OCR recognition errors. Summary of the Invention

[0003] In view of this, the present invention provides an AI contract anti-tampering method based on OCR + large language model, which can solve the technical problem in the prior art that the contract tampering detection is sensitive to OCR recognition errors, resulting in a high false alarm rate.

[0004] This invention is implemented as follows: It provides an AI contract anti-tampering method based on OCR and a large language model. The method acquires a scanned image or PDF document of the contract to be detected, performs color space conversion and resolution normalization to form a standardized contract image, and uses a multi-scale adaptive enhancement algorithm to perform local illumination compensation and edge sharpening on the standardized contract image. Super-resolution reconstruction technology is used to restore character details in blurred areas to form an enhanced contract image. Optical character recognition is performed on the enhanced contract image to extract text content and character position information, and the recognition confidence of each character is recorded to form initial text with confidence annotations. The initial text with confidence annotations is then... The model outputs corrected text content and structured data of key contract elements, which are then integrated with the semantic bridging fusion model of the enhanced contract image input. Semantic normalization is performed on the key elements in the structured data, replacing synonyms with standard terms and establishing a semantic vector sequence of clauses. Keywords in the clause semantic vector sequence are used as diffusion sources. Based on the hydrodynamic diffusion equation, the steady-state distribution field of semantic concentration in the contract text structure is calculated, and the concentration gradient matrix of the steady-state distribution field is extracted as a text fingerprint. The generated text fingerprint is then compared with the pre-stored standard contract text fingerprint using cosine similarity calculation. When the similarity is below a preset threshold, the location of the differing clauses is marked, and a tampering warning message is generated.

[0005] The multi-scale adaptive enhancement algorithm includes the following steps: decomposing the standardized contract image into sub-image layers of different scales, independently calculating the local contrast distribution for each sub-image layer, dynamically compressing the uneven illumination region according to Retinex theory, and fusing the enhancement results of each scale through the Laplacian pyramid.

[0006] Among them, super-resolution reconstruction technology identifies regions in a standardized contract image where the ambiguity exceeds a threshold, extracts the texture features and gradient information of the regions, reconstructs high-resolution image patches through sub-pixel displacement estimation and frequency domain interpolation, and uses non-local mean filtering to suppress noise introduced during the reconstruction process.

[0007] The initial text with confidence level annotation is generated by the optical character recognition engine outputting a probability distribution vector for each character to be recognized, selecting the character with the highest probability as the recognition result, and using the probability value as the confidence level annotation. When the confidence level is lower than 0.85, it is marked as a low-confidence character.

[0008] The semantic bridging fusion model includes an image encoding branch, a text encoding branch, and a cross-modal fusion layer. The image encoding branch uses a convolutional neural network to extract visual features that enhance the contract image. The text encoding branch uses a bidirectional long short-term memory network to encode the initial text output hidden state vector sequence with confidence labels as text features. The cross-modal fusion layer aligns the visual features with the text features through an attention mechanism.

[0009] The text features refer to the sequence of hidden state vectors output by the text encoding branch. Each hidden state vector contains the semantic information, contextual dependencies, and syntactic structure features of the corresponding character. The bidirectional long short-term memory network encodes the text sequence in two directions: forward and backward. Forward encoding captures the contextual information to the left of the character, and backward encoding captures the contextual information to the right of the character. The hidden state vectors in the two directions are concatenated to form the final text feature vector. The text feature vector is aligned and fused with visual features in the cross-modal fusion layer to achieve a joint representation of image information and text semantics.

[0010] Among them, the semantic bridging fusion model combines dictionary learning based on sparse coding with deep unfolded networks. It unfolds the iterative shrinking threshold algorithm into a multi-layer feedforward network structure, with each layer corresponding to one iteration. It achieves sparse representation of text and image features by learning the dictionary matrix and shrinking threshold parameters.

[0011] The steps for establishing the training dataset for the semantic bridging fusion model include collecting 10,000 contract scan images of different qualities, manually annotating the correct content and position of each character, adding enhancement processing such as blurring, noise, and uneven lighting to the contract scan images to form multiple samples, and using an open-source optical character recognition engine to generate text with misidentification as input samples.

[0012] The semantic bridging fusion model training steps include initializing the dictionary matrix as a random orthogonal matrix, setting the number of unfolded network layers to 8, training using an end-to-end supervised learning method, using character recognition cross-entropy loss and sparsity constraint regularization term as loss functions, and using an adaptive moment estimation optimization algorithm to update network parameters.

[0013] The weight coefficients of the attention mechanism in the cross-modal fusion layer are determined based on the confidence of characters in the initial text with confidence labels, the image clarity score of the corresponding region in the enhanced contract image, and the semantic importance of key elements in the structured data. The calculation method is to normalize the three parameters to the interval between 0 and 1 and then perform a weighted sum.

[0014] The semantic normalization process includes establishing a thesaurus in the contract domain, which contains the correspondence between Party A and the client, and between liquidated damages and compensation. The process involves matching the terms in the structured data with the thesaurus and replacing the matched synonyms with predefined standard terms.

[0015] The steps for establishing the semantic vector sequence of the clauses are as follows: the structured data is divided according to the boundaries of the contract clauses, each clause is treated as an independent text segment, and a pre-trained sentence embedding model is used to map each clause into a 768-dimensional semantic vector.

[0016] The solution process of the fluid dynamics diffusion equation involves modeling the semantic vector sequence of the contract text as a one-dimensional discrete space, with each clause occupying a spatial node. The semantic concentration of keywords diffuses from the source node to adjacent nodes, and the fluid dynamics diffusion equation is discretized using the finite difference method.

[0017] The concentration gradient matrix of the steady-state distribution field is generated by performing a difference operation on the concentration values ​​of adjacent clause nodes in the steady-state distribution field, calculating the concentration change rate to form a gradient vector, and arranging the gradient vectors of all clauses in order to form the concentration gradient matrix.

[0018] The dimension of the text fingerprint is the product of the number of clauses and the number of keywords. Principal component analysis is performed on the concentration gradient matrix to reduce the dimension to 256. Principal components with a cumulative variance contribution rate of 95% are retained. The eigenvectors after dimension reduction are used as the final text fingerprint.

[0019] The preset threshold is determined based on the fingerprint similarity distribution statistics of historical contract samples. Multiple scanned versions of 100 unaltered contracts are collected, and the fingerprint similarity between different versions of the same contract is calculated. The mean of the fingerprint similarity distribution minus twice the standard deviation is taken as the preset threshold.

[0020] The method for marking the location of the difference clause is to calculate the semantic concentration difference between the text fingerprint of the contract to be tested and the text fingerprint of the standard contract at each clause node. Clause nodes with an absolute difference value exceeding 0.15 are marked as suspected tampering locations, and the original text content corresponding to the suspected tampering locations is extracted for detailed comparison.

[0021] This invention proposes a contract tampering prevention detection method based on a semantic bridging fusion model. By unifying OCR recognition and semantic understanding within a sparse coding framework, it achieves deep fusion of image features and text semantics. During the recognition stage, semantic constraints are introduced to correct low-confidence characters based on context. The semantic bridging fusion model employs dictionary learning to automatically discover typical character-image alignment patterns. The deep unfolded network preserves the interpretability of iterative optimization while supporting end-to-end learning. The selective reconstruction of damaged areas is achieved through the local coding characteristics of sparse representation, preventing recognition errors from propagating globally. The hydrodynamic diffusion equation elevates the text fingerprint from a hash mapping to a semantic concentration field simulation. Semantic associations between clauses establish physical constraints through the diffusion process, making the fingerprint robust to semantically unchanged representational differences while remaining sensitive to semantic breaks caused by actual tampering. In summary, this invention solves the technical problem mentioned in the background art—the high false positive rate due to OCR recognition errors in contract tampering detection—by combining semantic-level recognition error correction with physical model-driven fingerprint generation. Attached Figure Description

[0022] Figure 1This is a flowchart of the method of the present invention.

[0023] Figure 2 This is a diagram illustrating the convergence process of semantic concentration diffusion iteration. Detailed Implementation

[0024] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below.

[0025] like Figure 1 The diagram shown is a flowchart of an AI contract anti-tampering method based on OCR and a large language model provided by this invention. This method includes the following steps:

[0026] S1. Obtain a scanned image or PDF document of the contract to be inspected, and perform color space conversion and resolution normalization on the scanned image to form a standardized contract image.

[0027] S2. A multi-scale adaptive enhancement algorithm is used to perform local illumination compensation and edge sharpening on the standardized contract image. The character details in the blurred areas are restored by super-resolution reconstruction technology to form an enhanced contract image.

[0028] S3. Perform optical character recognition on the enhanced contract image to extract the text content and character position information, and record the recognition confidence of each character to form an initial text with confidence label;

[0029] S4. Input the initial text with confidence labels and the enhanced contract image into the semantic bridging fusion model. The semantic bridging fusion model outputs the corrected text content and structured data of key contract elements.

[0030] S5. Perform semantic normalization on the key elements in the structured data, replace synonyms with standard terms, establish a semantic vector sequence of clauses, and calculate the concentration distribution parameters of each clause in the semantic space.

[0031] S6. Using the keywords in the semantic vector sequence of the clauses as diffusion sources, calculate the steady-state distribution field of semantic concentration in the contract text structure based on the fluid dynamics diffusion equation, and use the concentration gradient matrix of the steady-state distribution field as the text fingerprint.

[0032] S7. Perform cosine similarity calculation between the generated text fingerprint and the pre-stored standard contract text fingerprint. When the similarity is lower than a preset threshold, mark the location of the difference clause and generate a tampering warning message.

[0033] The processing steps of the multi-scale adaptive enhancement algorithm include: decomposing the standardized contract image into sub-image layers of different scales, independently calculating the local contrast distribution for each sub-image layer, dynamically compressing the uneven illumination region according to Retinex theory, fusing the enhancement results of each scale through the Laplacian pyramid, and preserving the high-frequency detail information of the seal and text edges.

[0034] The super-resolution reconstruction technology is implemented as follows: identifying regions in the standardized contract image where the ambiguity exceeds a threshold, extracting the texture features and gradient information of the regions, reconstructing high-resolution image patches through sub-pixel displacement estimation and frequency domain interpolation, and using non-local mean filtering to suppress noise introduced during the reconstruction process.

[0035] The initial text with confidence level annotation is generated as follows: the optical character recognition engine outputs a probability distribution vector for each recognized character, selects the character with the highest probability as the recognition result, uses the probability value as the confidence level annotation, and marks the low-confidence character as a low-confidence character when the confidence level is lower than 0.85. The coordinate range of the low-confidence character in the enhanced contract image is recorded.

[0036] The semantic bridging fusion model is structured as follows: it includes an image encoding branch, a text encoding branch, and a cross-modal fusion layer. The image encoding branch uses a convolutional neural network to extract visual features from the enhanced contract image. The text encoding branch uses a bidirectional long short-term memory network to encode the initial text labeled with confidence. The cross-modal fusion layer aligns visual features with text features through an attention mechanism, performs weighted fusion of visual features in low-confidence character regions, and the output layer generates the corrected text content and structured data. The semantic bridging fusion model is based on a combination of sparse coding dictionary learning and deep unfolded networks. Specifically, the iterative shrinking threshold algorithm is unfolded into a multi-layer feedforward network structure, with each layer corresponding to one iteration. Sparse representations of text and image features are achieved by learning a dictionary matrix and shrinking threshold parameters. Each atom of the dictionary matrix corresponds to a typical character-image alignment pattern. During forward propagation, the sparse coding coefficients are updated layer by layer, and finally, the sparse coding coefficients are mapped to the corrected text content. The steps for establishing the training dataset for the semantic bridging fusion model include: collecting 10,000 contract scan images of different qualities, manually annotating the correct content and position of each character, adding enhancement processing such as blurring, noise, and uneven lighting to the contract scan images to form multiple samples, using an open-source optical character recognition engine to generate text with misidentification as input samples, and constructing a triplet dataset of image-misidentified text-correct text. The steps for training the semantic bridging fusion model include: initializing the dictionary matrix as a random orthogonal matrix, setting the number of unfolded network layers to 8, training using an end-to-end supervised learning approach, using a loss function including character recognition cross-entropy loss and a sparsity constraint regularization term, using an adaptive moment estimation optimization algorithm to update network parameters, dynamically adjusting the shrinkage threshold parameter during training to balance reconstruction accuracy and sparsity, and stopping training when the character accuracy on the validation set no longer improves for 5 consecutive rounds.

[0037] The text features refer to the sequence of hidden state vectors output by the text encoding branch. Each hidden state vector contains the semantic information, contextual dependencies, and syntactic structure features of the corresponding character. The bidirectional long short-term memory network encodes the text sequence in two directions: forward and backward. Forward encoding captures the contextual information to the left of the character, and backward encoding captures the contextual information to the right of the character. The hidden state vectors in the two directions are concatenated to form the final text feature vector. The text feature vector is aligned and fused with visual features in the cross-modal fusion layer to achieve a joint representation of image information and text semantics.

[0038] The combination of sparse coding-based dictionary learning and deep unfolded networks brings the following technical benefits to the semantic bridging fusion model: Traditional optical character recognition systems separate image recognition from text understanding, resulting in recognition errors that cannot be corrected by subsequent semantic analysis. The semantic bridging fusion model, however, establishes a direct mapping between image features and text semantics through sparse coding, automatically discovers typical patterns in the data through dictionary learning, and retains the interpretability of the optimization algorithm while introducing end-to-end learning capabilities, allowing the model to incorporate semantic constraints during the recognition stage. When contract images are obscured by seals or have paper wrinkles, traditional methods produce continuous character misrecognition. The semantic bridging fusion model, however, utilizes the local coding characteristics of sparse representation to selectively reconstruct only damaged areas, restoring complete semantics through the combination of dictionary atoms, while undamaged areas retain their original features, avoiding global error propagation. The iterative structure of the deep unfolded network simulates the repeated confirmation process of human reading. The first few layers quickly identify high-confidence characters, and subsequent layers refine the judgment based on contextual semantics for difficult characters. The sparsity constraint regularization term forces the model to activate only a small number of the most relevant dictionary atoms, reducing computational complexity while improving generalization ability. The fusion mechanism fundamentally eliminates the representation gap between optical character recognition and large language models, transforming discrete character-level outputs into a continuous semantic embedding space. This provides high-quality input for subsequent contract element extraction and text fingerprint generation, significantly reducing the false alarm rate caused by recognition errors and improving the overall reliability of the contract anti-tampering system.

[0039] The weight coefficients of the attention mechanism in the cross-modal fusion layer are determined based on the confidence of characters in the initial text with confidence labels, the image sharpness score of the corresponding region in the enhanced contract image, and the semantic importance of key elements in the structured data. The calculation method is to normalize the three parameters to the interval between 0 and 1 and then perform a weighted summation with weights of 0.4, 0.3, and 0.3, respectively. The weighted summation result is converted into an attention distribution through the Softmax function.

[0040] The semantic normalization process includes: establishing a thesaurus in the contract domain, which contains the correspondence between Party A and the client, and between liquidated damages and compensation; performing graph matching on the terms in the structured data; replacing the matched synonyms with predefined standard terms; and keeping the original text of unmatched terms.

[0041] The steps for establishing the semantic vector sequence of the clauses are as follows: the structured data is divided according to the boundaries of the contract clauses, each clause is treated as an independent text segment, and a pre-trained sentence embedding model is used to map each clause into a 768-dimensional semantic vector. The semantic vector retains the core semantic information of the clauses and eliminates expression differences.

[0042] The concentration distribution parameters include an initial concentration value and a diffusion coefficient. The initial concentration value is determined based on the product of the keyword frequency and the inverse document frequency in the clause. The diffusion coefficient is determined based on the semantic importance score of the keyword, which is obtained by analyzing the frequency and position of the keyword in the structured data.

[0043] The solution process of the fluid dynamics diffusion equation is as follows: the semantic vector sequence of the contract text is modeled as a one-dimensional discrete space, each clause occupies a spatial node, the semantic concentration of the keyword diffuses from the source node to the adjacent node, the diffusion rate is controlled by the diffusion coefficient, the fluid dynamics diffusion equation is discretized using the finite difference method, and iterative calculation is performed until the concentration field reaches a steady state. The numerical distribution of the steady-state concentration field is the semantic concentration of each clause.

[0044] The concentration gradient matrix of the steady-state distribution field is generated by performing a difference operation on the concentration values ​​of adjacent clause nodes in the steady-state distribution field, calculating the concentration change rate to form a gradient vector, and arranging the gradient vectors of all clauses in order to form a concentration gradient matrix. The concentration gradient matrix reflects the semantic flow characteristics and structural boundary constraints of the contract text.

[0045] The dimension of the text fingerprint is the product of the number of clauses and the number of keywords. Principal component analysis is performed on the concentration gradient matrix to reduce the dimension to 256, and the principal components with a cumulative variance contribution rate of 95% are retained. The dimensionality-reduced feature vector is used as the final text fingerprint.

[0046] The preset threshold is determined based on the fingerprint similarity distribution statistics of historical contract samples. Multiple scanned versions of 100 unaltered contracts are collected, and the fingerprint similarity between different versions of the same contract is calculated. The mean of the fingerprint similarity distribution minus twice the standard deviation is taken as the preset threshold, and the preset threshold ranges from 0.82 to 0.88.

[0047] The method for marking the location of the difference clause is as follows: calculate the semantic concentration difference between the text fingerprint of the contract to be tested and the text fingerprint of the standard contract at each clause node. Clause nodes with an absolute difference value exceeding 0.15 are marked as suspected tampering locations. The original text content corresponding to the suspected tampering location is extracted and compared in detail.

[0048] The method for generating the tampering warning information is as follows: for the marked difference clause positions, extract the clause text of the contract to be detected and the standard contract at the difference clause positions, use a pre-trained language model to calculate the semantic similarity and keyword differences of the two texts, and generate a text description containing the change type and change magnitude. The change type includes amount change, date change, change of responsible party, and clause deletion.

[0049] The sparse coding is a signal representation method that represents the input signal as a linear combination of a small number of basis vectors in a predefined dictionary. The sparse coefficient vector is obtained by solving an optimization problem with sparsity constraints. The sparsity constraint regularization term is the L1 norm of the sparse coefficient vector, which forces most coefficients to be zero and retains only the key components.

[0050] The iterative threshold shrinkage algorithm is a classic algorithm for solving sparse optimization problems. It gradually approaches the optimal solution by alternately performing gradient descent and soft threshold shrinkage operations. Each iteration includes three steps: calculating the gradient using the current solution, updating the solution along the gradient direction, and applying soft threshold shrinkage to the updated solution. The soft threshold shrinkage sets the components whose absolute value is less than the threshold to zero.

[0051] The deep unfolded network maps each iteration of the iterative optimization algorithm to a layer of the neural network. Fixed parameters in the algorithm, such as step size and threshold, become learnable parameters of the network. The optimal parameter values ​​are automatically learned from the data through the backpropagation algorithm. The number of network layers corresponds to the number of iterations, and the forward propagation process is equivalent to the iterative solution process of the algorithm.

[0052] The dictionary matrix is ​​a set of basis vectors in sparse coding. Each column of the matrix is ​​called an atom, representing a basic pattern. The input signal is represented by the product of the dictionary matrix and the sparse coefficient vector. Dictionary learning is the process of automatically learning the optimal dictionary matrix from the training data, so as to minimize the sum of the reconstruction error and sparsity cost of all training samples.

[0053] The fluid dynamics diffusion equation is based on Fick's second law and describes the variation of semantic concentration with time and space. In the one-dimensional discretized term space, the rate of change of concentration with respect to time is equal to the product of the diffusion coefficient and the second-order partial derivative of concentration with respect to space. Under steady-state conditions, the time derivative is zero, and the equation simplifies to a second-order ordinary differential equation. The spatial distribution function of concentration is obtained by solving the boundary conditions.

[0054] The finite difference method discretizes the continuous spatial domain into a finite number of nodes, uses the concentration values ​​at the nodes to approximate the continuous field, and uses the second spatial derivative to approximate the concentration values ​​of three adjacent nodes to construct a system of linear equations for iterative solution. Each iteration updates the concentration values ​​of all nodes until the concentration change between two adjacent iterations is less than the convergence threshold of 0.001.

[0055] Principal component analysis is a linear dimensionality reduction technique that calculates the eigenvalues ​​and eigenvectors of the data covariance matrix, selects several eigenvectors with the largest eigenvalues ​​to form a projection matrix, projects high-dimensional data into a low-dimensional subspace, preserves the main variance information of the data, and reduces storage and computation costs.

[0056] The Retinex theory is a color constancy calculation model that decomposes an image into a reflection component and an illumination component. The reflection component represents the inherent properties of an object's surface that are unaffected by illumination, while the illumination component represents the spatial distribution of ambient light. Uneven illumination is compensated by estimating and removing the illumination component.

[0057] The Laplacian pyramid is a multi-scale image representation method. It constructs a Gaussian pyramid by downsampling the original image multiple times. The Laplacian pyramid is obtained by subtracting Gaussian images from adjacent levels. Each level of the Laplacian pyramid contains edge and detail information at different scales. During image fusion, weighted combinations are performed independently at each level.

[0058] The nonlocal mean filtering is an image denoising algorithm that searches for all pixels in the image that have similar neighborhoods to the target pixel, calculates weights based on neighborhood similarity, and then performs a weighted average of the pixel values ​​to obtain the denoising result. The algorithm effectively suppresses noise while preserving texture details by utilizing the self-similarity in the image.

[0059] The Softmax function maps any real vector to a probability distribution. The function calculates an exponential function for each component of the input vector, and then normalizes it by dividing by the sum of the exponential values ​​of all components. All components of the output vector are positive and their sum is 1. This function is often used for probability output in multi-class classification problems.

[0060] The cosine similarity is a measure of the cosine of the angle between two vectors. It is calculated by dividing the inner product of the two vectors by the product of their magnitudes. The value ranges from -1 to 1. The closer the value is to 1, the more similar the vectors are in direction. The measure is not sensitive to the magnitude of the vectors and only focuses on the direction information.

[0061] This invention elevates the generation of text fingerprints from simple hash mapping to a physical process simulation through a fluid dynamics diffusion equation. The diffusion of semantic concentration is constrained by the text structure boundaries, meaning that adjusting the order of clauses or inserting / deleting content significantly alters the steady-state distribution field, enabling sensitive detection of covert tampering. The semantic bridging fusion model eliminates the representational gap between optical character recognition and large language models, establishing a direct mapping between visual features and semantic representations through sparse coding, avoiding the accumulation of errors in subsequent processing. The multi-scale adaptive enhancement algorithm addresses the uneven quality of contract scan images, suppressing noise while preserving text edges, providing high-quality input for optical character recognition. The semantic normalization process resolves the fingerprint sensitivity issue caused by synonym rewriting, unifying terminology through domain knowledge graphs to reduce false positive alarm rates. The overall solution ensures detection accuracy while supporting pure intranet deployment, meeting enterprise data security requirements.

[0062] This invention also provides an AI contract anti-tampering system based on OCR and a large language model. The AI ​​contract anti-tampering system based on OCR and a large language model is implemented by a computer. The computer is equipped with a readable storage medium, which stores program instructions. When the program instructions are run in the computer, they execute the above-mentioned AI contract anti-tampering method based on OCR and a large language model.

[0063] The specific implementation methods of the above steps are described in detail below.

[0064] The specific implementation of step S1 is as follows: First, the scanned image file or PDF document of the contract to be detected is read, and the input file format is determined. If it is a PDF format, a PDF parsing library is called to convert each page into a bitmap image; if it is a scanned image, the pixel matrix is ​​read directly. Next, the read image is converted to a grayscale color space using a weighted average method with weight coefficients of 0.299, 0.587, and 0.114. This conversion eliminates interference from color information, facilitating subsequent text recognition. Then, the grayscale image is normalized. The resolution parameter of the input image is detected. When the resolution is below 300 dpi, a bicubic interpolation algorithm is used to enlarge the image to 300 dpi; when the resolution is above 300 dpi, a region averaging method is used to reduce the image to 300 dpi. This normalization process unifies the image quality standard to avoid resolution differences affecting recognition accuracy. Finally, the processed image data is stored as a standardized contract image. The pixel depth of the standardized contract image is 8 bits, and its size is automatically adjusted according to the original page ratio to maintain the aspect ratio. The purpose of these steps is to convert contract documents from different sources and in different formats into a unified image representation, thereby establishing a standardized input for subsequent image enhancement and text recognition.

[0065] The specific implementation of step S2 is as follows: First, the standardized contract image is decomposed into multiple scales. A Gaussian filter is used to downsample the image. The filter kernel size is 5×5, and the standard deviation parameter is set to 1.0. This downsampling is performed three times to obtain four sub-image layers at different scales. The resolutions of these sub-image layers are 100%, 50%, 25%, and 12.5% ​​of the original image, respectively. Next, the local contrast distribution is calculated independently for each sub-image layer using a sliding window method. The window size is 15×15 pixels, and the standard deviation of the pixels within the window is calculated as the contrast index. Regions with a standard deviation less than 10 are identified as low-contrast regions. Then, illumination compensation is performed on the low-contrast regions based on Retinex color constancy theory. The image is decomposed into reflection and illumination components. The illumination component is estimated using the center-around Retinex algorithm. The surrounding function is a Gaussian function, and the scale parameter is set to 80 pixels. The logarithm of the illumination component is subtracted from the original image to obtain the illumination-compensated reflection image. Simultaneously, blurriness detection is performed on the standardized contract image, and the Laplacian variance of the image is calculated. Regions with a variance less than 100 are identified as blurry areas. Gradient information and texture features are extracted from these blurry areas, and subpixel displacement estimation technology is used to analyze the displacement relationship between adjacent pixels. Frequency domain interpolation methods are used for high-frequency compensation in the frequency domain to reconstruct high-resolution image patches. The reconstruction process uses a Wiener filter to balance signal recovery and noise suppression. Then, the multi-scale processing results are fused using a Laplacian pyramid fusion algorithm. The Laplacian image of each scale layer is calculated, and weighted fusion is performed independently at each layer. The weights are determined by the local variance; the larger the variance, the higher the weight. After fusion, a complete image is reconstructed. Finally, nonlocal mean filtering is applied to the fused image for noise suppression. The search radius is set to 21 pixels, the similarity window size is 7×7 pixels, and the filter strength parameter is set to 10. This filtering preserves image texture while removing noise, outputting an enhanced contract image. The purpose of these steps is to improve the overall clarity of contract images with uneven scanning quality, enhance the readability of low-quality areas, and provide high-quality input images for optical character recognition.

[0066] The specific implementation of step S3 is as follows: First, the enhanced contract image is input into an optical character recognition engine. The optical character recognition engine adopts a deep convolutional neural network architecture, which includes a feature extraction layer, a sequence modeling layer, and a decoding layer. The feature extraction layer performs convolution operations on the image to extract the stroke features and structural features of the characters. The convolution kernel size is 3×3, and the number of feature map channels is 64, 128, and 256 respectively. The sequence modeling layer adopts a bidirectional long short-term memory network to encode the feature map into sequence features along the horizontal direction, capturing the contextual relationships between characters. The hidden layer dimension is 512. The decoding layer uses a connection-time classification algorithm to output the character probability distribution at each time step. The probability distribution includes all possible characters and whitespace symbols. Then, the character with the highest probability is selected from the probability distribution as the recognition result. The bounding box coordinates of the character in the image are recorded, including the horizontal and vertical coordinates of the upper left and lower right corners. At the same time, the probability value of the character is recorded as a confidence label. Then, all recognized characters are traversed, and it is determined whether the confidence level is lower than 0.85. Characters with a confidence level lower than 0.85 are marked as low-confidence characters, and the coordinate range and confidence level value of the low-confidence characters are recorded to construct a low-confidence character list. Finally, the recognized text content, character position information, confidence level labels, and low-confidence character list are combined to form an initial text with confidence level annotations. The initial text with confidence level annotations is stored in JSON format and includes text fields, position fields, confidence level fields, and low-confidence label fields. The purpose of these steps is to extract text information from the image and annotate the recognition reliability, providing basic data and confidence level guidance for subsequent semantic error correction.

[0067] The specific implementation of step S4 is as follows: First, the initial text with confidence labels and the enhanced contract image are simultaneously input into the semantic bridging fusion model. The image encoding branch of the semantic bridging fusion model processes the enhanced contract image, using a ResNet convolutional neural network to extract visual features. The network depth is 50 layers, and the output feature map has a dimension of 2048×H×W, where H and W are the height and width of the feature map. The text encoding branch encodes the initial text with confidence labels, using a bidirectional long short-term memory network to map the text sequence into a hidden state sequence. Each character corresponds to a hidden state vector with a vector dimension of 512. The cross-modal fusion layer aligns visual and textual features through an attention mechanism. It calculates a similarity matrix between the text latent state and image features, using dot product calculations for similarity. Attention weight coefficients are calculated based on the confidence of characters in the initial text with confidence annotations, the image sharpness score of the corresponding region in the enhanced contract image, and the semantic importance of key elements in the structured data. These three parameters are normalized to the 0-1 range and then weighted and summed with weights of 0.4, 0.3, and 0.3 respectively. The weighted summation result is converted into an attention distribution using a Softmax normalization function. This attention distribution guides the weighted aggregation of visual features. For low-confidence character regions, visual features at the corresponding positions are extracted for focused analysis, and joint inference is performed by combining the character's contextual text information and visual texture information. The semantic bridging fusion model is based on sparse coding principles. It expands the iterative shrinking threshold algorithm into an 8-layer feedforward network. Each layer performs gradient updates and soft threshold shrinking operations. A sparse linear representation of text and image features is achieved through a learned dictionary matrix. Each column vector of the dictionary matrix corresponds to a character-image alignment pattern. During the network's forward propagation, the sparse coding coefficients are updated layer by layer. A sparsity constraint regularization term causes most coefficients to approach zero, retaining only the most relevant pattern activations. The output layer maps the sparse coding coefficients to the corrected text content through a fully connected layer. Simultaneously, it extracts key elements from the contract, including the names of the contracting parties, contract amount, signing date, performance period, and breach of contract clauses. These key elements are organized into structured data and stored in key-value pairs, where the key is the element type and the value is the specific content. The purpose of these steps is to fuse visual and textual information to correct optical character recognition errors, extract the structured key elements of the contract, and eliminate the representational gap between image recognition and semantic understanding.

[0068] The specific implementation of step S5 is as follows: First, a thesaurus of terms in the contract domain is established. The thesaurus adopts a directed graph structure, where nodes represent terms and edges represent synonym relationships. The graph includes correspondences such as Party A and the client, Party B and the agent, liquidated damages and compensation, termination and rescission, etc. Each pair of synonyms points to a unified standard term node. Next, all terms in the structured data are traversed, and matching nodes are searched in the thesaurus. A strategy combining exact matching and fuzzy matching is adopted. Exact matching requires that the terms be completely identical, while fuzzy matching allows an edit distance of no more than 2 characters. For successfully matched synonyms, the standard term they point to is extracted, and the synonyms in the structured data are replaced with the standard term. Unmatched terms remain unchanged. Then, the structured data is segmented according to the boundaries of contract clauses. The clause boundaries are identified by periods, semicolons, and clause numbers. Each clause is treated as an independent text segment, and a clause list is obtained by counting all clauses. Each clause is encoded using a pre-trained sentence embedding model. This model employs the Sentence-BERT architecture, trained on a Siamese network structure, and maps variable-length text to fixed-dimensional semantic vectors (768 dimensions). Semantic similarity is measured using cosine distance; clauses with similar semantics are closer together in the vector space. The semantic vectors of all clauses are arranged in clause order to form a clause semantic vector sequence, preserving the semantic content and sequential relationship between clauses. Next, the concentration distribution parameter for each clause is calculated. First, each clause is segmented, extracting nouns, verbs, and key entities as keywords. The frequency of each keyword in the clause is counted as the term frequency (QF). The distribution of keywords across all clauses is calculated to obtain the inverse document frequency (IVF). Multiplying the QF by the IVF yields an initial concentration value, reflecting the local importance and global uniqueness of the keywords. Then, the frequency and position of keywords in the structured data are analyzed. Words located in key elements such as the contracting party, amount, and date are assigned higher semantic importance scores. These semantic importance scores are normalized to the range of 0.1 to 1.0 and used as diffusion coefficients to control the propagation rate of keyword semantic concentration. The purpose of these steps is to eliminate the differences in synonym expression, establish a unified semantic representation, and provide normalized semantic vectors and concentration parameters for text fingerprint generation.

[0069] The specific implementation of step S6 is as follows: First, the semantic vector sequence of clauses is modeled as a one-dimensional discrete space. If the number of clauses is N, the space contains N nodes, numbered 1 to N according to the clause order, with each node corresponding to one clause. Next, keywords from the clauses are extracted as diffusion sources. Each keyword releases semantic concentration at its corresponding clause node. The initial distribution of the semantic concentration is determined by the initial concentration value; the concentration of the node containing the keyword is set to the initial concentration value, while the initial concentration of other nodes is zero. Then, a fluid dynamics diffusion equation is established to describe the spatiotemporal evolution of concentration. This diffusion equation is based on Fick's second law and states that the rate of change of concentration with respect to time is equal to the product of the diffusion coefficient and the second derivative of concentration with respect to space. This equation characterizes the diffusion process of concentration from a high-concentration region to a low-concentration region. The diffusion equation is solved using the finite difference method in the one-dimensional discrete space. The spatial second derivative is approximated as a difference combination of the concentration values ​​of three adjacent nodes. Specifically, the second derivative of the i-th node is equal to the sum of the concentration of the (i+1)-th node minus twice the concentration of the i-th node plus the concentration of the (i-minus)-th node, divided by the square of the spatial step size. The time derivative is approximated using forward difference. Boundary conditions are set: the first and Nth nodes use zero-flux boundaries, meaning the concentration gradient at the boundary is zero, reflecting that the beginning and end of the text do not exchange semantic information with the outside world. Initially, the concentrations of each node are set, and time iterations are performed. In each iteration, the concentration values ​​of all nodes are updated according to the difference equation, with a time step of 0.01 and a spatial step of 1.0. Iteration continues until the concentration field reaches a steady state. The criterion for judging steady state is that the maximum change in concentration between two consecutive iterations is less than the convergence threshold of 0.001. In steady state, the concentration field no longer changes with time, the time derivative is zero, and the diffusion equation simplifies to a second-order ordinary differential equation. The numerical distribution of the steady-state concentration field is the final semantic concentration of each clause. Next, a concentration gradient matrix is ​​generated by performing a difference operation on the steady-state distribution field. The concentration difference between adjacent clause nodes is calculated and divided by the spatial step size to obtain the concentration gradient. The concentration gradient between the i-th clause and the (i+1)-th clause is the difference in concentration between the two nodes. The concentration gradients of all adjacent clauses are arranged in order to form a concentration gradient vector. The above diffusion process is repeated for all keywords, generating a concentration gradient vector for each keyword. The concentration gradient vectors of all keywords are stacked to form a concentration gradient matrix. The number of rows in the concentration gradient matrix is ​​the number of clauses minus 1, and the number of columns is the number of keywords. Then, the concentration gradient matrix is ​​dimensionality reduced. The covariance matrix of the concentration gradient matrix is ​​calculated, and the eigenvalues ​​and eigenvectors of the covariance matrix are solved. The eigenvalues ​​are sorted from largest to smallest, and the top few eigenvectors with a cumulative variance contribution rate of 95% are selected. The number of eigenvectors is generally 256. A projection matrix is ​​constructed to project the concentration gradient matrix onto the low-dimensional subspace spanned by the eigenvectors, resulting in a 256-dimensional eigenvector as the text fingerprint.The purpose of the steps described above is to model semantic propagation through a physical diffusion process, generating fingerprint features that are sensitive to text structure. These text fingerprints are highly sensitive to changes in semantic content but insensitive to format differences.

[0070] The specific implementation of step S7 is as follows: First, the pre-stored standard contract text fingerprint is read from the database. The standard contract text fingerprint is generated and stored using the same method when the contract is signed, and its dimension is 256. Next, the cosine similarity between the text fingerprint of the contract to be detected and the standard contract text fingerprint is calculated. The calculation method is the inner product of the two vectors divided by the product of the magnitudes of the two vectors. The cosine similarity value ranges from -1 to +1. The closer the value is to +1, the more consistent the directions of the two text fingerprints are and the more similar their semantic content. Then, the calculated cosine similarity is compared with a preset threshold. The preset threshold is determined based on the statistics of historical contract samples. 100 tamper-proof contracts are collected, and each contract is scanned multiple times. Each scan generates a text fingerprint, and the fingerprint similarity between different scanned versions of the same contract is calculated. The mean and standard deviation of the similarity are calculated. The preset threshold is set as the mean minus twice the standard deviation, and the value range is generally 0.82 to 0.88. In this embodiment, the preset threshold is 0.85. When the cosine similarity is below a preset threshold, the contract content is determined to have been tampered with, and the difference localization process begins. When the cosine similarity is higher than or equal to the preset threshold, the contract is determined not to have been tampered with, and verification pass information is output. For contracts determined to have been tampered with, the location of the differing clauses is further located, and the semantic concentration difference between the contract to be tested and the standard contract at each clause node is calculated. The semantic concentration is derived from a steady-state distribution field, and clause nodes with an absolute difference value exceeding 0.15 are marked as suspected tampering locations. The threshold of 0.15 is determined by analyzing the concentration fluctuation range caused by differences in normal format. The original clause text corresponding to the suspected tampering locations is extracted, and the clause content at the same location is obtained from the contract to be tested and the standard contract for detailed comparison. A pre-trained language model is used to calculate the semantic similarity between the two clause texts. The pre-trained language model adopts the BERT architecture, takes two texts as input, and outputs a similarity score between 0 and 1. At the same time, keywords in the two texts are extracted for difference analysis to identify added words, deleted words, and replaced words. The type of change is determined based on the type of discrepancies. If the discrepancy is a number in a context related to amount, it is determined to be a change in amount; if the discrepancy is a date, it is determined to be a change in date; if the discrepancy is a person's or organization's name, it is determined to be a change in the responsible party; if the entire clause is missing from one party's text, it is determined to be a deletion of the clause. The number and semantic weight of the discrepancies are counted, and a change severity score is calculated, ranging from 0 to 100, with higher scores indicating greater changes. The location of the discrepancy clause, the comparison of clause content, the change type, and the change severity are combined to generate a tampering warning message. The tampering warning message adopts a structured format and includes the contract number, detection timestamp, similarity value, list of suspected tampering locations, original text comparison for each location, and change description fields. The purpose of these steps is to quickly determine whether the contract has been tampered with through fingerprint comparison, accurately locate the tampering location and the changed content, and provide decision-makers with detailed warning information.

[0071] The key technical ideas of this invention include: First, a semantic bridging fusion model is used to achieve deep integration of optical character recognition and semantic understanding. A direct mapping between image features and text semantics is established through sparse coding, and the iterative optimization algorithm is expanded into a learnable neural network layer, integrating semantic constraints into the character recognition process. Traditional methods separate image recognition and text understanding, and recognition errors are misinterpreted as true semantics in subsequent semantic analysis and cannot be corrected. The fusion model utilizes the local coding characteristics of sparse representation to perform joint reasoning by combining contextual semantics and visual information during the recognition stage. It selectively reconstructs local damage caused by seal obstruction or paper wrinkles, preserving the original features of undamaged areas, avoiding global error propagation, fundamentally eliminating the representation gap between the recognition module and the understanding module, and significantly reducing errors in contract element extraction and subsequent fingerprint generation bias caused by character misrecognition. Second, a fluid dynamics diffusion equation is introduced to model the text fingerprint generation process, treating keywords as semantic concentration diffusion sources. The steady-state concentration field distribution is obtained by solving partial differential equations constrained by the text structure boundaries. Traditional hash fingerprinting methods are extremely sensitive to literal changes in text. Synonyms or format adjustments can produce completely different fingerprint values, leading to a high false alarm rate. In contrast, fingerprint generation methods based on diffusion equations utilize the continuity of physical processes and spatial constraints. The propagation rate of semantic concentration is determined by the importance of keywords, and the diffusion process is affected by the order of clauses and structural boundaries. This means that changes in clause content can significantly alter the gradient distribution of the steady-state concentration field, while pure format differences or synonym substitutions only cause minor local concentration adjustments without affecting the overall gradient pattern. This achieves high sensitivity to semantic tampering and robustness to format changes. Thirdly, multi-scale adaptive enhancement algorithms and super-resolution reconstruction techniques are used to improve the recognizability of low-quality images. To address the issues of uneven illumination and blurring during the scanning process, contrast enhancement and detail restoration are performed independently at multiple scales, and texture reconstruction is performed using the self-similarity of images. Traditional enhancement methods use global parameters to uniformly process the entire image, which is poorly adaptable to local quality differences and easily introduces noise amplification or over-smoothing. In contrast, the multi-scale method dynamically adjusts the enhancement intensity according to the local contrast distribution, keeping high-quality areas unchanged and focusing on enhancing low-quality areas. Super-resolution reconstruction compensates for high-frequency components in the frequency domain for blurred areas, and combines non-local mean filtering to suppress noise. This improves clarity while maintaining sharp text edges, providing high-quality input for subsequent character recognition and reducing the decrease in recognition confidence caused by uneven image quality.

[0072] It should be noted that the synergistic effect of the three key technical approaches is manifested in the following ways: After the multi-scale adaptive enhancement algorithm improves image quality, the semantic bridging fusion model performs recognition and understanding based on clearer visual input. The model's attention mechanism can more accurately align textual features with visual features. Sparse coding dictionary learning extracts more representative character-image alignment patterns from high-quality samples, reducing the model's sensitivity to noise and improving the accuracy of key element extraction. The high-quality structured data output by the semantic bridging fusion model provides a reliable foundation for subsequent semantic normalization and fingerprint generation, reducing synonym matching failures and keyword extraction omissions caused by recognition errors. The hydrodynamic diffusion equation is calculated based on accurate semantic vector sequences and concentration parameters. The improved input data quality makes the steady-state concentration field distribution more realistically reflect the semantic structure of the contract, and enhances the discriminative power of the concentration gradient matrix. The three technical approaches form a complete chain from image preprocessing, feature extraction, semantic understanding to fingerprint generation. Improvements in each link provide better input for downstream tasks, and the accumulation of errors in multiple links is effectively suppressed. The robustness and accuracy of the overall system are synergistically improved, enabling sensitive detection of contract tampering and accurate differentiation of normal differences.

[0073] It should be noted that this invention also solves the following technical problem: Existing technologies suffer from a decrease in OCR recognition rate due to uneven image quality in scanned contracts. Traditional image preprocessing methods employ global contrast enhancement and sharpening filtering, which over-enhancement of uniformly illuminated areas introduces noise, while insufficient enhancement of unevenly illuminated areas leads to character blurring. This invention decomposes the image into sub-image layers of different scales using a multi-scale adaptive enhancement algorithm. For each scale layer, the local contrast distribution is independently calculated, and dynamic range compression is performed based on Retinex theory. The enhancement results from each scale are fused using a Laplacian pyramid to achieve a balance between detail preservation and noise suppression. Super-resolution reconstruction technology is used to restore character details in blurred areas, providing high-quality input for OCR recognition.

[0074] Furthermore, this invention addresses the technical problem of text fingerprints being overly sensitive to synonym rewriting in existing technologies. Traditional hash fingerprinting methods generate completely different hash values ​​for any character changes in the text. Common synonym substitutions in contracts, such as changing "Party A" to "Client" or "penalty for breach of contract" to "compensation," trigger tampering alarms, leading to numerous false positives. This invention performs semantic normalization on structured data by establishing a synonym graph in the contract domain, uniformly replacing synonyms with standard terms to eliminate expression differences. Simultaneously, the text fingerprint generated based on the hydrodynamic diffusion equation represents the core meaning of the clauses in the semantic space. Semantically equivalent different expressions produce similar concentration field distributions during the diffusion process, making the fingerprint tolerant of legitimate synonym rewriting while remaining sensitive to substantive content changes.

[0075] Specifically, the principle of this invention is as follows: The core of this invention lies in moving the correction of OCR recognition errors forward to the recognition stage, achieving collaborative reasoning between visual features and textual semantics through a semantic bridging fusion model. Traditional methods separate image recognition from text understanding, passing recognition errors as deterministic inputs to subsequent modules. This invention, however, inputs the initial recognition result with confidence and the enhanced image into the fusion model. It utilizes sparse coding to establish a direct mapping relationship between local image features and character semantics. The dictionary learning process automatically learns typical character-image alignment patterns as basis vectors. The deep unfolding network iteratively refines the sparse representation coefficients through multiple layers, performing weighted fusion correction on low-confidence character regions in conjunction with the visual features of the surrounding context. The hydrodynamic diffusion equation models the contract text as a semantic concentration field. Keywords act as diffusion sources, propagating semantic information to adjacent clauses. The diffusion process is constrained by both clause boundaries and the strength of semantic association. The concentration gradient matrix of the steady-state distribution field encodes the semantic flow characteristics of the text. This physical model makes fingerprints tolerant of synonym rewriting and recognition noise because semantically equivalent different representations produce similar concentration field distributions during diffusion, while semantic breaks introduced by actual tampering significantly alter the diffusion path, leading to structural differences in the concentration field. The multi-scale adaptive enhancement algorithm improves image quality from the source, reducing the probability of recognition errors, forming a dual guarantee with the semantic error correction mechanism.

[0076] The following provides a specific embodiment 1 of the present invention. The specific implementation of step S1 in this embodiment 1 is the same as that described above, and the specific implementation of other steps is described in detail below.

[0077] The specific implementation of step S2 involves performing multi-scale adaptive enhancement processing on the standardized contract image. The multi-scale adaptive enhancement algorithm is based on Retinex theory and processes the standardized contract image... Decomposed into reflection components and light component The decomposition formula is expressed as follows:

[0078] ;

[0079] In the formula, For standardized contract images in coordinates The pixel value at that location is expressed in gray levels. The x-axis is in pixels. The vertical axis represents the coordinates in pixels. The reflection component represents an inherent property of the object's surface and is dimensionless. The illumination component represents the spatial distribution of ambient light, expressed in gray levels. The standardized contract image is decomposed into... Sub-image layers of different scales, The scale layer number is empirically 4, the first... Layered images Obtained by downsampling using a Gaussian filter, in units of gray levels. The empirical value for the Gaussian filter scaling parameter is [value missing]. Pixels This is the scale layer index, with a value of The local contrast distribution is calculated independently for each sub-image layer. The formula is expressed as follows:

[0080] ;

[0081] In the formula, For the first Sub-image in coordinates Local contrast at a point, dimensionless; This is the grayscale unit value, with a value of 1 for grayscale level. This is the local neighborhood mean, expressed in gray levels. (The last part, "by...", appears to be a fragment and doesn't translate directly. It likely refers to a specific method or approach.) The average value is obtained by averaging within the neighborhood window, where the default window size is [value missing]. Pixel; The standard deviation is the local neighborhood standard deviation, expressed in gray levels. The standard deviation is obtained by calculating the standard deviation within the neighborhood window; To prevent division by zero, a dimensionless constant is used, defaulting to 0.01. Dynamic range compression is applied to unevenly illuminated regions based on Retinex theory, resulting in an enhanced sub-image. The formula is expressed as follows:

[0082] ;

[0083] In the formula, For the first The enhanced sub-image, in gray levels; The estimated illumination components, in gray levels, are obtained by... It is obtained by performing a large-scale Gaussian filter, with the filter scale parameter set to 15 pixels by default. For reference light intensity, the unit is gray level, and the default is 128; is the dynamic range compression factor, dimensionless, with an empirical value of 0.6. By fusing enhancement results from various scales using the Laplace pyramid, the Laplace pyramid's... layer The calculation formula is expressed as follows:

[0084] ;

[0085] In the formula, For the first Layered Laplace pyramid, unit is gray level; For upsampling operations, a low-resolution image is interpolated to a high-resolution image; For the first The enhanced sub-image, in gray levels. Final enhanced image. The formula is obtained by weighted reconstruction of each layer of the Laplace pyramid, as follows:

[0086] ;

[0087] In the formula, To enhance the contract image, the unit is grayscale; For the first Layer weight coefficients are dimensionless and adaptively determined based on the local contrast of each layer. The calculation formula is as follows: , For the first Layer-average contrast is dimensionless. Super-resolution reconstruction technology identifies regions with ambiguity exceeding a threshold; ambiguity is evaluated using the gradient variance index. The formula is expressed as follows:

[0088] ;

[0089] In the formula, coordinates The gradient variance index at the location is dimensionless. For The center is the neighborhood window, and the default window size is [size missing]. Pixel; and These are the horizontal and vertical gradient operators, respectively; coordinates The pixel value at that location is expressed in gray levels. The x-axis is in pixels. The vertical axis represents the coordinates in pixels. To prevent division by zero, the constant is dimensionless and defaults to 0.01. Areas below a threshold of 0.15 are marked as blurry regions. High-resolution image patches are reconstructed from blurry regions using subpixel displacement estimation and frequency domain interpolation. After reconstruction, nonlocal mean filtering is used to suppress noise.

[0090] The specific implementation of step S3 involves performing optical character recognition on the enhanced contract image. The optical character recognition engine outputs a probability distribution vector for each recognized character. ,in For character indexing, Indicates the first The character was identified as the first The probability of each candidate character is dimensionless. The index of the candidate character has a value of , The size of the candidate character set. The character with the highest probability is selected as the recognition result. , For the first The recognition result of each character, corresponding to the confidence level. The formula is expressed as follows:

[0091] ;

[0092] In the formula, For the first The confidence level for character recognition, dimensionless, ranging from 0 to 1. Characters with a value below 0.85 are marked as low-confidence characters, and the coordinate range of these low-confidence characters in the enhanced contract image is recorded. ,in and The first The minimum and maximum x-coordinates of each character region, in pixels. and The first The minimum and maximum ordinates of each character region, in pixels.

[0093] The specific implementation of step S4 involves inputting the initial text with confidence labels and the enhanced contract image into a semantic bridging fusion model. The attention mechanism weight coefficients of the cross-modal fusion layer... The formula is expressed as follows:

[0094] ;

[0095] In the formula, For the first Attention weight coefficients for each character, dimensionless; and These are the minimum and maximum confidence scores for all characters, respectively, and are dimensionless. For the first The image sharpness score for each character's corresponding region is dimensionless and is obtained by calculating the edge intensity of the corresponding region using the Sobel operator. and These are the minimum and maximum sharpness scores for all regions, respectively, and are dimensionless. For the first The semantic importance of each character is dimensionless and is determined based on the character's position and context in the contract text. and These are the minimum and maximum values ​​of the semantic importance of all characters, respectively, and are dimensionless. A dimensionless constant used to prevent division by zero during confidence level normalization; the default value is 0.001. A dimensionless constant used to prevent division by zero during resolution normalization; the default value is 0.001. A dimensionless constant used for semantic importance normalization to prevent division by zero; the default value is 0.001. The function maps real number vectors to probability distributions, as shown in the formula. ,in For the input vector of the th 1 element, dimensionless For the summation index, The total number of characters.

[0096] The specific implementation of step S5 involves semantic normalization of key elements in the structured data. Semantic normalization is achieved by establishing a thesaurus of synonyms in the contract domain, performing graph matching on terms in the structured data, and replacing the matched synonyms with predefined standard terms. The establishment of the clause semantic vector sequence employs a pre-trained sentence embedding model, mapping each clause to a 768-dimensional semantic vector. , For the terms index, the value is... , This represents the total number of items. Concentration distribution parameters include the initial concentration value. and diffusion coefficient , For keyword indexing, the value is... The formula for the initial concentration value is expressed as follows:

[0097] ;

[0098] In the formula, For the first Initial concentration values ​​for each keyword, dimensionless; For the first The frequency of a keyword in a clause, dimensionless, represents the number of times the keyword appears in a single clause; For the first The inverse document frequency of each keyword, dimensionless, is calculated using the following formula: ,in For including the first The number of terms per keyword, dimensionless; For summation index. Diffusion coefficient. The formula is expressed as follows:

[0099] ;

[0100] In the formula, For the first The diffusion coefficient of each keyword, dimensionless; For the first The semantic importance score of each keyword is dimensionless and is obtained by analyzing the frequency and position of the keywords in structured data.

[0101] The specific implementation of step S6 is based on calculating the steady-state distribution field of semantic concentration in the contract text structure using the fluid dynamics diffusion equation. The fluid dynamics diffusion equation, based on Fick's second law, is expressed as follows in a one-dimensional discretized clause space:

[0102] ;

[0103] In the formula, For the first The keyword in the first The semantic concentration of each clause node is dimensionless. The diffusion equation is discretized using the finite difference method, and the difference approximation formula for the second-order spatial derivative is as follows:

[0104] ;

[0105] In the formula, This represents the node spacing, which is dimensionless and defaults to 1. For the first The keyword in the first The semantic concentration of each clause node is dimensionless. For the first The keyword in the first The semantic concentration of each clause node is dimensionless. Iterative calculations are performed until the concentration field reaches a steady state; the iterative formula is as follows:

[0106] ;

[0107] In the formula, For the first After the nth iteration The keyword in the first Concentration values ​​for each clause node, dimensionless. Index for iteration count; For the first After the nth iteration The keyword in the first Concentration values ​​for each clause node, dimensionless; For the first After the nth iteration The keyword in the first Concentration values ​​for each clause node, dimensionless; For the first After the nth iteration The keyword in the first Concentration values ​​for each clause node, dimensionless; The time step is dimensionless and defaults to 0.1; the iteration termination condition is... Concentration gradient matrix of steady-state distribution field The generation method involves performing a difference operation on the concentration values ​​of adjacent clause nodes. The gradient calculation formula is expressed as follows:

[0108] ;

[0109] In the formula, For the first The keyword in the first Concentration gradient of each clause node, dimensionless. The range of values ​​is ; For the first The keyword in the first The steady-state concentration values ​​of each clause node are dimensionless. For the first The keyword in the first Steady-state concentration values ​​for each clause node, dimensionless. Concentration gradient matrix. Reflecting the semantic flow characteristics and structural boundary constraints of the contract text, the initial dimension of the text fingerprint is: Principal component analysis was performed on the concentration gradient matrix to reduce its dimensionality to 256 dimensions. Principal components with a cumulative variance contribution rate of 95% were retained, and the eigenvectors of the reduced dimensionality were used as the final text fingerprint. .

[0110] The specific implementation of step S7 involves calculating the cosine similarity between the generated text fingerprint and the pre-stored standard contract text fingerprint. Cosine similarity... The formula is expressed as follows:

[0111] ;

[0112] In the formula, The cosine similarity is dimensionless and ranges from -1 to 1. The text fingerprint of the contract to be tested is dimensionless; A fingerprint of the standard contract text, dimensionless; The inner product of two text fingerprints is dimensionless and calculated using the following formula: ,in The first fingerprint of the contract text to be detected 1 element, dimensionless The first fingerprint of the standard contract text One element, dimensionless; and These are the Euclidean norms of the two text fingerprints, which are dimensionless and calculated using the following formula: ,in The first text fingerprint Each element is dimensionless. When When the semantic concentration difference is below the preset threshold of 0.82 to 0.88, the semantic concentration difference between the contract under test and the standard contract at each clause node is calculated. The formula is expressed as follows:

[0113] ;

[0114] In the formula, For the first The keyword in the first The semantic concentration difference of each clause node is dimensionless. For the contract to be tested, the first The keyword in the first The semantic concentration of each clause node is dimensionless. For the standard contract, the first The keyword in the first The semantic concentration of each clause node is dimensionless. If the value exceeds 0.15, the clause node is marked as a suspected tampering location. The original text content corresponding to the suspected tampering location is extracted for detailed comparison and a tampering warning message is generated.

[0115] To better understand and implement this invention, a specific application scenario is provided below as Example 2: A digital contract management project requires tamper-proofing checks on 1500 construction subcontracting contracts. Due to historical reasons, the original contracts are paper documents, which have been converted to PDF format and archived using various office scanners with resolutions ranging from 200dpi to 600dpi. Some contracts have quality issues such as skewed scanning angles, uneven lighting, and overlapping text on stamps. The technical team needs to compare the current scanned version of each contract with the standard version archived at the time of signing to detect any tampering, paying particular attention to changes in monetary clauses, schedule clauses, and breach of contract clauses.

[0116] The technical team first preprocessed 1500 contracts to be tested and their corresponding standard contracts, performing the standardization operation in step S1. The PDF documents were converted page by page into bitmap images, and a weighted average method was used to convert the RGB color space to grayscale space, with weighting coefficients of 0.299, 0.587, and 0.114, respectively. The detection revealed that 327 contracts had a scanning resolution lower than 300 dpi; these images were enlarged to 300 dpi using bicubic interpolation. The remaining 189 contracts had a scanning resolution higher than 300 dpi; these images were reduced to 300 dpi using a region averaging method. After processing, all standardized contract images had a unified resolution of 300 dpi, a pixel depth of 8 bits, and an average page size of 2480 × 3508 pixels, corresponding to an A4 paper size.

[0117] Next, step S2, image enhancement processing, is performed. Each standardized contract image is decomposed into multiple scales, and a Gaussian filter with a standard deviation of 1.0 and a kernel size of 5×5 is used for three consecutive downsampling iterations, resulting in four sub-image layers with resolutions of 100%, 50%, 25%, and 12.5% ​​of the original image. Local contrast is calculated using a 15×15 pixel sliding window. Statistical analysis reveals that an average of 18.7% of each page of contract images contains low-contrast regions, mainly distributed around the page edges and the seal. The Retinex center-around algorithm is applied to these low-contrast regions, with the Gaussian scale parameter of the around function set to 80 pixels. After illumination compensation, the regional contrast is increased to an average of 2.3 times the original value. Simultaneously, an average of 6.2% of each page of the image is detected as blurred regions with a Laplacian variance below 100. Super-resolution reconstruction is performed on these blurred regions using a Wiener filter for frequency domain interpolation, resulting in improved edge sharpness in the reconstructed blurred regions. The results of multi-scale processing are fused using the Laplacian pyramid, and finally, a nonlocal mean filter with a search radius of 21 pixels, a similarity window of 7×7 pixels, and a filter intensity of 10 is applied to suppress noise and output an enhanced composite image.

[0118] Then, step S3, optical character recognition, is performed. The enhanced contract image is input into a recognition engine based on a deep convolutional neural network architecture. The engine's feature extraction layer contains convolutional layers with 64, 128, and 256 channels, and the sequence modeling layer uses a bidirectional long short-term memory network with a hidden layer dimension of 512. During the recognition process, the confidence level of each character is recorded. The recognition results of 1500 contracts are analyzed. On average, each contract contains 8500 characters, with low-confidence characters (confidence levels below 0.85) accounting for 2.8%. These low-confidence characters mainly appear in areas obscured by seals, page edges, and near handwritten signatures. The character distribution across different confidence levels exhibits a right skewness, with characters having a confidence level above 0.95 accounting for 89.3%, and characters with a confidence level between 0.85 and 0.95 accounting for 7.9%. Among the low-confidence characters, 62% are numbers and punctuation marks, and 38% are Chinese characters.

[0119] In step S4, the initial text with confidence labels and the enhanced contract image are input into the semantic bridging fusion model. The image encoding branch of the model uses a 50-layer ResNet to extract 2048-dimensional visual features, while the bidirectional long short-term memory network of the text encoding branch outputs a 512-dimensional hidden state vector. The cross-modal fusion layer calculates attention weights based on character confidence, image sharpness score, and semantic importance, with weight coefficients of 0.4, 0.3, and 0.3 for the three parameters, respectively. The model implements sparse encoding based on an 8-layer deep unfolded network, with a dictionary matrix containing 1024 atoms, each corresponding to a character-image alignment mode. After processing, the model outputs corrected text. Comparison shows that an average of 3.1 low-confidence characters in each contract are successfully corrected, achieving a correction rate of 78.6%. Simultaneously, key contract elements are extracted, including the names of Party A and Party B, contract amount, signing date, performance period, and penalty ratio, which are organized into structured data. Table 1 shows the statistics of key elements extracted from 1500 contracts.

[0120] Table 1. Statistical Table of Key Contract Elements

[0121]

[0122] Step S5 performs semantic normalization, establishing a thesaurus containing 237 synonym pairs, covering common correspondences such as Party A and client, Party B and contractor, and liquidated damages and compensation. Traversing the structured data, 683 terms were successfully matched and replaced with standard terms, while 817 terms were not matched and remained in their original form. The structured data was segmented by clause boundaries, with an average of 32.6 clauses per contract. The Sentence-BERT model was used to encode each clause into a 768-dimensional semantic vector, forming a sequence of clause semantic vectors. Concentration distribution parameters were calculated, with an average of 5.4 keywords extracted per clause. The initial keyword concentration values ​​ranged from 0.02 to 0.89, and the diffusion coefficient ranged from 0.1 to 1.0.

[0123] Step S6 establishes the hydrodynamic diffusion equation to solve the semantic concentration field, modeling the clause sequence as a one-dimensional discrete space, with an average of 32.6 spatial nodes per contract. Keywords act as diffusion sources, releasing semantic concentration at their respective nodes. The diffusion equation is discretized using the finite difference method, with a time step of 0.01 and a spatial step of 1.0, and zero flux as the boundary condition. Figure 2 As shown, the semantic concentration diffusion process of a certain contract initially concentrates on the nodes containing keywords. With iterative calculations, the concentration diffuses to adjacent nodes. After 186 iterations, the concentration field reaches a steady state, and the maximum concentration change between two adjacent iterations decreases to 0.0008. A concentration gradient matrix is ​​generated by performing a difference operation on the steady-state distribution field. The number of rows in this matrix is ​​equal to the number of clauses minus one, and the number of columns is equal to the number of keywords. The average dimension of the concentration gradient matrix for each contract is 31.6 × 5.4. Principal component analysis is performed on the concentration gradient matrix, selecting the top 256 principal components with a cumulative variance contribution rate of 95%. After dimensionality reduction, a 256-dimensional text fingerprint is obtained.

[0124] Step S7 performs fingerprint comparison, calculating the cosine similarity of the text fingerprints of the contract to be detected and the standard contract, with a preset threshold set to 0.85. The comparison results show that 1437 contracts have a similarity higher than 0.85, indicating they are not tampered with, while 63 contracts have a similarity lower than 0.85, indicating they are suspected of being tampered with. Difference localization is performed on these 63 suspected tampered contracts, calculating the semantic concentration difference between clause nodes. Nodes with an absolute difference exceeding 0.15 are marked as differing clause positions, resulting in 189 differing clause positions, averaging 3 differing positions per suspected tampered contract. The clause text at the differing positions is extracted for detailed comparison, and the BERT model is used to calculate semantic similarity and identify keyword differences. Analysis reveals that among the 189 differing positions, there are 67 changes in amount, 42 changes in date, 38 changes in responsible party, 24 deletions of clauses, and 18 other changes. Table 2 shows the statistical distribution of the difference types.

[0125] Table 2 Statistical Table of Contract Difference Types

[0126]

[0127] The technical team manually reviewed 63 suspected tampered contracts, confirming that 59 of them were indeed tampered with, while 4 were misjudged due to extremely poor scanning quality. The accuracy rate of the method was 93.7%. Among the 59 confirmed tampered contracts, the amount altered ranged from 8% to 340% of the original value; date alterations mostly involved delays of 30 to 180 days; alterations of the responsible party mainly involved adding or deleting joint liability parties; and clause deletions primarily involved quality assurance clauses and warranty period clauses. The generated tampering warning information included the contract number, detection time, similarity score, difference location, original text comparison, and change description. This warning information was pushed to administrators via the internal network interface, providing a basis for subsequent legal processing.

[0128] This invention represents a significant improvement over traditional manual word-by-word comparison or simple hash verification. Traditional manual comparison relies on the experience and attention of reviewers, making it inefficient and prone to overlooking concealed tampering when dealing with large volumes of contract documents. This invention, however, automates the batch processing of 1500 contracts, reducing the manual review workload to only 63 suspected tampering cases. Traditional hash verification methods are sensitive to any byte changes in a file; differences in scanning equipment, image compression parameters, or minor variations in PDF metadata can lead to completely different hash values, generating numerous false positives. This invention employs semantic-level fingerprint comparison. Based on the hydrodynamic diffusion equation, the text fingerprint is insensitive to format differences and synonymous rewriting, responding only to changes in semantic content, reducing false positives while improving the detection of concealed tampering. Traditional optical character recognition systems separate image recognition from semantic understanding; recognition errors can cascade and propagate, leading to a decrease in overall accuracy. This invention's semantic bridging fusion model establishes a direct mapping between visual features and text semantics through sparse coding, incorporating semantic constraints at the recognition stage. It utilizes contextual information to correct low-confidence characters, improving the accuracy of key element extraction. The proposed technical solution ensures accurate detection while achieving efficient automated processing, providing reliable technical support for the digitalization of contract management.

[0129] It should be noted that the variables involved in this invention are explained in detail in Tables 3 and 4.

[0130] Table 3. Variable Explanation Table (Part 1)

[0131]

[0132] Table 4. Variable Explanation Table (Part Two)

[0133]

[0134] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for preventing tampering with AI contracts based on OCR and a large language model, characterized in that, Includes the following steps: S1. Obtain a scanned image or PDF document of the contract to be inspected, and perform color space conversion and resolution normalization on the scanned image to form a standardized contract image. S2. A multi-scale adaptive enhancement algorithm is used to perform local illumination compensation and edge sharpening on the standardized contract image. The character details in the blurred areas are restored by super-resolution reconstruction technology to form an enhanced contract image. S3. Perform optical character recognition on the enhanced contract image to extract the text content and character position information, and record the recognition confidence of each character to form an initial text with confidence label; S4. Input the initial text with confidence labels and the enhanced contract image into the semantic bridging fusion model. The semantic bridging fusion model outputs the corrected text content and structured data of key contract elements. S5. Perform semantic normalization on the key elements in the structured data, replace synonyms with standard terms, establish a semantic vector sequence of clauses, and calculate the concentration distribution parameters of each clause in the semantic space. S6. Using the keywords in the semantic vector sequence of the clauses as diffusion sources, calculate the steady-state distribution field of semantic concentration in the contract text structure based on the fluid dynamics diffusion equation, and use the concentration gradient matrix of the steady-state distribution field as the text fingerprint. S7. Perform cosine similarity calculation between the generated text fingerprint and the pre-stored standard contract text fingerprint. When the similarity is lower than a preset threshold, mark the location of the difference clause and generate tampering warning information. The dimension of the text fingerprint is the product of the number of clauses and the number of keywords. Principal component analysis is performed on the concentration gradient matrix to reduce the dimension to 256, retaining the principal components with a cumulative variance contribution rate of 95%. The dimensionality-reduced feature vector is used as the final text fingerprint. The semantic bridging fusion model includes an image encoding branch, a text encoding branch, and a cross-modal fusion layer. The image encoding branch uses a convolutional neural network to extract visual features that enhance the contract image. The text encoding branch uses a bidirectional long short-term memory network to encode the initial text with confidence labels. The cross-modal fusion layer aligns the visual features with the text features through an attention mechanism.

2. The method according to claim 1, characterized in that, The processing steps of the multi-scale adaptive enhancement algorithm include decomposing the standardized contract image into sub-image layers of different scales, independently calculating the local contrast distribution for each sub-image layer, performing dynamic range compression on the uneven illumination region according to Retinex theory, and fusing the enhancement results of each scale through the Laplacian pyramid.

3. The method according to claim 2, characterized in that, Super-resolution reconstruction technology identifies regions in a standardized contract image where the blur exceeds a threshold, extracts the texture features and gradient information of these regions, and reconstructs high-resolution image patches through sub-pixel displacement estimation and frequency domain interpolation. It also utilizes non-local mean filtering to suppress noise introduced during the reconstruction process.

4. The method according to claim 3, characterized in that, The initial text with confidence labels is generated by the optical character recognition engine outputting a probability distribution vector for each recognized character, selecting the character with the highest probability as the recognition result, and using the probability value as the confidence label. When the confidence is lower than 0.85, it is marked as a low-confidence character.

5. The method according to claim 4, characterized in that, The semantic bridging fusion model combines dictionary learning based on sparse coding with deep unfolded networks. It unfolds the iterative shrinking threshold algorithm into a multi-layer feedforward network structure, with each layer corresponding to one iteration. By learning the dictionary matrix and shrinking threshold parameters, it achieves sparse representation of text and image features.

6. The method according to claim 5, characterized in that, The steps for establishing the training dataset for the semantic bridging fusion model include collecting 10,000 contract scan images of different qualities, manually annotating the correct content and position of each character, adding enhancement processing such as blurring, noise, and uneven lighting to the contract scan images to form multiple samples, and using an open-source optical character recognition engine to generate text with misidentification as input samples.

7. The method according to claim 6, characterized in that, The training steps for the semantic bridging fusion model include initializing the dictionary matrix as a random orthogonal matrix, setting the number of unfolded network layers to 8, training using an end-to-end supervised learning approach, using character recognition cross-entropy loss and sparsity constraint regularization term as loss functions, and using an adaptive moment estimation optimization algorithm to update network parameters.

8. The method according to claim 7, characterized in that, The weight coefficients of the attention mechanism in the cross-modal fusion layer are determined based on the confidence of characters in the initial text with confidence labels, the image sharpness score of the corresponding region in the enhanced contract image, and the semantic importance of key elements in the structured data. The calculation method is to normalize the three parameters to the interval between 0 and 1 and then perform a weighted sum.

9. The method according to claim 8, characterized in that, The semantic normalization process includes establishing a thesaurus in the contract domain, which contains the correspondence between Party A and the client, and between liquidated damages and compensation. The terminology in the structured data is matched with the thesaurus, and the matched synonyms are replaced with predefined standard terms.