A signature image DNA coding storage and anti-fake verification method and system based on feature learning, a computer device and a readable storage medium

By combining feature learning and DNA molecular hybridization, a method was developed that integrates high-density storage and anti-counterfeiting verification of signature images, solving the problems of limited retrieval function and insufficient anti-counterfeiting verification in existing technologies, and improving verification accuracy and efficiency.

CN122244880APending Publication Date: 2026-06-19XI AN JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XI AN JIAOTONG UNIV
Filing Date
2026-03-20
Publication Date
2026-06-19

Smart Images

  • Figure CN122244880A_ABST
    Figure CN122244880A_ABST
Patent Text Reader

Abstract

This invention discloses a method, system, computer equipment, and readable storage medium for signature image DNA encoding storage and anti-counterfeiting verification based on feature learning, belonging to the field of electronic information technology. The method includes: extracting high-dimensional discriminative feature vectors from the signature image to be verified using a feature extractor; inputting the vectors into a trained sequence encoder to generate a DNA sequence; synthesizing a reverse complementary sequence to prepare a labeled probe; hybridizing the probe with the registered DNA sequence of the target signer; measuring the hybridization yield; if the yield exceeds a preset threshold, the signature is determined to be genuine; otherwise, it is a forged signature. This invention is the first to integrate handwritten signature anti-counterfeiting with DNA storage. Through three core components—a feature extractor, a sequence encoder, and a hybridization predictor—and multi-task joint training, genuine signature pairs are encoded as high-hybridization-yield sequences, while genuine and forged signature pairs are encoded as low-hybridization-yield sequences, achieving physical-level anti-counterfeiting verification based on molecular hybridization. The verification process of this invention is inherently parallel, energy-efficient, and highly accurate, meeting the needs of practical anti-counterfeiting applications.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of electronic information technology, and specifically relates to a method, system, computer equipment, and readable storage medium for signature image DNA encoding storage and anti-counterfeiting verification based on feature learning. Background Technology

[0002] With the explosive growth of global digital data, traditional media such as magnetic storage, optical storage, and solid-state storage face increasingly severe challenges in terms of storage density, durability, and energy consumption. Deoxyribonucleic acid (DNA), as a natural information storage carrier, possesses advantages such as extremely high storage density (theoretically reaching hundreds of petabytes per gram), extremely long preservation time (thousands of years under appropriate conditions), and extremely low maintenance energy consumption, and has been widely recognized as a strong candidate for next-generation data storage media. In recent years, DNA digital data storage technology has developed rapidly, and researchers have successfully achieved the encoding, synthesis, storage, and retrieval of DNA sequences into various file formats.

[0003] Handwritten signatures, as an important form of biometric information, are widely used for identity authentication and anti-counterfeiting verification in scenarios such as financial transactions, commercial contracts, and legal documents. Handwritten signature verification refers to determining whether a given signature image is the genuine signature of a specific signatory. In the traditional field of digital image processing, deep learning-based Siamese networks have been widely used for signature verification tasks. For example, the SigNet model proposed by Dey et al. uses two convolutional neural networks (CNNs) with shared weights to extract deep feature vectors of the signature pairs to be verified. By calculating the Euclidean distance between the feature vectors and combining it with contrastive loss training, they successfully achieved writer-independent offline signature verification (Dey S, Dutta A, Toledo JI, et al. SigNet: Convolutional siamese network for writer-independent offline signature verification. arXiv preprint arXiv:1707.02131,2017.).

[0004] In recent years, research has emerged combining deep learning with DNA storage for image recognition. Chinese patent application CN121582595A discloses a method for identifying invasive species based on deep learning and DNA storage. This method generates the DNA sequence of a species image using a DNA encoder and identifies the species using a hybridization yield predictor. Chinese patent CN115223178B discloses a signature verification method based on meta-learning, which improves verification accuracy by constructing a signature authenticity discrimination model. Despite the progress made in DNA data storage, image DNA encoding, and signature verification, the following technical shortcomings remain: 1) Existing DNA storage systems have limited retrieval functions. Current DNA data storage systems primarily rely on precise key-value pair retrieval, failing to achieve intelligent retrieval and analysis based on data content semantics. While Bee et al. implemented image similarity search, it only addressed visual similarity in general natural images (OpenImages dataset), without addressing semantic understanding and task-driven retrieval in specific application scenarios. 2) Research on the application of DNA storage in information security and anti-counterfeiting fields is scarce. Current research on DNA data storage mainly focuses on improving storage encoding efficiency, error correction capabilities, and read / write speeds, with few reports on combining DNA storage technology with information anti-counterfeiting verification functions. Especially in the important field of handwritten signature authentication, there is still no technical solution that integrates signature storage and anti-counterfeiting verification using DNA sequence encoding. 3) Existing DNA sequence encoding methods lack adaptation and optimization for specific tasks. Bee et al. used VGG16 as a general feature extractor, and Su et al. used LeNet-5, neither of which optimized the feature extraction architecture for the specific anti-counterfeiting task of signature verification. Handwritten signature images are characterized by rich stroke details, subtle individual differences, and small visual differences between genuine and counterfeit signatures, making it difficult for general feature extractors to effectively capture the key features that distinguish genuine and counterfeit signatures. 4) Existing methods do not fully utilize the paired tag information in signature verification. Bee et al. defined similarity based on the Euclidean distance threshold of feature vectors, and Su et al. combined Euclidean distance and class labels, but neither utilized the natural pairing label information in the signature verification task (i.e., genuine signature pairs from the same signer and genuine signature / forged signature pairs) to guide the optimization of DNA sequence encoding, resulting in encoding quality that could not well serve the anti-counterfeiting verification goal. 5) Feature-based learning methods have made significant progress in the field of signature anti-counterfeiting verification. By training deep convolutional networks on large-scale signature datasets, high-dimensional feature representations of signature images are automatically learned, replacing the tedious process of manually designing features in traditional methods.

[0005] In summary, how to deeply integrate DNA data storage technology with handwritten signature anti-counterfeiting verification tasks, and construct a method that can simultaneously achieve high-density storage of signature images and physical-level anti-counterfeiting verification based on molecular hybridization, has become a technical problem that urgently needs to be solved in this field. Summary of the Invention

[0006] To address the limitations of existing DNA storage technologies, such as limited retrieval functionality, lack of deep integration with information security and anti-counterfeiting fields, and the failure of existing signature verification methods to utilize DNA molecular hybridization combined with feature learning to achieve integrated physical-level DNA storage of signature images with anti-counterfeiting verification, this invention aims to provide a signature image DNA encoding storage and anti-counterfeiting verification method, system, computer equipment, and readable storage medium based on feature learning.

[0007] To achieve the above objectives, the present invention employs the following technical solution: In a first aspect, the present invention provides a signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning, comprising the following steps: Obtain the signature image and extract its feature vector using a pre-trained feature extractor; The feature vector is input into a trained sequence encoder and mapped to a DNA sequence. The DNA sequence is stored as the feature encoding sequence of the signature image in a DNA storage medium; During the verification phase, the signature image to be tested is converted into a DNA sequence and subjected to molecular hybridization with the stored target sequence. The authenticity of the signature is determined based on the hybridization rate.

[0008] The feature extractor is a pre-trained convolutional neural network optimized for handwritten signature anti-counterfeiting tasks. It is used to extract high-dimensional discriminative feature vectors from signature images to enhance the ability to distinguish subtle differences between genuine and counterfeit signatures.

[0009] Preferably, the feature extractor is the SigNet-F model.

[0010] The sequence encoder employs a twin-shared network structure, maps feature vectors to a base probability matrix, and generates a fixed-length DNA sequence by selecting the maximum probability.

[0011] Preferably, the sequence encoder includes a multilayer perceptron and a Softmax (normalized exponential function) normalization layer.

[0012] This structure ensures that real-to-real and real-to-fake pairings use the same encoding rules. The microdiscretization operation allows the discrete DNA sequence generation process to participate in gradient backpropagation. The fully connected layer structure is simple and efficient, with a probability matrix of 2048-dimensional → 1024-dimensional → 320-dimensional → 80×4.

[0013] The sequence encoder is trained using a multi-task joint loss function. The training process employs a balanced pairing strategy to construct training samples, ensuring a balance between the number of true-true pairs and true-false pairs for each signer.

[0014] The multi-task joint loss function includes: task prediction loss, based on the error between the hybridization rate output by the hybridization predictor and the paired labels; semantic consistency loss, used to maintain the relative similarity between the original signature feature pairs and the DNA sequence pairs; and user identity classification loss, used to enhance the identity discrimination capability of the encoding results.

[0015] Task prediction loss ensures high hybridization yield for true signature pairs and low hybridization yield for true and false signature pairs. Semantic consistency loss reduces the damage to the original discriminative structure caused by task optimization. User identity classification loss enhances the ability of the encoding results to retain identity discrimination information. The average accuracy reaches 95.64%.

[0016] The hybridization predictor is a differentiable deep learning model used to estimate the hybridization rate between two DNA sequences and keeps its parameters frozen during the sequence encoder training phase, serving as a task-driven differentiable supervisory signal source.

[0017] The hybridization predictor is pre-trained on a large-scale DNA sequence pair dataset, which covers 27 sequence lengths from 20nt to 150nt. Each length generates 100,000 sequence pairs, totaling 2.7 million sequence pairs. The sequences meet the constraints of 40%-60% GC content and maximum homopolymer length not exceeding 3.

[0018] Preferably, the DNA sequence pair dataset is generated using a binning uniform sampling strategy based on normalized Hamming distance, which divides the normalized Hamming distance range [0,1] into 10 intervals, and generates an equal number of sequence pairs in each interval.

[0019] The steps for determining the authenticity of a signature through molecular hybridization reaction specifically include: Synthesize the reverse complementary sequence of the DNA sequence to be tested and prepare labeled probes; The probe is annealed and hybridized with the feature encoding sequence of the target signer in the storage pool; The hybridization rate is obtained by measuring the hybridization reaction signal; If the hybridization rate exceeds a preset threshold, it is determined to be a genuine signature; otherwise, it is a forged signature.

[0020] Secondly, the present invention provides a system for implementing the signature image DNA encoding storage and anti-counterfeiting verification method, comprising: The feature extraction module is used to extract discriminative feature vectors from the signature image; The sequence encoding module is used to map feature vectors to DNA sequences; Storage module, used to store DNA sequences in DNA media; The verification module is used to synthesize the reverse complementary sequence of the DNA sequence to be tested and prepare a labeled probe. The probe is then hybridized with the characteristic coding sequence of the target signer in the storage pool. The hybridization signal is measured and the authenticity of the signature to be tested is determined according to a preset threshold.

[0021] Preferably, the verification module includes: The probe preparation unit is used to synthesize and label the reverse complementary sequence of the DNA sequence to be tested; The hybridization reaction unit is used to perform annealing hybridization between the probe and the target sequence; The signal detection unit is used to determine the hybridization rate and output the judgment result.

[0022] Thirdly, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning.

[0023] Compared with the prior art, the present invention has the following beneficial effects: This invention provides a signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning. By constructing three core components—a feature extractor, a sequence encoder, and a hybridization predictor—and employing a multi-task joint loss training strategy, it achieves end-to-end task-driven encoding from handwritten signature images to DNA sequences. This ensures that genuine signature pairs from the same signatory are encoded as similar DNA sequences with high hybridization yields, while genuine and counterfeit signature pairs are encoded as different DNA sequences with low hybridization yields. This invention introduces anti-counterfeiting verification of handwritten signature images into the DNA storage scenario for the first time. By encoding the discriminative features of the signature image into DNA sequences, the DNA medium not only possesses information storage capabilities but can also perform anti-counterfeiting verification through molecular hybridization, forming a complete technical chain of "feature learning-DNA encoding-molecular hybridization verification." Compared to traditional signature verification methods based on digital comparison, this invention achieves physical-level integration of storage medium and verification function. The verification process based on DNA molecular hybridization is a large-scale parallel physicochemical process, eliminating the need for individual digital comparisons. The probe sequence hybridizes simultaneously with all target sequences in the database, and the verification time does not significantly increase with the size of the registered signature database. The DNA sequence encoded by this method can effectively distinguish between genuine and forged signatures, meeting the application requirements of practical anti-counterfeiting verification scenarios.

[0024] Furthermore, a pre-trained Sigver-SigNet-F (λ=0.95) model was used as the front-end feature extractor, outputting a 2048-dimensional signature feature vector. This feature extractor is optimized for signature stroke structure, writing habits, and authenticity differences, exhibiting better task adaptability and discriminative ability in signature verification compared to general image feature extraction networks, providing a more effective input representation for subsequent sequence encoding. When training the hybridization predictor, a dataset—currently the largest dataset in research on predicting molecular hybridization—was constructed, containing 2.7 million sequence pairs ranging from 20nt to 150nt in length. This provides flexible and variable selection of hybridization sequence lengths, significantly aiding the subsequent encoder in encoding sequences of different lengths.

[0025] This invention employs a task-driven sequence encoder training mechanism, enabling DNA encoding results to directly serve anti-counterfeiting objectives. During the sequence encoder training phase, a frozen pre-trained hybridization predictor provides task supervision signals. Combined with semantic consistency constraints and user identity-assisted classification constraints, multi-task joint optimization is performed, ensuring that the DNA sequence output by the encoder satisfies the hybridization prediction task requirements while preserving as much identity-discriminating information as possible from the signature features. A twin-shared coding and balanced pairing training strategy is used to improve training stability and model generalization ability. During the sequence encoder training phase, a twin-shared coding structure ensures that both branches of paired samples use consistent coding rules. Simultaneously, a balanced pairing dataset of real-real and real-counterfeit samples is constructed based on sample-level structured feature data, maintaining a balance between positive and negative samples within each user. This helps reduce the impact of class imbalance, improves training convergence stability, evaluation comparability, and cross-user generalization ability. The anti-counterfeiting verification process is inherently parallel and energy-efficient. The DNA molecular hybridization-based verification process is a large-scale parallel physicochemical process that does not require one-to-one comparison calculations. The probe sequence hybridizes simultaneously with all target sequences in the database. As the size of the registered signature database increases, the verification time does not increase significantly.

[0026] The system provided by this invention transforms the aforementioned methods into a practically deployable technical solution through modular design. The system includes a feature extraction module, a sequence encoding module, a probe preparation module, a hybridization reaction module, and a result determination module, forming a complete verification closed loop from signature image input to authenticity result output, facilitating system development, deployment, and maintenance. Each module can be independently optimized or replaced according to actual needs, exhibiting good scalability and scenario adaptability. Attached Figure Description

[0027] Figure 1 This is a schematic diagram of the overall process of the present invention; Figure 2This is a schematic diagram of the overall architecture of the present invention, showing the three core components—feature extractor, sequence encoder, and hybridization predictor—and their data flow. Figure 3 A schematic diagram illustrating the dataset generation process for the hybridization predictor; Figure 4 Here is a Hamming distance distribution plot of a portion of the training dataset for the hybrid predictor, where Figure 4 (a) is a sequence distribution of length 30 nt. Figure 4 (b) is a sequence distribution of length 40 nt. Figure 4 (c) represents the sequence distribution of length 60 nt. Figure 4 (d) represents the sequence distribution of length 80 nt; Figure 5 A schematic diagram illustrating the one-hot encoding representation of DNA sequences and the working principle of convolutional neural networks, showing the mapping process from a 2048-dimensional feature vector to a 4×80 Softmax encoding matrix and then to an 80nt DNA sequence; Figure 6 A visualization of the partial-length sequence performance of the hybridization predictor. Figure 6 (a) is the result of a 30nt sequence. Figure 6 (b) is the result of a 40nt sequence. Figure 6 (c) represents a sequence of length 60 nt. Figure 6 (d) represents the sequence result of length 80 nt; Figure 7 For pre-trained Sigver-Signet-F feature extractors (such as...) Figure 2 The distribution of feature vectors extracted from the CEDAR dataset in the feature space (shown in the figure).

[0028] Figure 8 A graph showing the changes in encoder training epochs and test set performance metrics (true-to-true recognition rate, true-to-false interception rate, average accuracy). Figure 9 A schematic diagram of the webpage interface for collecting handwritten signatures; Figure 10 The image shows the results of calculating the similarity between the encoded sequences of the self-built signature samples. Detailed Implementation

[0029] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0030] The present invention will now be described in further detail with reference to the accompanying drawings: This invention provides a method for DNA encoding storage and anti-counterfeiting verification of handwritten signature images based on deep feature learning. It combines the anti-counterfeiting verification task of handwritten signatures with DNA data storage technology. By training three core components—a feature extractor, a sequence encoder, and a hybridization predictor—it achieves the encoding mapping from handwritten signature images to DNA sequences. This allows genuine signatures from the same signer to be encoded as similar DNA sequences with high hybridization yields, while genuine signatures paired with forged signatures are encoded as different DNA sequences with low hybridization yields. Thus, it simultaneously achieves DNA storage of signatures and anti-counterfeiting verification based on molecular hybridization.

[0031] The technical solution of this invention does not simply encode and store signature images using DNA. Instead, it constructs a closed-loop technical route consisting of "feature learning - sequence encoding - hybridization prediction - task-driven optimization - molecular verification". Its core idea is to first learn discriminative features in the digital domain that can distinguish between genuine and counterfeit signatures. Then, a trainable encoder maps these features to DNA sequence representations. A differentiable hybridization predictor is used as a surrogate model to introduce molecular hybridization behavior into the encoder training objective, enabling the encoding results not only to be stored but also to perform anti-counterfeiting verification at the subsequent molecular level.

[0032] Specifically, this invention divides the training process into three interconnected but distinct stages: the first stage constructs large-scale DNA sequence pair data and uses NUPACK thermodynamic simulation results to train a differentiable hybridization predictor to approximate the actual hybridization yield; the second stage selects a feature extractor suitable for offline signature verification tasks to obtain high-dimensional signature features with strong ability to distinguish between genuine and counterfeit signatures; the third stage fixes the feature extractor and hybridization predictor obtained in the first two stages and trains a sequence encoder so that the DNA sequence output by the encoder satisfies the task objective of "high hybridization yield for genuine signature pairs and low hybridization yield for genuine and counterfeit signature pairs" under the evaluation of the predictor.

[0033] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. Those skilled in the art will understand that the following embodiments are merely illustrative of the invention and do not constitute any limitation on the scope of protection of the invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

[0034] I. Experimental Environment Configuration The experimental environment set up in this embodiment is primarily based on subsequent needs such as data processing, model training, and evaluation, ensuring that related tasks can be carried out smoothly. The environment is built on a Linux server, with Python as the primary coding language. Anaconda, an open-source Python distribution designed to improve development and management efficiency, integrates various toolkits, including not only the runtime environment but also conda for managing the environment and dependencies. It also relies on data processing libraries such as numpy and pandas to complete efficient computational analysis tasks. This configuration enables smooth data processing and model training and evaluation tasks. Detailed environment settings for building deep learning models are shown in Table 1.

[0035] Table 1: Experimental Environment Configuration

[0036] II. Overall Architecture of the Invention like Figure 1 and Figure 2 As shown, the overall architecture of this invention includes three core components: a feature extractor (e.g., ... Figure 2 As shown in the image), sequence encoder (Encoder, Encoder, as shown in the image). Figure 2 (as shown in the image) and hybridization predictor (as shown in the image) Figure 2 As shown in the diagram, the overall process is as follows: Input a pair of handwritten signature images, and the feature extractor extracts high-dimensional feature vectors from both images respectively; the sequence encoder maps the feature vectors to a Softmax encoding matrix representation of the DNA sequence; the hybridization predictor estimates the hybridization yield between the two encoding sequences, which serves as a quantitative indicator of sequence similarity. The training objective of the system is to ensure that the DNA sequences corresponding to genuine signature pairs (labeled 1) from the same signer have a high hybridization yield, while the DNA sequences corresponding to the pairing of genuine and forged signatures (labeled 0) have a low hybridization yield.

[0037] like Figure 1 As shown, this invention divides the training process into three interconnected but distinct stages: Phase 1: Constructing large-scale DNA sequence pair data and training a differentiable hybridization predictor using NUPACK thermodynamic simulation results; The second stage: Select a feature extractor suitable for offline signature verification tasks to obtain high-dimensional signature features with strong ability to distinguish between genuine and fake signatures. The third stage involves fixing the feature extractor and hybridization predictor obtained in the first two stages, and training the sequence encoder so that its output DNA sequence meets the task objective of "high hybridization yield for true signature pairs and low hybridization yield for true and false signature pairs".

[0038] III. Construction and Training of Hybridization Predictors 3.1 Training Dataset Generation The hybridization predictor is used to predict the hybridization yield between two DNA sequences, replacing the gradient backpropagation used in the NUPACK software during training. The hybridization yield is defined as the ratio of the concentration of double-stranded DNA (dsDNA) to the initial concentration of single-stranded DNA (ssDNA) under thermodynamic equilibrium conditions, ranging from 0 to 1. A hybridization yield closer to 1 indicates a higher degree of hybridization between the two sequences and a greater degree of sequence similarity.

[0039] like Figure 3 As shown, to train a differentiable hybridization predictor, it is necessary to construct a large-scale DNA sequence pair dataset covering different sequence lengths and different sequence similarities, and use NUPACK software to calculate the hybridization yield for each sequence pair as a training label.

[0040] (1) Sequence length and data scale design The dataset covers 27 sequence lengths from 20 bp to 150 bp, increasing in 5 bp increments (i.e., 20, 25, 30, 35, ..., 145, 150 bp). 100,000 sequence pairs are generated for each sequence length, resulting in a total of 2,700,000 DNA sequence pairs. This multi-length design enables the hybridization predictor to learn the thermodynamics of hybridization at different sequence lengths, providing generalization ability across different coding lengths.

[0041] (2) Sequence constraints To ensure the biological validity and syntheticability of the generated sequences, all sequences must meet the following two constraints: GC content constraint: The percentage of GC bases in each sequence must be between 40% and 60%. GC content directly affects the thermal stability of DNA double strands. This range ensures that the sequence has a suitable melting temperature and avoids non-specific hybridization or excessively low hybridization efficiency caused by extreme GC content.

[0042] Homopolymer length constraint: Homopolymer structures with four or more consecutive identical bases are not allowed in the sequence (i.e., the maximum homopolymer length is 3). Long homopolymers lead to an increased error rate in DNA synthesis and affect the predictability of hybridization behavior.

[0043] (3) Uniform sampling strategy based on normalized Hamming distance binning To ensure a uniform distribution of training data across sequence similarity and prevent model bias in specific similarity intervals, a binning uniform sampling strategy based on normalized Hamming distance is employed. The normalized Hamming distance (Hamming distance divided by sequence length) is divided into 10 equal intervals [0, 1]: [0, 0.1), [0.1, 0.2), ..., [0.9, 1.0]. For each sequence length, 10,000 sequence pairs are generated for each interval, totaling 100,000 pairs for each length. This strategy ensures that the model has sufficient training samples in both high similarity (low Hamming distance) and low similarity (high Hamming distance) intervals, thereby accurately learning the complete mapping relationship between hybridization yield and sequence difference.

[0044] Figure 4 This shows the Hamming distance distribution of a portion of the training dataset for the hybrid predictor, where... Figure 4 (a) is a sequence distribution of length 30 nt. Figure 4 (b) is a sequence distribution of length 40 nt. Figure 4 (c) represents the sequence distribution of length 60 nt. Figure 4 (d) represents the distribution of sequences with a length of 80 nt. As can be seen from the figure, sequences of different lengths are evenly distributed across different Hamming distance intervals, ensuring that the model has sufficient training samples in both high similarity (low Hamming distance) and low similarity (high Hamming distance) intervals.

[0045] (4) Sequence pair generation process For each pair of sequences within each interval, the generation steps are as follows: Step 1: Randomly generate the target sequence. Generate a DNA sequence of a specified length by randomly sampling from the four bases {A, T, C, G} with equal probability. Check whether it simultaneously satisfies the GC content constraint and the homopolymer length constraint. If not, regenerate until a valid target sequence is obtained.

[0046] Step 2: Generate a query sequence based on the target sequence through site-directed mutagenesis. Randomly sample a target distance rate within the normalized Hamming distance range corresponding to the current interval, multiply it by the sequence length, and round it to obtain the number of base sites to be mutated. Randomly select a corresponding number of non-repeating sites from the target sequence, and replace the bases at each site with one of the other three bases to generate a candidate query sequence.

[0047] Step 3: Perform constraint checks on the candidate query sequences to see if they meet the GC content and homopolymer length constraints. If they do, record them as valid samples; otherwise, repeat Step 2. Each target sequence can be tried a maximum of 100 times. If all attempts fail, discard the target sequence and start over from Step 1.

[0048] Step 4: Calculate and record the actual Hamming distance of the sequence pair.

[0049] (5) NUPACK simulation calculation of hybridization yield For all generated sequence pairs, the hybridization yield was calculated using NUPACK thermodynamic analysis software and used as training tags. Following the experimental retrieval protocol of Bee et al., the sequences were preprocessed during simulation: the first sequence in the sequence pair was used as the target strand, and a reverse primer sequence was spliced ​​at its 3' end; the second sequence was used as the query strand, and a portion of the bases of the reverse primer was spliced ​​at its 3' end as a foothold sequence, followed by the reverse complementary strand, to simulate the oligonucleotide hybridization behavior with primer structures in actual retrieval experiments. NUPACK simulation conditions were set as follows: temperature 21°C, initial strand concentration 1 nM, maximum complex size 2, nucleic acid material type DNA, and default DNA thermodynamic parameters. Hybridization yield was defined as the ratio of the query-target double-stranded complex concentration to the initial single-stranded concentration (1 nM) under thermodynamic equilibrium conditions, ranging from 0 to 1.

[0050] (6) Resuming from breakpoints and ensuring data integrity Given the massive dataset size (NUPACK computation for 2.7 million sequence pairs is time-consuming), a breakpoint-resume mechanism was employed during the generation process: separate checkpoint files were maintained for the sequence pair generation and hybridization yield calculation stages, and these files were automatically flushed to disk after processing a certain number of samples. If the program was interrupted, it could resume from the breakpoint upon restarting, avoiding redundant calculations. A completion marker was written after all calculations were completed for each sequence length, and completed lengths were automatically skipped during batch execution. Finally, the results for all lengths were compiled into a single dataset file containing the sequence length, target sequence, query sequence, normalized Hamming distance, and simulated hybridization yield.

[0051] 3.2 Hybridization Predictor Network Structure Design like Figure 2 and Figure 5 As shown, the input to the hybridization predictor is a stacked one-hot encoded representation of a pair of DNA sequences (target sequence and query sequence).

[0052] (1) Input representation Each DNA sequence is first converted into a one-hot encoding matrix: each base position in the sequence is represented by a 4-dimensional vector, corresponding to the four bases A, T, C, and G, where the component corresponding to the actual base at that position is 1, and the rest are 0. Thus, each DNA sequence of length L is encoded as an L×4 one-hot matrix. Subsequently, the one-hot matrices of the target sequence and the query sequence are stacked along the pairing dimension and transposed to form a 4×L×2 three-dimensional tensor as the network input, where 4 corresponds to the base channel, L corresponds to the sequence length, and 2 corresponds to the target-query pairing.

[0053] One-hot encoding: For a DNA sequence of length L ,in Indicates the first in the sequence The bases at each position, The one-hot encoded vector for each base position is defined as:

[0054] in, Indicates base The one-hot encoded vector. In the formula... This is an indicator function that takes the value 1 when the condition in parentheses is true and 0 otherwise. Therefore, each base can be represented as a 4-dimensional binary column vector, with its four dimensions corresponding to A, T, C and G, respectively.

[0055] One-hot encoding matrix of the entire sequence for:

[0056] Input tensor construction: given target sequence and query sequence Network input tensor The structure is as follows:

[0057] in Indicates stacking operation Corresponding target sequence, Corresponding query sequence, It represents the set of real numbers.

[0058] It's important to note that the hybridization predictor's input uses the forward representations of the target and query sequences (i.e., both in the 5'→3' direction), rather than converting the query sequence into an inverse complementary sequence before input. This design is based on the following considerations: the hybridization predictor's training data simultaneously includes two labeling information: the normalized Hamming distance of the sequence pairs and the NUPACK simulated hybridization yield. The Hamming distance measures the base difference between the two sequences at corresponding positions. Using forward sequence input maintains the positional correspondence between the target and query sequences, allowing the local interaction layer to directly capture patterns of base matching and difference at corresponding positions, thereby simultaneously learning the mapping relationship between sequence similarity and hybridization thermodynamic behavior.

[0059] (2) Local interaction layer The first layer of the network is the Local Interactions Layer, used to capture local complementary matching patterns between two sequences. This layer employs a sliding window mechanism, with the window size parameter w set to 1, corresponding to a sliding window length of 2w + 1 = 3 base positions. For each window position in the sequence, the base encoding segments of the target sequence and the query sequence within that window are extracted. Then, the outer product matrix between the two segments is calculated on each base channel, resulting in a 3×3 local matching matrix. The outer product matrices of the four base channels are flattened and concatenated, generating a 4×9 = 36-dimensional feature vector for each window position. After sliding the window along the sequence direction, a local interaction feature map with a dimension of (L-2w)×36 is output. This design enables the network to capture local complementary matching information between two sequences, including misaligned crossovers, rather than simply detecting position-by-position exact pairings.

[0060] Sliding window extraction: Base channel index, and These correspond to the four bases A, T, C, and G, respectively.

[0061] and These represent the target sequence and the query sequence, respectively. and These represent the one-hot encoding matrices of the two, respectively. Indicates the target sequence in the th order. The position of each base channel The local window vector extracted from the center. This represents the local window vector of the query sequence on the corresponding channel. In the formula, and These represent local segments extracted from the one-hot encoding matrices of the target sequence and the query sequence, respectively. Both belong to... That is, the length is The vector.

[0062] Local outer product calculation: for each base channel and window position Calculate the outer product matrix ,

[0063] in, Indicates the location , No. The local outer product matrix over each base channel describes the interaction between two sequences within that local window. Feature splicing: Flatten and splice the outer product matrix of the four base channels to obtain the position. Local interaction feature vectors :

[0064] Indicates position The local interactive feature vector at a given location is obtained by flattening and concatenating the outer product matrices corresponding to the four base channels.

[0065] when , hour, The complete output of the local interaction layer is:

[0066] The overall output feature matrix of the local interaction layer is composed of the local interaction feature vectors of all valid window positions in sequence. This indicates the flattening operation, which converts a matrix into a one-dimensional vector. This indicates a concatenation operation, which joins multiple vectors sequentially into a longer vector.

[0067] (3) Pooling and Convolutional Layers The local interaction feature maps are processed sequentially through the following layers: First, a one-dimensional average pooling layer with a kernel size of 3 is passed to reduce the spatial dimensionality and smooth the local interaction features; then, a one-dimensional convolutional layer with a kernel size of 3 and 36 output channels is passed (…). The hyperbolic tangent (tanh) activation function is used to further extract higher-order hybridization pattern features of sequence pairs; finally, a global average pooling layer is applied. The feature maps of different lengths are compressed into fixed-length 36-dimensional feature vectors, enabling the network to accept sequence inputs of different lengths.

[0068] Forward propagation process: The local interactive feature maps are processed sequentially through the following operations:

[0069]

[0070]

[0071] in, This represents the output of the local interaction feature map after one-dimensional average pooling, used to smooth and reduce the spatial dimensionality of the original local interaction features; This represents the high-order feature representation obtained after one-dimensional convolution and hyperbolic tangent activation; This represents a fixed-length feature vector formed after global average pooling. In the formula, , and These represent one-dimensional average pooling, one-dimensional convolution, and global average pooling operations, respectively. This represents the hyperbolic tangent activation function.

[0072] (4) Output layer The 36-dimensional feature vector obtained after global average pooling is input into a fully connected layer, which outputs a scalar value. This scalar value is then mapped to the range of 0 to 1 by a sigmoid activation function, serving as the predicted hybridization yield. Hybridization yield prediction: The 36-dimensional feature vector obtained after global average pooling is input into a fully connected layer, and the predicted hybridization yield is output after passing through a sigmoid activation function.

[0073] in, The weight vector of the fully connected layer. This is a bias term. ∈(0,1) represents the predicted hybridization yield.

[0074] The Sigmoid function is defined as:

[0075] in, This represents the Sigmoid activation function, which maps the output of a fully connected layer to... Within the interval, the final probabilistic prediction result is obtained. Input variables for the function, It is a natural constant.

[0076] 3.3 Training Methods for Hybridization Predictors (1) Training data preparation The 2.7 million DNA sequence pairs generated using the aforementioned method and their corresponding NUPACK simulated hybridization yields were used as training data. For each sequence length dataset (100,000 pairs per length), the sequence pairs were converted into one-hot encoded stacked tensor representations as model inputs, with the NUPACK simulated hybridization yields used as training labels.

[0077] (2) Division of training set and validation set To ensure that the model's generalization ability across different sequence similarity intervals is adequately evaluated, a grouping sampling strategy based on normalized Hamming distance is used to divide the training and validation sets. Specifically, data of each sequence length are sorted according to normalized Hamming distance and divided into 10 equal groups. Within each group, 20% of the samples are randomly selected as the validation set, and the remaining 80% are used as the training set. This strategy ensures that the validation set has uniform coverage across all sequence similarity intervals, avoiding the lack of validation samples in certain similarity intervals due to random partitioning.

[0078] (3) Loss function and optimizer The binary cross-entropy loss function is used as the training loss. The crossover yield ranges from 0 to 1, and the network output is activated by a sigmoid function, consistent with the probabilistic prediction task. Compared to the mean squared error loss function (MSE Loss), the binary cross-entropy loss has a stronger gradient signal for predicted values ​​close to 0 and 1, and can more effectively drive the model to correct prediction biases in boundary regions: when the true crossover yield is close to 1 and the predicted value is low, or when the true crossover yield is close to 0 and the predicted value is high, the gradient generated by the cross-entropy loss is significantly greater than that of the mean squared error loss, and can more effectively drive the model to correct prediction biases in these boundary regions. This characteristic is particularly important for the crossover predictor, because in the anti-counterfeiting verification task, the accurate distinction between the two extreme regions of high crossover yield (true signature pairs) and low crossover yield (true and false signature pairs) directly determines the verification performance. Experiments also verified that the binary cross-entropy loss is superior to the mean squared error loss in prediction accuracy. Binary cross-entropy loss:

[0079] in, This represents the value of the binary cross-entropy loss function; Indicates the number of samples in a batch; Indicates the first The true label of each sample; Indicates the first The predicted output for each sample; This indicates that the loss of all samples within a batch is summed. It represents logarithmic operations.

[0080] The optimizer used is RMSprop, and the learning rate is set to 0.001.

[0081] (4) Training strategies Hybridization predictor models were trained independently for datasets of 27 different sequence lengths, with one predictor for each length. This length-based independent training strategy allows each predictor to focus on learning the hybridization thermodynamics at a specific sequence length, avoiding interference from feature differences between sequences of different lengths. In the subsequent encoder training phase, predictors of the appropriate length can be selected based on the sequence design requirements of the actual biological experiment. Each length was trained for 20 epochs with a batch size of 64. After each epoch, the loss was evaluated on the validation set, and the training loss curve and validation loss curve were recorded to monitor the convergence status. After training, the mean absolute error (MAE) and coefficient of determination (R²) were used to calculate the results. 2 The model's predictive accuracy was evaluated on the validation set using three metrics: correlation coefficient, Pearson correlation coefficient, and Pearson correlation coefficient.

[0082] (5) Training results Figure 6 The performance visualization results of the hybridization predictor for partial-length sequences are shown, where Figure 6 (a) is the result of a 30nt sequence. Figure 6 (b) is the result of a 40nt sequence. Figure 6 (c) represents a sequence of length 60 nt. Figure 6 (d) represents the sequence result with a length of 80nt. As can be seen from the figure, the hybridization predictor performance of the 80nt sequence is optimal. Therefore, the encoder of this invention preferably uses 80nt as the encoding length.

[0083] After training is complete, all parameters of the hybridization predictor are fixed and used as a non-trainable differentiable surrogate model in the subsequent encoder training phase, replacing the non-differentiable NUPACK software in end-to-end gradient backpropagation.

[0084] IV. Selection of Feature Extractors 4.1 Feature Extractor Selection like Figure 2 As shown, the feature extractor module in this invention is used to extract a discriminative high-dimensional representation from a handwritten signature image. For the offline handwritten signature anti-counterfeiting scenario addressed by this invention, the feature extractor must meet the following requirements: task adaptability, authenticity discrimination capability, generalization ability, and interface compatibility.

[0085] The feature extractor module in this invention is used to extract a discriminative high-dimensional representation from the handwritten signature image, and uses this representation as input to the subsequent sequence encoder to achieve task-driven mapping from the signature image to the DNA sequence. The choice of feature extractor directly affects the separability of the subsequent DNA encoding results for the two types of pairings: "real signature-real signature" and "real signature-forged signature". For the offline handwritten signature anti-counterfeiting scenario targeted by this invention, the selection of the feature extractor should preferentially meet the following requirements: 1) Task adaptability: It should be able to effectively represent fine-grained differences in offline signature images, such as stroke shape, line quality, stroke connection, and local forgery traces, rather than simply providing general visual features for natural image classification; 2) Authenticity discrimination capability: It should still have good discrimination capability under skilled forgery conditions, especially showing stable performance on the equivalent error rate (EER) metric; 3) Generalization capability: The feature space learned on one dataset should be transferable to other signature datasets or other signers, reducing the dependence on retraining the feature extractor for each scenario; 4) Interface compatibility: The output feature dimensions and expression method should facilitate coupling with the subsequent sequence encoder module of this invention, enabling stable calls in the end-to-end training process. The above requirements correspond to the shortcomings of existing methods pointed out in the technical background of this invention: general image feature extractors (such as VGG-type and LeNet-type) are not optimized for signature anti-counterfeiting tasks and are difficult to fully capture the key differences between genuine and counterfeit signatures.

[0086] This embodiment selects the SigNet-F pre-trained model, which is based on the Sigver model and performs well in signature verification tasks, as the feature extractor (e.g., Sigver model). Figure 2 As shown in the figure, the weights are pre-trained on GPDS-960, the largest publicly available offline signature verification dataset, with parameters set to λ = 0.95. This model achieves good performance on several publicly available offline signature verification datasets (such as CEDAR and MCYT), with a lower equivalent error rate (EER) compared to many existing methods. It also outperforms general image feature extraction networks (such as VGG19 and CaffeNet) on signature verification tasks, indicating better task adaptability to signature stroke structure and authenticity differences. Using this pre-trained model helps reduce the cost of training a feature extractor from scratch, improves the stability of the subsequent sequence encoder training stage, and enhances the generalization ability of the overall scheme.

[0087] Figure 7 This demonstrates a pre-trained SigNet-F feature extractor (such as...). Figure 2 The figure shows the distribution of feature vectors extracted from the CEDAR dataset in the feature space. As can be seen from the figure, the feature clusters of different users are relatively clear and distinct, indicating that the feature extractor has good discriminative ability.

[0088] 4.2 Feature Extraction and Preprocessing After determining the feature extractor module, the signature features output by the feature extractor are used as the input to the subsequent sequence encoder module. To facilitate the construction of subsequent training samples and task-driven training, the feature data needs to be preprocessed and organized in a unified manner.

[0089] In this embodiment, CEADR is used to obtain the genuine signature feature matrix and the forged signature feature matrix for each of the 55 signatories of CEDAR. Each signator contains 24 genuine signatures and 24 forged signatures, corresponding to: a genuine signature feature matrix with a shape of (24, 2048); and a forged signature feature matrix with a shape of (24, 2048). That is, for each signator, there is a set of 24×2048 feature matrices for genuine signatures and a set of 24×2048 feature matrices for forged signatures. For all 55 signatories, a feature data set organized according to "signator - genuine / forged category" is formed. This organization method can retain information in both the signator dimension and the genuine / forged category dimension, facilitating subsequent batch parsing and sample-level expansion.

[0090] Based on this, further data processing is performed. The aforementioned feature data is uniformly organized, and the feature matrix grouped by signer and authenticity category is converted into sample-level structured data that can be directly used for subsequent encoder training and testing. The processing includes: parsing the signer number and authenticity label, expanding the feature matrix row by row, adding sample sequence number and label information to each sample, and summarizing the 2048-dimensional feature vector into a unified structured data table after expanding it column by column.

[0091] Specifically, the processing method can perform the following operations: Tag parsing and identifier generation: Based on the organization of feature data, determine the signer ID (user_id) and authenticity label (is_forged) for each sample; where real signatures and forged signatures correspond to different label values; Sample-level expansion: Expand the real signature feature matrix and the forged signature feature matrix of each signer by row, with each row corresponding to a 2048-dimensional feature vector of a signature sample.

[0092] Structured record construction: For each sample record, save the signer number, sample number (sample_index), and true / false label (is_forged), and expand the 2048-dimensional feature vector into feat_0 to feat_2047; Unified summary output: The sample records of all signatories are summarized into a unified structured data table for subsequent training sample construction, statistical analysis and model calling.

[0093] Through the above processing, the original feature matrix grouped by signer and authenticity category is transformed into a unified data representation. On the one hand, this representation retains traceable information such as signer identity, authenticity category, and sample sequence number; on the other hand, the sample features can be directly used as input to the subsequent sequence encoder module, thereby providing a standardized data foundation for the subsequent DNA sequence encoding and hybridization prediction task-driven training of this invention.

[0094] V. Structure and Training of Sequence Encoders 5.1 Sequence Encoder Network Structure Design After preprocessing and structuring the signature image features, this embodiment further designs a sequence encoder module to map the input signature feature vector into a differentiable DNA sequence probability representation. Specifically, it is responsible for mapping the 2048-dimensional feature vector output by the feature extractor into a DNA sequence of 80 nucleotides. The sequence encoder adopts a lightweight mapping structure based on fully connected layers. For each input signature feature vector (2048 dimensions), the encoder sequentially performs feature compression, sequence space mapping, reshaping, and base probability normalization, outputting a fixed-length DNA sequence probability matrix. The network structure design of the sequence encoder is as follows: Input layer: 2048-dimensional signature feature vector; First fully connected layer: maps 2048 dimensions to a hidden dimension of 1024; activation function: ReLU; The second fully connected layer maps 1024 dimensions to 320 dimensions; Reshaping operation: Reshapes a 320-dimensional vector into an 80×4 matrix; Softmax normalization: Normalize the sequence at each site in 4 dimensions to obtain the probability distribution of A / C / G / T at that site.

[0095] The encoder outputs a probability matrix of shape (80, 4) for a single sample, where: the sequence length is 80nt; each row represents the probability distribution of the four bases at the corresponding sequence site.

[0096] Considering that the training phase of this invention uses signature feature pairs as basic input units (e.g., real-to-real pairing, real-to-fake pairing), this invention employs a twin-like shared coding structure. For a pair of input features (f1, f2), they are respectively input into the same sequence encoder with shared parameters, resulting in two DNA sequence probability outputs (p1, p2). This design ensures that the two branches use consistent coding rules, which is beneficial for the stable establishment of pairing constraints.

[0097] To enhance the ability of the encoding results to preserve the signer's identity information, this invention further includes a user identity auxiliary classification branch. This branch takes the DNA probability sequence output by the encoder as input, flattens it, and performs a fully connected mapping to output the user category prediction result. During training, this branch, along with the user labels, constitutes auxiliary supervision. Preferably, in the CEDAR dataset scenario, the number of user categories is set to 55.

[0098] 5.2 Training Data Generation and Paired Sample Construction To adapt to the twin sequence encoder training framework of this invention, the sample-level structured signature feature data needs to be further converted into a dataset in the form of "paired inputs". This invention uses a data processing script to perform user-specific partitioning, training / testing partitioning, and positive / negative sample pairing construction on the standardized feature data table, generating a balanced paired dataset for sequence encoder training and validation.

[0099] The script first reads the standardized feature data table and automatically identifies feature columns starting with "feat_" as dimensions of the signature feature vector. This data table contains at least the following fields: signer ID (user_id), sample index (sample_index), authenticity label (is_forged), and 2048 feature columns from feat_0 to feat_2047.

[0100] Within each signer, the script uses a fixed partitioning of real and forged signatures based on `sample_index`: the training set samples are `sample_index<20` (samples 1-19), and the test set samples are `sample_index>=20` (samples 20-24). Simultaneously, it distinguishes between real and forged signature features based on `is_forged`, thus creating a subset of real / forged features for each user during both the training and testing phases.

[0101] For each signer, balanced training samples are constructed using two types of pairings during the training phase: "true-true" and "true-false".

[0102] True-True Pairing: Constructed by pairwise combinations from the 19 true signature features in the user's training set. For a single user, the following can be obtained: True-true pairing.

[0103] True-False Pairing: Construct a candidate set from the real signature features and fake signature features of the user's training set, and truncate it according to the principle of the same number of true-true pairs to achieve a balance of positive and negative samples within each user.

[0104] For each signer, the pairing logic is the same as in the training set during the testing phase, but constructed based on the test samples (samples 20-24). True-true pairings are obtained by combining pairs of the five true signature features from the test set. True-to-true pairs are counted; true-to-false pairs are also truncated according to the same principle as true-to-true pairs to ensure a balance between the number of positive and negative samples in the test set.

[0105] For each constructed paired sample, the script generates a unified structured record, including at least: label (pairing label, 1 for true-true, 0 for true-false), user_label (signer ID), feat1_0 to feat1_2047 (2048-dimensional features of the first signature sample), and feat2_0 to feat2_2047 (2048-dimensional features of the second signature sample). Finally, training and testing paired data files are generated for use by the sequence encoder training script.

[0106] Based on the above pairing construction strategy, this embodiment obtains the size statistics of the sequence encoder training and testing datasets under this implementation. For each signer, during the training phase, 19 real signature samples are used to construct true-true pairs, and 171 true-false pairs are constructed according to the equal quantity principle. Therefore, a total of 342 pairing samples are obtained for a single-user training set. Given that the CEDAR dataset contains 55 signers, the total number of training pairing samples is 55 × 342 = 18810. Similarly, during the testing phase, 5 real signature samples are used to construct true-true pairs, and 10 true-false pairs are constructed. Therefore, a total of 20 pairing samples are obtained for a single-user test set. For 55 signers, the total number of test pairing samples is 55 × 20 = 1100.

[0107] The above construction method has the following advantages: The training and test sets for each signer maintain a balance in the number of positive and negative samples, which helps to avoid training bias caused by skewed sample distribution for some users. The paired labels directly correspond to the two relationships of "real-real" and "real-forged" in the anti-counterfeiting target of this invention, which can provide clear supervision for subsequent encoder training. The fixed partitioning method based on sample_index ensures the determinism of training / testing partitioning, which facilitates repeated experiments, parameter comparison and subsequent system evaluation. This data construction method does not depend on a specific network structure. When the feature dimension, encoding length or auxiliary task changes, only the consistency of the sample-level structured field definition needs to be maintained to reuse the same pairing generation process.

[0108] 5.3 Multi-task joint training strategy To ensure that the DNA sequence output by the encoder simultaneously meets the requirements of task prediction, feature semantic preservation, and identity discrimination, this embodiment employs a multi-task joint training strategy. The total loss consists of the following three parts: (1) Yield Loss The DNA probability sequence output by the encoder is used to generate an approximate discrete base representation through a differentiable discretization Softmax function, and then organized into a paired input form before being fed into a pre-trained hybridization predictor. A regression loss is calculated between the predictor output and the paired tags, with MSE Loss chosen as the task prediction loss. In this embodiment, the hybridization predictor's parameters are frozen during the sequence encoder training phase, and it is used only as a task-driven supervision module.

[0109] (2) Semantic consistency loss (Reconstruction Loss) To maintain the relative semantic relationship between sample pairs before and after encoding, a semantic consistency constraint is introduced. This constraint constructs a consistency loss (preferably based on the mean squared error of cosine similarity difference) by comparing the similarity of the original signature feature pairs in the input space and the similarity of the corresponding DNA sequences in the predictor's internal feature space. This design reduces the disruption to the original discriminative structure during task optimization.

[0110] (3) User identity classification loss Two DNA probability sequences are input into the user identity auxiliary classification branch respectively. The classification loss (preferably cross-entropy loss) is calculated using the corresponding user labels. The losses of the two branches are then combined to enhance the identity preservation capability of the encoding results.

[0111] In this invention, the sequence encoder training phase employs a parameter update strategy of "fixed front-end features, fixed back-end predictor, and trainable intermediate encoder." Specifically, the feature extractor provides stable signature feature representations, and the hybrid predictor provides differentiable task supervision signals; both maintain parameter freeze during encoder training. Only the sequence encoder and its user identity auxiliary classification branch participate in gradient updates. This strategy avoids predictor parameter drift caused by the unstable encoder output distribution in the early stages of training, thereby improving overall training stability.

[0112] In multi-task joint optimization, the numerical range and convergence speed of different loss terms may vary. To avoid a single loss term dominating in the early stages of training and inhibiting the learning of other objectives, this invention employs a weighted loss method to balance the task prediction loss, semantic consistency loss, and user identity classification loss.

[0113] Along the gradient propagation path, the gradients of the task prediction loss and semantic consistency loss are backpropagated to the sequence encoder output through the frozen hybridization predictor, but the predictor weights are not updated; the gradient of the user identity classification loss is directly propagated to the encoder output through the auxiliary classification branch. In this way, the encoder parameters are simultaneously subject to the joint constraints of the "hybridization behavior objective" and the "identity discrimination objective".

[0114] The model's total loss function and parameter optimization are shown below: This invention employs weighted multi-task loss joint optimization, with a total loss... Calculate using the following formula:

[0115] in, This represents the total joint loss function for encoder training. To identify yield loss, the mean square error (MSE) between the predicted yield and the target tag is calculated to constrain the generated DNA sequence to meet the preset hybridization yield requirements. To reconstruct the semantic consistency loss, we align the input feature vector with the cosine similarity of the generated sequence feature space to ensure that the generated DNA sequence retains the semantic information of the original data. To achieve identity discrimination loss, cross-entropy is used to force the model to embed specific user features in the generated sequence in order to achieve identity traceability.

[0116] , and These are the weighting coefficients for the corresponding loss terms, used to balance the importance of prediction accuracy, semantic preservation, and identity recognition in multi-task learning. The values ​​of the weighting coefficients are empirically fine-tuned to balance the differences in the scale of each loss term and to assign a semantic reconstruction loss. A higher weighting ratio aims to ensure that, while meeting output and identity constraints, the model prioritizes faithful reproduction of the original feature information in the generated sequence. The final value is... , , .

[0117] 5.4 Parameter Update Strategy In this invention, the sequence encoder training phase employs a parameter update strategy of "fixed front-end features, fixed back-end predictor, and trainable intermediate encoder." Specifically, the feature extractor provides stable signature feature representations, and the hybrid predictor provides differentiable task supervision signals; both maintain parameter freeze during encoder training. Only the sequence encoder and its user identity auxiliary classification branch participate in gradient updates.

[0118] Along the gradient propagation path, the gradients of the task prediction loss and semantic consistency loss are backpropagated to the sequence encoder output through the frozen hybridization predictor, but the predictor weights are not updated; the gradient of the user identity classification loss is directly propagated to the encoder output through the auxiliary classification branch. In this way, the encoder parameters are simultaneously subject to the joint constraints of the "hybridization behavior objective" and the "identity discrimination objective".

[0119] During parameter updates, the Adam (Adaptive Moment Estimation) optimizer is used to jointly update the sequence encoder and the user identity auxiliary classification branch, with a learning rate of 0.001.

[0120] 5.5 Training Process and Results The training process of the sequence encoder includes: initializing the sequence encoder, twin shared coding structure, user identity auxiliary classification branch and pre-trained hybridization predictor; loading training / validation paired data; performing forward computation, loss calculation, backpropagation and parameter update in each round; and evaluating task loss and auxiliary classification performance on the validation set.

[0121] Figure 8 The results demonstrate the changes in the encoder training epochs and the true-to-true recognition rate (Recall), true-to-false interception rate (TNR), and average accuracy (AvgAcc) of the encoded sequences tested on the test set when an 80nt hybridization predictor is selected. The results show that the highest average accuracy is achieved at epoch 30, with a true-to-true recognition rate of 91.64%, a true-to-false interception rate of 99.64%, and an average accuracy of 95.64%. This indicates that the encoder of this invention produces sequences with excellent performance and can meet the requirements for anti-counterfeiting verification.

[0122] After training the sequence encoder, the parameters of the feature extractor and sequence encoder should be fixed to generate the corresponding DNA coding sequence for any signature image to be processed. Specifically, the input signature image is first converted into a 2048-dimensional signature feature vector by the feature extractor, and then the trained encoder outputs a base probability matrix of shape 80×4. Subsequently, the maximum probability selection rule is used to convert the four-dimensional probability distribution of the one-hot code encoded at each site into a single base, thereby obtaining a deterministic DNA sequence of length 80 nt.

[0123] VI. DNA Encoding Storage Method for Handwritten Signatures After the above three stages of training, any input handwritten signature image can generate a corresponding 80nt DNA sequence using a feature extractor and encoder with fixed parameters. This DNA sequence is the feature encoding sequence of the signature image and can be used as a key to authenticate the signature in a DNA storage system.

[0124] In a DNA storage system, the DNA sequence corresponding to each signer's registered signature can be used as a target sequence in the database. In addition to the feature coding region, each DNA oligonucleotide may also contain: (1) a unique barcode region for identifying the signer's identity ID; (2) a primer region for PCR amplification and subsequent processing. All signer's DNA oligonucleotides can be stored together in the same DNA pool; (3) an error correction coding region for correcting and reconstructing DNA storage codes after certain errors occur.

[0125] Once the DNA sequence is designed, it can be synthesized into actual DNA oligonucleotide molecules using DNA synthesis techniques (such as phosphoramide imine chemical synthesis), thus enabling the storage of handwritten signature information in DNA.

[0126] VII. Anti-counterfeiting verification methods based on molecular hybridization When it is necessary to verify whether a signature to be checked is the genuine signature of a certain signatory, the following steps are performed: Step 1: Input the image of the signature to be tested into the trained feature extractor and encoder to generate its corresponding 80nt DNA feature coding sequence.

[0127] Step 2: Synthesize the reverse complement of the coding sequence and add a biotin label to its 5' end to prepare a biotinylated probe oligonucleotide.

[0128] Step 3: Perform an annealing hybridization reaction between the probe oligonucleotide and the registered signature DNA sequence of the signatory in the DNA storage database. If the signature to be tested is the signatory's genuine signature, the probe sequence and the target sequence should have a high degree of complementarity, forming a stable double-stranded hybridization structure, resulting in a high hybridization yield; if the signature to be tested is a forged signature, the complementarity between the probe sequence and the target sequence will be low, and the hybridization yield will be low.

[0129] Step 4: Determine the hybridization result using molecular detection methods. Optional detection methods include, but are not limited to: (a) using streptavidin-coated magnetic beads for magnetic bead separation, followed by high-throughput sequencing to read the captured sequences and their abundance; sequences with high hybridization yields will be preferentially captured; (b) using nucleic acid fluorescent dyes such as SYBR Green I to detect the fluorescence intensity of the hybridization products; higher fluorescence intensity indicates a higher degree of hybridization and a greater likelihood that the signature to be tested is a genuine signature; (c) using electrochemical sensors and other methods to detect the hybridization signal.

[0130] Step 5: Set the hybridization yield threshold. If the detected hybridization yield or fluorescence intensity exceeds the preset threshold, the signature to be tested is determined to be genuine; otherwise, it is determined to be a forged signature. This threshold can be adjusted according to the security requirements of specific application scenarios.

[0131] 8. Handwritten signature image capture webpage design To support the collection and subsequent verification of self-built signature samples in addition to the CEDAR public dataset, this invention further designed and implemented a handwritten signature collection webpage, which is used to collect images of the subject's real signature and forged signature, and generate standardized image data that can be used for subsequent feature extraction, encoding training, simulation experiments and wet experiment verification.

[0132] 8.1 Webpage Function Design This web system provides functions such as signature type selection (real signature / forged signature), signer ID management, canvas size settings, save size settings (including 2:1 automatic constraint), background color and handwriting color settings, and stroke thickness adjustment. It can complete the collection and organization of signature samples from multiple users and categories under a unified collection standard.

[0133] The system front end adopts a canvas drawing interaction method, supports mouse and touch screen input, and provides operations such as undo, redo, and clear canvas, making it convenient for subjects and experimenters to complete signature entry under different terminal conditions.

[0134] 8.2 Data Quality Control In terms of data quality control, the webpage has set up a visual guide box to constrain the signature writing area during data collection and automatically crop and scale the central area of ​​the canvas when saving, thereby outputting signature images with consistent size and proportion, reducing morphological deviations caused by differences in writing position, canvas size and data collection device.

[0135] The system provides real-time preview and saved signature preview functions, which can instantly check the current signature status and historical saved results; and assists in the collection process management through statistical information such as "current signer count / 24" and "next number", supports number increment and missing number filling according to signer and signature type, which is conducive to forming a standardized data structure similar to CEDAR (e.g., a fixed number of real signatures and forged signatures for each signer).

[0136] 8.3 Data Export and Management The system supports local storage persistence, management of saved sample lists, single deletion, full reset, and batch download (ZIP). It also automatically exports a metadata table (metadata.csv) containing information such as filename, signature type, signer ID, acquisition time, image size, cropping area, and handwriting parameters, thus providing a foundation for subsequent data cleaning, sample traceability, feature extraction, and experimental reproduction.

[0137] 8.4 Application Effect Verification Based on this web-based system, this invention can collect self-built signature image data outside of public datasets for two types of work: 1) Constructing supplementary datasets in the digital domain to verify the applicability of the proposed feature extraction, sequence encoding, and hybridization prediction-driven training process to data under different collection conditions; 2) In molecular-level experiments, generating corresponding DNA coding sequences based on self-built signature samples and conducting wet experiments for verification, thereby forming a complete experimental chain of "self-collected signature images - feature extraction - DNA encoding - molecular hybridization verification".

[0138] Figure 10 This paper presents six images similar to a user in CEDAR, collected from a handwritten signature webpage designed and implemented using this method. The six images were encoded using an encoder with 30 epochs to obtain six DNA sequences. These sequences were combined in pairs to form 15 combinations, and the calculated average similarity was 90.17%, demonstrating the generalizability of this method.

[0139] In summary, this invention provides a signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning. By constructing three core components—a feature extractor, a sequence encoder, and a hybridization predictor—and employing a multi-task joint training strategy, it successfully achieves end-to-end task-driven encoding from handwritten signature images to DNA sequences. This enables the DNA medium to simultaneously possess information storage functionality and molecular hybridization-based anti-counterfeiting verification capabilities. Experimental results show that this method achieves a 91.64% true-to-true recognition rate and a 99.64% true-to-false interception rate on the CEDAR dataset, with an average accuracy of 95.64%, verifying the effectiveness and superiority of this method.

[0140] The above content is only for illustrating the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. Any modifications made to the technical solution based on the technical concept proposed in this invention shall fall within the scope of protection of the claims of this invention.

Claims

1. A signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning, characterized in that, Includes the following steps: Obtain the signature image and extract its feature vector using a pre-trained feature extractor; The feature vector is input into a trained sequence encoder and mapped to a DNA sequence. The DNA sequence is stored as the feature encoding sequence of the signature image in a DNA storage medium; During the verification phase, the signature image to be tested is converted into a DNA sequence and subjected to molecular hybridization with the stored target sequence. The authenticity of the signature is determined based on the hybridization rate.

2. The signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning according to claim 1, characterized in that, The feature extractor is a pre-trained convolutional neural network optimized for handwritten signature anti-counterfeiting tasks.

3. The signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning according to claim 1, characterized in that, The sequence encoder employs a twin-shared network structure, maps feature vectors to a base probability matrix, and generates a fixed-length DNA sequence by selecting the maximum probability.

4. The signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning according to claim 1, characterized in that, The sequence encoder is trained using a multi-task joint loss function. The training process employs a balanced pairing strategy to construct training samples, ensuring a balance between the number of true-true pairs and true-false pairs for each signer.

5. The signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning according to claim 4, characterized in that, The multi-task joint loss function includes: task prediction loss, based on the error between the hybridization rate output by the hybridization predictor and the paired labels; semantic consistency loss, used to maintain the relative similarity between the original signature feature pairs and the DNA sequence pairs; and user identity classification loss, used to enhance the identity discrimination capability of the encoding results.

6. The signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning according to claim 5, characterized in that, The hybridization predictor is a differentiable deep learning model used to estimate the hybridization rate between two DNA sequences and keeps its parameters frozen during the sequence encoder training phase, serving as a task-driven differentiable supervisory signal source.

7. The signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning according to claim 6, characterized in that, The hybridization predictor is pre-trained on a large-scale DNA sequence pair dataset, which covers 27 sequence lengths from 20nt to 150nt. The sequences meet the constraints of GC content of 40%-60% and maximum homopolymer length of no more than 3.

8. The signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning according to claim 1, characterized in that, The steps for determining the authenticity of a signature through molecular hybridization reaction specifically include: Synthesize the reverse complementary sequence of the DNA sequence to be tested and prepare labeled probes; The probe is annealed and hybridized with the feature encoding sequence of the target signer in the storage pool; The hybridization rate is obtained by measuring the hybridization reaction signal; If the hybridization rate exceeds a preset threshold, it is determined to be a genuine signature; otherwise, it is a forged signature.

9. A system for implementing the signature image DNA encoding storage and anti-counterfeiting verification method according to any one of claims 1-8, characterized in that, include: The feature extraction module is used to extract discriminative feature vectors from the signature image; The sequence encoding module is used to map feature vectors to DNA sequences; Storage module, used to store DNA sequences in DNA media; The verification module is used to synthesize the reverse complementary sequence of the DNA sequence to be tested and prepare a labeled probe. The probe is then hybridized with the characteristic coding sequence of the target signer in the storage pool. The hybridization signal is measured and the authenticity of the signature to be tested is determined according to a preset threshold.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the signature image DNA encoding storage and anti-counterfeiting verification method based on feature learning as described in any one of claims 1 to 8.

Citation Information

Patent Citations

  • Signature authentication methods, systems, terminal devices, and storage media

    CN115223178B

  • Invasive species identification method and device based on deep learning and DNA storage, and electronic equipment

    CN121582595A