Blockchain-based AI training data copyright protection method and system

By using blockchain technology, deep learning, and zero-knowledge proofs, digital fingerprints of artworks are generated and compliant audits are conducted. This solves the problems of transparent traceability and automated authorization of AI training datasets, and realizes a trustworthy process from artwork registration to model release, ensuring the rights of creators and regulatory compliance.

CN122263151APending Publication Date: 2026-06-23BEIJING INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2026-03-12
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing AI training datasets lack publicly available and detailed lists of data sources, making it difficult for creators to confirm whether their works have been used for training and to prove that the AI ​​model used their works. Traditional copyright markings become invalid after data preprocessing, and the lack of a unified and automated authorization channel prevents creators from finely controlling the scope of their works' use.

Method used

By using a blockchain-based approach, digital fingerprints of artworks are extracted using multimodal deep semantic networks. Zero-knowledge proof technology is combined with compliance audits for privacy protection, and smart contracts are used to achieve automated authorization management, thus building a trusted collaborative network between creators and AI model trainers.

Benefits of technology

It achieves transparent traceability and automated authorization of AI training data while protecting commercial privacy, solves the problems of traceability and regulatory compliance of creators' use of their works, builds a complete evidence chain from work registration to model release, and improves the efficiency and transparency of authorization management.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122263151A_ABST
    Figure CN122263151A_ABST
Patent Text Reader

Abstract

The application discloses a kind of AI training data copyright protection method and system based on block chain, the method includes: extracting semantic feature vector to art work generates digital fingerprint, with using permission policy together write into block chain;AI training party extracts training data feature vector, with on-chain digital fingerprint Similarity comparison, compliance proof is generated based on zero-knowledge proof and is submitted to block chain;Smart contract is inquired according to matching result work permission, it is judged whether authorization, and digital license certificate or rejection log is generated;After training, compliance report is generated by summarizing digital license certificate, and AI model is bound and stored in block chain.Therefore, using zero-knowledge proof technology, under the premise that original training data set content is not disclosed, the compliance proof that can be verified is submitted to chain.And through smart contract, license agreement is converted into tamper-proof code logic, ensure that each authorized record is open and transparent, traceable and automatically executed.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of intelligent blockchain storage technology, and in particular to a blockchain-based method and system for protecting the copyright of AI training data. Background Technology

[0002] Currently, AI model trainers (such as large technology companies and AI labs) aggregate massive amounts of publicly available internet data into their private, centralized servers through web scraping or bulk downloading to create datasets, followed by data preprocessing and model training. In this model, the data usage process remains a closed black box to outsiders (including data creators and regulatory bodies). AI trainers consider their datasets core trade secrets and are unwilling to disclose specific data content; while regulators and copyright holders demand transparent audits. Existing AI training sets typically lack publicly available, detailed lists of data sources. Creators cannot know whether their works have been included in the training set, and due to the highly abstract internal parameters of AI models, even if a creator discovers that the model has generated works in a similar style, they find it difficult to technically prove that the model actually used their specific work, leaving them with no recourse for legal action. Furthermore, current technologies largely rely on digital watermarks or traditional cryptographic hashes (such as MD5 and SHA) for copyright marking. However, during the data preprocessing stage of AI training, the raw data often undergoes cropping, scaling, compression, noise reduction, or format conversion. These operations can destroy watermarks or change hash values, causing the artwork to disappear from the dataset and become unrecognizable.

[0003] The aforementioned issues result in existing AI model training datasets typically lacking annotations regarding data sources. Creators cannot confirm whether their work was used for training, nor can they prove that an AI model actually used their work. Furthermore, artworks lack unique and identifiable technical identifiers. After images, music, videos, and other artworks are processed, transformed, or disassembled, current technologies struggle to effectively identify them. Similarly, the lack of a unified and automated authorization channel prevents creators from ensuring that the AI ​​training provider has actually removed their work if they do not wish it to be used for training. This prevents creators from fine-grained control over the scope of their work's use (e.g., research-only, commercial prohibition). Summary of the Invention

[0004] The purpose of this invention is to overcome the shortcomings of existing technologies and provide a blockchain-based method and system for protecting the copyright of AI training data. This system, based on blockchain, deep learning feature extraction, and zero-knowledge proof technology, aims to achieve traceability and authorization management of artworks within AI training data. The system constructs a trusted collaborative network between creators (data owners) and AI model trainers (data users). It extracts digital fingerprints of artworks through a multimodal deep semantic network, utilizes a vector database for rapid similarity retrieval of massive amounts of data, and combines zero-knowledge proof technology to complete compliance audits while protecting the privacy of the training set. Finally, it achieves automated copyright authorization through smart contracts.

[0005] The objective of this invention is achieved through the following technical solution:

[0006] Firstly, this application discloses a blockchain-based method for protecting the copyright of AI training data, comprising: S1, extracting semantic feature vectors from artworks using a feature extraction model to generate digital fingerprints, and writing the digital fingerprints along with the usage permission policies set by the creators into the blockchain; S2, extracting feature vectors from training data locally by the AI ​​trainer, comparing them with the digital fingerprints on the blockchain, generating compliance proofs based on zero-knowledge proofs, and submitting the matching results and proofs to the blockchain; S3, querying the permissions of the corresponding works by the smart contract on the blockchain according to the matching results, determining whether to authorize, and generating digital license certificates or rejection logs; S4, after the AI ​​training is completed, summarizing the digital license certificates obtained in this training to generate a training data compliance report, and binding it with the AI ​​model and storing it on the blockchain.

[0007] Furthermore, the generation of digital fingerprints includes: extracting feature vectors of artworks in semantic space using a pre-trained multimodal deep neural network; performing a hash operation on the feature vectors to generate a fingerprint hash value as a unique identifier for the artwork; and packaging the fingerprint hash value, creator identity information, timestamp, and usage permission policy into the blockchain.

[0008] Furthermore, the generation of compliance proof based on zero-knowledge proof includes: the AI ​​trainer constructs an index for the feature vectors of the training data using a vector database locally; compares the local index with a digital fingerprint database synchronized from the blockchain for similarity, and marks data items whose similarity exceeds a preset threshold; and generates an encrypted proof for proving the authenticity of the comparison process and result based on a pre-deployed zero-knowledge proof circuit, which does not contain the original training data information.

[0009] Secondly, this application discloses a blockchain-based AI training data copyright protection system for implementing the aforementioned blockchain-based AI training data copyright protection method. The system includes: an artwork fingerprint generation and registration terminal for extracting semantic feature vectors of artworks and generating digital fingerprints, as well as setting usage permission policies and uploading them to the blockchain; a privacy-preserving AI dataset auditing engine for locally extracting features and indexing training data, comparing similarity with the on-chain digital fingerprint database, and generating zero-knowledge proofs; a blockchain network and smart contract authorization system for storing immutable ownership and policy records, and automatically executing authorization decisions and records based on smart contracts; and an AI model compliance traceability interface for generating a training data compliance report after training is completed and binding it to the AI ​​model.

[0010] Furthermore, the artwork fingerprint generation and registration terminal includes: a pre-trained deep neural network, used to extract feature vectors of artworks in semantic space and generate digital fingerprints with semantic invariance.

[0011] Furthermore, the blockchain network and smart contract authorization system adopt a hybrid storage architecture that combines on-chain indexing and off-chain storage, including: the original work file and detailed permission policy file are encrypted and stored in a distributed storage network, and their content addressing hash is stored only on the blockchain; the blockchain maintains a work registration mapping table and an authorization record mapping table, which respectively record the ownership information associated with the digital fingerprint hash and each authorization status change.

[0012] Furthermore, the smart contract authorization system is also used to: automatically generate a non-transferable digital license certificate when a work is matched and the permission policy is verified to be compatible. This certificate is bound to the AI ​​training party's account and contains the authorization scope and timestamp. When the permission policy is not compatible, it automatically records unauthorized access logs and sends a compliance warning to the AI ​​training party.

[0013] Furthermore, the AI ​​model compliance traceability interface is also used to: summarize all digital license certificates obtained during the training process to generate a training data component table, and generate a compliance fingerprint through the Merkle tree algorithm; write the compliance fingerprint into the metadata of the AI ​​model file, and establish a mapping relationship between the model file hash and the compliance fingerprint in the blockchain.

[0014] Furthermore, the AI ​​model compliance traceability interface also includes a public verification interface, which allows third parties to query the corresponding training data composition table and the authorization status of specific works through model hash or compliance fingerprint, and to verify based on the Merkel proof mechanism.

[0015] The beneficial effects of this invention are:

[0016] This invention achieves a balance between protecting business privacy and ensuring compliance auditing: Utilizing zero-knowledge proof (ZKP) technology, it allows AI trainers to submit verifiable compliance proofs on the blockchain without disclosing the original training dataset (i.e., protecting their core business secrets). Compared to traditional black-box models or mandatory auditing schemes that require public data, this invention guarantees corporate privacy.

[0017] A decentralized and automated transparent authorization mechanism has been constructed: by building an automated permission state machine through smart contracts, this invention transforms the licensing agreement into immutable code logic. Compared with the traditional model that relies on manual review or centralized platform management, this eliminates the risks of opaque operations and single points of trust, ensuring that every authorization record is open, transparent, traceable, and automatically executed.

[0018] This invention achieves reliable traceability throughout the entire lifecycle of AI models: It establishes a complete chain of evidence from work registration to training authorization and model release. By embedding the Merkel root generated from compliance records into the AI ​​model's metadata, regulators and users can verify the model's training process at any time, effectively solving the problem of difficulty in obtaining evidence in copyright disputes involving AI-generated content.

[0019] It promotes the openness and sharing of a high-quality data ecosystem: By providing creators with credible technical rights confirmation and transparent authorization guarantees through the system, this invention effectively eliminates creators' concerns about their works being stolen by AI black boxes. Attached Figure Description

[0020] Figure 1 This is a schematic diagram of the overall modular design of a blockchain-based AI training data copyright protection system according to some embodiments of this application;

[0021] Figure 2 This is a modular schematic diagram of an artwork fingerprint generation and registration terminal according to some embodiments of this application;

[0022] Figure 3 This is a modular schematic diagram of a privacy-preserving AI dataset auditing engine according to some embodiments of this application;

[0023] Figure 4 This is a modular schematic diagram of a blockchain network and smart contract authorization system according to some embodiments of this application;

[0024] Figure 5 This is a modular schematic diagram of an AI model compliance tracing interface according to some embodiments of this application. Detailed Implementation

[0025] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0026] Before we begin, let's explain some technical terms.

[0027] Blockchain technology: Blockchain is a decentralized distributed ledger technology that uses cryptographic algorithms to link data blocks chronologically and employs a consensus mechanism to ensure data consistency and immutability. In this invention, blockchain technology is used to build a trusted evidence storage platform that records the digital fingerprint of artworks, ownership information, licensing policies, and compliance audit logs of the AI ​​training provider, ensuring that both the creator's rights statement and the AI ​​provider's usage are traceable and cannot be unilaterally altered.

[0028] Digital signature technology: A digital signature is a technology used to verify the authenticity and integrity of digital information. This article uses digital signature technology to ensure that only authorized personnel can access patient data, thereby protecting patient privacy and security. Its core principle is to encrypt the data or a digest of the data using the sender's private key, thereby generating a unique signature. The recipient can use the sender's public key to decrypt and verify the signature, ensuring that the data has not been tampered with and confirming the sender's identity.

[0029] Deep Learning and Semantic Feature Extraction: Deep learning is a machine learning method based on artificial neural networks, capable of automatically learning high-level abstract features from unstructured data. Semantic feature extraction refers to using pre-trained deep neural networks (such as CLIP and ResNet) to transform works such as images and audio into high-dimensional vectors. In this invention, this technique is used to generate digital fingerprints of works.

[0030] Vector databases and indexing techniques: Vector databases are systems specifically designed for storing, managing, and querying high-dimensional vector data. To quickly find similar vectors in massive datasets, approximate nearest neighbor (ANN) algorithms and efficient indexing structures (such as HNSW and IVF) are typically employed.

[0031] Zero-knowledge proofs: Zero-knowledge proofs are cryptographic protocols that allow the prover to prove a statement is true without revealing the specific information to the verifier. Common implementation protocols include zk-SNARKs.

[0032] Smart contracts: A smart contract is a computer program that runs on a blockchain and automatically executes its contract terms when preset conditions are met. Its code logic is public and transparent, and the execution result is irreversible.

[0033] Merkle Tree: A Merkle tree is a tree-like data structure based on hash values. Each leaf node identifies a data block with a cryptographic hash, while each non-leaf node is a combination of the cryptographic hashes of its child nodes. The core advantage of this structure is its ability to efficiently and securely verify the integrity of large-scale datasets, and any small change in the underlying data will cause a drastic change in the root hash value.

[0034] According to an embodiment of this application, a blockchain-based method for protecting the copyright of AI training data constructs a complete and trustworthy process from work ownership confirmation, privacy auditing, automatic authorization to model traceability, aiming to achieve traceability, verifiability, and manageability of artworks in AI training scenarios. Specifically, it includes:

[0035] In the S1 stage of rights confirmation and policy setting, semantic feature vectors are extracted from the artwork using a feature extraction model to generate a digital fingerprint. This digital fingerprint, along with the usage permission policy set by the creator, is then written into the blockchain. Specifically, the creator uploads their work, and a deep neural network extracts the feature vectors of the artwork in the semantic space. Based on this, a highly robust digital fingerprint is generated, which can be called the artwork identifier. Simultaneously, the creator sets the permission policy for the work, such as whether it can be used for AI training, and if so, whether it allows all uses or only non-commercial uses. This permission information, along with the artwork's digital fingerprint, is packaged and written into the blockchain, forming an immutable rights statement. In this way, by generating a digital fingerprint based on deep learning semantic features, the artwork can still be accurately identified by the system through feature similarity comparison, even after undergoing the preprocessing operations common in AI training.

[0036] In detail, generating a digital fingerprint includes: extracting feature vectors of the artwork in the semantic space using a pre-trained multimodal deep neural network; performing a hash operation on the feature vectors to generate a fingerprint hash value as a unique identifier for the artwork; and packaging the fingerprint hash value, creator identity information, timestamp, and usage permission policy into the blockchain.

[0037] Building on this, further issues arise regarding the licensing management of personal works within AI training data. Existing digital copyright licensing processes are complex, reliant on manual review, and costly to implement. Furthermore, the AI ​​model training process lacks traceability. Currently, there is no mechanism for recording the chain of connections between the model training process and the data used, making post-event tracing and dispute identification difficult; there is also a lack of a unified, trustworthy collaborative environment among multiple parties. Creators, AI companies, and regulators lack a shared foundation of trust.

[0038] Therefore, the method further performs the following steps:

[0039] In the S2 privacy audit and matching phase, the AI ​​trainer extracts feature vectors from the training data locally, compares them with on-chain digital fingerprints for similarity, and generates compliance proofs based on zero-knowledge proofs. The matching results and proofs are then submitted to the blockchain. Specifically, the AI ​​trainer activates the audit engine, calculates the feature vectors of the training data, and compares them with the on-chain digital fingerprint database. Using zero-knowledge proof technology, the AI ​​submits encrypted proofs such as "I used artwork #1234" to the blockchain without disclosing specific training data.

[0040] In some embodiments, S2 specifically includes:

[0041] S21 Index Construction: The AI ​​side extracts the feature vectors of the training data locally and constructs a vector index.

[0042] S22 Local Scan: The system automatically compares the local index with the synchronized on-chain fingerprint database and marks all data items whose similarity exceeds the threshold (e.g., 0.85).

[0043] S23 Proof Generation: Based on the above comparison results, a compliance scan proof is generated using a pre-deployed zero-knowledge proof circuit. This proof does not contain any original training data information.

[0044] S24 Results on the Blockchain: The AI ​​training team only submits the list of matched work identifiers and the generated cryptographic proof to a smart contract on the blockchain. The smart contract verifies the validity of the proof; once verification is successful, the privacy audit is confirmed to be complete.

[0045] Thus, by using zero-knowledge proof technology, AI trainers can submit verifiable "compliant mathematical proofs" to on-chain smart contracts without disclosing the contents of their core training dataset (trade secrets), resolving the conflict between commercial privacy and regulatory compliance.

[0046] S3, the smart license decision stage, involves querying the permissions of the corresponding work based on the matching results of the smart contract on the blockchain. The smart contract determines whether to grant the license and generates a digital license certificate or a rejection log.

[0047] In detail, after receiving the matching request and proof submitted in Phase S2, the smart contract automatically retrieves the rights statement associated with the corresponding work identifier from the blockchain ledger, especially the permissions set by the creator. Subsequently, the system compares the creator's preset permissions with the intended use by the AI ​​training provider. If they match—for example, if the work permission allows training and the AI ​​provider requests to use it for training—the smart contract automatically generates a digital license certificate containing a timestamp, the AI ​​training provider's identity, and the work identifier, and records this certificate in the authorized ledger on the blockchain. If they do not match—for example, if the work permission prohibits training—the smart contract records an unauthorized attempt log on the blockchain, does not issue a license certificate, and sends a notification to the AI ​​training provider, requesting that it remove the data from the training set or obtain authorization.

[0048] This transforms the creator's legal permission requests into smart contract code, creating an automated permission state machine. Without human intervention, the system can automatically determine whether to approve or reject the AI's usage request based on preset policies, and record state changes in real time.

[0049] S4. Training Compliance Archiving Phase: After AI training is completed, the digital license certificates obtained during this training are compiled to generate a training data compliance report, which is then bound to the AI ​​model and stored on the blockchain. Specifically, after AI training is completed, the system packages all valid digital license certificates obtained during this training process to generate a compliance report for the model's training data. Any third party can query this report to verify whether the AI ​​model has passed the complete data source compliance process.

[0050] Understandably, the above methods enable end-to-end reliable traceability. Leveraging the decentralized and immutable characteristics of blockchain, the entire process—from the original work to the training data and the AI ​​training records—is recorded on the chain, constructing a complete chain of evidence from the original work to the generated model. This solves the problems of opaque data sources and difficulties in post-event evidence collection in existing technologies. Furthermore, by establishing a highly robust technical identifier based on deep learning semantic features as a digital fingerprint, even if the work is distorted by AI preprocessing operations such as cropping, scaling, and compression, the system can still accurately identify the original work through vector similarity comparison. This effectively solves the problem of traditional file hash-based identification technologies failing due to file fine-tuning.

[0051] Furthermore, by introducing zero-knowledge proof technology, AI training providers can prove to the outside world that their datasets have passed compliance scans without disclosing the original training data content. This protects the trade secrets of AI companies while satisfying the copyright holder's right to know and the auditing requirements of regulators. Moreover, combined with smart contract technology, automated and precise authorization is achieved. This invention transforms legal licensing agreements into automatically executable code logic. Creators can flexibly set policies such as allowing or prohibiting specific uses, and the system automatically executes permission determinations and records the status without manual intervention, greatly improving the efficiency and transparency of authorization management.

[0052] The above method is implemented through a blockchain-based AI training data copyright protection system according to an embodiment of this application, with reference to... Figures 1-5 To understand, first refer to Figure 1 The system includes an artwork fingerprint generation and registration terminal, a privacy-protected AI dataset audit engine, a blockchain network and smart contract authorization system, and an AI model compliance traceability interface.

[0053] The artwork fingerprint generation and registration system is used to extract the semantic feature vector of an artwork and generate a digital fingerprint, as well as to set usage permission policies and upload them to the blockchain. For example... Figure 2 As shown.

[0054] Specifically, this platform allows creators to upload original artworks, including but not limited to images, audio, and text. It generates a unique and attack-resistant technical identifier for artwork registration and ownership verification. Its built-in pre-trained deep neural network model extracts feature vectors from the artwork's semantic space and generates a digital fingerprint based on these vectors. This module packages the hash value of the digital fingerprint, the creator's public key, and timestamps onto the blockchain. Simultaneously, creators must set a licensing policy for AI training, such as allowing all training, allowing only non-commercial training, or completely prohibiting training; this policy is stored on the blockchain along with the digital fingerprint.

[0055] Privacy-preserving AI dataset auditing engine, reference Figure 3 This engine is used to extract features and build indexes on training data locally, and to perform similarity comparisons with an on-chain digital fingerprint database and generate zero-knowledge proofs. Specifically, before training an AI model, the engine scans its training dataset to identify whether it contains registered copyrighted works. By using a vector database to build a high-dimensional feature index (such as HNSW), it supports millisecond-level retrieval of hundreds of millions of data points. Furthermore, by introducing a zero-knowledge proof module, it allows AI trainers to generate encrypted proofs locally that "the dataset does not contain infringing data" or "the dataset contains authorized data," without uploading the original training data to the public network, thereby protecting the trade secrets of AI companies.

[0056] Blockchain network and smart contract authorization system, reference Figure 4 It is used to store immutable ownership and policy records, and automatically execute authorization decisions and records based on smart contracts.

[0057] The aforementioned blockchain network serves as an immutable public ledger, maintaining records of work ownership, authorization status, and transaction logs. Based on a consortium blockchain or a high-performance public blockchain, a smart contract for evidence storage is deployed. The stored content consists only of "feature fingerprint hashes" and "zero-knowledge proof verification results," avoiding the storage of large files to ensure operational efficiency. The smart contract authorization system automatically executes authorization logic based on the matching results of the audit engine. Based on preset logic code (i.e., the smart contract), when the audit engine detects a match, the contract automatically reads the AI ​​training license policy for the work. If the policy matches (e.g., the work is set to "allowed," and the AI ​​requests "training"), the contract automatically generates a non-fungible digital license credential and records it on the blockchain. If the policy conflicts (e.g., the work is set to "prohibited"), the contract records a rejection log and issues a compliance warning to the AI ​​training provider.

[0058] AI Model Compliance Traceability Interface Reference Figure 5 This system is used to generate a training data compliance report after training is complete and to link it to the AI ​​model. Specifically, it establishes a compliance association between the model and the training data, providing public access for querying. When the model is released, the system aggregates all digital licenses obtained during that training session, generates a training data compliance report, and uploads it to the blockchain. External regulators or users can verify whether the model contains unauthorized or illegal data by querying the model ID.

[0059] To better illustrate this, the execution process of the blockchain-based AI training data copyright protection system of this application embodiment will be specifically described.

[0060] The artwork fingerprint generation and registration terminal serves as the system's entry point, primarily operating on the creator's local device. Its core task is to transform unstructured artworks (such as images and audio) into a unique, computer-recognizable, and interference-resistant digital identity (ArtID), and then bind it to the creator's established authorization policy to complete on-chain registration. The AI ​​model compliance traceability interface executes S1, thereby performing semantic fingerprint extraction based on deep learning.

[0061] Compared to traditional file hashing (such as MD5), where a single bit change completely alters the file, deep neural networks, used as feature extractors, can analyze the content of an artwork and map it into a high-dimensional feature vector. This generates fingerprints with semantic invariance; that is, even after the artwork has undergone preprocessing such as cropping, scaling, compression, or noise addition during AI training, as long as the human eye can recognize it as the same artwork, the generated feature vectors will still be very close in mathematical space. This solves the problem of traditional techniques being unable to recognize distorted images.

[0062] The AI ​​model compliance traceability interface defines a standardized data structure (such as JSON / XML) to translate human intentions into executable code logic for smart contracts. Creators select simple options on the interface (e.g., "Allow AI training?", "Allow commercial use?"), and the system automatically generates the corresponding logical parameters. This policy document serves as the legal basis for subsequent automatic judgments by smart contracts.

[0063] Furthermore, an off-chain computation and on-chain evidence storage mechanism is established. To protect the privacy of creators' original files and reduce the storage costs of the blockchain, this system adopts a hash value on-chain strategy. Specifically, the feature vector of the work is first extracted and calculated on the user's local device, without the original file itself leaving the user's device. After the creator digitally signs the feature vector hash value and the permission policy file using their private key, the system writes the fingerprint hash (ArtID), permission policy digest, and digital signature into the blockchain, forming an immutable ownership record.

[0064] The privacy-preserving AI dataset auditing engine is deployed on the server side of the AI ​​model training site; it can perform compliance scanning on large-scale training data without disclosing the original content of the AI ​​training dataset, quickly identify which artworks are registered on the blockchain, and generate audit reports that can be verified by the blockchain; it is used to execute S2 and S3.

[0065] Understandably, considering that AI training sets typically contain hundreds of millions or even billions of data samples, traditional one-to-one comparison methods are extremely inefficient and cannot meet engineering requirements. This embodiment introduces vector database technology, enabling the system to utilize the HNSW (Hierarchical Navigable Small World Graph) algorithm to construct a fast index of a high-dimensional feature space. This allows the system to calculate the cosine similarity between the feature vectors of the training data and the fingerprint database of registered works on the blockchain within milliseconds. Even with data volumes reaching hundreds of millions, real-time nearest neighbor search can be achieved to quickly locate suspected infringing data. Furthermore, a zero-knowledge proof privacy protection mechanism is employed; training data is generally core business secrets, and AI companies typically refuse to upload the data to public networks or provide it to third parties for comparison.

[0066] Therefore, this module adopts the zk-SNARKs (zero-knowledge concise non-interactive knowledge proof) protocol. The AI ​​runs the comparison algorithm on its local server, inputting a private training dataset fingerprint and a public on-chain fingerprint database. The algorithm outputs a matching result and generates an encrypted mathematical proof. The verifier only needs to receive and verify this proof. This mechanism mathematically guarantees that the AI ​​trainer does indeed perform a genuine comparison and that the comparison result is correct. However, throughout the entire process, the original training data never leaves the AI ​​trainer's local server, truly achieving the goal of data usability without visibility.

[0067] The blockchain network and smart contract authorization system are responsible for maintaining the authenticity of data in a decentralized environment and automatically executing permission management according to preset rules.

[0068] Specifically, the system uses a blockchain network as its core trusted evidence storage infrastructure, responsible for maintaining the unique ownership status and authorization history of works across the entire network. In terms of data storage architecture, it adopts a hybrid model combining on-chain indexing and off-chain storage to address the limited storage capacity of the blockchain and improve system throughput.

[0069] Specifically, the blockchain ledger primarily maintains two core mapping tables: one is the artwork registration mapping table, which uses the artwork's digital fingerprint hash (ArtID) as the key to link and store the creator's public key, timestamp, and hash digest of the permission policy file; the other is the authorization record mapping table, used to record status changes after each compliance scan in chronological order. For the original artwork files, high-resolution thumbnails, and detailed JSON-formatted policy files, the system encrypts them and stores them in distributed storage networks such as IPFS (InterPlanetary File System), writing only their content address hashes to the blockchain. This not only ensures the immutability of core ownership data but also achieves physical isolation and privacy protection for sensitive data.

[0070] The smart contract authorization logic center is the core of automated management. This module does not rely on any manual intervention and directly adjudicates the usage requests of AI training providers based on preset code logic. When the privacy protection audit engine submits a request containing the fingerprint of the target work and the intended use, the smart contract first retrieves the corresponding work registration information on the chain through an index and reads the permission control policy bits associated with the work. Subsequently, the logic processor inside the contract compares the AI ​​training provider's request parameters (such as "model training" or "commercial use") with the creator's preset permitted range. If the request conforms to the policy settings, a permission generation instruction is triggered; if the request conflicts with the policy (e.g., the creator has set a "prohibit training" flag), the contract automatically rejects the request and triggers a logging function to permanently write this unauthorized access attempt into the on-chain audit log for subsequent compliance warnings or legal evidence collection.

[0071] Finally, as the output of the smart contract's logical determination, this embodiment introduces a non-transferable digital license credential mechanism to identify compliance. When the smart contract determines that the AI ​​training provider's request is legitimate, it automatically forges a unique digital credential. The credential's metadata includes the authorized object's identity ID, the fingerprint hash of the authorized work, the timestamp of the authorization's effective date, and the specific scope of the authorization (e.g., "for research use only"). This credential is set to be non-tradable and non-transferable, and can only be bound to the authorized AI training provider's account address. This mechanism ensures the singularity and non-repudiation of the authorization relationship, making this credential the sole electronic evidence to subsequently prove the compliance of the AI ​​model training data source.

[0072] Furthermore, the AI ​​model compliance traceability interface addresses the black-box problem in existing technologies where the source of a model's training data cannot be traced back after its release. By mapping the authorization status recorded on the blockchain to the model's metadata, this module provides regulatory agencies, distribution platforms, and end users with an open, transparent, and tamper-proof compliance verification channel, ensuring the integrity of the evidence chain from data ownership confirmation to model application. It is used to execute S4.

[0073] In detail, when the AI ​​model completes training and is ready for release, this module automatically triggers the archiving process. The system first iterates through all valid digital licenses obtained in the smart contract for this training task, extracting the unique identifier and fingerprint hash of the original work for each license. Then, the system aggregates these discrete license data to generate a standardized Training Bill of Materials (T-BOM). To ensure the compactness of the data structure and the efficiency of verification, the system does not directly store the massive detailed list. Instead, it uses a Merkle Tree algorithm to hash all licenses in the T-BOM, generating a unique Merkle Root hash value from the bottom up. This root hash value constitutes the compliance fingerprint of the AI ​​model in terms of copyright. Any tampering with the licensing status of the underlying training data will cause a drastic change in this root hash value, thus ensuring the authenticity of the record.

[0074] Furthermore, to achieve a permanent binding between the model file and the compliance record, the system writes the generated compliance fingerprint into the standard metadata field of the AI ​​model file. Simultaneously, the system initiates a record in the blockchain evidence storage network, establishing an inseparable mapping relationship between the AI ​​model's own hash (such as the SHA-256 value of the model weight file) and the aforementioned compliance fingerprint.

[0075] Based on the above architecture, during the model distribution and usage phase, any third party can utilize the API provided by this module for compliance verification. Verifiers only need to provide the hash value of the model file or the compliance fingerprint from the metadata, and the system can retrieve the corresponding T-BOM record in the blockchain ledger. Using the Merkle Proof mechanism, verifiers can not only confirm whether the model has undergone system compliance certification, but also further precisely query whether a specific artist's work is included in the model's training dataset and its licensing status.

[0076] Therefore, to address the issue of massive amounts of AI training data, a high-dimensional feature index was constructed using a vector database, enabling millisecond-level similarity retrieval on a data scale of hundreds of millions, significantly improving audit efficiency. Furthermore, a complete chain of evidence was established from work registration to training authorization to model release. In particular, by generating a Training Data Components Table (T-BOM) and binding it to model metadata, traceability of the AI ​​model training phase was achieved.

[0077] It is understood that the system in this application generates unique digital fingerprints for multimodal artworks such as paintings and music, and stores the artwork fingerprints, creator identity information, and timestamps on the blockchain to establish an immutable foundation for artwork registration. Based on this, the system provides a dataset scanning and fingerprint comparison module for AI training, automatically matching and registering the works contained in its training dataset; combined with smart contracts, it automates the execution of whether a work is authorized for use and the scope of use, thus forming a complete traceability loop for proving the existence of the work and its use in training. This invention solves the problems of opaque training data sources, creators' inability to know whether their works have been used for training, and the difficulty in achieving accurate evidence collection and access control in existing AI model training processes. It realizes traceable, verifiable, and settlement-enabled end-to-end management of personal artworks in AI training scenarios, and has good scalability and application value.

[0078] The above description is merely a preferred embodiment of the present invention. It should be understood that the present invention is not limited to the forms disclosed herein and should not be construed as excluding other embodiments. It can be used in various other combinations, modifications, and environments, and can be altered within the scope of the concept described herein through the above teachings or related technologies or knowledge. Modifications and variations made by those skilled in the art that do not depart from the spirit and scope of the present invention should be within the protection scope of the appended claims.

Claims

1. A blockchain-based method for protecting the copyright of AI training data, characterized in that, include: S1. Extract semantic feature vectors from artworks using a feature extraction model, generate digital fingerprints, and write the digital fingerprints along with the usage permission policies set by the creators into the blockchain. S2. The AI ​​trainer extracts the feature vector of the training data locally, compares it with the digital fingerprint on the chain, generates a compliance certificate based on zero-knowledge proof, and submits the matching result and the certificate to the blockchain. S3. The smart contract on the blockchain queries the permissions of the corresponding work based on the matching results. The smart contract determines whether to authorize and generates a digital license certificate or a rejection log. S4. After the AI ​​training is completed, the digital license certificates obtained in this training will be summarized to generate a training data compliance report, which will be bound to the AI ​​model and stored on the blockchain.

2. The method for protecting the copyright of AI training data based on blockchain according to claim 1, characterized in that, The generation of digital fingerprints includes: The artwork's feature vector in the semantic space is extracted using a pre-trained multimodal deep neural network; the feature vector is hashed to generate a fingerprint hash value that serves as the artwork's unique identifier; the fingerprint hash value, the creator's identity information, the timestamp, and the usage permission policy are packaged and written into the blockchain.

3. The method for protecting the copyright of AI training data based on blockchain according to claim 1, characterized in that, The generation of compliance proofs based on zero-knowledge proofs includes: The AI ​​trainer uses a vector database locally to build an index for the feature vectors of the training data; compares the local index with a digital fingerprint database synchronized from the blockchain for similarity, and marks data items whose similarity exceeds a preset threshold; based on a pre-deployed zero-knowledge proof circuit, it generates an encrypted proof to prove the authenticity of the comparison process and results, which does not contain information from the original training data.

4. A blockchain-based AI training data copyright protection system, used to implement the blockchain-based AI training data copyright protection method according to any one of claims 1-3, characterized in that, include: The artwork fingerprint generation and registration terminal is used to extract the semantic feature vector of artworks and generate digital fingerprints, as well as set usage permission policies and upload them to the blockchain. A privacy-preserving AI dataset auditing engine is used to extract features and build indexes on training data locally, and to compare similarity with the on-chain digital fingerprint database and generate zero-knowledge proofs. A blockchain network and smart contract authorization system are used to store immutable ownership and policy records, and to automatically execute authorization decisions and records based on smart contracts; The AI ​​model compliance traceability interface is used to generate a training data compliance report after training is completed and bind it to the AI ​​model.

5. The blockchain-based AI training data copyright protection system according to claim 4, characterized in that, The artwork fingerprint generation and registration terminal includes: A pre-trained deep neural network is used to extract feature vectors of artworks in the semantic space and generate digital fingerprints with semantic invariance.

6. The blockchain-based AI training data copyright protection system according to claim 4, characterized in that, The blockchain network and smart contract authorization system adopt a hybrid storage architecture that combines on-chain indexing and off-chain storage, including: The original work files and detailed permission policy files are encrypted and stored in a distributed storage network, with their content address hashes only stored on the blockchain; The blockchain maintains a work registration mapping table and an authorization record mapping table, which respectively record the ownership information associated with the digital fingerprint hash and each authorization status change.

7. The blockchain-based AI training data copyright protection system according to claim 4, characterized in that, The smart contract authorization system is also used for: When a work is matched and the permission policy is verified to be compatible, a non-transferable digital license certificate is automatically generated. This certificate is bound to the AI ​​training party's account and contains the scope of authorization and a timestamp. When the permission policy does not match, the system automatically records unauthorized access logs and sends compliance warnings to the AI ​​training provider.

8. The blockchain-based AI training data copyright protection system according to claim 4, characterized in that, The AI ​​model compliance traceability interface is also used for: All digital license credentials obtained during the training process are aggregated to generate a training data composition table, and compliance fingerprints are generated using the Merkle tree algorithm. The compliance fingerprint is written into the metadata of the AI ​​model file, and a mapping relationship between the model file hash and the compliance fingerprint is established in the blockchain.

9. The blockchain-based AI training data copyright protection system according to claim 4, characterized in that, The AI ​​model compliance traceability interface also includes a public verification interface, which allows third parties to query the corresponding training data composition table and the authorization status of specific works through model hash or compliance fingerprint, and to verify based on the Merkel proof mechanism.