A method for traceable management of AI model training data
By generating original sample identifiers and derived sample identifiers, recording each processing step, establishing consumption vouchers and impact indexes, and locking the associated model version, the problem of tracing the scope of impact of problematic samples in existing technologies is solved, and precise model version handling is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHONGQING HIKE NETWORK TECH CO LTD
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technologies make it difficult to accurately trace the impact range of problematic samples after training data has undergone multiple processing stages. This leads to model version handling often remaining at the level of extensive batch rollback or full retraining, lacking a detailed traceability mechanism down to specific samples.
By receiving raw samples and generating raw sample identifiers, recording derived sample identifiers for each cleaning, segmentation, enhancement, relabeling, and sampling process, generating consumption vouchers, establishing an impact index when the model version is frozen, locking associated model versions based on problem samples, and performing failure marking and targeted retraining.
It enables continuous traceability from the original sample to the model version, accurately defines the scope of impact of problematic samples, avoids batch rollback and large-scale manual investigation, and improves the pertinence and controllability of model version handling.
Smart Images

Figure CN122309513A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of artificial intelligence training data traceability management technology, specifically to a method for traceable management of AI model training data. Background Technology
[0002] With the continuous application of artificial intelligence models in scenarios such as quality inspection and recognition, speech analysis, and text discrimination, the scale of training data is constantly expanding, and the data sources are gradually changing from a single collection source to a form where multiple sources are accessed, multiple rounds of cleaning and processing are carried out, and multiple iterations of training coexist. In existing training management methods, raw data, some processed results, training task records, and model files are usually stored separately. Some systems can also realize dataset version management, training task logging, and model version registration to support training process review, result backtracking, and model deployment management. These methods can meet the basic needs of general training organization and version management, but their recording granularity is mostly limited to the dataset level, task level, or artifact level, and more often reflects the correspondence of "a certain batch of data participated in a certain training" or "a certain model comes from a certain task".
[0003] In practice, training data is often not directly input into the training process in its raw state. Instead, it undergoes multiple processing steps, including cleaning, splitting, augmentation, relabeling, and sampling. The final batch of training data typically consists of a processed set of derived samples. While existing technologies can preserve some training logs, sample source information, and model version information, it is often difficult to maintain a continuous correspondence between the original samples, derived samples, training consumption records, and model versions after multiple rounds of derivation processing. Especially when issues such as label correction, sample removal, content distortion, or source conflicts occur, existing methods often only allow for review at the dataset or task level, making it difficult to accurately identify which training batches and model versions were actually affected by specific problematic samples, and also to further differentiate the extent of the impact.
[0004] Therefore, existing technologies, after training data undergoes multiple processing stages and enters multiple rounds of training, lack a traceability mechanism that starts with a specific problem sample and continuously connects the original samples, derived samples, training consumption records, and model versions. This makes it difficult to accurately define the actual impact scope of problem samples, leading to subsequent model version handling often resorting to a crude approach of batch investigation, full-round retraining, or large-scale rollback. This problem does not stem from the fact that existing systems cannot record training activities themselves, but rather from the fact that existing record relationships have not been refined to the actual training consumption chain of specific samples, thus failing to meet the needs for precise traceability of problem samples and targeted handling of associated model versions. Summary of the Invention
[0005] To address the shortcomings of existing technologies, this invention provides a method for traceable management of AI model training data, thereby resolving the problems mentioned in the background section.
[0006] To achieve the above objectives, the present invention provides the following technical solution: a method for traceable management of AI model training data, comprising: S1. Receive the original sample, extract the content fingerprint, source identifier, labeled version, and collection batch, generate the original sample identifier and register it; S2. For each cleaning, segmentation, enhancement, relabeling, and sampling process of the original sample, record the parent sample identifier, processing type, processing parameters, and execution time, and generate the derived sample identifier. S3. When assembling the training batch, record the derived sample identifier, batch identifier, round identifier, label version, and sampling rule for the derived samples entering the training batch, and generate a consumption voucher. S4. When the model version is frozen, collect the consumer-derived sample identifiers based on the consumption vouchers and establish an impact index from the model version to the consumption vouchers. S5. Receive the problem sample trigger request, trace the derived sample identifier based on the original sample identifier, lock the associated model version based on the impact index, and generate the impact adjudication result; S6. Based on the impact adjudication results, perform failure marking, targeted retraining, and version update on the associated model version, and write the new consumption voucher and new impact index registration record corresponding to the version update.
[0007] Furthermore, S1 includes: When receiving the original sample, a content fingerprint is generated based on the content payload of the original sample; The original sample identifier is generated based on the content fingerprint, source identifier, labeled version, collection batch, and rule version number; Original sample registration records are generated based on the original sample identifiers and written into the registration database.
[0008] Furthermore, S2 includes: For each cleaning, segmentation, enhancement, relabeling, and sampling process of the original sample, record the parent sample identifier, processing type, processing parameters, and execution time; and generate a processing parameter summary based on the processing parameters. A derived sample identifier is generated based on the parent sample identifier, processing type, processing parameter summary, time segment corresponding to the execution time, and rule version number.
[0009] Furthermore, the processing parameter summary is reorganized according to the main parameters participating in the derivation determination, based on a fixed field order, unified encoding rules, and unified null value placeholder rules. The derived registration record corresponding to the derived sample identifier is written into the derived database, and a derived evidence record is generated by appending it to the parent sample evidence chain index.
[0010] Furthermore, S3 includes: During training batch assembly, the derived sample record, batch identifier, round identifier, label version, and sampling rules are recorded for the derived samples written into the formal batch set. Consumption vouchers are generated based on derived sample identifiers, batch identifiers, round identifiers, label versions, and sampling rules; consumption registration records are generated based on consumption vouchers and written into the consumption database.
[0011] Furthermore, S4 includes: When the model version is frozen, an impact index registration record is generated based on the model version identifier, the consumption voucher in the formal consumption record, the consumed derived sample identifier, the batch identifier, and the round identifier. The impact index registration record uses the model version identifier as the primary index key and the consumption certificate as the secondary index key, and is written to the impact index database.
[0012] Furthermore, S5 includes: After receiving a problem sample trigger request, starting from the original sample identifier, search for valid derived records whose parent sample identifier is equal to the original sample identifier to obtain a set of derived sample identifiers. Based on the consumption vouchers in the formal consumption records corresponding to the derived sample identifier set, retrieve the impact index registration records of the formal index status, lock the associated model version, and generate the impact adjudication results.
[0013] Furthermore, S6 includes: When marking the associated model version as invalid based on the impact adjudication result, the version status of the associated model version is rewritten to the corresponding status among invalid pending replacement, invalid retention, and invalid withdrawal. A supplementary training task is generated based on the sample set of the problem chain to be eliminated and the alternative sample set.
[0014] Furthermore, a new formal approval group will be formed based on the supplementary training tasks; New consumption vouchers are generated based on the newly officially sealed batch; Based on the new model version identifier and new consumption voucher generated by the version update, a new impact index registration record is generated and written to the impact index library.
[0015] Compared with the prior art, the present invention has the following beneficial effects: 1. By continuously associating the original sample identifier, derived sample identifier, consumption voucher, model version identifier, and impact index, and upon receiving a trigger request from a problem sample, the system traces the derived sample identifier level by level starting from the original sample identifier, and then locks the associated model version based on the consumption voucher and impact index. This achieves the effect of tracing the problem sample from the original collection stage, sample processing stage, training consumption stage all the way to the model version stage. This solves the problem in existing technologies that can only stay at the dataset level, task level, or artifact level of association, making it difficult to accurately define the actual impact range of specific problem samples on training batches and model versions. This avoids the crude handling of whole batch rollback, whole round retraining, or large-scale manual investigation.
[0016] 2. By performing failure marking, targeted retraining, and version updates on the associated model versions based on the impact adjudication results, and rewriting the new formal batch set formed after retraining with new consumption vouchers and new impact indexes, the model versions affected by the problem samples can be targeted and updated without disrupting the existing version traceability chain and historical evidence chain. This transforms the model version handling from a simple failure and deactivation to a traceable, replaceable, and updatable continuous governance process, thereby improving the pertinence, continuity, and controllability of model version handling. Attached Figure Description
[0017] Figure 1 This is a flowchart illustrating a method for traceable management of AI model training data according to the present invention. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] Example: Figure 1 A flowchart illustrating a method for traceable management of AI model training data according to the present invention is provided. The method includes: S1. Receive the original sample, extract its content fingerprint, source identifier, labeled version, and collection batch, generate and register the original sample identifier. The specific implementation is as follows: When receiving raw samples, the registration nodes deployed at the training data access boundary complete the extraction of content fingerprint, source identifier, labeled version, and collection batch during the first reception period when the sample enters the management domain of this method for the first time and is first received and registered. Based on this, the raw sample identifier is generated and registered. Among them, the raw sample is the first registered content unit submitted by the upstream source in one go and has not undergone cleaning, segmentation, enhancement, relabeling, or sampling processing in this method. In the image scenario, it can be a camera-captured image; in the voice scenario, it can be a continuous recording; in the text scenario, it can be a business record; and in the industrial quality inspection production line scenario, it can be a single defect image or a corresponding single quality inspection record formed by the production line camera when the workpiece passes through the inspection station.
[0020] The content fingerprint is a fixed-length identifier representing the original sample content. Within the current reception period after the sample is fully received, the registration node calls the parsing template of the locked version according to the source type. It extracts the byte sequence of the pixel carrier area for images, the byte sequence of the audio payload area for speech, and the unified encoded text payload for text. After removing the transmission message header, link check segment, and outer container field, it is reassembled into a continuous byte sequence in a fixed order. When missing segments are encountered, they are filled with fixed placeholder segments. Then, a fixed-length fingerprint value is generated according to the fingerprint rules of the locked version. The fingerprint rule version includes at least the parsing template version, the content truncation rule version, the byte reassembly order version, the null value substitution rule version, and the fingerprint length version. The registration node reads and locks the effective version of the current window from the rule repository at the starting point of the scrolling observation window. The content stored in the rule repository includes at least the rule version of each step, parameter template, threshold parameter, state transition constraint, and version effective time. When each node enters the current observation window, it locks the effective version of the current window in read-only mode.
[0021] The source identifier is a source code that represents which field equipment, business system, or manually labeled terminal the original sample comes from. In the production line scenario, it is generated by concatenating the tenant number, production line number, workstation number, equipment number, and terminal number in a fixed order. Fixed separators are used between fields, and missing fields are filled with preset placeholder values. The character encoding is unified, letters are uniformly converted to uppercase, and leading zeros are retained. The source identifier does not change when the equipment display name changes. When the equipment is replaced, a new source identifier is generated and written into the inheritance mapping.
[0022] The label version is the label rule version applicable to the currently attached label, which is read by the registration node from the label file, label service feedback record, or manual label submission record at the time of receipt; the collection batch is the batch number defined according to the continuous receiving interval on site. In the main implementation method, it is defined according to the start and end interval of the work order. When the work order number is missing, it degenerates into defining according to the start and end interval of the shift. When the shift information is missing, it degenerates into defining according to the continuous operation interval of the equipment. When the work order switch and the shift switch occur simultaneously, the work order switch time is used as the dividing point.
[0023] To ensure completeness, the registration node performs preliminary processing before extraction: Timestamps from different devices are uniformly converted to the field master clock, with milliseconds as the time unit. An allowable deviation of ±5 milliseconds is permitted; if the deviation exceeds this limit, the registration node's completion time is used as the business time and written into the time synchronization deviation. Blank fields in the metadata are padded according to fixed rules. Padded fields are limited to non-primary key subfields in the source identifier, the batch acquisition prompt value, and the microsecond digit of the reception time. Content fingerprints and annotation version primary values must not be padded. The padded order is fixed as follows: the value attached to the current message, and the most recently successfully registered and not withdrawn record within the last 24 hours under the same source identifier. The effective value of the cancellation and the system default value are used. When the latter two are used for padding, the padding source mark is written. For fragment damage during the content reading process, the fixed fragment length is checked in order. The fixed fragment length is in bytes in image, voice and text scenarios. The main implementation method takes 4096 bytes. The damage ratio is calculated based on the number of failed fragments as a percentage of the total number of fragments. The first threshold and the second threshold are both configuration values of the locked version in the rule warehouse. The main implementation method takes 1% and 5% respectively. When the damage ratio does not exceed the first threshold, it is replaced with a fixed placeholder fragment and registration continues. When it exceeds the first threshold but does not exceed the second threshold, it is transferred to delayed re-retrieval. When it exceeds the second threshold, it is directly rejected.
[0024] The registration node completes the acceptance judgment within a scrolling observation window. The observation window length can be set from 1 second to 30 seconds, with a default value of 5 seconds. Within this observation window, if the content is complete, the source identifier is resolvable, the labeled version exists, and the collection batch can be assigned, the generation of the original sample identifier is triggered. The observation window in each step refers to the time interval within which the node completes rule version locking, condition adjudication, and formal result generation within a fixed duration. Unless otherwise explicitly defined in the corresponding step, the same effective version and the same adjudication caliber are used within the observation window. If the above fields are missing and trigger a stop rule, the adjudication is carried out in descending order of severity: continuous failure of content integrity verification, continuous missing source identifier, continuous missing labeled version, and continuous inability to assign collection batch. If the more severe rule is established, the less severe rule will not be executed. Continuous failure of content integrity verification triggers a 30-second freeze of the source link, while other missing values only trigger rejection of the current sample.
[0025] To suppress duplicate registrations, the registration node generates an idempotent key, which consists of a source identifier, a device-side sample sequence number, a receiving time segment, and a content fingerprint. The receiving time segment takes the second and millisecond values of the receiving time. When the device-side sample sequence number is missing, it is replaced by a locally incrementing receiving sequence number under the source identifier and marked as a proxy sequence number. The idempotent key in each step is used to identify whether the same business action belongs to the same formal registration event under the condition of repeated triggering. Unless otherwise explicitly defined in the corresponding step, the idempotent determination is only effective within the scope of the current step and does not replace the previous identifier across steps. Within a 10-second deduplication window, the determination is made in the order of comparing the source identifier, then the device-side sample sequence number, then the receiving time segment, and finally the content fingerprint. If the first three items are consistent and the content fingerprint is consistent, it is determined to be a duplicate, and only the initial registration record is retained, while the duplicate triggering record is added to the evidence chain. If the first three items are consistent but the content fingerprint is inconsistent, it is determined to be a conflict and transferred to the manual review queue.
[0026] The original sample identifier is generated by the registration node according to the identifier generation rules of the locked version. The identifier generation rules have fixed fields: content fingerprint, source identifier, label version, collection batch and rule version number. The fixed order is content fingerprint first and rule version number second. When an identifier conflict occurs, a conflict regeneration sequence number is added while keeping the aforementioned fields unchanged and the sample is re-registered.
[0027] During generation, the rule version number, execution node number, execution time, upstream message digest, previous evidence record index, and current record index are synchronously written. The current record index is formed by sequentially incrementing the number within the partition where the registration node is located, forming an evidence chain that is sequentially appended and cannot be written back. Each record in the evidence chain is appended in the order of event occurrence and must contain at least the previous index and the current index. New records are only appended to the event records formed in the current step and do not overwrite existing records formed in previous steps. If the evidence chain appending fails, the registration status is set to "to be supplemented" and the official original sample identifier is not released to the public.
[0028] The purpose of this is to bind subsequent derived samples, consumption vouchers, and impact indexes to a unique and verifiable original sample starting point, so as to avoid multiple identities of the same on-site sample due to repeated uploading, label re-extraction, and device resending. Its applicable boundary is the training data access scenario of continuous sample delivery. It does not require the upstream device to adopt a dedicated protocol, but it requires the upstream to at least provide readable sample content, source information, and receiving time.
[0029] After registration, an original sample registration record is generated. The record includes at least the original sample identifier, content fingerprint, source identifier, label version, collection batch, business time, execution time, rule version number, evidence chain index, and registration status. The records are stored in the registration database and sequentially written into the partition file and the index database for retrieval by the next step based on the original sample identifier. Upstream submits samples through message queues or access interfaces, and downstream reads registration records through query interfaces. The registration request must include at least the sample content address or binary direct transmission content, source identifier, upstream submission time, label payload, and collection batch prompt value. Upon success, the original sample identifier and registration status code are returned; upon failure, an error code and handling suggestions are returned. Among them, 1001 and 1007 allow automatic retry using the original idempotent key; 1002, 1003, and 1004 require completion and resubmission; 1005 does not allow retry; 1006 is transferred to manual review; 1008 triggers a rule version consistency check; and if 1008 occurs three times consecutively, the system switches to the previous effective version and writes it into the version rollback evidence chain.
[0030] Regarding time and resource constraints, the maximum delay for a single sample from arrival at the registration node to completion of registration can be set to 200 milliseconds to 2 seconds, the concurrent receiving capacity can be set to 100 to 10,000 samples per second, and the default number of retries can be set to 2. If registration is not completed after exceeding the delay limit, the sample will be transferred to the delay queue. The order of supplementary registration will be first in ascending order of the original receiving time, and then in ascending order of the receiving sequence number within the source identifier group. The original receiving time will be retained as the business time, and the supplementary registration time will be recorded as the execution time.
[0031] If the access link is interrupted, samples that have been verified but not yet entered into the database are temporarily stored in the local buffer. When the buffer reaches its capacity limit, the latest complete batch is retained first according to the collection batch. A complete batch is a batch in which the work order end mark has been reached and the sample reception count in the batch has reached the count declared in the batch header. Manual review can only be performed by the review terminal with review authority. The review conclusion is limited to the confirmation of duplicates, the splitting of conflicting items, the correction of the source, and the rejection of registration. The original registration record cannot be overwritten after review. Only an additional ruling record can be generated and the original evidence chain index can be used.
[0032] Regarding security and compliance boundaries, the registration node only accepts original samples submitted by authorized sources. For fields involving names, ID numbers, contact numbers, address details, and account fields that can directly identify natural persons, the content fingerprint is masked according to the desensitization rules of the locked version after it is generated and before it is officially stored in the database. Unmasked content is only stored in a restricted isolation area, and desensitized copies are stored in the registration database.
[0033] On-site performance shall be tested according to the following criteria: With a continuous sample size of no less than 10,000 samples covering scenarios of continuous operation, source retransmission, short-term link interruption, and rule switching, the original sample identifier duplication rate shall be calculated as the number of different original samples assigned the same original sample identifier as the total number of registered samples. The omission registration rate shall be calculated as the number of samples that meet the registration conditions but have not formed a formal registration record as the total number of registered samples. The duplication rate shall not exceed 0.001%, the omission registration rate shall not exceed 0.01%, and the proportion that meets the delay limit requirement shall not be less than 99.5%.
[0034] Preferably, in the steel plate surface defect detection production line, the production line camera resolution can be set to 4096×3000, the size of a single image is about 18 megabytes, the receiving rhythm can be set to 12 frames per second, the scrolling observation window is 5 seconds, the time alignment tolerance is ±2 milliseconds, the deduplication window is 10 seconds, and the upper limit of single-item registration delay is 350 milliseconds. In an 8-hour shift that continuously receives 36,000 images, the generated original sample identification is collision-free, 112 duplicate resends are correctly merged, and 3 conflicting items are transferred to the review queue, with a registration success rate of 99.97%.
[0035] Alternatively, in voice quality inspection scenarios, the original sample can be set as a single recording file, and the source identifier can be changed to consist of the recording terminal number, the business seat number, and the session number, while the remaining registration logic, version locking, evidence chain traceability, idempotent order control, interface return constraints, and downstream calling methods remain unchanged.
[0036] S2. For each cleaning, segmentation, enhancement, relabeling, and sampling process of the original sample, record the parent sample identifier, processing type, processing parameters, and execution time, and generate derived sample identifiers. The specific implementation is as follows: For each cleaning, segmentation, enhancement, relabeling, and sampling process of the original samples, the derived registration node, deployed downstream of the training data access boundary and connected to the registration library and the derived library, reads the original sample registration record or the already registered derived record. During the current derived period after the parent sample completes formal registration, it records the parent sample identifier, processing type, processing parameters, and execution time, and generates the derived sample identifier accordingly. The current derived period starts at the time when the derived registration node receives the derived task and ends at the time when this derived registration is completed.
[0037] The parent sample identifier is the identifier of the next higher level sample that the current derivation action is directly based on. The original sample identifier is taken during the first derivation, and the previously registered derivation sample identifier is taken during subsequent derivations. When the parent sample is the original sample, it is read from the registration database, and when the parent sample is the derivation sample, it is read from the derivation database.
[0038] The processing types are limited to five categories: cleaning, segmentation, enhancement, relabeling, and sampling. Actions aimed at restoring sample usability, removing damaged samples, or correcting invalid payloads are categorized as cleaning. Actions aimed at expanding sample representation without repairing damage are categorized as enhancement. When the same parent sample involves both types of actions, it is split into two consecutive derivations: cleaning followed by enhancement. Segmentation only changes the boundary range of a single parent sample and does not determine whether it enters subsequent links. Sampling only determines whether candidate samples are retained and does not change the boundary range of a single sample. For scenarios where some sub-samples are retained after segmentation, segmentation registration is completed first, followed by sampling registration of the segmentation results. Any action involving changes to label values, label boundaries, label attributes, or label review conclusions is categorized as relabeling.
[0039] Processing parameters are submitted using a structured key-value sequence, with the key name taken from the parameter template corresponding to the current processing type. The processing parameter summary is formed by recombining the main parameters participating in this derivation decision according to a fixed field order, unified unit caliber, unified character encoding, and unified decimal truncation rules. Unless otherwise explicitly defined, the summary in this document is formed by recombining the fields participating in the main decision of the current step according to a fixed order, unified encoding, unified unit caliber, and unified null value placeholder rules. It does not include appendix fields. Appendix parameters that do not participate in the derivation decision are not included in the processing parameter summary. Null value fields participate in the recombination with fixed placeholder values. Image position parameters are uniformly converted to pixel integer coordinates, time parameters are uniformly converted to millisecond integer values, text position parameters are uniformly converted to character sequence integer values, and Boolean parameters are uniformly converted to 0 or 1.
[0040] The main parameters for cleaning are limited to the damaged replacement method, blank removal threshold, and retained region range; the main parameters for segmentation are limited to the start position, end position, consecutive segment number, and boundary overlap, with the segmentation boundary using a left-closed, right-open aperture, and the boundary overlap extending to the start side of the next segment, truncating at the parent sample boundary when it crosses it; the main parameters for enhancement are limited to enhancement type, action order, and amplitude value, and only one set of enhancement combinations with a fixed action order is allowed in the same derived registration; the main parameters for relabeling are limited to the old label value, new label value, and revision reason code; the main parameters for sampling are limited to the extraction sequence number, The process involves retaining the marker and removing the reason code. When the sampling object is a single parent sample, only a derived record with the retained marker is generated, and the content payload remains unchanged. When the sampling object is a candidate sample set, the candidate sample set is preferentially adopted from the sample set explicitly given by the upstream task order. If the upstream does not provide a set, the derived registration node will query and generate the set from the registered samples within the same source identifier, the same processing type, and the same observation window according to the task order's constraints. A sampling decision record is first generated for each candidate sample, and then the derived sample identifier is registered for each retained sample. Only the decision record is retained for samples that are not retained.
[0041] The execution time is the moment when the derived registration node completes its derivation action and generates a registrable result. The time unit is milliseconds, and the clock caliber follows the master clock from the previous stage. Before the derived registration node executes, it retrieves the source identifier, collection batch, annotation version, business time, and evidence chain index corresponding to the parent sample from the registration or derived database. It also retrieves the version of the derivation rule, parameter template version, and identifier generation rule version currently locked and effective from the rule repository. The content payload of the sample to be derived is then subjected to denoising, boundary alignment, and null value padding. Denoising in image scenarios involves removing pure black edges, pure white edges, and corrupted blocks; in audio scenarios, it involves removing silent segments before and after the recording and corrupted frames; and in text scenarios, it involves removing control characters and duplicate record headers. Boundary alignment uses pixel coordinates in image scenarios, millisecond duration coordinates in audio scenarios, and character sequence coordinates in text scenarios. Null value padding is only allowed in non-primary decision fields of the processing parameters, and the padding order is fixed as follows: the value attached to the current task order, the most recently successfully executed value of the same processing type under the same source identifier within the last 24 hours, and the default value from the rule repository.
[0042] The order of derivation request adjudication is fixed as follows: existence of parent sample, status of parent sample, legality of processing type, integrity of processing parameters, and consistency of rule version; the source identifier must be consistent with the parent sample, and the collection batch inherits the collection batch to which the parent sample belongs by default. When sampling across batches, it is allowed to write a new collection batch, but the batch migration mark should be marked in the processing parameters.
[0043] The derivation registration node decides whether to execute the current derivation within the scrolling observation window. The observation window length can be set from 1 second to 60 seconds, with a default value of 10 seconds. Multiple parallel derivation branches are allowed to be generated within the same observation window for the same parent sample. Each branch is uniquely identified by "parent sample identifier + processing type + processing parameter summary". Continuous derivations within the same branch are registered in a strict serial order.
[0044] To suppress duplicate registrations, derived registration nodes generate idempotent keys, which consist of the parent sample identifier, processing type, processing parameter summary, and execution time segment. Within a 10-second deduplication window, if the first three items are consistent and the content results are consistent, it is considered a duplicate derivation; if the first three items are consistent but the content results are inconsistent, it is considered a conflicting derivation. Among these, if the content results are consistent, the derivation result fingerprint is compared first. The derivation result fingerprint follows the rule version of the content fingerprint and calculates the payload of the derived formal registration. Cleaning, segmentation, and enhancement calculate the content payload; re-annotation calculates the combination of content payload and label payload; sampling-type derivation calculates the combination of parent sample identifier and retained marker. If they are inconsistent, the corresponding main fields are compared.
[0045] The derived sample identifier is generated by the derived registration node according to the identifier generation rules of the locked version. The fixed participating fields are the parent sample identifier, processing type, processing parameter summary, execution time segment and rule version number. The fixed order is parent sample identifier first and rule version number second. When an identifier conflict occurs, the conflict regeneration sequence number is added starting from 1 within the same combination of parent sample identifier and processing type and re-registered.
[0046] After derivation is completed, a derivation registration record is formed. The record content includes at least the derived sample identifier, parent sample identifier, processing type, processing parameters, execution time, rule version number, business time inherited value, evidence chain index, and derivation status. The records stored in the derivation library are written sequentially to the partition file and the retrieval entries in the index library. The derivation registration record adds new derived evidence records based on the parent sample evidence chain index. The parent sample evidence chain index is retained as the upstream index. The current derivation record forms a new current index and does not overwrite the existing index of the parent sample. The evidence chain records in each step are added in the order of event occurrence. The newly added record only adds to the current index and does not overwrite the existing index records formed in the previous steps.
[0047] Upstream tasks are submitted via message queues or derived interfaces, while downstream tasks are retrieved from derived registration records via query interfaces. Each derived request must include at least the parent sample identifier, processing type, processing parameters, task order number, and upstream submission time. The task order number is unique under the same source identifier. The upstream submission time is only noted and does not participate in idempotency determination. Successful submissions return the derived sample identifier and derived status code; failures return an error code and suggested handling. 2001 indicates the parent sample identifier does not exist, 2002 indicates an illegal processing type, 2003 indicates missing processing parameters, 2004 indicates the parent sample status is frozen, and 2005 indicates duplicate derived tasks. 2006 indicates a conflicting derivation, 2007 indicates the derived library storage is unreachable, and 2008 indicates a rule version mismatch. 2001, 2003, and 2007 allow automatic retry using the original idempotent key. 2002 and 2004 require correction and resubmission. 2005 does not allow retry. 2006 is transferred to manual review. When 2008 appears for the first time, the local cached rule version is compared with the current version in the rule repository, and then the task order declaration version is compared with the current version in the rule repository. Before review, the rule repository version is read again once. If the retry still fails, it returns to 2008. When it appears 3 times consecutively, it switches to the previous effective version and writes the version rollback evidence chain.
[0048] Regarding time and resource constraints, the maximum delay from receiving a single derived task to completing registration can be set to 500 milliseconds to 5 seconds, the concurrent derivation capacity can be set to 50 to 5000 tasks per second, and the default number of retries can be set to 2. If the derivation registration is not completed after exceeding the delay limit, it will be transferred to the delay queue. The order of supplementary registration is first in ascending order of the original business time, then in ascending order of the task order number within the parent sample identifier group, and then in ascending order of the sub-task sequence number within the task order.
[0049] If the derivation link is interrupted, when the buffer reaches its capacity limit, the most recent complete task group will be retained first. A complete task group is one in which the task order end marker has been reached and all subtasks in the task order have formed registration results or adjudication records. Manual review is only allowed to be performed by the review terminal with review authority. The review conclusion is limited to duplicate derivation confirmation, conflict derivation splitting, parent sample correction, and registration rejection. The original derivation registration record shall not be overwritten after review. Only an additional adjudication record can be generated and the upstream index can be used. The additional adjudication record shall include at least the review subject, review time, review conclusion, index of the adjudicated record, summary of the conclusion basis, and execution status.
[0050] Regarding security and compliance boundaries, the derived registration node only accepts derived requests triggered by authorized task orders. Relabeling actions involving personal information are only allowed to be performed on desensitized copies. When new label payloads are written to the derived library, only the desensitized results are retained. The original text of old labels and the original text visible to manual review are only stored in the restricted isolation area and the original text must not be written back to the derived library.
[0051] On-site performance shall be inspected according to the following criteria: With a continuous number of no less than 10,000 derived tasks covering five types of processing (cleaning, segmentation, enhancement, relabeling, and sampling) and four types of links (parallel derivation, conflict derivation, duplicate derivation, and manual review), the sample size for each processing type shall be no less than 10% of the total sample size. The derivation sample identification duplication rate shall be calculated as the number of items with different derivation results assigned the same derivation sample identification to the total number of derivation registrations. The omission registration rate shall be calculated as the number of items that meet the derivation conditions but have not formed a formal derivation registration record to the total number of items that should be registered. The duplication rate shall not exceed 0.001%, the omission registration rate shall not exceed 0.01%, and the proportion that meets the upper limit of the delay shall not be less than 99.5%.
[0052] Preferably, in the steel plate surface defect detection production line, a continuous derivation chain of cleaning, segmentation, and enhancement is performed on the registered original samples. The image segmentation boundary overlap can be set to 32 pixels, the enhancement amplitude range can be set to brightness offset ±8 gray levels and rotation angle ±3 degrees, the scrolling observation window is 10 seconds, and the upper limit of single-piece derivation registration delay is 800 milliseconds. In an 8-hour shift that continuously processes 36,000 images to form 92,400 derivation tasks, there are no collisions in the derivation sample identification, 87 duplicate derivation pieces are correctly merged, and 2 conflicting derivation pieces are transferred to the review queue. The derivation registration success rate reaches 99.96%.
[0053] Alternatively, in a voice quality inspection scenario, the segmentation can be set to segment by silent interval, the enhancement can be set to volume range adjustment, and the re-annotation can be set to session intent revision, while the remaining parent sample identifier inheritance, processing parameter locking, evidence chain traceability, idempotent order control, interface return constraints, and downstream call methods remain unchanged.
[0054] S3. During training batch assembly, the derived sample identifier, batch identifier, round identifier, label version, and sampling rule are recorded for each derived sample entering the training batch, and a consumption voucher is generated. The specific implementation is as follows: During training batch assembly, the batch assembly node, deployed in front of the training execution node and connected to the derived library, rule repository, and consumer library, records the derived sample identifier, batch identifier, round identifier, tag version, and sampling rule for the derived samples entering the training batch, and generates consumption vouchers accordingly. Entering the training batch means that the derived sample has completed the assembly judgment by the batch assembly node and has been written into the formal batch set of the current batch, and the current batch has met the batch sealing conditions. Samples that only enter the candidate set but not the formal batch set cannot generate consumption vouchers.
[0055] The training batch assembly period starts at the moment when the batch assembly node receives the training task order for this round and ends at the moment when the training batch is sealed. Upstream, the derived registration record is retrieved from the derived database by the derived sample identifier through the query interface, and downstream, the training execution node reads the sealed sample set in the order of consumption vouchers.
[0056] The derived sample identifier uses the formally registered derived sample identifier from the previous step. It is read item by item by the batch assembly node within the current batch assembly cycle and compared with the derived status, source identifier, collection batch, and evidence chain index item by item. Only when the derived status is formally valid and not expired can it enter the candidate set. When referencing the original sample identifier, derived sample identifier, consumption voucher, model version identifier, adjudication result identifier, and disposal result identifier generated in the previous step in subsequent steps, the corresponding identifiers that have been formally registered in the previous step are directly used, and their existing meanings are not changed due to the generation of fields with the same name in subsequent steps. The identifiers added in each step are only used to identify the newly formed formal registration results in this step.
[0057] The batch identifier is formed by recombining the task order number, training execution node number, the current batch sealing start time segment, rule version number, and batch sequence number in a fixed order. When the same task order forms multiple batches under the same training execution node, the batch sequence number starts from 1 and increments under this combination.
[0058] The round identifier is the round number of the current model in the continuous training chain. When the batch assembly node is assembled, it reads the training plan record, the previous formal round record, and the current task sheet declaration value for consistency verification, with the training plan record as the primary criterion. If the previous formal round record is missing but the training plan record is consistent with the task sheet declaration value, assembly is allowed to continue and a missing note is written. When the task sheet has a retraining mark, the round identifier can reuse the original round. When there is no retraining mark, the round identifier must be incremented.
[0059] The label version is the label rule version that the derived sample can currently use for training. During assembly, the batch assembly node reads it from the derived registration record and compares it item by item with the label version range required by the current training task sheet. The label version range adopts a closed interval caliber. When comparing, the major version number is compared first, and then the sub-version number is compared. If it falls within the closed interval declared in the task sheet, it is determined to be a match.
[0060] The sampling rules are given by the version of the sampling rules that are locked in the rule repository, and include at least the selection order, the number of retained items, the category balance restriction, the source dispersion restriction, and the duplication suppression restriction; the sampling rule summary is formed by recombining the main rule fields involved in the current assembly decision according to a fixed field order, a unified unit caliber, and a unified character encoding.
[0061] Before executing the batch assembly node, the candidate set is first read from the derived library according to the task order's constraints. The batch rule version, sampling rule version, consumption voucher generation rule version, and label version constraint template locked in the current observation window are read from the rule warehouse. The derived samples in the candidate set are aligned in ascending order by business time, source identifier, and derived sample identifier. Blank appendix fields are filled in according to fixed completion rules. Completion is only allowed in non-primary decision fields. The completion order is fixed as the value attached to the current task order, the range of the same label version under the same source identifier, the most recent successful assembly value of the same sampling rule version in the last 24 hours, and the default value in the rule warehouse. Duplicates in the candidate set are deduplicated by combining derived sample identifier, label version, and round identifier. Only when the three are consistent are they judged as candidate duplicates. Invalid samples, conflicting samples awaiting adjudication, and samples with mismatched label versions are directly removed and the reason for removal is recorded.
[0062] The batch assembly node decides whether to include candidate derived samples in the current training batch within a scrolling observation window. The observation window length can be set from 1 second to 60 seconds, with a default value of 10 seconds. The decision order is fixed as derived sample status, round identifier consistency, label version matching, sampling rule consistency, and batch capacity boundary. If the previous item is not valid, the subsequent item will not be checked.
[0063] When the sampling rules are executed, the candidate set is first grouped according to the source dispersion limit, and then grouped according to the category value under the tag version within each source group. The number of items retained in each group is allocated according to the batch capacity limit declared in the task order and the group quota ratio in the rule warehouse. If the sample of a source group is insufficient, it is not allowed to exceed the source dispersion limit to make up the difference. If the sample of a category group is insufficient, it is allowed to reclaim the unused quota according to the degradation strategy in the rule warehouse and allocate it to other category groups under the same source. Items are selected one by one in the selection order in the rule warehouse. When the maximum number of items retained in the group is reached, the selection of the current group is stopped. When the batch capacity limit is reached, the assembly of the current batch is stopped.
[0064] Batch capacity boundary refers to the maximum assembly quantity of a single batch specified in the current training task order. The main implementation method is based on the number of sample pieces, while the voice scenario can be based on the total duration. If the task order does not declare the measurement caliber, it is based on the number of sample pieces by default.
[0065] To suppress duplicate assembly, batch assembly nodes generate assembly idempotent keys, which consist of derived sample identifier, batch identifier, round identifier, tag version, and sampling rule summary. Within a 10-second deduplication window, if the first four items are consistent and the sampling rule summary is consistent, it is determined to be a duplicate assembly, and only the first assembly record is retained, while a duplicate trigger record is appended to the evidence chain. If the first four items are consistent but the sampling rule summary is inconsistent, it is determined to be a conflict assembly, and the current batch identifier is frozen. Selected samples remain in a candidate frozen state and do not generate formal consumption vouchers. During the freezing period, the current batch cannot provide retrieval or invocation to downstream processes, training execution nodes cannot prefetch, and no new candidate samples are allowed to enter.
[0066] Consumption vouchers are generated by batch assembly nodes according to the generation rules of the locked version. The fixed participating fields are derived sample identifier, batch identifier, round identifier, tag version, sampling rule summary, and consumption generation time segment. The fixed order is derived sample identifier first, followed by consumption generation time segment. The consumption generation time segment is the second and millisecond values of the moment when the current derived sample is written into the formal batch set. The same derived sample is allowed to enter different batches in the same round, but an independent consumption voucher should be generated for each different batch. The same derived sample may not generate more than one formal consumption voucher in the same round and the same batch combination.
[0067] After the consumption voucher is generated, the batch assembly node generates a consumption registration record. The record includes at least the consumption voucher, derived sample identifier, batch identifier, round identifier, tag version, sampling rule, business time inherited value, execution time, rule version number, and consumption status. The consumption status is limited to four types: formal consumption, removal of unconsumed items, chain to be supplemented, and conflict to be adjudicated. The records are stored in the consumption database and written sequentially to the retrieval entries in the partition file and index database. The consumption registration record adds a new consumption evidence record to the derived sample evidence chain index. The parent index is retained as the upstream index, and the current consumption index does not cover the existing index of the derived sample. The consumption evidence record includes at least the execution node number, execution time, rule version number, previous index, current index, conclusion summary, and consumption status.
[0068] Upstream, training task orders and candidate set constraints are submitted via message queue or batch assembly interface. Downstream, consumption registration records are read by batch identifier via query interface. Batch assembly requests must include at least the task order number, training execution node number, round identifier declaration value, label version range, batch capacity limit, batch capacity measurement caliber, sampling rule version prompt value, candidate set constraints, task order end marker, and retraining marker. Upon success, the batch identifier, consumption certificate set summary, and assembly status code are returned. Upon failure, an error code and handling suggestions are returned. Among them, 3001 indicates that the derived sample does not exist, 3002 indicates that the round identifier is inconsistent, 3003 indicates that the label version does not match, 3004 indicates that the sampling rule is illegal, 3005 indicates duplicate assembly, 3006 indicates conflicting assembly, 3007 indicates that the consumption library storage is unreachable, and 3008 indicates that the rule version does not match. 3001 and 3007 allow automatic retry using the original assembly idempotent key, but the candidate set constraints, round identifier, label version range, and sampling rule version prompt value must not have changed.
[0069] When a 3006 error occurs, after manual review and confirmation, the batch can continue to be sealed under the original batch identifier, or the original batch identifier can be invalidated and a new batch can be rebuilt. Manual review can only be performed by audit terminals with review authority. The review conclusion is limited to confirmation of duplicate assembly, splitting of conflicting assembly, round correction, and rejection of batch sealing. The review cannot overwrite the original consumption registration record, but can only generate an additional ruling record and use the upstream index. The additional ruling record must include at least the review subject, review time, review conclusion, index of the ruled record, summary of the conclusion basis, and execution status.
[0070] When 3008 occurs for the first time, the local cached rule version is compared with the current version of the rule repository, and then the task order declaration version is compared with the current version of the rule repository. Before review, the rule repository version is read again once. If the retry still fails, 3008 is returned. If it occurs 3 times in a row, the previous effective version is switched and the version rollback evidence chain is written.
[0071] Regarding time and resource constraints, the maximum delay from a single derived sample entering the batch assembly node to forming a formal consumption voucher can be set to 200 milliseconds to 3 seconds, the maximum batch assembly sealing time can be set to 5 seconds to 120 seconds, the concurrent assembly capacity can be set to 100 to 10,000 pieces per second, and the default number of retries can be set to 2. If the consumption registration is not completed after exceeding the delay limit, it will be transferred to the delay queue. The order of supplementary registration is first in ascending order by round identifier, then in ascending order by business time within the batch identifier group, and then in ascending order by derived sample identifier.
[0072] If the batch assembly link is interrupted, the records that have completed the assembly judgment but have not entered the consumption database are temporarily stored in the local buffer. When the buffer reaches the capacity limit, the most recent complete batch in the current round is retained first. A complete batch is a batch whose capacity limit has been reached, or whose task order end mark has been reached and whose current candidate set has no more formal valid samples to be selected, and whose batch has all derived samples have formed consumption vouchers or removal decision records.
[0073] Regarding security and compliance boundaries, batch assembly nodes only receive assembly requests triggered by authorized training task orders. After desensitization, the label version field, category field, and boundary field required for assembly judgment are still retained. The original label text is only stored in a restricted isolation area, not written to the consumer library, and not diffused to the training execution node.
[0074] On-site performance is tested according to the following criteria: With a continuous sample size of no less than 10,000 derived samples covering normal assembly, duplicate assembly, conflicting assembly, rule switching, and short-term link interruption scenarios, the consumption voucher duplication rate is calculated as the number of items with different consumption facts assigned the same consumption voucher as the total number of consumption registrations; the omission rate is calculated as the number of items that meet the assembly conditions but have not formed a formal consumption registration record as the total number of items that should be registered; and the proportion that meets the delay limit requirement is calculated as the number of items that complete formal consumption registration within the delay limit in the test sample size as the total number of items that should be registered. The duplication rate should not exceed 0.001%, the omission rate should not exceed 0.01%, and the proportion that meets the delay limit requirement should not be less than 99.5%.
[0075] Preferably, in the steel plate surface defect detection production line, the training execution node is configured with 8 training servers, the maximum capacity of a single batch can be set to 512 pieces, the scrolling observation window is set to 10 seconds, and the maximum delay for generating a single consumption voucher is set to 600 milliseconds. In an 8-hour shift of continuously assembling 92,400 derived samples to form 180 training batches, there are no collisions in consumption vouchers, 73 duplicate assemblies are correctly merged, and 2 conflicting assemblies are frozen with batch identification and transferred to review, with a consumption registration success rate of 99.97%.
[0076] Alternatively, in voice quality inspection scenarios, the maximum batch capacity can be determined based on the total recording duration, and the sampling rule can be set to speaker-dispersed retention, while the remaining derived sample identifier inheritance, batch identifier generation, round consistency verification, consumption voucher generation, evidence chain traceability, idempotent order control, interface return constraints, and downstream call methods remain unchanged.
[0077] S4. When the model version is frozen, based on the consumption voucher-derived sample identifiers, establish an impact index from the model version to the consumption voucher. The specific implementation is as follows: When a model version is frozen, a version freezing node, deployed behind the training execution node and connected to the consumer library, model registration library, rule repository, and influence index library, collects the identifiers of the consumed derived samples based on the consumption credentials and establishes an influence index from the model version to the consumption credentials. The model version freezing period starts when the training execution node completes the parameter solidification of the current round and sends a freeze completion signal to the version freezing node, and ends when the model version registration is completed and the influence index is written to the library. Parameter solidification means that the training execution node has written the current round model file to the local read-only version area and generated a model file summary, and the model file summary is consistent with the output summary declared in the training task sheet for this round. The freeze completion signal must contain at least the model file summary, round identifier, and training execution node number.
[0078] The model version is a version identifier representing the result of a formal training session. When the model is frozen, it is formed by reading the training task order number, training execution node number, round identifier, freeze completion time segment, rule version number, and model version internal sequence number in a fixed order from the version freeze node. When a conflict occurs in the same training task order, the same training execution node, or the same round, and a new model version is rebuilt after manual review, the model version internal sequence number starts from 1 and increments. Repeated freezes and retrying cannot occupy the new sequence number, and the old sequence number after being invalidated cannot be reused.
[0079] The consumption vouchers are the same as those officially registered in the previous stage. During the current freeze period, each voucher is read by the version freeze node and compared with the consumption status, batch identifier, round identifier, derived sample identifier, tag version, and evidence chain index. Only when the consumption status is officially consumed and has not been revoked can it enter the collection set.
[0080] The derived sample identifier that has been identified as having formed a formal consumption voucher and has actually entered the formal batch set of the current round is defined by the criteria of the formal batch set of the previous step of "entering the training batch". When the same derived sample identifier corresponds to multiple consumption vouchers under the same model version, only the formal consumption facts of the derived sample entering multiple different batches under the same round are allowed, and the consumption facts of the same batch are not allowed.
[0081] The impact index consists of a set of index detail records grouped by model version identifier. Each index detail record corresponds to one consumption voucher. Multiple index detail records can correspond to the same model version. At the time of freezing, the version freezing node generates the index record one by one with the model version identifier as the primary index key, the consumption voucher as the secondary index key, and the consumed derived sample identifier, round identifier, batch identifier, tag version, and consumption status as related fields.
[0082] Before executing the version freeze node, the formal consumption registration records are first read from the consumption database by round identifier and batch identifier. The training task sheet, training execution node number, freeze signal time, previous model version record, and version evidence chain index corresponding to the current freeze task are read from the model registration database. Then, the generation rule version of the model version currently locked in the observation window, the generation rule version of the influence index, and the consumption status constraint template are read from the rule repository. The consumption registration records are aligned in ascending order by round identifier, then by batch identifier, and finally by consumption voucher. If the round identifier, batch identifier, and consumption voucher are all the same, they are arranged in ascending order by execution time. If the execution times are still the same, they are arranged in ascending order by the current index. Blank appendix fields are filled according to fixed completion rules. Blank appendix fields are limited to freeze task appendix, explanatory source fields, and non-primary decision statistics fields. Model version identifier, consumption voucher, round identifier, batch identifier, rule version number, and consumption status must not be filled. The completion order is fixed as: the value attached to the current freeze task, the most recent successfully frozen value in the same round under the same training execution node, and the default value from the rule repository.
[0083] To suppress duplicate freezes, the version freeze node generates a freeze idempotent key. The freeze idempotent key consists of the training task order number, training execution node number, round identifier, summary of the formal consumption record set, and rule version number. The summary of the formal consumption record set is formed by recombining the consumption vouchers, batch identifiers, round identifiers, derived sample identifiers, and label versions of all formal consumption records under the current freeze task in a fixed field order. Before recombining, the records are sorted in ascending order by round identifier, then by batch identifier, and finally by consumption voucher. Unconsumed records and conflict pending resolution records are excluded from the summary of the formal consumption record set. Within a 10-second deduplication window, if the first four items are consistent and the rule version number is consistent, it is determined to be a duplicate freeze. Only the first freeze record is retained, and a duplicate trigger record is added to the evidence chain. If the first four items are consistent but the rule version number is inconsistent, it is determined to be a conflict freeze, and the current model version identifier is frozen. During the freeze, only the review terminal and the version freeze node are allowed to query internally. It is not allowed to add new impact index records, publish to the model registration library, or provide external retrieval to the impact index library.
[0084] The version freeze node determines whether to generate a model version and impact index within a scrolling observation window. The observation window length can be set from 1 second to 60 seconds, with a default value of 10 seconds. The decision order is fixed as follows: validity of freeze completion signal, consistency of round identifier, validity of consumption status, consistency of rule version, and uniqueness of model version. If the previous item is not valid, the subsequent item will not be checked.
[0085] When the freeze completion signal is valid, first check whether the round identifier returned by the training execution node is consistent with the round identifier of the formal consumption record in the consumer library. Then check whether the batch set declared in the freeze task is covered by all consumption registration records. Coverage adopts the set equality caliber. When comparing, first remove duplicates, then sort in ascending order by batch identifier. If the batch set declared is consistent with the batch set in the formal consumption record in terms of both the number of elements and the element values, it is determined that the coverage is successful. If there are uncovered batches in the consumption record, stop the current freeze and write the missing batch adjudication record.
[0086] After the model version identifier is formed, the version freeze node generates a version registration record and an impact index registration record. The version registration record includes at least the model version identifier, training task order number, training execution node number, round identifier, freeze completion time, rule version number, and version status. The version status is limited to four types: officially released, frozen pending adjudication, pending chain replenishment, and revoked. The officially released status allows visibility to downstream searches and subsequent impact adjudication. The frozen pending adjudication and pending chain replenishment statuses are not released to downstream users. The revoked status retains search data but cannot be used as a valid model version in subsequent adjudication. The impact index registration record includes at least the model version identifier, consumption certificate, consumed derived sample identifier, and batch identifier. The data includes round identifier, tag version, consumption status, execution time, rule version number, and index status. The index status is limited to four types: formal index, chain to be supplemented, conflict to be adjudicated, and revocation. Records are stored in the model registry and the impact index registry, and are written sequentially to the partition file and the index registry retrieval entries. The impact index registration record is added to the consumption evidence chain index to generate a new impact evidence record. The parent index is retained as the upstream index, and the current impact index record forms a new current index. It does not overwrite the existing consumption index. The impact evidence record includes at least the model version identifier, consumption certificate, execution node number, execution time, rule version number, previous index, current index, conclusion summary, consumption status, and index status.
[0087] Upstream submits freeze tasks via message queue or model freeze interface, while downstream reads the affected index registration record by model version identifier via query interface. The freeze request must include at least the training task order number, training execution node number, round identifier declaration value, batch set declaration value, freeze completion signal, rule version prompt value, model version release marker, model file summary, and retraining / reconstruction marker. Success returns the model version identifier, affected index set summary, and freeze status code; failure returns an error code and handling suggestions. Specifically, 4001 indicates no formal consumption record was found within the current freeze task declaration scope, or no formal consumption record was found in any batch within the declared batch set; 4002 indicates inconsistent round identifiers; 4003 indicates an incomplete batch set; 4004 indicates an illegal consumption status; 4005 indicates duplicate freeze; 4006 indicates conflicting freeze; 4007 indicates the affected index database storage is unreachable; and 4008 indicates a different status code. This indicates a rule version mismatch; 4001 and 4007 allow automatic retry using the original frozen idempotent key, provided that the round identifier declaration value, batch set declaration value, and rule version prompt value have not changed; when 4006 occurs, after manual review and confirmation, it can continue to be published under the original model version identifier, or the original model version identifier can be invalidated and a new model version can be rebuilt. When invalidating the original model version identifier and rebuilding a new model version, the aggregation should be re-executed based on the original frozen task declaration value and the original formal consumption record set, and the original impact index draft should not be directly reused; manual review is only allowed to be performed by the review terminal with review authority, and the review conclusion is limited to duplicate freeze confirmation, conflict freeze splitting, round correction, and rejection of publication. After review, the original version registration record should not be overwritten, only an additional adjudication record can be generated and the upstream index can be used. The additional adjudication record should at least include the review subject, review time, review conclusion, index of the adjudicated record, summary of the conclusion basis, and execution status.
[0088] When 4008 occurs for the first time, the local cached rule version is compared with the current version of the rule repository. Then, the frozen task declaration version is compared with the current version of the rule repository. Before review, the rule repository version is read again once. If the retry still fails, 4008 is returned. If it occurs 3 times in a row, the previous effective version is switched and the version rollback evidence chain is written.
[0089] Regarding time and resource constraints, the maximum delay from a single consumption voucher entering the version freeze node to forming a formal impact index record can be set to 200 milliseconds to 3 seconds, the maximum duration of a single model version freeze can be set to 5 seconds to 180 seconds, the concurrent freeze capacity can be set to 50 to 5000 items per second, and the default number of retries can be set to 2. If the impact index registration is not completed after exceeding the delay limit, it will be transferred to the delay queue. The order of supplementary registration is first in ascending order by round identifier, then in ascending order by batch identifier within the model version identifier group, and then in ascending order by consumption voucher.
[0090] If the version freeze link is interrupted, records that have been collected but not yet included in the affected index database are temporarily stored in the local buffer. When the buffer reaches its capacity limit, the most recent complete model version in the current round is retained first. A complete model version is one in which the freeze completion signal has arrived, the corresponding batch set has been completely covered by the formal consumption records, and all consumption vouchers in the collection set have formed an affected index registration record or a removal decision record.
[0091] Regarding security and compliance boundaries, the version freeze node only accepts version freeze requests triggered by authorized freeze tasks. The impact index registration record only retains the model version identifier, consumption certificate, consumed derived sample identifier, batch identifier, round identifier, tag version, and consumption status required for reverse lookup. The original tag text and original content payload are not written to the impact index library. The model registration library only saves the registration fields at the model version level and does not save the original tag text and original content payload. The impact index query interface only returns the index fields and the de-identified tag version fields. Subsequent impact adjudication steps only read the index fields in the impact index library and do not directly read the original content payload.
[0092] The on-site performance is tested according to the following criteria: With a test sample size of no less than 10,000 consecutive consumption vouchers covering normal freezing, duplicate freezing, conflicting freezing, rule switching, and short-term link interruption scenarios, the impact index duplication rate is calculated as the number of vouchers with different model versions assigned the same impact index primary key relative to the total number of impact index registrations. The omission registration rate is calculated as the number of vouchers that meet the freezing conditions but have not formed a formal impact index registration record relative to the total number of registrations. The proportion that meets the delay limit requirement is calculated as the number of vouchers that complete the formal impact index registration within the delay limit relative to the total number of registrations. The impact index duplication rate should not exceed 0.001%, the omission registration rate should not exceed 0.01%, and the proportion that meets the delay limit requirement should not be less than 99.5%.
[0093] Preferably, in the steel plate surface defect detection production line, the training execution node is configured with 8 training servers, the model version freeze is triggered once after each formal round is completed, the scrolling observation window is 10 seconds, the upper limit of the single-item impact index registration delay is 500 milliseconds, and within an 8-hour shift of continuously processing 180 training batches to form one formal model version, the version freeze node completes the aggregation of 92,400 formal consumption records, the generated model version identifier has no collision, the impact index has no duplicate records, one duplicate frozen item is correctly merged, one conflicting frozen item has its model version identifier frozen and transferred to review, and the impact index registration success rate reaches 99.98%.
[0094] Alternatively, in the voice quality inspection scenario, the model version freeze trigger condition can be set to be triggered after the cumulative recording duration reaches the task order declaration threshold, and the batch set declaration value can be replaced with the duration interval declaration value. The remaining consumption voucher collection, model version identifier generation, impact index registration, evidence chain traceability, idempotent order control, interface return constraints and downstream call methods remain unchanged.
[0095] S5. Receive the problem sample trigger request, trace the derived sample identifier based on the original sample identifier, lock the associated model version based on the impact index, and generate the impact adjudication result. The specific implementation is as follows: When a problem sample trigger request is received, the impact adjudication node, deployed behind the impact index database and connected to the registration database, derived database, consumer database, model registration database, impact index database, adjudication database, and rule repository, traces the derived sample identifier based on the original sample identifier and locks the associated model version based on the impact index during the current adjudication period when the problem sample enters the problem handling chain, generating an impact adjudication result. The legitimate sources of problem sample trigger requests are limited to sample problem alarms, manual review conclusions, compliance offline notifications, and tag correction notifications. Repeated reports of the same original sample under the same problem type are merged into a single trigger request, while different problem types form independent trigger requests. A problem sample trigger request must include at least the original sample identifier, problem type, trigger source, trigger time, handling priority, rule version prompt value, problem basis summary, and trigger number. The trigger time fragment is the second and millisecond value of the moment when the problem trigger interface completes receiving and generating the formal problem sample trigger request.
[0096] Before the execution of the impact adjudication node, the system first reads the original sample registration record from the registration database according to the original sample identifier, then recursively reads the derived registration record from the derived database according to the parent sample identifier and the evidence chain index, then retrieves the consumption record from the consumption database that is in the formal consumption state and was formally registered before the current adjudication time according to the derived sample identifier, then retrieves the impact index registration record from the impact index database that is in the formal index state according to the consumption certificate, then retrieves the model version record from the model registration database that is in the formal release state according to the model version identifier, and then reads the traceability rule version, impact adjudication rule version, and result generation rule version that are currently locked and effective from the rule repository. The original sample registration record, derived registration record, consumption registration record, and impact index registration record are aligned in ascending order of business time, execution time, and current index. Blank appendix fields are filled according to a fixed fill rule. Filling is only allowed to occur in non-primary decision fields. The fill order is fixed as follows: the value attached to the current problem sample trigger request, the value of the most recent successful adjudication of the same original sample identifier in the last 24 hours, and the default value of the rule repository.
[0097] When tracing derived sample identifiers, the influencing decision node takes the original sample identifier as the root node. First, it searches for officially valid derived records whose parent sample identifier is equal to the original sample identifier. Then, it continues to search layer by layer for officially valid derived records whose parent sample identifier is equal to the derived sample identifier using the newly acquired derived sample identifier. If no new officially valid derived record is found, it is considered a formal termination condition. When records of pending chain, revocation, conflict pending decision, and duplicate registration are found, the recursion does not continue downward, but a termination note is generated for each. When there are multiple parallel derived branches for the same parent sample, all of them are retained and aggregated in ascending order of derived sample identifier. If the same derived sample identifier is hit through multiple paths, the deduplication is performed jointly according to the derived sample identifier, and only one formal aggregation result is retained, with a multi-path hit note added to the decision evidence record.
[0098] The completeness of derivation tracing is determined when all formal and valid parallel derivation branches that can be retrieved with the original sample identifier as the root node have been recursively traversed, and each branch terminates when no new formal and valid derivation record is retrieved; if any branch has a chain to be supplemented, a conflict to be resolved, or the library is unreachable, the derivation tracing is determined to be incomplete.
[0099] When locking the associated model version based on the impact index, the impact adjudication node first retrieves the formal consumption record from the consumption database according to the formal valid derived sample identifier set obtained by tracing, and then retrieves the impact index registration record of the formal index status from the impact index database according to the consumption voucher. When the same consumption voucher hits multiple formal index status records, all of them are retained and grouped according to the model version identifier. Multiple hit records under the same model version are summarized uniformly before generating the impact adjudication result. As long as there is at least one formal impact index registration record of a certain model version whose consumed derived sample identifier falls into the current tracing set, it is determined to be an associated model version. However, associated model versions are limited to model versions in the model registration database that are in the formal release status and whose corresponding impact index registration record is in the formal index status.
[0100] The impact adjudication node determines whether to generate an impact adjudication result within a scrolling observation window. The observation window length can be set from 1 second to 60 seconds, with a default value of 10 seconds. The adjudication order is fixed as follows: validity of the problem sample triggering request, existence of the original sample identifier, completeness of the derivation traceability, searchability of the impact index, and validity of the model version release status. If the preceding item is not valid, the following item will not be checked.
[0101] To suppress duplicate rulings and affect the generation of ruling idempotent keys by ruling nodes, the ruling idempotent key consists of the original sample identifier, issue type, trigger time fragment, summary of the traced derived sample set, and rule version number. The summary of the traced derived sample set consists only of officially valid derived sample identifiers and does not include the original sample identifier, branch notes, revoked branches, and branches to be supplemented. Before reorganization, the derived sample identifiers are sorted in ascending order. If no officially valid derived sample identifiers are found, no set summary is generated and 5002 is returned directly. Within a 10-second deduplication window, if the first four items are consistent and the rule version number is consistent, it is determined to be a duplicate ruling. Only the first ruling record is retained, and a duplicate trigger record is added to the evidence chain. If the first four items are consistent but the rule version number is inconsistent, it is determined to be a conflict ruling, and the current ruling result identifier is frozen. During the freezing period, the current ruling result cannot be provided for retrieval or invocation downstream, and version processing nodes are not allowed to invoke it.
[0102] The impact ruling result is the formal result that characterizes the scope and level of impact of the problem sample on the associated model version. At the time of the ruling, the impact ruling node generates the result one by one, with the original sample identifier as the primary key and the associated model version identifier as the secondary key, combined with the problem type, the number of associated consumption vouchers, the number of associated batches, and the number of associated rounds. The same combination of the same original sample identifier and the same associated model version identifier generates only one formal ruling result in the same ruling. Multiple hit branches are first aggregated and then the ruling result is generated.
[0103] The number of associated consumption vouchers is obtained by counting the number of official consumption vouchers that are hit by the current traceable derived sample set under the same model version, in units of pieces; the number of associated batches is obtained by counting the number of different batch identifiers that are hit under the same model version, in units of batches; the number of associated rounds is obtained by counting the number of different round identifiers that are hit, in units of rounds.
[0104] The factors influencing the ruling result include at least the ruling result identifier, original sample identifier, issue type, associated model version identifier, number of associated consumption vouchers, number of associated batches, number of associated rounds, impact level, execution time, and rule version number. The ruling result identifier is formed by recombining the original sample identifier, associated model version identifier, issue type, ruling generation time segment, and rule version number in a fixed order. When the same original sample identifier and the same associated model version identifier conflict and are reconstructed, an additional ruling sequence number is added starting from 1 under that combination. Except for cases explicitly permitted by each step, each state is only allowed to change according to the migration order specified in the corresponding step, and it is not allowed to directly rewrite the existing formal record by skipping intermediate states. When the state changes, the corresponding evidence chain record should be written simultaneously, and the state before and after the change should be retained.
[0105] Impact levels are limited to three categories: low impact, medium impact, and high impact. Low impact corresponds to a number of associated consumption vouchers below the first threshold in the rule repository and a number of associated rounds of 1. High impact corresponds to a number of associated consumption vouchers not less than the second threshold in the rule repository or a number of associated rounds greater than 1. Medium impact corresponds to not falling into either the low or high impact category. The first threshold must be less than the second threshold, and both must be configured in the rule repository according to the issue type and locked at the start of the current observation window according to the current issue type. The number of associated batches is only used as an appendix for subsequent version processing and does not participate in the impact level classification.
[0106] After the impact ruling is formed, the impact ruling node generates a ruling registration record and a ruling evidence record. A complete ruling group refers to a group in which all officially valid derived samples corresponding to the same original sample identifier have been traced, and the corresponding official consumption vouchers, official impact indexes, and officially released model versions have all been retrieved and their impact levels calculated, and a ruling registration record or a ruling record has been formed. The ruling registration records are stored in the ruling database and sequentially written to the retrieval entries in the partition file and the index database. The ruling evidence record is added to the impact index evidence chain index to generate a new ruling evidence record. The parent index is retained as the upstream index, and the current ruling index does not cover the existing impact index. The ruling evidence record includes at least the ruling result identifier, the original sample identifier, the associated model version identifier, the issue type, the impact level, the execution time, the rule version number, the previous index, the current index, and the conclusion summary.
[0107] When the issue triggering interface succeeds, it returns a ruling result identifier, a summary of the associated model version set, and a ruling status code. When it fails, it returns an error code and handling suggestions. Specifically: 5001 indicates the original sample does not exist; 5002 indicates no valid derived sample identifier was found, or any parallel branch has a pending chain, conflict pending ruling, or recursion interruption due to unreachable library; 5003 indicates that there are no impact index registration records in the formal consumption voucher corresponding to the traceability set; if there is a partial hit, 5003 is not returned, but an index missing note is generated and the ruling continues; 5004 indicates an illegal model version status; 5005 indicates a duplicate ruling; 5006 indicates a conflict ruling; 5007 indicates the ruling library storage is unreachable; and 5008 indicates a rule version mismatch. For cases 5001 and 5007, automatic retry using the original ruling idempotent key is allowed, provided that the original sample identifier, issue type, trigger time fragment, and rule version prompt value remain unchanged. When case 5006 occurs, after manual review and confirmation, the original ruling result can continue to be published under the original ruling result identifier, or the original ruling result identifier can be invalidated and a new ruling result can be reconstructed. Manual review is only allowed to be performed by audit terminals with review authority. The review conclusion is limited to confirmation of duplicate rulings, splitting of conflicting rulings, correction of issue types, and rejection of publication. After review, the original ruling registration record must not be overwritten; only an additional ruling record can be generated and the upstream index can be used. The additional ruling record must include at least the review subject, review time, review conclusion, index of the ruled record, summary of the conclusion basis, execution status, and whether to continue publishing.
[0108] When 5008 occurs for the first time, the local cached rule version is compared with the current version of the rule repository. Then, the version of the problem sample trigger request declaration is compared with the current version of the rule repository. Before review, the rule repository version is read again once. If the retry still fails, 5008 is returned. If it occurs 3 times in a row, the previous effective version is switched and the version rollback evidence chain is written.
[0109] Regarding time and resource constraints, the upper limit of the delay from a single original sample entering the impact adjudication node to forming a formal impact adjudication result can be set to 500 milliseconds to 5 seconds, the concurrent adjudication capacity can be set to 50 to 5000 samples per second, and the default number of retries can be set to 2. If the adjudication registration is not completed after exceeding the delay limit, it will be transferred to the delay queue. The order of supplementary registration is first in ascending order of problem type, then in ascending order of original sample identifier, and then in ascending order of associated model version identifier.
[0110] If the ruling link is interrupted, records that have been traced but not entered into the ruling database are temporarily stored in the local buffer. When the buffer reaches its capacity limit, the most recent complete ruling group with the highest current processing priority and the most associated model versions will be retained first.
[0111] Regarding security and compliance boundaries, the impact adjudication node only receives adjudication requests triggered by authorized issue handling services or manual review terminals. The adjudication registration record and query interface only retain the original sample identifier, issue type, associated model version identifier, number of associated consumption vouchers, number of associated batches, number of associated rounds, and impact level required for participation in the handling. The original tag text, original content payload, and model file content are not written to the adjudication database. Subsequent version handling nodes only read the adjudication fields in the adjudication database and do not directly read the original content payload. If the manual review terminal needs to view the original tag text, it can only view it in a restricted isolation area in read-only mode and should generate an independent audit record.
[0112] On-site performance is tested according to the following criteria: With at least 10,000 consecutive problem samples triggering requests, covering five types of issues—labeling errors, content distortion, duplicate registration, compliant deletion, and source conflicts—as well as scenarios involving normal adjudication, duplicate adjudication, conflicting adjudication, rule switching, and short-term link interruption, the sample size for each issue type must be at least 10% of the total test sample size. The adjudication result duplication rate is calculated as the percentage of cases where different combinations of original sample identifiers and associated model version identifiers are assigned the same primary key to the adjudication result out of the total number of registered cases. The omission registration rate is calculated as the percentage of cases that meet the adjudication conditions but have not formed a formal adjudication registration record out of the total number of cases to be registered. The proportion meeting the delay limit requirement is calculated as the percentage of cases in the test sample size that complete formal adjudication registration within the delay limit out of the total number of cases to be registered. The adjudication result duplication rate should not exceed 0.001%, the omission registration rate should not exceed 0.01%, and the proportion meeting the delay limit requirement should not be less than 99.5%.
[0113] Preferably, in the steel plate surface defect detection production line, when executing the adjudication for a label correction-type problem sample trigger request, the scrolling observation window is set to 10 seconds, and the upper limit of the single-item adjudication registration delay is set to 800 milliseconds. In an 8-hour shift that continuously receives 12,000 problem sample trigger requests, the adjudication node completes the association retrieval of 92,400 derived samples, 92,400 consumption records, and 1 formal model version. The adjudication results have no primary key collisions, 9 duplicate adjudication cases are correctly merged, and 1 conflicting adjudication case has its adjudication result mark frozen and transferred to review. The adjudication registration success rate reaches 99.98%.
[0114] Alternatively, in the voice quality inspection scenario, the content distortion in the problem type can be replaced with the missing recording frame, the number of associated batches can be replaced with the number of associated duration intervals, and the rest of the original sample identifier tracing, derived sample identifier aggregation, impact index locking, adjudication result generation, evidence chain traceability, idempotent order control, interface return constraints and downstream call methods can remain unchanged.
[0115] S6. Based on the impact determination results, perform failure marking, targeted retraining, and version update on the associated model version, and write the new consumption voucher and new impact index registration record corresponding to the version update. The specific implementation is as follows: The version processing node, deployed behind the adjudication library and connected to the model registration library, impact index library, consumption library, derived library, training execution node, rule repository, batch assembly node, and version freeze node, reads the adjudication result identifier, original sample identifier, associated model version identifier, number of associated consumption vouchers, number of associated batches, number of associated rounds, and impact level during the current processing period after the impact adjudication result enters the formal release state. It first performs an invalidation mark on the associated model version, and then generates a targeted retraining task or a full retraining task based on the impact level and problem type. This drives the training execution node to complete the version update, and writes new consumption vouchers to the new formal batch set formed by the new retraining task and writes a new impact index to the new model version.
[0116] The failure flag refers to rewriting the version status of the associated model version in the model registry to failure pending replacement, failure retained, or failure withdrawn. Low impact corresponds to failure pending replacement, medium impact corresponds to failure retained, and high impact corresponds to failure withdrawn. In the failure pending replacement state, the original associated model version retains historical retrieval capabilities but cannot be distributed as the default valid model. In the failure retained state, it can continue to be queried within a limited scope but cannot be added for production use. In the failure withdrawn state, it is not visible to downstream training and use, and only audit retrieval capabilities are retained.
[0117] Targeted retraining refers to retraining only for the version of the associated model that is currently matched by the ruling result, without changing the overall goal and main configuration of the original training task. This is done by using the formal consumer sample set corresponding to the original associated model version as the base, removing problem chain samples and adding replacement samples. When the coverage ratio of problem chain samples exceeds the full retraining threshold in the rule warehouse, targeted retraining is converted to full retraining.
[0118] A version update is considered complete when, after targeted or full retraining, the new model version has been officially registered by the version freeze node and is in a state of official release, and the new consumption credentials and new impact indexes have been officially registered. If training is only completed but the official version registration has not been completed, it cannot be considered a complete version update.
[0119] Alternative samples refer to derived samples that are not hit by the current ruling, are in a formally valid state, and meet the current retraining task's label version restrictions, source dispersion restrictions, category balance restrictions, and duplication suppression restrictions. The target retraining capacity is equal to the number of problem chain samples to be removed by default. The rule repository can be configured to allow an upward or downward deviation range for the capacity. If not explicitly configured, it must not deviate from the number to be removed. If there are insufficient alternative samples, return 6004. Alternative samples must not cross label versions and must not exceed the source dispersion restrictions and category balance restrictions.
[0120] Before executing the version handling node, the system first reads the impact ruling results of the officially released status from the ruling library, reads the model version records of officially released, invalid and pending replacement, or invalid retention status from the model registration library, reads the impact index registration records of all officially indexed statuses from the impact index library according to the associated model version identifier, reads the official consumption registration records from the consumption library according to the consumption certificate, reads the official valid derivation registration records from the derivation library according to the consumed derivation sample identifier, and reads the version handling rule version, supplementary training task generation rule version, alternative sample selection rule version, and version update rule version locked and effective in the current observation window from the rule repository. The system then aligns the ruling results, model version records, impact index registration records, consumption registration records, and derivation registration records in ascending order according to the associated model version identifier, round identifier, batch identifier, and consumption certificate. Blank notes fields are filled in the order of the attached value of the current ruling result, the most recent successful handling value of the same associated model version identifier in the last 24 hours, and the default value of the rule repository.
[0121] The version handling node decides whether to perform the handling within the scrolling observation window. The observation window length can be set from 1 second to 60 seconds, with a default value of 10 seconds. The decision order is fixed as follows: the validity of the decision result status, the handleability of the associated model version status, the identifiability of the problem chain samples, the availability of supplementary training samples, and the consistency of the version update rules. If the previous item is not valid, the subsequent item will not be checked.
[0122] Multiple associated model versions under the same ruling result identifier are allowed to generate disposal tasks in parallel, but each associated model version must be executed serially in a fixed order of failure marking, targeted retraining or full retraining, and version update.
[0123] Problem chain sample identification refers to the formally valid derived sample identifier set obtained by tracing back along the original sample identifier corresponding to the current ruling result, and the intersection operation with the consumed derived sample identifiers in all formal impact index registration records under the associated model version to form a problem chain sample set to be removed; if the intersection result is empty, 6003 is returned; only formally valid and unrevoked consumed derived sample identifiers are retained in the intersection result. If there are supplementary chains or conflicting samples to be ruled, only an incomplete problem chain sample identification appendix is generated and it is not included in the problem chain sample set to be removed.
[0124] To suppress duplicate processing, the version processing node generates a processing idempotent key. The processing idempotent key consists of the adjudication result identifier, the associated model version identifier, the impact level, the summary of the sample set of the problem chain to be removed, and the rule version number. The summary of the sample set of the problem chain to be removed consists only of the identifiers of the officially valid and unrevoked consumed derived samples that are hit under the current adjudication result. Before reorganization, the samples are deduplicated by the consumed derived sample identifiers and sorted in ascending order. When the set is empty, no summary is generated and 6003 is returned directly. Within the 10-second deduplication window, if the first four items are the same and the rule version number is the same, it is judged as duplicate processing. Only the first processing record is retained and a duplicate trigger record is added to the evidence chain. If the first four items are the same but the rule version number is different, it is judged as conflict processing and the current processing result identifier is frozen. During the freezing period, the current processing result cannot be provided to downstream for retrieval or invocation, and the training execution node is not allowed to receive new supplementary training tasks triggered by the processing result.
[0125] The targeted retraining task is generated by the version handling node according to the generation rules of the locked version. The fixed fields are the adjudication result identifier, the associated model version identifier, the summary of the sample set of the problem chain to be removed, the summary of the alternative sample set, the retraining round identifier, and the rule version number. The fixed order is adjudication result identifier first, followed by rule version number. The retraining round identifier is generated based on the round identifier corresponding to the original associated model version. If full retraining is not triggered, the original round identifier is taken plus the retraining sequence number. If full retraining is triggered, a new formal round identifier is regenerated. The alternative sample set summary is generated separately for each associated model version. Before reorganization, it is deduplicated by the derived sample identifier and sorted in ascending order. The same alternative sample cannot be assigned to multiple associated model versions in the same handling cycle.
[0126] After the version handling node marks the failure, it sends the retraining task to the training execution node. After the training execution node completes the retraining, it returns a new training completion signal and a new training task summary. The version handling node then calls the batch assembly node and the version freeze node to generate a new consumption certificate for the new formal batch set formed by the new retraining task, and generates a new impact index for the new model version registration record. The new consumption certificate is only generated for the new formal batch set formed by the retraining task and is registered by the batch assembly node during the retraining task assembly phase. The new impact index is only established for the new model version and is registered by the version freeze node when the new model version is officially frozen. Version inheritance records are added to the model registration database between the new model version and the original associated model version. The inheritance relationship between the new model version and the original associated model version is only established through the version inheritance records in the model registration database and does not change the existing identifier and historical registration records of the original associated model version. The existing consumption certificates corresponding to the original associated model version retain their historical formal consumption status, and the existing impact index retains its original status and adds a failure association note. They must not be directly deleted.
[0127] The version handling result includes at least the handling result identifier, the adjudication result identifier, the associated model version identifier, the handling status, the retraining task identifier, the new model version identifier, the execution time, and the rule version number. The handling status is limited to six types: expired and awaiting replacement, expired and retained, expired and withdrawn, retraining in progress, retraining completed and awaiting update, and updated. After the retraining task is issued to the training execution node and officially received, the handling status changes to retraining in progress. When the training execution node returns a training completion signal and the new model version has not yet been officially frozen, it changes to retraining completed and awaiting update. After the new model version is officially released and the new consumption voucher and the new impact index have been officially registered, it changes to updated.
[0128] The disposal result identifier is formed by recombining the adjudication result identifier, the associated model version identifier, the disposal generation time segment, and the rule version number in a fixed order. When the same adjudication result identifier and the same associated model version identifier conflict and are rebuilt, the disposal sequence number is added starting from 1 under that combination. Repeated disposal retry cannot occupy the new disposal sequence number, and the old disposal sequence number after being invalidated cannot be reused.
[0129] After the version handling result is formed, the version handling node generates handling registration records and handling evidence records. A complete handling group refers to all related model versions under the same ruling result identifier that have completed the corresponding handling status registration, and model versions that need retraining have completed retraining task registration, version update registration, new consumption voucher registration, and new impact index registration, while model versions that do not need retraining have completed invalidation withdrawal or invalidation retention registration. The handling registration record includes at least the handling result identifier, ruling result identifier, related model version identifier, handling status, number of samples in the problem chain to be removed, number of replacement samples, new model version identifier, execution time, and rule version number. The handling evidence record is generated by adding new handling evidence records based on the ruling evidence chain index. The parent index is retained as the upstream index, and the current handling index does not cover the existing ruling index. The handling evidence record includes at least the handling result identifier, ruling result identifier, related model version identifier, handling status, execution time, rule version number, previous index, current index, and conclusion summary.
[0130] Each version disposal request must include at least the adjudication result identifier, the associated model version identifier, the impact level, the disposal priority, the disposal strategy flag, and the rule version hint value. Upon success, it returns the disposal result identifier, the retraining task identifier, the new model version identifier, and the disposal status code. Upon failure, it returns an error code and disposal suggestions. Specifically, 6001 indicates that the adjudication result does not exist; 6002 indicates that the associated model version status is invalid, withdrawn, revoked, pending chain replacement, or frozen pending adjudication; 6003 indicates that the intersection result is empty or only contains samples in the pending chain replacement, conflict pending adjudication, or revoked status; 6004 indicates that the number of replacement samples is still lower than the target retraining capacity even when the label version consistency, source dispersion restriction, category balance restriction, and duplication suppression restriction are all satisfied; 6005 indicates duplication disposal; 6006 indicates conflict disposal; 6007 indicates that the disposal library storage is unreachable; and 6008 indicates... The rule version does not match; 6001 and 6007 allow automatic retry using the original idempotent key, provided that the adjudication result identifier, associated model version identifier, impact level, and rule version prompt value have not changed; when 6006 occurs, after manual review and confirmation, it can continue to be published under the original disposal result identifier, or the original disposal result identifier can be invalidated and a new disposal result can be rebuilt; manual review is only allowed to be performed by the review terminal with review authority, and the review conclusion is limited to duplicate disposal confirmation, conflict disposal splitting, impact level correction, and rejection of publication. After review, the original disposal registration record cannot be overwritten, only an additional disposal record can be generated and the upstream index can be used. The additional disposal record should include at least the review subject, review time, review conclusion, index of the adjudicated record, summary of the conclusion basis, execution status, whether to continue publishing flag, disposal result identifier, associated model version identifier, and disposal status.
[0131] When 6008 occurs for the first time, the local cached rule version is compared with the current version of the rule repository. Then, the version of the version handling request declaration is compared with the current version of the rule repository. Before review, the rule repository version is read again once. If the retry still fails, 6008 is returned. If it occurs 3 times in a row, the previous effective version is switched and the version rollback evidence chain is written.
[0132] Regarding time and resource constraints, the maximum delay for a single disposal task from entering the version disposal node to forming a formal disposal registration record can be set to 500 milliseconds to 10 seconds, the concurrent disposal capacity can be set to 20 to 2000 items per second, and the default number of retries can be set to 2. If the disposal registration is not completed after exceeding the delay limit, it will be transferred to the delay queue. The order of supplementary registration is first in ascending order of impact level, then in ascending order of associated model version identifier, and then in ascending order of adjudication result identifier.
[0133] Regarding security and compliance boundaries, the version handling node only receives handling tasks triggered by authorized version handling requests. The handling registration record and query interface only retain the adjudication result identifier, associated model version identifier, impact level, number of samples in the problem chain to be removed, number of replacement samples, new model version identifier, and handling status required for participation in the handling. The original tag text, original content payload, and model file content are not written to the handling library. The original associated model versions in the expired and awaiting replacement or expired and retained states can retain retrieval capabilities, but cannot participate in subsequent training assembly and version freezing as the default valid version. The original associated model versions in the expired and withdrawn state are not visible to downstream training assembly, version freezing, and version handling.
[0134] The on-site performance shall be tested according to the following criteria: With a continuous sample size of no less than 10,000 formal impact rulings covering three impact levels (low impact, medium impact, and high impact) and scenarios including failure pending replacement, failure retention, failure withdrawal, targeted retraining, and full retraining, the sample size for each impact level shall be no less than 10% of the total sample size. The repetition rate of the handling results shall be calculated as the number of cases in which different ruling result identifiers and associated model version identifiers are assigned the same primary key to the handling result, out of the total number of handling registrations. The omission rate shall be calculated as the number of cases that meet the handling conditions but have not formed a formal handling registration record, out of the total number of cases that should be registered. The proportion that meets the delay limit requirement shall be calculated as the number of cases in the sample size that complete formal handling registration within the delay limit, out of the total number of cases that should be registered. The repetition rate of the handling results shall not exceed 0.001%, the omission rate shall not exceed 0.01%, and the proportion that meets the delay limit requirement shall not be less than 99.5%.
[0135] Preferably, in the steel plate surface defect detection production line, when performing version processing on the formal impact ruling result of the medium impact level, the rolling observation window is set to 10 seconds, the upper limit of the single-item processing registration delay is set to 1500 milliseconds, and within an 8-hour shift that continuously receives 12,000 version processing requests, the version processing node completes the failure to be replaced mark of one associated model version, removes 36 problem chain samples from 92,400 formal valid derived samples and adds 36 replacement samples, drives the training execution node to generate a new model version, and rewrites the new model version with new consumption vouchers and new impact indexes. The processing result has no primary key collision, 11 duplicate processing items are correctly merged, and 1 conflict processing item is frozen and transferred to review. The processing registration success rate reaches 99.98%.
[0136] Alternatively, in voice quality inspection scenarios, the number of alternative samples can be controlled according to the total recording duration interval. The handling strategy markers corresponding to high impact levels can be directly set to full retraining, while the entry points for other failure markers, targeted retraining, version updates, new consumption voucher writing, and new impact index writing remain unchanged. The above nodes can be deployed by independent server processes, different service modules within the same server, edge gateways and central services combined, or implemented by program units with equivalent functions. The above databases can be implemented by a combination of relational databases, key-value databases, object storage and indexing services. Under the premise of not changing the object identifiers, state constraints, evidence chain addition rules, version locking rules, and upstream and downstream call relationships in each step, they are equivalent implementation forms.
[0137] All calculations involved in the embodiments are dimensionless numerical calculations, and the preset parameters and thresholds in the calculations are set by those skilled in the art according to the actual situation.
[0138] It should be noted that this invention can be deployed on the device itself to realize embedded applications, or it can run on a PC or other terminal with a user interface, thereby meeting various hardware environments and usage requirements.
[0139] The above embodiments can be implemented, in whole or in part, by software, hardware, firmware, or any other combination thereof. When implemented using software, the above embodiments can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions or computer programs. When the computer instructions or computer programs are loaded or executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wireless or wired transmission; wired transmission methods include optical fiber, twisted pair, coaxial cable, etc.; wireless transmission includes infrared, microwave, etc. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center containing one or more sets of available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. A semiconductor medium can be a solid-state drive.
[0140] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and modules described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.
[0141] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or modules may be electrical, mechanical, or other forms.
[0142] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0143] In addition, the functional modules in the various embodiments of this application can be integrated into one processing module, or each module can exist physically separately, or two or more modules can be integrated into one module.
[0144] If the aforementioned functions are implemented as software functional modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0145] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.
[0146] In conclusion, the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for traceable management of AI model training data, characterized in that, include: S1. Receive the original sample, extract the content fingerprint, source identifier, labeled version, and collection batch, generate the original sample identifier and register it; S2. For each cleaning, segmentation, enhancement, relabeling, and sampling process of the original sample, record the parent sample identifier, processing type, processing parameters, and execution time, and generate the derived sample identifier. S3. When assembling the training batch, record the derived sample identifier, batch identifier, round identifier, label version, and sampling rule for the derived samples entering the training batch, and generate a consumption voucher. S4. When the model version is frozen, collect the consumer-derived sample identifiers based on the consumption vouchers and establish an impact index from the model version to the consumption vouchers. S5. Receive the problem sample trigger request, trace the derived sample identifier based on the original sample identifier, lock the associated model version based on the impact index, and generate the impact adjudication result; S6. Based on the impact adjudication results, perform failure marking, targeted retraining, and version update on the associated model version, and write the new consumption voucher and new impact index registration record corresponding to the version update.
2. The method for traceable management of AI model training data according to claim 1, characterized in that, S1 includes: When receiving the original sample, a content fingerprint is generated based on the content payload of the original sample; The original sample identifier is generated based on the content fingerprint, source identifier, labeled version, collection batch, and rule version number; Original sample registration records are generated based on the original sample identifiers and written into the registration database.
3. The method for traceable management of AI model training data according to claim 1, characterized in that, S2 include: For each cleaning, segmentation, enhancement, relabeling, and sampling process of the original sample, record the parent sample identifier, processing type, processing parameters, and execution time; and generate a processing parameter summary based on the processing parameters. A derived sample identifier is generated based on the parent sample identifier, processing type, processing parameter summary, time segment corresponding to the execution time, and rule version number.
4. The method for traceable management of AI model training data according to claim 3, characterized in that: The processing parameter summary is formed by reorganizing the main parameters involved in the derivation decision according to a fixed field order, unified encoding rules, and unified null value placeholder rules. The derived registration record corresponding to the derived sample identifier is written into the derived database, and a derived evidence record is generated by appending it to the parent sample evidence chain index.
5. The method for traceable management of AI model training data according to claim 1, characterized in that, S3 includes: During training batch assembly, the derived sample record, batch identifier, round identifier, label version, and sampling rules are recorded for the derived samples written into the formal batch set. Consumption vouchers are generated based on derived sample identifiers, batch identifiers, round identifiers, label versions, and sampling rules; consumption registration records are generated based on consumption vouchers and written into the consumption database.
6. The AI model training data traceability management method according to claim 1, characterized in that, S4 include: When the model version is frozen, an impact index registration record is generated based on the model version identifier, the consumption voucher in the formal consumption record, the consumed derived sample identifier, the batch identifier, and the round identifier. The impact index registration record uses the model version identifier as the primary index key and the consumption certificate as the secondary index key, and is written to the impact index database.
7. The AI model training data traceability management method according to claim 1, characterized in that, S5 include: After receiving a problem sample trigger request, starting from the original sample identifier, search for valid derived records whose parent sample identifier is equal to the original sample identifier to obtain a set of derived sample identifiers. Based on the consumption vouchers in the formal consumption records corresponding to the derived sample identifier set, retrieve the impact index registration records of the formal index status, lock the associated model version, and generate the impact adjudication results.
8. The AI model training data traceability management method according to claim 1, characterized in that, S6 include: When marking the associated model version as invalid based on the impact adjudication result, the version status of the associated model version is rewritten to the corresponding status among invalid pending replacement, invalid retention, and invalid withdrawal. A supplementary training task is generated based on the sample set of the problem chain to be eliminated and the alternative sample set.
9. The method for traceable management of AI model training data according to claim 8, characterized in that: A new formal approval collection will be formed based on the supplementary training tasks; New consumption vouchers are generated based on the newly officially sealed batch; Based on the new model version identifier and new consumption voucher generated by the version update, a new impact index registration record is generated and written to the impact index library.