Data marking method, electronic device and storage medium
By analyzing scene elements and spatial logic constraints using a large language model, dynamically orchestrating the recognition model pipeline, and generating judgment codes, this solves the problem of efficiently screening complex semantic training samples in existing technologies. It achieves an efficient and automated data labeling method to generate high-quality training samples.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SICHUAN XUANJIE INTELLIGENT TECHNOLOGY CO LTD
- Filing Date
- 2026-04-24
- Publication Date
- 2026-06-19
Smart Images

Figure CN122241319A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data labeling technology, and in particular to a data labeling method, electronic device, and storage medium. Background Technology
[0002] Currently, the training of deep learning models heavily relies on large-scale, high-quality labeled data. When training new functional models and lacking readily available labeled samples, common practices include manual labeling or automatic labeling using existing single recognition models. However, manual labeling is costly and inefficient; while a single recognition model can only output the coordinates or masks of its preset category, failing to handle complex scenarios such as "people smoking" or "people playing around," which require simultaneous recognition of multiple objects and understanding of their spatial relationships or action semantics.
[0003] In existing technologies, even when multiple recognition models are combined for cascaded screening, the results are limited to simple existence checks (such as "there is a person and smoke in the image"), making it difficult to further determine complex logic involving the relative position, orientation, or interaction between objects, such as "is a person smoking?". This results in a large amount of noisy data in the generated samples that does not conform to the target semantics. Furthermore, hard-coded decision logic lacks universality and scalability for different complex scenarios, making it difficult to adapt to diverse new functional requirements. Therefore, how to automatically and efficiently filter and label training samples that meet complex semantic constraints from massive amounts of raw data has become an urgent technical problem to be solved. Summary of the Invention
[0004] To address the aforementioned technical problems, the technical solution adopted by this invention is as follows:
[0005] According to a first aspect of this application, a data labeling method is provided, the method comprising the following steps:
[0006] S100, Obtain the functional description text of the new model;
[0007] S200: Input the functional description text into the large language model, which will parse and generate at least one scene element and spatial logical constraint. The scene element is used to represent the category of the object to be identified, and the spatial logical constraint is used to represent the positional relationship or action semantics between the objects to be identified.
[0008] S300, Match the corresponding recognition model in the preset model pool according to the scene elements; the model pool stores a number of recognition models and the function identifier and output format identifier of each recognition model;
[0009] S400, determine the processing order of each recognition model according to the spatial logic constraints, and generate pipeline configuration;
[0010] S500, according to the processing order configured in the pipeline, calls the corresponding recognition models sequentially to process the raw data; wherein, the current level recognition model performs reasoning on the input data, outputs the recognition result containing coordinates or mask, and passes the data containing the scene elements in the recognition result as intermediate results to the next level recognition model, while the data that does not contain the scene elements is discarded;
[0011] S600: When the pipeline configuration includes spatial logic constraints, the spatial logic constraints and the coordinates or mask data output by the recognition models at each level are input into the large language model, and the large language model generates the decision code.
[0012] S700, execute the judgment code in the sandbox environment, perform secondary filtering on the intermediate results, and retain the data that meets the spatial logic constraints;
[0013] S800 generates labeled files from the coordinates or mask data output by the recognition models at each level, according to the preset training sample format, and uses the data that meets the spatial logic constraints as training samples for the new model.
[0014] According to another aspect of this application, a non-transitory computer-readable storage medium is also provided, wherein at least one instruction or at least one program is stored in the storage medium, and the at least one instruction or at least one program is loaded and executed by a processor to implement the above-described data marking method.
[0015] According to another aspect of this application, an electronic device is also provided, including a processor and the aforementioned non-transitory computer-readable storage medium.
[0016] The present invention has at least the following beneficial effects:
[0017] The data labeling method of this invention performs semantic parsing of functional description text using a large language model, automatically generating scene elements and spatial logical constraints. Based on a model pool, it dynamically orchestrates a cascaded recognition pipeline, achieving efficient coarse screening of massive amounts of raw data. Furthermore, the large language model automatically generates decision code containing spatial relationship calculation functions based on spatial logical constraints, and executes it securely in a sandbox. This transforms complex behaviors described in natural language (such as "a person smoking") into quantifiable coordinate relationship judgments, accurately filtering out data that meets complex semantic constraints without requiring manual writing of decision logic or tedious post-processing. Finally, it automatically generates labeled files, forming a high-quality training sample set. This method not only significantly reduces the cost of manual intervention and improves data labeling efficiency, but also possesses high versatility and scalability, flexibly adapting to the needs of different new functional models for complex scene samples, effectively solving the cold start problem of training samples for new models. Attached Figure Description
[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0019] Figure 1 A flowchart of a data labeling method provided in an embodiment of the present invention. Detailed Implementation
[0020] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0021] It should be noted that, based on this disclosure, those skilled in the art will understand that one aspect described herein can be implemented independently of any other aspect, and two or more of these aspects can be combined in various ways. For example, any number of aspects set forth herein can be used to implement the device and / or practice the method. Furthermore, this device and / or practice the method can be implemented using other structures and / or functionalities besides one or more of the aspects set forth herein.
[0022] The following will refer to Figure 1 The flowchart shown illustrates a data labeling method, introducing one such method.
[0023] The data labeling method may include the following steps:
[0024] S100, retrieve the functional description text of the new model.
[0025] In this embodiment, the system receives natural language text input by the user through a graphical user interface or API interface. This text describes a complex scene that the new model to be trained can recognize. For example, the user inputs: "I need to train a model that can recognize 'a person is smoking'." The system stores this text in memory as a string, appending a timestamp and task ID for subsequent steps.
[0026] This step uses natural language as the input interface, which lowers the barrier to entry for users. Users do not need to have programming or annotation expertise to define complex recognition requirements, providing a semantic source for subsequent automated processing.
[0027] S200: Input the functional description text into the large language model, which then parses and generates at least one scene element and spatial logical constraint. The scene element is used to represent the category of the object to be identified, and the spatial logical constraint is used to represent the positional relationship or action semantics between the objects to be identified.
[0028] In this embodiment, the system constructs a structured prompt containing the following content:
[0029] User-input text describing the function, such as "a person is smoking".
[0030] Task instructions: "Please extract the scene elements (i.e., the necessary object categories) and spatial logic constraints (i.e., the positional relationships or action semantics between objects) from the above description. The output format is JSON."
[0031] Optional Few-shot examples can be added, such as: "Input: 'person hits dog', Output: {'required_entities':['person','dog'],'logical_constraints':['person's hand is in contact with dog','dog shows avoidance posture']}".
[0032] The system sends the prompt words to the API of a large language model (such as GPT-4, DeepSeek, etc.) and receives the returned JSON text. It parses the JSON to obtain a set of scene elements, such as ["person", "cigarette", "mouth"], and a set of spatial logical constraints, such as ["cigarette is close to mouth", "the angle of cigarette points to mouth"]. If the returned format does not meet expectations, the system can be configured with a retry mechanism or use the default parsing rules.
[0033] By leveraging the semantic understanding capabilities of large language models, unstructured natural language is automatically decomposed into machine-recognizable structured elements and constraints, avoiding the tedious process of manually writing rules or configuration files. It can adapt to descriptions of various complex scenarios and has strong generalization ability.
[0034] S300, matching the corresponding recognition model in the preset model pool according to the scene elements; the model pool stores several recognition models and the function identifier and output format identifier of each recognition model.
[0035] Furthermore, step S300 includes the following steps:
[0036] S310, convert the scene elements into semantic vector V e .
[0037] In this embodiment, the system obtains the set of scene elements generated in step S200, such as ["person", "cigarette", "mouth"]. These text labels are input into a pre-trained semantic embedding model (such as Sentence-BERT, OpenAI's text-embedding-ada-002, or a Chinese pre-trained model such as text2vec-large-chinese) to obtain a fixed-dimensional semantic vector.
[0038] Each scene element can be encoded independently and then averaged, or all elements can be concatenated into a sentence (e.g., "person cigarette mouth") and encoded as a whole. For simplicity, it's common practice to concatenate all elements with commas to form a string before encoding, resulting in a vector. ,in, For the embedding dimension (e.g., 768).
[0039] In this step, discrete text labels are converted into numerical representations in a continuous vector space, providing a unified mathematical basis for subsequent similarity calculations. This enables the functional identifiers in the model pool to be quantitatively compared with any scene elements, avoiding the rigid limitations of string matching.
[0040] S320, convert the function identifiers of each recognition model in the model pool into function vectors, resulting in a function vector list A = (A1, A2, ..., A...). i A n ), i=1,2,…,n; A i Let be the function vector corresponding to the i-th recognition model, and n be the number of recognition models.
[0041] In this embodiment, when each recognition model in the model pool is registered, its "functional identifier" field (such as capabilities or labels) is predefined as a list of categories that the model can recognize, for example: ["person", "face"] or ["car", "truck", "bus"]. The system uses the exact same semantic embedding model as S310 to convert the functional identifier list of each model into a corresponding functional vector. The transformation method is the same as S310 (e.g., encoding list elements by concatenating them with commas). The pre-transformation results can be computed offline and stored in the model pool's metadata, avoiding redundant online computations. The final result is a list of vectors. ,in, This represents the total number of identified models in the model pool.
[0042] This step vectorizes model functions in advance or online, enabling all models in the model pool to be retrieved and compared in the same semantic space, laying the foundation for fast matching. This approach does not rely on manually labeled hierarchical relationships, but instead utilizes the semantic similarity of pre-trained models for automatic generalization (e.g., "person" and "pedestrian" are relatively close in the vector space).
[0043] S330, obtain V e The similarity between each function vector in A is used to obtain a similarity list η = (η1, η2, ..., ηn). i , ..., η n ); η i For V e With A i The similarity between them.
[0044] For each From 1 to Calculate semantic vectors With function vector Cosine similarity between them:
[0045] ;
[0046] The range of values for cosine similarity is: The closer the value is to 1, the closer the semantics of the two vectors are. The system can efficiently compute all vectors using batch matrix operations (such as NumPy or PyTorch). Get a list of similarities .
[0047] In this step, the semantic matching degree between scene elements and the capabilities of each model is quantified by cosine similarity, which allows the system to capture synonyms or hyponyms (e.g., "character" and "person" have high similarity), thereby expanding the range of matchable models and improving the robustness of matching.
[0048] S340, the recognition model corresponding to MAX(η)>τ is determined as the candidate model; MAX() is the preset maximum value function, and τ is the preset similarity threshold.
[0049] In this embodiment, a preset similarity threshold can be set. (For example, 0.7, which can be adjusted based on experiments). The system finds a list of similarities. The maximum value in .if If the maximum value is reached, the corresponding identification model (if multiple models have the same maximum value, all of them are selected) is marked as a candidate model.
[0050] In this embodiment, the most similar model is matched for each scene element. That is, for the element "person", the most similar model (such as "person_detector") is found; for the element "cigarette", the most similar model (such as "cigarette_detector") is found.
[0051] In this step, threshold control prevents completely irrelevant models from being mismatched to scene elements, ensuring the reliability of the matching. Simultaneously, each element is matched independently, allowing the system to select different specialized models for different objects, achieving the optimal combination.
[0052] S350, if the candidate model does not fully cover the scene elements, a supplementary matching mechanism is triggered; the supplementary matching mechanism includes inputting the uncovered scene elements as prompt words into the large language model, and the large language model recommending models with alternative recognition capabilities from the model pool, or generating zero-shot recognition instructions based on the CLIP model.
[0053] In this embodiment, after matching all scene elements, it is checked whether each element has at least one corresponding candidate model. If an element (e.g., "mouth") does not find any model with a similarity exceeding a threshold, it is determined to be "incompletely covered". At this point, a supplementary matching mechanism is triggered, which includes two optional paths:
[0054] The large language model recommends alternative models: Using uncovered scene elements (such as "mouth") and a list of functional identifiers for all models in the model pool as context, it constructs a prompt: "We need to identify 'mouth,' but there are no models in the model pool that directly detect mouths. Please recommend a model from the following list that best indirectly provides mouth location information, and explain how to extract mouth coordinates from its output. Model list:..." The large language model is invoked, returning the recommendation result, such as "face_landmark_detector," and instructing the use of mouth keypoints from facial landmarks as an alternative.
[0055] Generate CLIP-based zero-shot recognition instructions: If no model in the model pool can indirectly provide the feature, the system constructs a zero-shot instruction. For example, for the feature "mouth", the system generates a CLIP-based judgment function: given an image or region of interest, the CLIP model is used to calculate the similarity between the text "a mouth" and the image region; if the similarity exceeds a threshold, a mouth is considered to exist. The instruction format is: clip_score=clip.compute_similarity(image_roi,“a mouth”).
[0056] The system incorporates the above supplementary results into the pipeline configuration as virtual recognition model nodes.
[0057] The supplementary matching mechanism allows the system to continue operating even when the model pool is insufficient, rather than failing outright. Utilizing the reasoning capabilities of large language models to recommend indirect alternatives, or leveraging CLIP's zero-shot capability for dynamic identification, greatly enhances the system's robustness and versatility, ensuring coverage of elements in any scene.
[0058] The above steps achieve soft matching of model capabilities by mapping scene elements and model function identifiers to the same semantic vector space and calculating cosine similarity for matching. This avoids rigid keyword matching and can identify semantic associations such as synonyms and hyponyms. Simultaneously, threshold filtering ensures matching quality and provides a supplementary mechanism for uncovered elements, based on large language model inference or CLIP zero-shot identification. This approach enables the system to automatically and flexibly reuse heterogeneous models in the existing model pool. Even if the model pool lacks certain specialized models, it can obtain alternative capabilities through semantic approximation or zero-shot methods, thus ensuring the smooth execution of subsequent pipelines and significantly improving the system's adaptability and robustness.
[0059] S400, determine the processing order of each recognition model according to the spatial logic constraints, and generate pipeline configuration.
[0060] Furthermore, step S400 includes the following steps:
[0061] S410, construct a directed acyclic graph with the identification model as nodes and the dependencies implied by spatial logical constraints as edges.
[0062] Iterate through all candidate recognition models output in step S300 (including CLIP zero-sample models generated by the supplementary matching mechanism) and create an independent node for each model; assign node attributes: node ID, model function, output data type (coordinates / mask / key points), and inference speed (ms / frame, pre-stored in the model pool metadata).
[0063] Example: Based on scene elements person, cigarette, and mouth, generate 3 nodes: Node1: Human body detection model; Node2: Cigarette detection model; Node3: Mouth key point model.
[0064] Read the spatial logical constraint text generated in step S200, and extract the dependencies between models through rule matching / semantic parsing of large language models:
[0065] Dependency-free: The outputs of the two models have no spatial computational relationship and can be executed in parallel or in any order;
[0066] One-way dependency: The output of one model must participate in the spatial computation of another model.
[0067] Initialize an empty directed acyclic graph data structure (using an adjacency list for storage, compatible with programming languages such as Python, Java, and C++); add all initialized nodes to the graph, with no directed edges initially.
[0068] This step abstracts discrete recognition models into graph nodes, transforms abstract spatial logical constraints into graph-structured dependencies, and uses standardized adjacency lists to store graph data. This decouples the recognition models from the execution logic, enabling a clear and structured expression of complex dependencies between multiple models. It provides a stable and universal data structure foundation for subsequent pipeline sequence generation, is compatible with any number and type of recognition model combinations, and possesses strong scalability and versatility.
[0069] S420, for any two recognition models M a and M b If the spatial logic constraint requires M a The output result can only be used in M after spatial operations are performed. b If the determination is correct, then M is generated. a To M b The processing order is determined by topological sorting of the directed acyclic graph; when there are multiple feasible sorts, a weighted sort is performed according to the reciprocal of the processing speed of each recognition model.
[0070] The specific implementation of this step is as follows:
[0071] Step 1: Generate directed edges
[0072] Traversing all pairwise combinations of the recognition model (M) a M b ).
[0073] Matching spatial logic constraints: If the constraint requires calculating M first a Output the coordinates / mask, and then input the result into M. b To perform reasoning / determination, create a path from M in the graph. a Pointing to M b The directed edge.
[0074] Edge attribute marker: Records the spatial operation type (distance calculation / containment relationship / angle calculation) corresponding to the edge.
[0075] Example: Spatial logic constraint: Cigarette close to mouth → mouth coordinates (Node3) and cigarette coordinates (Node2) must be obtained before the distance can be calculated; the two nodes have no prior dependencies, so no directed edge is generated; if the constraint is to first detect the human body, and then detect the cigarette within the human body area → generate edge Node1→Node2.
[0076] Step 2: Topological sorting determines the basic execution order.
[0077] The sorting is performed using Kahn's algorithm (the most commonly used and easily reproducible topological sorting algorithm in industry), with the following steps:
[0078] Calculate the in-degree (the number of edges pointing to that node) of all nodes in the graph.
[0079] Add all nodes with an in-degree of 0 to the queue;
[0080] Loop through the queue nodes, add them to the sorted list, traverse their adjacent nodes, and decrement the in-degree of each adjacent node by 1; if the in-degree of an adjacent node becomes 0, add it to the queue.
[0081] After the traversal is complete, the execution order that satisfies all dependencies is obtained.
[0082] The pseudocode is as follows:
[0083] Initialize the in-degree array in_degree[]
[0084] Initialize the queue and store the node where in_degree[i]=0.
[0085] result = empty list
[0086] while queue is not empty:
[0087] u = Team leader leaves the team
[0088] result.append(u)
[0089] for v in adjacency list[u]:
[0090] in_degree[v]-=1
[0091] if in_degree[v]==0:
[0092] queue.append(v)
[0093] Return result.
[0094] Step 3: Weighted sorting optimization under multiple feasible orders
[0095] When topological sorting generates multiple valid execution orders, weighted sorting rules are enabled.
[0096] Weight definition: Weight w = 1 / model inference speed. The faster the inference speed, the smaller the weight.
[0097] Sorting rule: For nodes at the same level (with an in-degree of 0), sort them by weight from smallest to largest, and execute the faster model first.
[0098] Example: Node2 (cigarette model, speed 10ms) weight = 0.1; Node3 (mouth model, speed 20ms) weight = 0.05;
[0099] When there are no dependencies at the same level, the execution order is: Node3 → Node2.
[0100] This step precisely defines the data dependencies between models using directed edges, employs the mature and stable Kahn algorithm to perform topological sorting, and strictly follows spatial logic constraints to generate a valid pipeline execution order, fundamentally avoiding logical errors in model execution. Simultaneously, a weighted sorting mechanism based on the reciprocal of processing speed is introduced to maximize the overall inference efficiency of the pipeline while satisfying functional constraints, reducing data labeling time. This implementation method has low algorithmic complexity and strong compatibility, and can be reproduced by those skilled in the art without creative effort, balancing the accuracy of execution logic with the efficiency of system operation.
[0101] The steps described above accurately depict the data dependencies between the recognition models by constructing a directed acyclic graph (DAG) and automatically generate the correct execution order using topological sorting, avoiding logical errors that may occur in manual pipeline orchestration. Building on this, when multiple feasible orders exist, a weighted sorting mechanism based on the reciprocal of processing speed is introduced, prioritizing the execution of faster models. This leverages their rapid filtering capabilities to reduce the amount of input data for subsequent slower models, thereby significantly improving overall data processing efficiency while ensuring dependency correctness. This approach enables the system to flexibly adapt to dependency structures in various complex scenarios and automatically optimize execution strategies, providing efficient and reliable pipeline orchestration capabilities for subsequent large-scale data labeling.
[0102] S500, according to the processing order configured in the pipeline, calls the corresponding recognition models in sequence to process the raw data; wherein, the current level recognition model performs inference on the input data, outputs the recognition result containing coordinates or mask, and passes the data containing the scene elements in the recognition result as intermediate results to the next level recognition model, while the data that does not contain the scene elements is discarded.
[0103] The system reads the raw dataset (e.g., an image folder raw_images / ). Execution follows the S400 pipeline sequence:
[0104] The first-level model (such as person_detector) takes the original image as input and outputs a list of detection boxes [(x1,y1,x2,y2,'person'),...]. If the output is empty (no one is in the image), the image is discarded and does not proceed to the next level. If at least one person is present, the image ID, all detection boxes, and optional masks are packaged into an intermediate data structure D={I, B,M} and passed to the second level.
[0105] The second-level model (such as `cigarette_detector`) can take the entire image as input, but for efficiency, the system calculates the Region of Interest (ROI) based on the detection boxes B from the previous level: each detection box is expanded by 10% and its union is taken; cigarette detection is only performed within this region. The output is a list of cigarette boxes. If a cigarette box is empty, it is removed; otherwise, D is updated, and cigarette box information is added.
[0106] The same applies to the third level and beyond; all data that passes through the entire cascaded model and contains the required scene elements at each level constitutes the intermediate result set.
[0107] Through cascading filtering, each stage removes data that does not contain necessary elements, significantly reducing the processing load of subsequent models (for example, filtering for people before filtering for smoke avoids wasting time detecting smoke in images without people). Simultaneously, an ROI strategy is employed to further reduce computational overhead, achieving efficient coarse data screening.
[0108] Furthermore, step S500, which involves passing the data containing the scene elements in the recognition result as an intermediate result to the next-level recognition model, includes the following steps:
[0109] S510, for the coordinate dataset B = {(x1, y1), (x2, y2), l} output by the current level recognition model; where (x1, y1) are the coordinates of the upper left corner of the detection box, (x2, y2) are the coordinates of the lower right corner of the detection box, and l is the category label of the object in the detection box.
[0110] After the current-level recognition model (e.g., a human detection model) performs inference on the input data, it outputs a set of detection results. The system then standardizes these results into a unified coordinate dataset. .in:
[0111] The coordinates of the top-left corner of the detection box, in pixels, typically satisfying the following conditions: , , and These represent the width and height of the image, respectively.
[0112] : The coordinates of the bottom right corner of the detection box, also in pixels.
[0113] : The category label of the object within the detection box, which can be a string or integer identifier, such as "person" or 1.
[0114] In the code implementation, It is usually represented as a list or NumPy array, with each element being a quintuple (x1, y1, x2, y2, label). If the model outputs normalized coordinates (between 0 and 1), the system will restore them to absolute pixel coordinates.
[0115] A unified coordinate format enables the outputs of different models (such as object detection, facial landmarks, and instance segmentation) to be processed uniformly by subsequent modules, avoiding compatibility issues caused by format differences and laying the foundation for the standardization of intermediate data structures.
[0116] S520, based on B, construct the intermediate data structure D={I, B, M}; where I is the original data identifier and M is the mask data set.
[0117] Construct a three-level structured intermediate data D, containing three core attributes:
[0118] I: A unique identifier for the original data, used to associate the original image / video frame. Data traceability can be achieved using a hash value, file path, or frame number.
[0119] B: The standardized coordinate dataset generated by S510 carries target location and category information.
[0120] M: Mask data set, storing pixel-level segmentation mask matrix, and is assigned an empty list [] when there is no segmentation output.
[0121] I, B, and M are serialized in key-value pair format (JSON / Protocol Buffers) to form a standardized data carrier that can be transmitted across modules. If the current level model does not detect any scene elements (B is an empty set), the data is directly marked as invalid, the pipeline is terminated, and the data is discarded.
[0122] This step constructs a lightweight, traceable, and standardized intermediate data structure, realizing the integrated encapsulation of raw data identifiers, target locations, and pixel-level masks, ensuring the integrity of data transmission between each stage of the pipeline; at the same time, it achieves rapid removal of invalid data through empty set judgment, reducing the transmission and computation of invalid data in the pipeline, and reducing system resource waste from the source.
[0123] S530, D is used as the input of the next-level recognition model, and the next-level recognition model only performs inference within the region of interest indicated by B; the region of interest is obtained by taking the union of each coordinate frame after expanding it by a preset ratio.
[0124] When the system will When passing the image to the next level of the recognition model (such as a smoke detection model), the entire original image is not directly input; instead, it is based on... The detection bounding box in the image calculates the Region of Interest (ROI), and inference is performed only within the ROI. The steps for calculating the ROI are as follows:
[0125] Expanding the coordinate frame: For Each detection box in Calculate its width ,high Preset expansion ratio (Typically, it's taken as 0.1 to 0.2, which means expanding by 10% to 20%), generating the expanded box. :
[0126] ;
[0127] in, and The `max` and `min` values are the image width and height, respectively. These values ensure that the expanded bounding box does not extend beyond the image boundaries.
[0128] Union: Combine all expanded boxes Merge into a single overall region of interest. Since the expanded boxes may overlap, taking their union will yield a continuous rectangular region covering the entire target area.
[0129] ;
[0130] That is, take the minimum value of the top left corner of all expanded boxes as the top left corner of the ROI, and the maximum value of the bottom right corner as the bottom right corner of the ROI.
[0131] Cropping and Inference: Cropping out from the original image The corresponding sub-image (or it can be used as input, focusing only on that region during inference) is then passed to the next-level recognition model for inference. The coordinates output by the next-level model are typically in the sub-image coordinate system, which the system needs to map back to the original image coordinate system for subsequent accumulation. middle.
[0132] For example: Suppose the current-level model detects a person in an image, with a detection bounding box of (100, 150, 250, 400), and the image size is 640×480. Take the scaling factor. ,but , The expanded bounding box is (100-15, 150-25, 250+15, 400+25) = (85, 125, 265, 425). If there were only this one bounding box, the ROI would be (85, 125, 265, 425). The next-level smoke detection model only searches for smoke in this region, rather than searching the entire image, thus significantly reducing the computational cost.
[0133] By limiting the inference scope to the region of interest (ROI), the system avoids redundant computation across the entire image. This is especially beneficial when the target in the scene is small, as the ROI significantly reduces computational overhead. The expanded bounding box design compensates for situations where the detection box may not fully encompass the target (e.g., smoke might be located in the hand area, slightly exceeding the person bounding box), preventing missed detections. The union operation ensures that all relevant regions are covered when multiple targets are present, while avoiding redundant processing of multiple independent ROIs, achieving a balance between accuracy and efficiency.
[0134] S600: When the pipeline configuration includes spatial logic constraints, the spatial logic constraints and the coordinates or mask data output by the recognition models at each level are input into the large language model, which then generates the decision code.
[0135] Furthermore, the step S600, which involves generating the decision code from the large language model, includes the following steps:
[0136] S610, Construct structured prompt words; the structured prompt words include: textual descriptions of the spatial logical constraints, coordinate field names and format definitions output by each level of the recognition model, and at least one set of Few-shot examples; the Few-shot examples demonstrate how to convert constraints described in natural language into Python code containing spatial relationship calculation functions;
[0137] The system dynamically constructs a prompt for invoking the large language model. This prompt uses a structured format to ensure that the large language model can accurately understand the task and generate the required decision code. The structured prompt consists of the following three core components:
[0138] 1. Textual description of spatial logical constraints:
[0139] The system directly references the original spatial logic constraint text generated in step S200, such as ["cigarette is close to mouth", "hand holds cigarette"]. These constraints are embedded into prompts in natural language form, serving as the target logic to be implemented by the code.
[0140] 2. Definition of coordinate field names and formats output by each level of the recognition model:
[0141] The system summarizes all recognition models matched in step S300, specifying the field names and coordinate formats of each model's output in the intermediate data structure D. For example:
[0142] -person_bbox:[x1,y1,x2,y2] # Human body detection box
[0143] -cigarette_bbox:[x1,y1,x2,y2] # Cigarette detection box
[0144] -mouth_landmark:[x,y] # Key points of the mouth
[0145] -hand_bbox:[x1,y1,x2,y2] # Hand detection box
[0146] The system will explicitly state that these fields will be passed as keyword arguments to Python functions, for example, def judge(person_bbox, cigarette_bbox, mouth_landmark, hand_bbox):.
[0147] 3. At least one set of Few-shot examples:
[0148] The system selects 1-2 sets of examples from a pre-set example library that are most relevant to the current spatial logical constraints. Each example includes: a natural language description of the constraint, the names of the available coordinate fields, and the corresponding Python decision code;
[0149] The example library is pre-built, covering common spatial relationship patterns (such as proximity, containment, overlap, orientation, etc.). For example, for the "proximity" constraint, the system might select the following example:
[0150] Constraint description: Determine whether the cigarette is close to the mouth.
[0151] Available fields: cigarette_bbox, mouth_point
[0152] Code:
[0153] def judge(cigarette_bbox,mouth_point):
[0154] if not cigarette_bbox or not mouth_point:
[0155] return False
[0156] cig_cx=(cigarette_bbox[0]+cigarette_bbox[2]) / 2
[0157] cig_cy=(cigarette_bbox[1]+cigarette_bbox[3]) / 2
[0158] distance=((cig_cx-mouth_point[0])**2+(cig_cy-mouth_point[1])**2)**0.5
[0159] return distance<30
[0160] The system concatenates the above three parts into a complete prompt string and adds explicit instructions: "Based on the above constraints and field definitions, please output a Python function judge(**kwargs) that returns a boolean value. The function must use the provided field names and may include calculations such as IoU, distance, and containment relationships."
[0161] Structured prompts break down the task into three parts: constraint description, data definition, and example demonstration. This clarifies the logical goals that the large language model needs to achieve and provides templates for output format and references for programming style. Few-shot examples are particularly crucial, as they guide the model to generate correct and executable code through imitation learning, avoiding formatting errors or logical deviations caused by the model's free interpretation, and significantly improving the usability and consistency of the generated code.
[0162] S620, using a large language model to output a decision code based on the structured prompt words; the decision code includes at least one of the following spatial relationship calculation functions:
[0163] Cross-Union Ratio Calculation: ; The intersection-union ratio (IUU) of detection boxes A and B;
[0164] Euclidean distance between the center points: ; and Detection boxes The x and y coordinates of the center point; and Detection boxes The x and y coordinates of the center point;
[0165] Inclusion relationship determination: ; and Detection boxes The coordinates of the top left corner; and Detection boxes The coordinates of the lower right corner; and Detection boxes The coordinates of the top left corner; and Detection boxes The coordinates of the bottom right corner.
[0166] The system sends the structured prompts generated by S610 to the API interface of a large language model (such as GPT-4, DeepSeek-V3, etc.), setting appropriate parameters (such as temperature=0.2 to reduce randomness, max_tokens=1024 to ensure code integrity). The large language model returns Python code in text form.
[0167] The returned code must contain at least one function for calculating spatial relationships. After generating the code, the system will perform a simple syntax check (such as using `compile()` to verify if it is valid Python code) to ensure that the code is executable. A typical example of generated code is as follows:
[0168] def judge(person_bbox, cigarette_bbox, mouth_landmark, hand_bbox):
[0169] # Check if necessary input exists
[0170] if not cigarette_bbox or not mouth_landmark:
[0171] return False
[0172] # Calculate the center point of the smoke
[0173] cig_cx = (cigarette_bbox[0] + cigarette_bbox[2]) / 2
[0174] cig_cy = (cigarette_bbox[1] + cigarette_bbox[3]) / 2
[0175] # Calculate the Euclidean distance from the cigarette to your mouth
[0176] distance = ((cig_cx - mouth_landmark[0])**2 + (cig_cy - mouth_landmark[1])**2)**0.5
[0177] # Determine if hand constraints are included
[0178] hand_contains_cig = False
[0179] if hand_bbox:
[0180] hand_contains_cig = (hand_bbox[0] <= cig_cx <= hand_bbox[2]and
[0181] hand_bbox[1] <= cig_cy <= hand_bbox[3])
[0182] # Overall judgment: The cigarette was close to the mouth and held in the hand.
[0183] return distance<30 and hand_contains_cig.
[0184] The code includes a center point Euclidean distance calculation function (distance formula) and an inclusion relationship determination function (whether the hand contains the center point of the smoke). If constraints require, the model can also generate an IoU calculation function.
[0185] The system stores the generated code string in memory, ready to be executed in the subsequent step S700.
[0186] By automatically generating decision-making code through a large language model, the system eliminates the need for manually writing complex decision-making logic for each new scenario, achieving automatic conversion from natural language constraints to executable code. The generated code fully utilizes fundamental spatial geometric functions such as IoU, distance, and containment relationships, accurately quantifying the spatial positional relationships and action semantics between objects. This transforms previously ambiguous natural language descriptions (such as "near" or "hold") into computable mathematical conditions, providing a reliable basis for subsequent secondary screening. Furthermore, the code generation process exhibits excellent scalability; any new spatial logic constraint can be generated simply by modifying the prompt words, without altering the system code.
[0187] The above steps guide the large language model to automatically generate Python decision code containing spatial relationship calculation functions such as intersection-union ratio, Euclidean distance of centroids, and inclusion relationship determination by dynamically constructing structured prompts that include spatial logical constraints, coordinate field definitions, and Few-shot examples. This mechanism transforms complex behaviors described in natural language (such as "a person smokes") into quantifiable and executable geometric calculation logic, completely replacing the tedious process of manually writing decision rules. The introduction of Few-shot examples significantly improves the accuracy and format consistency of code generation, allowing the generated code to be executed directly in a sandbox environment. This solution not only achieves automated determination of arbitrarily complex spatial semantics but also possesses high versatility and scalability—simply changing the constraint descriptions in the prompts generates decision code applicable to different new scenarios, providing accurate and reliable logical support for the subsequent selection of high-quality training samples.
[0188] S700, execute the judgment code in the sandbox environment to perform secondary filtering on the intermediate results and retain the data that meets the spatial logic constraints.
[0189] Furthermore, the execution of the determination code in the sandbox environment described in step S700 includes the following steps:
[0190] S710 creates an isolated Python runtime environment and restricts the whitelist of modules that can be imported; the whitelist only includes mathematical operation modules and data structure operation modules.
[0191] Before executing the decision-making code generated by the LLM, the system constructs a controlled Python execution environment to prevent malicious or erroneous code from affecting the main system. The specific implementation method is as follows:
[0192] Use a restricted exec() environment: expose only the necessary built-in functions and modules to the code by customizing the globals and locals dictionaries. For example:
[0193] allowed_modules = {
[0194] 'math': math,
[0195] 'builtins': {
[0196] 'True': True, 'False': False, 'None': None,
[0197] 'abs': abs, 'max': max, 'min': min, 'round': round,
[0198] 'len': len, 'range': range, 'list': list, 'dict': dict,
[0199] # Disable dangerous built-in functions such as open, eval, exec, and __import__
[0200] }
[0201] }
[0202] exec(judge_code, {'__builtins__': allowed_modules['builtins'], **allowed_modules}, local_dict).
[0203] Module whitelist: Only modules for mathematical operations (such as math) and basic data structure operations (such as namedtuple in collections) are allowed to be imported. All other modules (such as os, sys, subprocess, requests, socket) are prohibited. This is achieved by not providing __import__ in globals or by using a custom import hook.
[0204] Alternatives include running standalone scripts using subprocess and filtering system calls using seccomp (Linux) or AppArmor; or using PyPy sandbox mode. However, the lightest and most cross-platform approach is to restrict the exec environment.
[0205] By using a whitelist mechanism, the modules and functions that the code can call are strictly limited, which fundamentally prevents dangerous operations such as file reading and writing, network communication, and system command execution, ensuring the security and stability of the main system. Even if the code generated by LLM contains malicious content, it cannot cause damage.
[0206] S720, set execution timeout threshold With memory limit .
[0207] To prevent resource exhaustion caused by infinite loops, excessive recursion, or memory leaks in the decision-making code, the system sets resource limits:
[0208] Timeout limit: Implemented using signal.alarm (Unix) or threading.Timer and the func_timeout library. If the code execution exceeds... If an exception is thrown, the system will catch it and consider it a failure, then revert to the default logic.
[0209] Memory Limit: The process memory limit can be set using the resource module (Unix).
[0210] import resource;
[0211] resource.setrlimit(resource.RLIMIT_AS, (M max M max )).
[0212] M max It is typically set to 512MB or 1GB. For cross-platform implementations, it can also run in a separate child process, subject to restrictions imposed by the operating system.
[0213] Timeout and memory limits prevent abnormal code from consuming too many resources, causing system crashes or denial of service, thus ensuring the overall stability and predictability of data labeling tasks, which is especially important in large-scale batch processing.
[0214] S730 performs static analysis on the abstract syntax tree of the decision code before execution. If a file operation, network request, or dynamic code generation node is detected, execution is refused and the system reverts to the preset default decision logic. After execution, the decision result is displayed. Mapped to intermediate data, retain The corresponding data; the intermediate data is the data structure that has been processed by the previous recognition model during the pipeline execution process but has not yet undergone the final spatial logic determination.
[0215] Static analysis (AST scan):
[0216] After S610 generates the decision code and before S710 executes it, the system uses Python's ast module to parse the code string, generate an abstract syntax tree (AST), and then traverses all nodes to check for the following dangerous patterns:
[0217] File operations: function call names are open, read, write, os.remove, shutil, etc., or property access is such as os.path.
[0218] Network requests: Function call names include requests.get, urllib.request.urlopen, and socket-related methods, etc.
[0219] Dynamic code generation: Function call names are eval, exec, compile, __import__, or the exec string literal can be used.
[0220] If any dangerous node is detected, the system logs the information, refuses to execute the code, and invokes the preset default judgment logic. The default judgment logic can be "determine true as long as all scene elements exist", or set simple conditions according to business rules.
[0221] Execution and Result Mapping: If the static analysis passes and does not time out / exceed memory limits, the system executes the decision function in the sandbox, passing in the coordinates / mask dictionary corresponding to the current intermediate data, and obtains the return value r (a boolean value, mapped to an integer 0 or 1). For each intermediate data (i.e., the data unit that has been processed by the previous recognition model but has not yet undergone final spatial logic determination), the format is as follows: The system records its judgment results. After execution, the system iterates through all intermediate data and retains only the data from the previous step. Discard the data. The data is then passed to step S800 for annotation generation.
[0222] Intermediate data refers to the data structure that has been processed by the preceding recognition model during pipeline execution but has not yet undergone final spatial logic determination. It contains the original data identifier. Coordinate dataset and mask set This is the cumulative result of the outputs from each level of the model. In the S730, this intermediate data serves as the input to the decision code, with each data point corresponding to a... value.
[0223] Static analysis, acting as a second line of defense, intercepts dangerous code before execution in the sandbox, thus avoiding potential risks during actual runtime. The results of the analysis are used to determine the nature of the code. Through mapping, the system can accurately filter out data that meets complex spatial logic constraints, thus ensuring high-quality training samples. Simultaneously, the rollback mechanism ensures that even if the code generated by the LLM cannot be executed safely, the entire pipeline will not be interrupted and will still produce basic training samples, enhancing the system's robustness.
[0224] The above steps, through multiple security mechanisms such as constructing a restricted Python execution environment, setting timeout and memory limits, and performing AST static analysis, provide strict security isolation for the untrusted judgment code generated by LLM, effectively preventing dangerous behaviors such as file operations, network requests, and dynamic code execution from harming the main system. Simultaneously, timeout and memory limits ensure the controllability of system resources, avoiding resource exhaustion caused by abnormal code. After execution, the judgment result is... By mapping to intermediate data, a precise secondary screening of complex spatial logic is achieved, retaining only high-quality samples that meet the constraints. This scheme, while ensuring system security and stability, achieves automated judgment and screening of arbitrarily complex semantics, providing a reliable data source for the generation of subsequent training samples.
[0225] S800 generates labeled files from the coordinates or mask data output by the recognition models at each level, according to the preset training sample format, and uses the data that meets the spatial logic constraints as training samples for the new model.
[0226] Furthermore, the step S800 of generating the annotation file according to the preset training sample format includes the following steps:
[0227] S810, when the new model is an object detection model, a composite bounding box is generated based on the coordinate data output by the recognition models at each level; the coordinates B of the composite bounding box... new The coordinate frame is determined by taking the smallest bounding rectangle from the output coordinate frames of all participating recognition models.
[0228] When a user specifies a new model as an object detection model (such as YOLO, Faster R-CNN, etc.), the system needs to generate one or more object detection boxes for each selected image. The coordinates of the composite bounding box... The bounding rectangle is determined by taking the smallest bounding rectangle from the coordinate frames output by all participating recognition models.
[0229] The calculation steps are as follows:
[0230] 1. Collect all detection frames involved in the judgment: The system collects intermediate data. The system extracts the bounding boxes output by all recognition models involved in spatial logic determination. For example, in the "person smoking" scenario, the models involved in the determination might include person_bbox and cigarette_bbox (or mouth_landmark, which is a point and not included in the rectangle calculation). The system collects the coordinates of all these bounding boxes.
[0231] 2. Calculate the minimum bounding rectangle: Let the set of detection boxes participating in the decision be... Each box Then the composite annotation box The coordinates are:
[0232] ;
[0233] That is, take the minimum value of the coordinates of the top left corner of all boxes as the top left corner of the new box, and take the maximum value of the coordinates of the bottom right corner of all boxes as the bottom right corner of the new box.
[0234] 3. Boundary trimming: Ensure Not exceeding the image boundary, i.e. , , , ,in, and This refers to the image width and height.
[0235] For example: Suppose an image contains the following bounding boxes: Human body: (100, 150, 250, 400); Smoke: (180, 320, 210, 360); then the composite bounding box would be:
[0236] ;
[0237] ;
[0238] ;
[0239] ;
[0240] The final composite frame is (100, 150, 250, 400), which is the human body frame itself (because the cigarette frame is completely contained within it).
[0241] By taking the minimum bounding rectangle, the composite bounding box completely covers all target regions involved in the judgment, enabling the new model to learn the contextual information of all relevant objects in the scene simultaneously during training, rather than focusing on only a single object. This approach preserves the integrity of the overall scene while avoiding the redundancy of labeling each individual object separately, making it particularly suitable for complex behavior recognition tasks that require the simultaneous localization of multiple interactive objects.
[0242] S820, assign a new category label to the composite label box.
[0243] The system generates a new category label based on the user-input text describing the new model's functionality. This label is typically a simplification and standardization of the functionality description. For example:
[0244] User inputs "a person is smoking" → generates the category label "smoking";
[0245] User inputs "Two people are fighting" → generates category label "fighting";
[0246] User inputs "car rear-end collision" → generates category label "rear_end";
[0247] The system will use composite annotation boxes This is associated with the new category label to form a tag item. In terms of output format:
[0248] YOLO format: Each label item is class_id x_center y_center width height, where class_id is the integer ID corresponding to the new category.
[0249] COCO JSON format: Add a record to the annotations array, containing fields such as image_id, category_id (new category ID), and bbox ([x1, y1, width, height]).
[0250] If there are multiple composite boxes in the same image (such as multiple instances of "people smoking"), the system will generate an independent label for each instance and assign the same new category label to each.
[0251] The automatic generation of new category labels ensures that the annotation results are directly aligned with the functional objectives of the new model, eliminating the need for manual renaming or remapping later. This end-to-end automatic annotation method ensures that the semantic meaning of the labels on the training samples is completely consistent with the expected output of the new model, providing accurate supervision signals for subsequent model training.
[0252] S830, when the new model is an image classification model, directly save the original data that meets the spatial logical constraints, and associate the corresponding directory path with the category identifier corresponding to the spatial logical constraints to form a classification folder structure.
[0253] When a user specifies a new model as an image classification model (such as ResNet, ViT, etc.), the system does not need to generate bounding boxes; instead, it directly organizes the selected images into folders according to their categories. Specific steps:
[0254] 1. Determine Category Identifier: Generate a category name (category identifier) based on the user's function description text. For example, if the function description is "a person is smoking," the category identifier is "smoking." If the system processes multiple categories simultaneously (such as "person smoking" and "person hitting dog"), a separate identifier is generated for each category.
[0255] 2. Create category directories: Under the specified output root directory, create subdirectories named after the category identifier, for example, . / training_data / smoking / .
[0256] 3. Copying or moving images: For each intermediate data (i.e., images that meet spatial logic constraints) filtered by S700, the system uses its original data identifier... Locate the corresponding image file (e.g., file path) and copy (or move) it to the corresponding category directory. Keep the image filename unchanged or rename it to a unique identifier.
[0257] The classification folder structure is the standard input format for training image classification models. It can be directly read by mainstream data loaders such as PyTorch's ImageFolder and TensorFlow's image_dataset_from_directory, without the need for additional annotation file parsing. This method greatly simplifies the data preparation process for training new models, enabling direct conversion from raw data to a classification training set.
[0258] The above steps automatically generate a suitable training sample format based on the type of the new model: for object detection models, composite bounding boxes are generated by calculating the minimum bounding rectangle of all participating coordinate boxes, and automatically generated new category labels are assigned, ensuring that the annotation results fully cover interactive objects in complex scenes; for image classification models, the filtered images are directly organized into folder structures according to categories, seamlessly connecting to mainstream training frameworks. This mechanism achieves automated format conversion from complex semantic filtering to final training samples, completely eliminating the need for manual annotation and data organization, making the training sample preparation process for the new model fully automated and standardized, and significantly improving the overall efficiency from data labeling to model training.
[0259] Furthermore, after step S800, the method further includes the following steps:
[0260] S900, perform quality verification on the generated annotation file; the verification includes calculating the diversity index of the annotation samples. ; 1≤p≤N, 1≤q≤N; where, For the first The feature vector of each sample in the prediction training visual model. is the cosine similarity; N is the total number of currently generated labeled samples.
[0261] After generating the annotation file in step S800, the system performs a quality assessment on the constructed training sample set, with diversity being the core verification dimension. The specific implementation is as follows:
[0262] Feature Extraction: For each labeled sample (image), the system uses a pre-trained visual model (such as ResNet-50 pre-trained on ImageNet, ViT, or a self-supervised model such as DINOv2) to extract feature vectors. This model acts as the feature extractor, and its parameters remain fixed during quality verification. Let the... After each sample undergoes forward propagation, the output of the second-to-last layer (global pooling layer) is taken as the feature vector. ,in, For example, 2048.
[0263] Calculate the cosine similarity matrix: for all Given a sample, calculate any two distinct samples. and Cosine similarity between them:
[0264] ;
[0265] The range of values for cosine similarity is: The larger the value, the more semantically similar the two samples are.
[0266] Div: The system calculates all distinct sample pairs (the diversity index Div). The cosine similarity average of () is calculated, and then 1 is subtracted from this average to obtain the diversity index Div. The denominator is... Represents the number of all ordered sample pairs (excluding) (The case of self-similarity).
[0267] When samples in a sample set are highly similar, the average similarity is close to 1. When the sample set is highly diverse, the average similarity approaches 0. Close to 1.
[0268] By calculating the diversity index, the system can quantitatively assess the richness of the training sample set, avoiding overfitting or insufficient generalization ability of the new model due to high sample repetition. This index is based on the semantic features of the pre-trained visual model and can effectively capture the differences in visual content of the samples, rather than relying solely on filenames or simple statistical features, providing an objective and quantifiable basis for subsequent active learning supplementation.
[0269] S910, when Div is less than a preset threshold, the sample with the lowest density in the feature space is added to the input of the annotation pipeline, and steps S500 to S800 are re-executed to expand sample diversity.
[0270] When the calculated diversity index Below the preset threshold When the value is 0.6 (e.g., adjustable according to the application scenario), the system determines that the diversity of the current sample set is insufficient and triggers an active learning supplementation mechanism. The specific steps are as follows:
[0271] 1. Calculate the feature space density:
[0272] The system selects each candidate sample from the candidate data pool (the raw data that has not yet been labeled). Extract its feature vector Then calculate the density of the sample in the labeled sample feature space. A common method for calculating density is: calculate... Features of all labeled samples The average distance (or the reciprocal of similarity):
[0273] ;
[0274] Alternatively, use local density: calculate To its The average distance to the nearest labeled neighbor. The lower the density (i.e., the larger the average distance to the labeled sample), the more information the sample contains in the feature space, indicating that it is in a "sparse" region.
[0275] 2. Select the sample with the lowest density:
[0276] The system sorts all candidate samples by density value from smallest to largest and selects the one with the lowest density. Samples (e.g.) These samples were used as supplementary samples.
[0277] 3. Add to the pipeline input:
[0278] The system adds these supplementary samples to the original data queue as new input and re-executes steps S500 to S800:
[0279] S500: The cascaded identification model performs a coarse screening of supplementary samples.
[0280] S600: LLM generates decision codes.
[0281] S700: The sandbox performs a secondary screening.
[0282] S800: Generate a new annotation file.
[0283] The newly generated labeled samples are merged with the original samples to form an updated training sample set.
[0284] 4. Iterative verification:
[0285] The system can repeatedly execute S900 to S910 until... Or reach the preset maximum number of iterations to ensure that the final sample set has sufficient diversity.
[0286] When diversity is insufficient, the system employs an active learning mechanism to find candidate samples in the feature space that differ most significantly from the already labeled samples for supplementary labeling. These samples are often located at decision boundaries or in sparsely distributed regions, providing the model with new information. This mechanism effectively avoids the sample homogenization problem caused by pipeline selection biases (e.g., only selecting samples from specific angles or lighting conditions). Iterative supplementation ensures the diversity and coverage of the training sample set, thereby significantly improving the generalization ability and robustness of the new model. The entire supplementation process requires no manual intervention, achieving automatic optimization of the training sample set.
[0287] Furthermore, although the steps of the method in this disclosure are described in a specific order in the accompanying drawings, this does not require or imply that the steps must be performed in that specific order, or that all the steps shown must be performed to achieve the desired result. Additional or alternative steps may be omitted, multiple steps may be combined into one step, and / or a step may be broken down into multiple steps.
[0288] Embodiments of the present invention also provide a non-transitory computer-readable storage medium that can be disposed in an electronic device to store at least one instruction or at least one program related to implementing a method in the method embodiments, wherein the at least one instruction or the at least one program is loaded and executed by the processor to implement the method provided in the above embodiments.
[0289] The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0290] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A readable signal medium may also be any readable medium other than a readable storage medium, capable of sending, propagating, or transmitting programs for use by or in conjunction with an instruction execution system, apparatus, or device.
[0291] The program code contained on the readable medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical fiber, RF, etc., or any suitable combination thereof.
[0292] Program code for performing the operations of this application can be written in any combination of one or more programming languages, including object-oriented programming languages such as Java and C++, and conventional procedural programming languages such as C or similar languages. The program code can execute entirely on the user's computing device, partially on the user's device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0293] Embodiments of the present invention also provide an electronic device, including a processor and the aforementioned non-transitory computer-readable storage medium.
[0294] The electronic device is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments in this application.
[0295] Electronic devices are manifested in the form of general-purpose computing devices. Components of an electronic device may include, but are not limited to: at least one processor, at least one memory, and a bus connecting different system components (including memory and processor).
[0296] The memory stores program code that can be executed by the processor, causing the processor to perform the steps in the various embodiments described in this specification.
[0297] The memory may include readable media in the form of volatile memory, such as random access memory (RAM) and / or cache memory, and may further include read-only memory (ROM).
[0298] The memory may also include programs / utilities having a set (at least one) of program modules, including but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of these examples may include an implementation of a network environment.
[0299] A bus can represent one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus that uses any of the various bus structures.
[0300] Electronic devices can also communicate with one or more external devices (e.g., keyboards, pointing devices, Bluetooth devices, etc.), one or more devices that enable user interaction with the electronic device, and / or any device that enables the electronic device to communicate with one or more other computing devices (e.g., routers, modems, etc.). This communication can be achieved through input / output (I / O) interfaces. Furthermore, electronic devices can communicate with one or more networks (e.g., local area networks (LANs), wide area networks (WANs), and / or public networks, such as the Internet) via network adapters. The network adapter communicates with other modules of the electronic device via a bus. It should be understood that other hardware and / or software modules can be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems.
[0301] From the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein can be implemented by software or by combining software with necessary hardware. Therefore, the technical solutions according to the embodiments of this disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, external hard drive, etc.) or on a network, including several instructions to cause a computing device (such as a personal computer, server, terminal device, or network device, etc.) to execute the methods according to the embodiments of this disclosure.
[0302] Embodiments of the present invention also provide a computer program product including program code, which, when the program product is run on an electronic device, causes the electronic device to perform the steps of the methods described above in various exemplary embodiments of the present invention.
[0303] While specific embodiments of the invention have been described in detail by way of examples, those skilled in the art should understand that the examples are for illustrative purposes only and are not intended to limit the scope of the invention. Those skilled in the art should also understand that various modifications can be made to the embodiments without departing from the scope and spirit of the invention.
Claims
1. A data labeling method, characterized in that, The method includes the following steps: S100, Obtain the functional description text of the new model; S200: Input the functional description text into the large language model, which will parse and generate at least one scene element and spatial logical constraint. The scene element is used to represent the category of the object to be identified, and the spatial logical constraint is used to represent the positional relationship or action semantics between the objects to be identified. S300, Match the corresponding recognition model in the preset model pool according to the scene elements; the model pool stores a number of recognition models and the function identifier and output format identifier of each recognition model; S400, determine the processing order of each recognition model according to the spatial logic constraints, and generate pipeline configuration; S500, according to the processing order configured in the pipeline, calls the corresponding recognition models sequentially to process the raw data; wherein, the current level recognition model performs reasoning on the input data, outputs the recognition result containing coordinates or mask, and passes the data containing the scene elements in the recognition result as intermediate results to the next level recognition model, while the data that does not contain the scene elements is discarded; S600: When the pipeline configuration includes spatial logic constraints, the spatial logic constraints and the coordinates or mask data output by the recognition models at each level are input into the large language model, and the large language model generates the decision code. S700, execute the judgment code in the sandbox environment, perform secondary filtering on the intermediate results, and retain the data that meets the spatial logic constraints; S800 generates labeled files from the coordinates or mask data output by the recognition models at each level, according to the preset training sample format, and uses the data that meets the spatial logic constraints as training samples for the new model.
2. The data labeling method according to claim 1, characterized in that, Step S300 includes the following steps: S310, convert the scene elements into semantic vector V e ; S320, convert the function identifiers of each recognition model in the model pool into function vectors, resulting in a function vector list A = (A1, A2, ..., A...). i A n ), i=1,2,…,n; A i Let be the function vector corresponding to the i-th recognition model, and n be the number of recognition models; S330, obtain V e The similarity between each function vector in A is used to obtain a similarity list η = (η1, η2, ..., ηn). i , ..., η n ); η i For V e With A i The similarity between them; S340, the recognition model corresponding to MAX(η) > τ is determined as the candidate model; MAX() is the preset maximum value function, and τ is the preset similarity threshold; S350, if the candidate model does not fully cover the scene elements, a supplementary matching mechanism is triggered; the supplementary matching mechanism includes inputting the uncovered scene elements as prompt words into the large language model, and the large language model recommending models with alternative recognition capabilities from the model pool, or generating zero-shot recognition instructions based on the CLIP model.
3. The data labeling method according to claim 1, characterized in that, Step S400 includes the following steps: S410, construct a directed acyclic graph with the recognition model as nodes and the dependencies implied by spatial logical constraints as edges; S420, for any two recognition models M a and M b If the spatial logic constraint requires M a The output result can only be used in M after spatial operations are performed. b If the determination is correct, then M is generated. a To M b The processing order is determined by topological sorting of the directed acyclic graph; when there are multiple feasible sorts, a weighted sort is performed according to the reciprocal of the processing speed of each recognition model.
4. The data labeling method according to claim 1, characterized in that, Step S500, which involves passing the data containing the scene elements in the recognition result as an intermediate result to the next-level recognition model, includes the following steps: S510, for the coordinate dataset B = {(x1, y1), (x2, y2), l} output by the current level recognition model; where (x1, y1) are the coordinates of the upper left corner of the detection box, (x2, y2) are the coordinates of the lower right corner of the detection box, and l is the category label of the object in the detection box; S520, based on B, construct the intermediate data structure D={I, B, M}; where I is the original data identifier and M is the mask data set; S530, D is used as the input of the next-level recognition model, and the next-level recognition model only performs inference within the region of interest indicated by B; the region of interest is obtained by taking the union of each coordinate frame after expanding it by a preset ratio.
5. The data labeling method according to claim 1, characterized in that, The step S600, which involves generating the decision code from the large language model, includes the following steps: S610, Construct structured prompt words; the structured prompt words include: textual descriptions of the spatial logical constraints, coordinate field names and format definitions output by each level of the recognition model, and at least one set of Few-shot examples; the Few-shot examples demonstrate how to convert constraints described in natural language into Python code containing spatial relationship calculation functions; S620, using a large language model to output a decision code based on the structured prompt words; the decision code includes at least one of the following spatial relationship calculation functions: Cross-Union Ratio Calculation: ; The intersection-union ratio (IUU) of detection boxes A and B; Euclidean distance between the center points: ; and Detection boxes The x and y coordinates of the center point; and Detection boxes The x and y coordinates of the center point; Inclusion relationship determination: ; and Detection boxes The coordinates of the top left corner; and Detection boxes The coordinates of the lower right corner; and Detection boxes The coordinates of the top left corner; and Detection boxes The coordinates of the bottom right corner.
6. The data labeling method according to claim 1, characterized in that, Executing the determination code in the sandbox environment as described in step S700 includes the following steps: S710 creates an isolated Python runtime environment and restricts the whitelist of modules that can be imported; the whitelist only includes mathematical operation modules and data structure operation modules. S720, set execution timeout threshold With memory limit ; S730 performs static analysis on the abstract syntax tree of the decision code before execution. If a file operation, network request, or dynamic code generation node is detected, execution is refused and the system reverts to the preset default decision logic. After execution, the decision result is displayed. Mapped to intermediate data, retain The corresponding data; the intermediate data is the data structure that has been processed by the previous recognition model during the pipeline execution process but has not yet undergone the final spatial logic determination.
7. The data labeling method according to claim 1, characterized in that, Step S800, which involves generating a labeled file according to a preset training sample format, includes the following steps: S810, when the new model is an object detection model, a composite bounding box is generated based on the coordinate data output by the recognition models at each level; the coordinates B of the composite bounding box... new The coordinates of the output bounding rectangles of all participating recognition models are determined by taking the smallest bounding rectangle. S820, assign a new category label to the composite annotation box; S830, when the new model is an image classification model, directly save the original data that meets the spatial logical constraints, and associate the corresponding directory path with the category identifier corresponding to the spatial logical constraints to form a classification folder structure.
8. The data labeling method according to claim 1, characterized in that, Following step S800, the method further includes the following steps: S900, perform quality verification on the generated annotation file; the verification includes calculating the diversity index of the annotation samples. ; 1≤p≤N, 1≤q≤N; where, For the first The feature vector of each sample in the prediction training visual model. Cosine similarity; N is the total number of currently generated labeled samples; S910, when Div is less than a preset threshold, the sample with the lowest density in the feature space is added to the input of the annotation pipeline, and steps S500 to S800 are re-executed to expand sample diversity.
9. A non-transitory computer-readable storage medium, wherein the storage medium stores at least one instruction or at least one program segment, characterized in that, The at least one instruction or the at least one program segment is loaded and executed by the processor to implement the data marking method as described in any one of claims 1-8.
10. An electronic device, characterized in that, Includes a processor and the non-transitory computer-readable storage medium as described in claim 9.