Object detection model training data synthesis method and terminal
By introducing semantic region-control type mapping rules and visual saliency heatmap correction into the training data of the object detection model, training data that conforms to UI design specifications is generated, which solves the problem of insufficient model generalization ability in the existing technology and improves the model's detection accuracy and ability to adapt to complex interfaces.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- FUJIAN TQ DIGITAL
- Filing Date
- 2026-01-28
- Publication Date
- 2026-06-19
AI Technical Summary
Existing object detection models tend to "memorize" random associations in training data, leading to a decline in recognition performance when the interface layout changes in the real world. Conventional data augmentation techniques cannot effectively improve the model's structural generalization ability, and the noisy data generated by existing image synthesis schemes misleads the model's learning.
By acquiring the control to be laid out and the target background image, candidate placement positions are generated according to the predefined semantic region and control type mapping rules. The control is then placed in a reasonable position to synthesize training data. The semantic region and control type mapping rules are introduced to constrain the layout logic. The position is corrected by combining the visual saliency heatmap and layout template to prevent overlap and simulate the effect of ambient light.
Generating training data that conforms to UI design specifications improves the authenticity and logical rationality of the data, enhances the model's detection accuracy and generalization ability, and enables it to better learn the layout rules of controls in real-world scenarios.
Smart Images

Figure CN122244580A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision and target detection technology, and in particular to a method and terminal for synthesizing training data for a target detection model. Background Technology
[0002] The performance of deep learning-based object detection models is highly dependent on the scale and quality of the training data. Currently, even with high-precision labeled data obtained through automated means, models tend to "memorize" accidental associations in the data (such as controls always appearing in specific locations) rather than learning their essential features. This "context overfitting" causes the model's recognition performance to significantly degrade when faced with unconventional changes in interface layouts in the real world.
[0003] Conventional data augmentation techniques (such as geometric transformations and color dithering) only perturb at the pixel level and cannot change the layout context of controls, thus having limited effect on improving the structural generalization ability of the model. Some existing image synthesis schemes often use random combinations, which increases the amount of data but generates a large amount of data that violates common sense in UI design (such as placing the "login button" in the middle of a news article). Such "noisy data" can mislead the model to learn incorrect patterns and thus impair the model's generalization performance. Summary of the Invention
[0004] The technical problem to be solved by this invention is to provide a method for synthesizing training data for an object detection model, so as to generate training data with layout that conforms to UI design specifications and natural visual integration.
[0005] A method for synthesizing training data for an object detection model, the method comprising:
[0006] Get the control to be laid out and the target background image; Based on the predefined mapping rules between semantic regions and control types, candidate placement positions of the control to be laid out on the target background image are generated; The control to be laid out is placed in the candidate placement position to synthesize training data.
[0007] To solve the above-mentioned technical problems, another technical solution adopted by the present invention is as follows: A terminal includes a memory, a processor, and a computer program stored in the memory and running on the processor, wherein the processor executes the computer program to perform the following steps: Get the control to be laid out and the target background image; Based on the predefined mapping rules between semantic regions and control types, candidate placement positions of the control to be laid out on the target background image are generated; The control to be laid out is placed in the candidate placement position to synthesize training data.
[0008] The beneficial effects of this invention are as follows: This invention first acquires the control to be laid out and the target background image. Then, based on predefined mapping rules between semantic regions and control types, candidate placement positions of the control on the target background image are generated. Finally, the control is placed into the candidate placement positions to synthesize training data. Generating candidate placement positions through the mapping rules between semantic regions and control types aims to constrain the layout logic from its source. Compared to related technologies where random control positions or grid-based determination may lead to unreasonable layouts, this solution attempts to ensure that controls are placed within reasonable areas that conform to their functional semantics, aiming to generate training data that better conforms to UI design specifications. This helps improve the realism and logical rationality of the synthesized data, enabling the object detection model trained on this data to better learn the control layout patterns in real-world scenarios, thereby supporting improvements in the model's detection accuracy and generalization ability. Attached Figure Description
[0009] Figure 1 A flowchart illustrating the steps of a method for synthesizing training data for an object detection model, as provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of the structure of a terminal for synthesizing training data for an object detection model, provided in an embodiment of the present invention. Figure 3 This is a schematic diagram of the system structure of a terminal for synthesizing training data for a target detection model, provided in an embodiment of the present invention. Figure 4 This is a flowchart of another step in a method for synthesizing training data for a target detection model, provided in an embodiment of the present invention. Figure 5 This invention provides a candidate placement position data structure for a method of synthesizing training data for a target detection model, as provided in an embodiment of the present invention.
[0010] Label Explanation: 1. A terminal for synthesizing training data for an object detection model; 2. A processor; 3. A memory. Detailed Implementation
[0011] To explain in detail the technical content, objectives, and effects of the present invention, the following description is provided in conjunction with the embodiments and accompanying drawings.
[0012] The following describes the relevant technical terms involved in this invention: Semantic Regions: Logical blocks on a page divided according to function or content, such as headers, navigation bars, main content areas, sidebars, and footers.
[0013] Visual saliency heatmap: It uses color changes to visually show the degree to which the user's gaze is focused on different areas of the interface. The warmer the color (such as red), the more focused the attention is usually.
[0014] PASCAL VOC: A foundational benchmark dataset and related challenges in the field of computer vision object detection. It not only provides labeled data, but more importantly, defines a set of standard tasks, evaluation metrics (such as mAP), and dataset formats.
[0015] COCO: A large-scale, high-quality image dataset focused on scene understanding, supporting various visual tasks such as object detection, instance segmentation, and keypoint detection.
[0016] YOLO: A groundbreaking real-time object detection algorithm. Its core idea is to treat object detection as a single regression problem, directly predicting bounding boxes and class probabilities on the image, achieving a good balance between speed and accuracy.
[0017] The following describes in detail a method for synthesizing training data for an object detection model according to the present invention, with reference to the appendix. Figure 1 ,include: Step 110: Obtain the control to be laid out and the target background image; specifically, the basic resource library module stores and manages two types of basic resources: the control to be laid out is a segmented UI control image with an alpha channel and its category label; the target background image is a background image library collected or simulated from various application interfaces to ensure background diversity.
[0018] Step 120: Based on the predefined mapping rules between semantic regions and control types, generate candidate placement positions of the control to be laid out on the target background image; specifically, for each "generate control to be laid out - target background image" pair, output one or more candidate placement positions with reasonableness confidence scores.
[0019] For example, the candidate placement location is a list of candidate locations that stores all candidate location objects generated for this control. Each candidate location object is a composite structure containing the following information: Location ID: A unique integer identifier used for precise tracking within the system; Bounding box: A rectangular object that defines the specific geometry of the control. The bounding box includes the top-left corner X-coordinate, top-left corner Y-coordinate, width, and height. The top-left corner X-coordinate and top-left corner Y-coordinate are the coordinates of the top-left corner of the rectangle in the pixel coordinate system of the target background image. The width and height are the width and height of the rectangle (unit: pixels). Layout rationality confidence: The "rationality confidence score" of the candidate position, which is usually between [0, 1]. The higher the score, the more reasonable the layout. Semantic Region Type: A string label indicating the functional category of the image region where the candidate location is situated (e.g., "Bottom Navigation Area", "Content Area", "Title Area"). This field is a direct result of applying the "mapping rules" between "predefined semantic regions" and "control types"; Alignment information includes alignment templates and guides. The alignment template is the name of the specific design system or component template being followed (e.g., "material_bottom_app_bar"). Guides are virtual baselines for aligning control bounding boxes (e.g., "top", "left", "center"), used to ensure that interface elements conform to the precise arrangement requirements of the template.
[0020] Step 130: Place the control to be laid out into the candidate placement position to synthesize training data; The training data mentioned above consists of images. After obtaining the training data, the training data and the annotation files are combined to form a training dataset. The annotation files are automatically generated files conforming to standard formats (such as PASCAL VOC, COCO, YOLO) after each image is successfully synthesized.
[0021] As described above, this embodiment first acquires the control to be laid out and the target background image. Then, based on predefined semantic regions and control types mapping rules, candidate placement positions of the control to be laid out on the target background image are generated. Finally, the control to be laid out is placed in the candidate placement positions to synthesize training data. The candidate placement positions are generated through the semantic region and control type mapping rules. In related technologies, the placement positions of controls on the background image are usually random or determined based on a grid, which may lead to synthesized images that do not conform to the layout logic of real application scenarios, such as placing buttons in image areas. This embodiment, by introducing a semantic region and control type mapping, ensures that controls are placed within reasonable areas that conform to their functional semantics, and can generate training data with layouts that conform to UI design specifications. This improves the realism and logical rationality of the synthesized training data, enabling the target detection model trained based on this data to better learn the control layout rules in real-world scenarios, thereby improving the model's detection accuracy and generalization ability.
[0022] In one embodiment of this application, step 120, generating candidate placement positions of the control to be laid out on the target background image, includes: Step 210: Calculate the reasonableness confidence score corresponding to each candidate placement location; Step 220: The reasonableness confidence score is determined based on the matching degree between the control type of the control to be laid out and the target semantic region in the target background image. The matching degree is obtained according to the mapping rule. Based on UI-specific mapping rules, control types are associated with semantic areas of the background. For example, the mapping rule can be defined as: "Settings" type icons tend to be placed in the "top status bar" or "bottom right floating button area"; "Buy" type buttons tend to be placed in the typical operation area "above the bottom navigation bar". This strategy ensures logical consistency between control types and page functions.
[0023] As described above, this embodiment evaluates candidate positions by calculating a reasonableness confidence score based on mapping rules. By introducing a matching degree calculation, this embodiment provides a quantified measure of reasonableness for each candidate position. This allows for prioritizing or filtering positions with higher confidence scores in subsequent control placement steps. This reduces the occurrence of unreasonable layouts and ensures the quality of output data from the source.
[0024] In one embodiment of this application, step 210, calculating the reasonableness confidence score corresponding to each candidate placement position, includes: calculating the reasonableness confidence score by weighted averaging multiple factors, including: the matching degree between the control type of the control to be laid out and the semantic region in the target background image, the salience value of the region where the candidate placement position is located in the visual salience heatmap of the target background image, and the degree to which the candidate placement position meets the alignment requirements of the preset layout template. As described above, this embodiment calculates a reasonableness score by considering multiple factors, including matching degree, visual saliency, and layout alignment. Related technologies typically rely on a single rule (such as type matching) for position selection, neglecting crucial factors such as visual guidance and interface aesthetics. This method considers user attention distribution by introducing a visual saliency heatmap and ensures the regularity of interface elements by introducing layout template alignment. This multi-factor weighted evaluation mechanism makes the generated candidate positions not only semantically reasonable but also visually more natural and conforming to design specifications. This improves the realism and quality of the synthetic training data, enabling models trained on this data to better adapt to real and complex application interfaces.
[0025] In one embodiment of this application, it further includes: Step 410: Perform anti-overlap detection on the candidate placement positions of multiple controls to be laid out on the same target background image; specifically, the anti-overlap detection adopts bounding box intersection-union detection to ensure that the layout of multiple controls to be laid out on the same target background image is reasonable.
[0026] Step 420: If there are overlapping candidate placement positions, filter out candidate placement positions with scores lower than a preset threshold based on the reasonableness confidence score, until all candidate placement positions do not overlap. As described above, this embodiment handles layout conflicts through anti-overlap detection and a filtering mechanism based on reasonableness confidence scores. In related technologies, the placement of multiple controls may overlap, causing the synthesized image to not conform to the actual interface specifications and affecting data quality. This embodiment, upon detecting overlap, prioritizes retaining candidate placement positions with higher reasonableness confidence scores and automatically removes candidate placement positions with scores below a preset threshold. This ensures that the final synthesized interface layout avoids element overlap while guaranteeing the quality of the output data.
[0027] In one embodiment of this application, it further includes: Step 510: Based on the visual saliency heatmap of the target background image, candidate placement positions are generated for the controls to be laid out. Specifically, an image saliency detection algorithm is used to analyze the target background image and generate a visual saliency heatmap representing "visual focus areas". The probability of the controls to be laid out being placed is positively correlated with the saliency value of that area, simulating the principle in real interfaces where important elements are placed near the visual focus.
[0028] As described above, this embodiment uses a visual saliency heatmap to guide the generation of candidate positions. Related technologies typically do not consider the visual attention distribution of the background image when generating candidate positions, which may result in controls being placed in areas that are not easily noticed by the user. By utilizing a saliency heatmap, candidate positions are preferentially generated in areas of high visual saliency within the image. This ensures that the synthesized interface controls are more likely to be located in areas where the user's natural focus is, thus simulating the interface design used in real-world applications to improve user experience. The resulting training data more closely resembles the interaction logic of real-world scenarios, helping to train an object detection model with stronger discriminative ability regarding the reasonable positions of controls in the visual context.
[0029] In one embodiment of this application, it further includes: Step 610: Determine the reasonableness confidence score based on the salience value of the corresponding region in the visual salience heatmap of the target background image of the candidate placement position; As described above, this embodiment directly uses the visual saliency value as a basis for calculating the reasonableness confidence score. This enhances the realism of the synthetic data in simulating real user visual behavior, thereby helping to train a target detection model that is more accurate and conforms to human cognitive habits in locating key interactive elements in complex visual scenes.
[0030] In one embodiment of this application, step 120, generating candidate placement positions of the control to be laid out on the target background image according to predefined semantic regions and control types, includes: Step 710: Based on the predefined mapping rules between semantic regions and control types, generate the current placement position of the control to be laid out on the target background image; Step 720: Based on the preset layout template, snap the current placement position of the control to be laid out to the virtual reference line of the layout template to generate candidate placement positions; for example, define common UI templates such as "list item", "grid layout", "toolbar" and snap the position of the control to the virtual reference line of the template.
[0031] As described above, this embodiment generates candidate positions by combining semantic mapping and layout template snapping. After generating the initial positions through semantic matching, the virtual reference lines of the layout template are further used for position correction, aligning the controls with preset grids or alignment lines. This ensures that the generated candidate positions meet the visual specifications of the interface design, guaranteeing the alignment and sense of order in the composite interface.
[0032] In one embodiment of this application, step 130, placing the control to be laid out into candidate placement positions to synthesize training data, includes: Step 810: After placing the control to be laid out into the candidate placement position, perform scale transformation and appearance fusion processing on the control to be laid out so that the control to be laid out blends with the target background image; wherein, scale transformation refers to randomly scaling the control within a reasonable range.
[0033] As described above, this embodiment enhances the realism of the synthesized data through scale transformation and appearance fusion processing. This embodiment simulates the deformation of control images through scale transformation and simulates ambient light and projection effects through appearance fusion. This increases the diversity of training data, resulting in higher quality training data that more closely resembles real-world images, thus improving the generalization ability and robustness of the object detection model in complex and variable real-world scenes.
[0034] In one embodiment of this application, appearance blending processing of the layout control includes: Step 910: Based on the main color tone and brightness statistics of the target placement area in the target background image, adjust at least one of the hue, saturation, and brightness of the control to be laid out; specifically, fine-tune the hue, saturation, and brightness of the image of the control to be laid out to simulate the effect of ambient light.
[0035] Step 920: Generate a projection for the control to be laid out according to the preset light source direction; specifically, generate a soft dynamic projection for the control according to the preset global light source direction to enhance the three-dimensionality and realism of the composite.
[0036] As described above, this embodiment effectively eliminates sharp visual boundaries and inconsistent lighting caused by simple pasting in the synthesized image by simulating ambient light illumination and generating stereoscopic projection. Compared with the obvious synthesis artifacts caused by directly pasting controls to the background in the prior art, this solution significantly improves the visual realism of the synthesized image. This helps prevent the object detection model from misjudging synthesis artifacts as learned features, thereby ensuring that the model learns the general visual features of the control itself. This improves the data quality used for model training from a data perspective, thus supporting the technical objective of improving the model's generalization ability.
[0037] See attached document Figure 3 and attached Figure 4 The above-mentioned method for synthesizing training data for object detection models can be applied to specific scenarios. For example, in a scenario where a robust UI element detector is being developed for a cross-platform application, a model trained with only limited real data from the terminal has insufficient generalization ability. A company needs to ensure that its core function buttons (such as "one-click payment") can be accurately recognized on terminals with various resolutions and layouts, such as mobile phones, tablets, and in-vehicle screens. This includes steps A-1 to A-4: Step A-1: Collect a clean image of the "One-Click Payment" button and gather screenshots of the interface from various terminals and application states as backgrounds. This corresponds to step 110 above.
[0038] Step A-2: The system combines the button with a vehicle screen interface background.
[0039] The semantically aware layout strategy engine determines, based on the "semantic region constraint" rule, that the payment button should be located in the easily accessible "lower-middle region," and calculates a reasonable position within this region. This corresponds to step 120 above.
[0040] The appearance fusion module adjusts the brightness and contrast of buttons based on the dark mode commonly found in in-vehicle interfaces, and generates an adapted projection.
[0041] Step A-3: After the synthetic scheme passes the rationality filtering, the system generates a large amount of synthetic data simulating different terminals and layouts. This corresponds to step 130 above.
[0042] Step A-4: Train the detection model by mixing synthetic data with real data. The resulting model's anticipation of the "one-click payment" button when faced with unfamiliar terminal layouts helps improve the model's recognition accuracy and stability under different terminal layouts.
[0043] See attached document Figure 3 and attached Figure 4The above-mentioned method for synthesizing training data for object detection models can be applied to specific scenarios. For example, in a scenario where a game releases new recharge events every week to increase player activity, new events may mean using new art assets to display recharge event advertisements. For instance, the close button on the ad pop-up may have different styles of simulated images. This includes steps B-1 to B-5: Step B-1: After unpacking the game APK package (similar to decompression), locate the resource images inside and categorize them into folders (categories such as: close button, operation control, background, prop, important images, general images, etc.; create multiple folders according to specific needs and place the resource images unpacked from the APK into the corresponding folders). This corresponds to step 110 above.
[0044] Step B-2: Combine all images from the "Close Button" folder and the "Background Image" folder. During this process, the background image will be stretched to various mobile phone screen resolutions. Using an algorithm, the close button image is scaled into multiple images and randomly and without occlusion is filled into the background image. This corresponds to step 120 above.
[0045] Step B-3: Since the image of the close button is filled into the game background by code, the information of the control, such as its type (including mapping information), can be used to generate training data for training the AI (the training data includes the combined image and mapping file). After training, the model is obtained. This corresponds to step 130 above.
[0046] Step B-4: Combine the model with automated code and apply it in game automated testing to attempt to identify the close button in the pop-up window. The aim is to enable the model to learn the common visual features of the close button, thereby improving its potential to recognize unseen close buttons with similar styles.
[0047] Step B-5: Once the close button is identified, its location can be determined, and automated code can be used to click it.
[0048] Please refer to Figure 2 A terminal 1 for synthesizing training data for an object detection model includes a memory 3, a processor 2, and a computer program stored in the memory 3 and running on the processor 2. When the processor 2 executes the computer program, it implements the various steps in the above-described method for synthesizing training data for an object detection model.
[0049] In summary, this invention provides a method and terminal for synthesizing training data for an object detection model. First, it acquires the control to be laid out and the target background image. Then, based on predefined mapping rules between semantic regions and control types, it generates candidate placement positions for the control on the target background image. Finally, it places the control in the candidate placement positions to synthesize training data. Generating candidate placement positions through the mapping rules between semantic regions and control types aims to constrain the layout logic from its source. Compared to related technologies where random control positions or grid-based determination may lead to unreasonable layouts, this solution attempts to ensure that controls are placed within reasonable areas that conform to their functional semantics, aiming to generate training data that better conforms to UI design specifications. This helps improve the realism and logical rationality of the synthesized data, enabling the object detection model trained based on this data to better learn the control layout rules in real-world scenarios, thereby supporting the improvement of the model's detection accuracy and generalization ability. The core contribution of this invention lies in proposing a new paradigm of "controllable synthesis under rule constraints." By establishing a complete technical chain from semantic understanding and visual guidance to quality filtering, it transforms the generation of synthesized data from an uncontrollable random process into a predictable and interpretable deterministic process. This provides a new technical approach for efficiently building robust AI models using synthetic data.
[0050] The above description is merely an embodiment of the present invention and does not limit the patent scope of the present invention. Any equivalent modifications made based on the content of the present invention specification and drawings, or direct or indirect applications in related technical fields, are similarly included within the patent protection scope of the present invention.
Claims
1. A method for synthesizing training data for an object detection model, characterized in that, The method includes: Get the control to be laid out and the target background image; Based on the predefined mapping rules between semantic regions and control types, candidate placement positions of the control to be laid out on the target background image are generated; The control to be laid out is placed in the candidate placement position to synthesize training data.
2. The method for synthesizing training data for a target detection model according to claim 1, characterized in that, Generating candidate placement positions for the control to be laid out on the target background image includes: Calculate the reasonableness confidence score corresponding to each of the candidate placement locations; The reasonableness confidence score is determined based on the matching degree between the control type of the control to be laid out and the target semantic region in the target background image, and the matching degree is obtained according to the mapping rule.
3. The method for synthesizing training data for a target detection model according to claim 2, characterized in that, The calculation of the reasonableness confidence score corresponding to each of the candidate placement positions includes: The reasonableness confidence score is calculated by weighted averaging of multiple factors, including the matching degree between the control type of the control to be laid out and the semantic region in the target background image, the salience value of the region where the candidate placement position is located in the visual salience heatmap of the target background image, and the degree to which the candidate placement position meets the alignment requirements of the preset layout template.
4. The method for synthesizing training data for a target detection model according to claim 2, characterized in that, Also includes: Perform anti-overlap detection on the candidate placement positions of multiple controls to be laid out on the same target background image; If there are overlapping candidate placement positions, then based on the reasonableness confidence score, candidate placement positions with scores lower than a preset threshold are filtered out until all candidate placement positions are no longer overlapping.
5. The method for synthesizing training data for a target detection model according to claim 1, characterized in that, Also includes: Based on the visual saliency heatmap of the target background image, the candidate placement positions are generated for the control to be laid out.
6. The method for synthesizing training data for a target detection model according to claim 5, characterized in that, Also includes: A reasonableness confidence score is determined based on the salience value of the corresponding region in the visual salience heatmap of the target background image for the candidate placement position.
7. The method for synthesizing training data for a target detection model according to claim 1, characterized in that, The step of generating candidate placement positions for the control to be laid out on the target background image based on predefined semantic regions and control types includes: Based on the predefined mapping rules between semantic regions and control types, the current placement position of the control to be laid out on the target background image is generated; According to the preset layout template, the current placement position of the control to be laid out is snapped to the virtual reference line of the layout template to generate the candidate placement position.
8. The method for synthesizing training data for a target detection model according to claim 1, characterized in that, The step of placing the control to be laid out into the candidate placement position to synthesize training data includes: After placing the control to be laid out into the candidate placement position, the control to be laid out is subjected to scale transformation and appearance fusion processing so that the control to be laid out is blended with the target background image.
9. The method for synthesizing training data for a target detection model according to claim 8, characterized in that, The appearance blending process for the control to be laid out includes: Based on the main color tone and brightness statistics of the target placement area in the target background image, adjust at least one of the color tone, saturation and brightness of the control to be laid out; A projection is generated for the control to be laid out according to the preset light source direction.
10. A terminal for synthesizing training data for an object detection model, comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements a method for synthesizing training data for a target detection model according to any one of claims 1 to 9.