Method for constructing ethnic costume target detection model and application thereof
By improving the YOLOv13 model, adding the P2 small target detection module, introducing the ASFF module, and using the SIoU loss function, the problem of low detection accuracy of ethnic costumes was solved, and high-precision ethnic costume target detection was achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- YUNNAN NORMAL UNIV
- Filing Date
- 2026-05-13
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244426A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of target detection technology, and in particular to a method for constructing a target detection model for ethnic costumes and its application. Background Technology
[0002] With the continuous development of computer vision and artificial intelligence technologies, object detection technology has been widely applied in the field of image recognition. Ethnic costumes, as an important carrier of minority cultures, embody rich historical memories and ethnic characteristics, and are of great significance for the inheritance and digital preservation of ethnic culture. Applying object detection to ethnic costume detection scenarios can help intangible cultural heritage workers quickly locate and record endangered costume types, establish structured digital archives, and provide permanent digital records for traditional costumes facing extinction.
[0003] However, due to the complex shooting environment, diverse types of clothing, and intricate details, the automatic recognition and detection of ethnic costume images still faces significant challenges. In practical applications, ethnic costume images often contain various complex background elements, such as figures, environment, and other decorations, which can easily cause confusion between the target and the background. At the same time, different ethnic costumes differ significantly in texture, color, and structure, and the small size of some parts of the clothing can easily lead to problems such as occlusion, overlap, and scale variations, resulting in a high rate of missed detections and false detections during the detection process.
[0004] In view of this, this application proposes a method for constructing a target detection model for ethnic costumes, aiming to improve the accuracy of ethnic costume detection. Summary of the Invention
[0005] The main purpose of this application is to provide a method for constructing a target detection model for ethnic costumes, aiming to solve the problem of improving the accuracy of ethnic costume detection.
[0006] To achieve the above objectives, this application provides a method for constructing a target detection model for ethnic costumes, which includes the following steps: (1) Dataset construction phase: S10, Collect ethnic costume image data, and after preprocessing the ethnic costume image data, label the target ethnic costume type corresponding to each image, and divide the training set from the labeled image data; (2) Improve the YOLOv13 model building stage: S20, Four small target detection modules of different scales, P2, P3, P4 and P5, are set at the detection head of the YOLOv13 network model. Among them, the P2 small target detection module is used to extract high-resolution information from the shallow features of the backbone network, and input it into the detection head after channel number alignment. S30, a feature fusion module is set between the output end of the YOLOv13 network model and the detection head to perform scale alignment, adaptive weight calculation and feature weighted fusion of the P3, P4 and P5 small target detection modules, thereby completing the construction of the improved YOLOv13 model; (3) Model training phase S40, the improved YOLOv13 network model is trained by inputting the training set. During the training process, the SIoU loss function is used as the loss function of the detection head to obtain the ethnic costume target detection model after the training is completed.
[0007] Optionally, the ethnic costume image data shall satisfy the following constraints during acquisition: Full-body image of the clothing; Including close-up images of patterns, structures, and materials; Close-up images of key parts, including headdresses, embroidery, and belts.
[0008] Optionally, the ethnic costume image data includes the costumes of 15 ethnic groups, including Bai, Naxi, Hani, Wa, Bulang, Dai, Li, Lahu, Jingpo, Pumi, Achang, Nu, Jino, De'ang, and Dulong.
[0009] Optionally, the SIoU loss function during training includes: S41, calculate the angle loss between the center point of the predicted box and the center point of the ground truth box, and train the model with the goal of minimizing the angle loss; S42, when the angle loss is minimized, the distance loss weight is calculated based on the angle loss, and the center point distance between the predicted box and the real label box is calculated based on the distance loss weight, so as to minimize the center point distance for model training; S43, when the center point distance is minimized, calculate the aspect ratio between the predicted box and the ground truth box, and train the model with the goal of minimizing the aspect ratio; S44, when the aspect ratio is minimized, calculate the intersection-union ratio between the predicted box and the ground truth box, and use the value of the intersection-union ratio of 1 as the target for model training.
[0010] Optionally, after step S40, the method further includes: When multiple prediction boxes with the same score appear, non-maximum suppression is used to select the target prediction box as the unique bounding box from the multiple prediction boxes.
[0011] In addition, to achieve the above objectives, this application also provides a method for constructing a target detection model for ethnic costumes as described in any of the preceding claims, resulting in a target detection model for ethnic costumes.
[0012] In addition, to achieve the above objectives, this application also provides a target detection model for ethnic costumes as described above, and its application in the detection of targets for ethnic costumes.
[0013] This application has at least the following beneficial effects: 1. The addition of the P2 small target detection module to the detection head structure effectively enhances the model's ability to extract high-resolution detail features, significantly improves the detection accuracy of small-area parts in ethnic costumes, and reduces the false negative rate of small targets; 2. The ASFF module was introduced in the feature fusion stage, which improved the model's ability to perceive multi-scale ethnic costume targets in complex backgrounds; 3. The SIoU loss function is used instead of the traditional CIoU loss function in the loss function, which speeds up the model convergence and improves the bounding box localization accuracy. Attached Figure Description
[0014] Figure 1 This is a flowchart illustrating the method for constructing a target detection model for ethnic costumes according to an embodiment of this application; Figure 2 This is a schematic diagram of the architecture of the small target detection module involved in an embodiment of this application; Figure 3 This is a schematic diagram of the architecture of the multi-scale feature fusion module involved in an embodiment of this application; Figure 4 This is an architecture diagram of the ethnic costume target detection model involved in the embodiments of this application; The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0015] To better understand the above technical solutions, exemplary embodiments of this disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of this disclosure to those skilled in the art.
[0016] First Embodiment Reference Figure 1 This embodiment provides a method for constructing a target detection model for ethnic costumes, which includes the following steps: (1) Dataset construction phase: S10, Collect ethnic costume image data, and after preprocessing the ethnic costume image data, label the target ethnic costume type corresponding to each image, and divide the training set from the labeled image data; In this embodiment, images of ethnic costumes from multiple ethnic groups and in multiple scenes are collected using cameras and publicly available channels. The collected images of ethnic costumes cover clothing styles from multiple ethnic groups and in multiple scenes, ensuring the diversity and representativeness of the data.
[0017] In some alternative implementations, the ethnic costume image data includes the costumes of 15 ethnic groups, including the Bai, Naxi, Hani, Wa, Bulang, Dai, Li, Lahu, Jingpo, Pumi, Achang, Nu, Jino, De'ang, and Dulong.
[0018] Furthermore, and optionally, comprehensive multi-angle and multi-scale data collection of ethnic costumes will be conducted under both natural and controlled lighting conditions. The images will encompass full-body images of the costumes as well as close-up details showcasing patterns, structure, and materials, including key elements such as headdresses, embroidery, and belts. By simultaneously acquiring overall form and detailed pattern information, the data will be ensured to comprehensively reflect the structural characteristics and detailed attributes of the ethnic costumes. The resulting ethnic costume image data will meet the following constraints during data collection: Full-body image of the clothing; Including close-up images of patterns, structures, and materials; Close-up images of key parts, including headdresses, embroidery, and belts.
[0019] In some optional implementations, preprocessing includes quality screening of the acquired ethnic costume images, removing duplicate images and images that are obviously blurry, abnormally exposed, or have a resolution below a preset threshold. This step ensures the clarity and effective information content of the training data, improves the quality of features that the model can learn, and thus enhances the convergence stability and detection accuracy of the model during subsequent training.
[0020] In some alternative implementations, rectangular boxes are used to annotate the image, and the boxes in the image after annotation are the actual bounding boxes.
[0021] (2) Improve the YOLOv13 model building stage: In this embodiment, it should be noted that S20 and S30 are only descriptions of the improved parts of the improved YOLOv13 model, and not descriptions of the construction of the complete improved YOLOv13 model. For other parts, such as the architecture of the backbone network, please refer to the conventional YOLOv13 model for settings, and they will not be described again in this embodiment.
[0022] S20, Four small target detection modules of different scales, P2, P3, P4 and P5, are set at the detection head of the YOLOv13 network model. Among them, the P2 small target detection module is used to extract high-resolution information from the shallow features of the backbone network, and input it into the detection head after channel number alignment. In S20, a new P2 small target detection module is added to the detection head. The P2 small target detection module first extracts high-resolution information from the shallow features of the backbone, and then directly connects it to the detection head after 1×1 convolution to align the number of channels.
[0023] In some alternative implementations, refer to Figure 2 In the neck network, three small target detection modules of different scales, P3, P4 and P5, are set up to form a four-scale detection structure of P2, P3, P4 and P5. This enhances the model's ability to extract detailed features from high-resolution ethnic costume images.
[0024] It should be noted that in images of ethnic costumes, details such as patterns, headdresses, embroidery, and belts are often small in size and significantly affected by shooting angle, occlusion, and changes in ambient lighting. This leads to issues with the original YOLOv13's ability to detect small objects, resulting in missed detections and unstable recognition. The introduction of the P2 small object detection module significantly improves the model's ability to capture shallow detail features, allowing the network to more fully utilize high-resolution features for fine detection of small pattern areas in ethnic costumes. This reduces the missed detection rate of small objects and improves overall detection accuracy and robustness. In some alternative implementations, a folder named "datasets" is created, and two subfolders named "image" and "labels" are created within this folder to store the 1690 images of ethnic costumes and their corresponding labels. S30, a multi-scale feature fusion module is set between the output end of the YOLOv13 network model and the detection head to perform scale alignment, adaptive weight calculation and feature weighted fusion of the P3, P4 and P5 small target detection modules, thereby completing the construction of the improved YOLOv13 model; Reference Figure 3 The schematic diagram of the multi-scale feature fusion module shown in S30 introduces a multi-scale feature fusion module ASFF between the output end of FullPAD Tunnel and the input end of the detection head. The ASFF module first receives feature maps from three scales, P3, P4 and P5, and achieves dynamic integration of feature information through scale alignment, adaptive weight calculation and pixel-by-pixel weighted fusion.
[0025] It should be noted that in images of ethnic costumes, due to the diverse types of clothing, complex textures, and significant background variations (such as environmental decorations and human interference), the traditional YOLO's fixed feature fusion method struggles to adapt to changes in the importance of features at different scales. This can easily lead to feature redundancy or semantic loss, thus affecting the detection model's ability to recognize multi-scale clothing targets. The ASFF module adaptively adjusts the fusion ratio of features at different scales, enabling the network to dynamically allocate feature weights based on image complexity. This makes it more suitable for targets with large scale differences and numerous textural details in ethnic costumes, effectively improving multi-scale detection performance against complex backgrounds.
[0026] (3) Model training phase S40, the improved YOLOv13 network model is trained by inputting the training set. During the training process, the SIoU loss function is used as the loss function of the detection head to obtain the ethnic costume target detection model after the training is completed.
[0027] In this embodiment, the original CIoU loss function in the detection head of the traditional YOLOv13 network model is replaced with the SIoU loss function. The SIoU loss function, based on the calculation of the overlap between the predicted box and the ground truth box, introduces angle constraints and direction matching mechanisms, and comprehensively considers positional deviation, aspect ratio difference and direction consistency, thereby improving the accuracy of bounding box regression.
[0028] Specifically, the SIoU loss function during training includes: S41, calculate the angle loss between the center point of the predicted box and the center point of the ground truth box, and train the model with the goal of minimizing the angle loss; S41, calculate the angle loss between the center point of the predicted box and the center point of the ground truth box, and train the model with the goal of minimizing the angle loss; S42, when the angle loss is minimized, the distance loss weight is calculated based on the angle loss, and the center point distance between the predicted box and the real label box is calculated based on the distance loss weight, so as to minimize the center point distance for model training; S43, when the center point distance is minimized, calculate the aspect ratio between the predicted box and the ground truth box, and train the model with the goal of minimizing the aspect ratio; S44, when the aspect ratio is minimized, calculate the intersection-union ratio between the predicted box and the ground truth box, and use the value of the intersection-union ratio of 1 as the target for model training.
[0029] It should be noted that the shape of targets in ethnic costume images is greatly affected by factors such as pose changes, partial occlusion, and natural deformation of the clothing. The original CIoU loss function is insufficient in handling target shape changes and localization accuracy, easily leading to bounding box jitter or localization offset, thus affecting detection results. After introducing the SIoU loss function, the model can converge faster during training and learn the shape and orientation information of ethnic costume targets more accurately, significantly improving the stability of bounding box localization and detection accuracy, especially in complex poses and partially occluded scenes.
[0030] In the technical solution provided in this embodiment, a P2 small target detection module is added to the detection head structure to enhance the model's ability to extract high-resolution detail features and significantly improve the detection accuracy of small-area parts in ethnic costumes; a multi-scale feature fusion module is introduced in the feature fusion stage to improve the model's ability to perceive multi-scale ethnic costume targets in complex backgrounds; and the SIoU loss function is used instead of the traditional CIoU loss function in the loss function to accelerate the model's convergence speed and improve the bounding box localization accuracy.
[0031] Second Embodiment Based on the first embodiment, in this embodiment, since the network model may predict multiple overlapping bounding boxes, non-maximum suppression (NMS) is required. NMS removes bounding boxes that have a high degree of overlap with the bounding box with the highest score, thereby ensuring that each ethnic costume is accurately labeled by only one bounding box.
[0032] Furthermore, as an implementation scheme, the present application embodiment also relates to a method for constructing a target detection model of ethnic costumes as described above, resulting in a target detection model of ethnic costumes.
[0033] In this embodiment, refer to Figure 4 The diagram illustrates the architecture of the ethnic costume object detection model. During detection, a pre-processed image of ethnic costume is input into an improved YOLOv13 network model, which extracts features from the image. In the backbone layer, convolution and pooling operations are performed layer by layer to extract key features from the image. The extracted multi-scale features are then fused and enhanced by the Neck layer. The Neck layer incorporates an ASFF module, which performs scale alignment and dynamic weighted fusion of features from different scales, and outputs the learned features to the head layer.
[0034] The head layer maps the learned features to the prediction task. The network model predicts the bounding boxes of ethnic costumes based on the extracted features. These bounding boxes mark the location and size of ethnic costumes that may exist in the image.
[0035] If multiple predicted boxes appear, non-maximum suppression (NMS) is performed. NMS will delete those predicted boxes that have a high degree of overlap with the highest-scoring bounding box, thus ensuring that each ethnic costume is accurately marked by only one bounding box. Finally, for each predicted bounding box, the network model classifies the target and provides a confidence score. The detection results, processed through these steps, are then output, including the category of each detected target, the location and size of the bounding box, and the confidence score.
[0036] Furthermore, as an implementation scheme, the embodiments of this application also involve an ethnic costume target detection model as described above, and its application in ethnic costume target detection.
[0037] Third Embodiment Furthermore, in order to verify the effect of the ethnic costume target detection model PAS-YOLOv13 involved in the embodiments of this application in the ethnic costume target detection scenario, comparative experiments and ablation experiments were conducted in this embodiment. The datasets used in the experiments were all self-built ethnic costume datasets.
[0038] In this embodiment, 1690 clear, high-quality images of ethnic costumes were retained after screening. The LabelMe annotation tool was used to annotate the 1690 images with rectangular boxes. The annotated ethnic costume dataset was divided into a training set, a validation set, and a test set in a ratio of 7:2:1, with 1183 images used as the training set, 169 images used as the validation set, and 338 images used as the test set.
[0039] The experimental environment for this application is shown in Table 1 below: Table 1. Experimental Environment
[0040] In this embodiment, the comparative experimental results are shown in Table 2 below: Table 2 Comparison of experimental results
[0041] As can be seen, the improved algorithm in this application has improved in terms of precision, recall, mAP@50 and mAP@95 compared with YOLOv5, YOLOv10, YOLOv11 and YOLOv12, and has better performance in terms of parameter number, model size and computational cost.
[0042] Furthermore, in this embodiment, the ablation experiment results are shown in Table 3 below: Table 3. Module ablation experimental results
[0043] The experimental results show that in the ablation experiment using YOLOv13n as the baseline, with a P2 small target detection layer, an ASFF adaptive spatial feature fusion module, and a SIoU loss function, adding the ASFF module at the end of the neck network first improved mAP@50 from 71.3% to 75.7%, an increase of 4.4 percentage points, and mAP@95 from 58% to 60.1%, an increase of 2.1 percentage points. Changing the loss function of the detection head from CIoU to SIoU improved mAP@50 from 71.3% to 76.9%, an increase of 5.6 percentage points, and mAP@95 from 58% to 62%, an increase of 4 percentage points. With only the P2 shallow detection branch added, mAP@50 reached 76.1%, an increase of 4.8 percentage points compared to the baseline, and mAP@95 reached 63.8%, an increase of 5.8 percentage points compared to the baseline. With both the P2 shallow detection branch and the SIoU loss function added... The mAP@50 was 78.6%, an increase of 7.3 percentage points compared to the baseline, and the mAP@95 was 62%, an increase of 4 percentage points compared to the baseline. When an ASFF module was added to the end of the neck network and the loss function of the detection head was changed from CIoU to SIoU, the mAP@50 reached 79.6%, an increase of 8.3 percentage points compared to the baseline, and the mAP@95 reached 63.4%, an increase of 5.4 percentage points compared to the baseline. With the addition of the P2 shallow detection branch and the ASFF module, the mAP@50 was 79.1%, an increase of 7.8 percentage points compared to the baseline, and the mAP@95 was 64.4%, an increase of 6.4 percentage points compared to the baseline. Further changing the loss function of the detection head from CIoU to SIoU further improved the mAP@50 to 80.4%, a total increase of 9.1 percentage points compared to the baseline, and the mAP@95 to 64.6%, a total increase of 6.6 percentage points compared to the baseline. This fully verifies the effectiveness of the design of each improved module in this application and the feasibility of the overall improvement scheme.
[0044] Although preferred embodiments of this application have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments as well as all changes and modifications falling within the scope of this application.
[0045] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.
Claims
1. A method for constructing a target detection model for ethnic costumes, characterized in that, The method for constructing the ethnic costume target detection model includes the following steps: (1) Dataset construction phase: S10, Collect ethnic costume image data, and after preprocessing the ethnic costume image data, label the target ethnic costume type corresponding to each image, and divide the training set from the labeled image data; (2) Improve the YOLOv13 model building stage: S20, Four small target detection modules of different scales, P2, P3, P4 and P5, are set at the detection head of the YOLOv13 network model. Among them, the P2 small target detection module is used to extract high-resolution information from the shallow features of the backbone network, and input it into the detection head after channel number alignment. S30, a multi-scale feature fusion module is set between the output end of the YOLOv13 network model and the detection head to perform scale alignment, adaptive weight calculation and feature weighted fusion of the P3, P4 and P5 small target detection modules, thereby completing the construction of the improved YOLOv13 model; (3) Model training phase S40, the improved YOLOv13 network model is trained by inputting the training set. During the training process, the SIoU loss function is used as the loss function of the detection head to obtain the ethnic costume target detection model after the training is completed.
2. The method for constructing the ethnic costume target detection model as described in claim 1, characterized in that, The ethnic costume image data must meet the following constraints during acquisition: Full-body image of the clothing; Including close-up images of patterns, structures, and materials; Close-up images of key parts, including headdresses, embroidery, and belts.
3. The method for constructing the ethnic costume target detection model as described in claim 1, characterized in that, The ethnic costume image data includes the costumes of 15 ethnic groups, including Bai, Naxi, Hani, Wa, Bulang, Dai, Li, Lahu, Jingpo, Pumi, Achang, Nu, Jino, De'ang, and Dulong.
4. The method for constructing the ethnic costume target detection model as described in claim 1, characterized in that, The SIoU loss function during training includes: S41, calculate the angle loss between the center point of the predicted box and the center point of the ground truth box, and train the model with the goal of minimizing the angle loss; S42, when the angle loss is minimized, the distance loss weight is calculated based on the angle loss, and the center point distance between the predicted box and the real label box is calculated based on the distance loss weight, so as to minimize the center point distance for model training; S43, when the center point distance is minimized, calculate the aspect ratio between the predicted box and the ground truth box, and train the model with the goal of minimizing the aspect ratio; S44, when the aspect ratio is minimized, calculate the intersection-union ratio between the predicted box and the ground truth box, and use the value of the intersection-union ratio of 1 as the target for model training.
5. The method for constructing the ethnic costume target detection model as described in claim 1 or 4, characterized in that, After step S40, the method further includes: When multiple prediction boxes with the same score appear, non-maximum suppression is used to select the target prediction box as the unique bounding box from the multiple prediction boxes.
6. A method for constructing a target detection model for ethnic costumes as described in any one of claims 1 to 4, wherein the resulting target detection model for ethnic costumes is...
7. The application of the ethnic costume target detection model as described in claim 6 in the detection of ethnic costume targets.