Semantic acquisition method of traffic signal light, electronic device and storage medium

By collecting traffic light video clips on autonomous vehicles and using 3D position tracking and 2D detection boxes for single-light semantic recognition, the problem of low efficiency and large error in manual annotation is solved, and fast and accurate traffic light semantic acquisition is achieved.

CN122244822APending Publication Date: 2026-06-19安徽蔚来智驾科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
安徽蔚来智驾科技有限公司
Filing Date
2024-12-18
Publication Date
2026-06-19

Smart Images

  • Figure CN122244822A_ABST
    Figure CN122244822A_ABST
Patent Text Reader

Abstract

This application relates to the field of autonomous driving technology, specifically providing a method, electronic device, and storage medium for semantic acquisition of traffic lights, aiming to solve the problem of quickly and accurately acquiring the semantic information of traffic lights in images. The method provided in this application includes tracking individual lights in each frame of a video clip of traffic lights to obtain the tracking ID of each light; detecting 2D detection boxes of individual lights in the images, performing semantic recognition on the image regions where the 2D detection boxes are located to obtain the initial semantic information of each light; associating the tracking ID and initial semantic information of the same light in the image; treating each light indicated by a tracking ID as a target light, acquiring the initial semantic information of the target light in each frame of the images based on the tracking ID of the target light, and smoothing the initial semantic information to obtain the final semantic information. Based on the above method, the semantic information of individual lights in each frame of the video clip can be quickly, accurately, and automatically identified.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of autonomous driving technology, specifically to a semantic acquisition method for traffic lights, an electronic device, and a storage medium. Background Technology

[0002] Traffic lights at intersections primarily use the color of individual lights to indicate the passage status for each traffic direction (such as straight ahead, left turn, right turn, etc.). For example, a green light indicates permission to proceed, while a red light indicates permission to proceed. When controlling autonomous driving at intersections, it's crucial to accurately obtain the passage status of the traffic lights indicating each direction and then determine the vehicle's driving decisions based on that status. For instance, if a vehicle's planned trajectory is to go straight through the intersection, but the passage status for that direction is prohibited, the vehicle must be stopped at the stop line before the intersection and wait for the passage status to change to permitted before proceeding straight through the intersection.

[0003] To accurately obtain the traffic light status for each direction and ensure vehicle safety, images of traffic lights can be captured by cameras on vehicles. A perception model then processes these images to identify the traffic lights and determine their status (e.g., color) for each direction. Based on this status, the appropriate traffic flow is determined. For example, if the color of the direction indicator is green, the flow is permitted; if the color is green, the flow is prohibited. Perception models are typically neural network models, requiring extensive training data to accurately identify the traffic light status from images. The training data consists of images of traffic lights and their annotations, including the status of the traffic light indicating the specified direction. Currently, manual annotation is used to obtain semantic information such as color and countdown numbers from the traffic lights. However, manual annotation is inefficient and prone to errors, affecting the accuracy of the annotation information. If training data is obtained through manual annotation, it will not only affect the training efficiency of the perception model, but may also affect the performance of the perception model, thus making it difficult to accurately perceive the status of traffic lights indicating traffic directions in the image.

[0004] Accordingly, a new technical solution is needed in this field to solve the above problems. Summary of the Invention

[0005] In order to overcome the above-mentioned deficiencies, this application is proposed to solve or at least partially solve the following technical problem: how to quickly and accurately obtain the semantic information of traffic lights in an image.

[0006] In a first aspect, a semantic acquisition method for traffic lights is provided, comprising:

[0007] Acquire video clips of traffic lights captured by cameras on vehicles;

[0008] Obtain the 3D position of a single light in each frame of the video segment, track the single light in each frame of the video segment based on the 3D position, and obtain the tracking ID of the single light in each frame of the video segment.

[0009] Detect the 2D detection box of a single light in the image, perform single-light semantic recognition on the image region where the 2D detection box is located, and obtain the initial semantic information of the single light;

[0010] The tracking ID and initial semantic information of the same single light in the image are associated;

[0011] Each tracking ID indicates a single light as a target single light, and the initial semantic information of the target single light in each frame image is obtained according to the tracking ID of the target single light.

[0012] The initial semantic information of the target lamp in each frame image is smoothed to obtain the final semantic information of the target lamp in each frame image.

[0013] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the association between the tracking ID and initial semantic information of the same single light in the image includes:

[0014] The 3D position of a single lamp in the image is projected onto the image coordinate system to obtain the 2D position; the 2D position is then matched with the 2D detection box of the single lamp in the image.

[0015] If a match is successful, the tracking ID of the single light corresponding to the 3D position and the initial semantic information of the single light corresponding to the 2D detection box are associated.

[0016] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the semantic information includes the color of a single light, and the method includes smoothing the initial color of the target single light in each frame image by means of the following:

[0017] Based on the initial color, a first candidate time frame for color abrupt change is obtained; wherein, the initial color of the target single lamp is different in the first candidate time frame and the first time frame thereafter;

[0018] Obtain the abrupt color corresponding to the first candidate time frame, wherein the abrupt color is the initial color of the target single lamp in the first time frame after the first candidate time frame;

[0019] Obtain a first time window of a first preset duration following the first candidate time frame;

[0020] The number of occurrences of the mutation color within the first time window is obtained, and the number of time frames within the first time window is obtained; if the difference between the number of time frames and the number of occurrences is less than a first difference threshold, then the first candidate time frame is taken as the first reliable time frame.

[0021] The initial color is smoothed based on the first reliable time frame.

[0022] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the step of smoothing the initial color based on the first reliable time frame includes:

[0023] Obtain the stable color corresponding to the first trusted time frame, wherein the stable color is the initial color of the target single light in the first time frame after the first trusted time frame;

[0024] The color conversion information of the first trusted time frame is obtained based on the stable color, and the color conversion information is used to indicate that the final color of the target single light after the first trusted time frame is the stable color.

[0025] Based on the color conversion information, the final color of the target single lamp in each frame image is determined.

[0026] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the step of determining the final color of the target single light in each frame image according to the color conversion information includes: when there are multiple first trusted time frames, determining the final color of the target single light in each frame image according to the color conversion information of each first trusted time frame in the order from first to last.

[0027] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the method further includes obtaining the flashing state of the target single light based on the final color of the target single light in each frame image in the following manner:

[0028] Based on the tracking ID of the target single light, the final color of the target single light in each frame image is obtained, and the flickering period is obtained by performing flicker detection on the final color;

[0029] Each time frame within the video segment is taken as a target time frame, and the flashing state of the target single light in the target time frame is obtained according to the flashing period.

[0030] If the target time frame falls within the flashing period, the flashing state is flashing; if the target time frame does not fall within the flashing period, the flashing state is not flashing.

[0031] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the step of performing flicker detection on the final color to obtain the flickering period includes:

[0032] The final color of the target single light is matched with a preset flashing template to obtain a color matching time period, and the color matching time period is used as the flashing time period;

[0033] The blinking template includes M template colors arranged sequentially, and the template color is the color when a single light blinks, M < N, where N is the total number of time frames in the video segment; the color matching period includes M time frames, and the final color of the target single light in the 1st to the Mth time frames is the same as the 1st to the Mth template colors in the blinking template.

[0034] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, when the target single light is a countdown light, the semantic information also includes countdown numbers. The countdown light is a single light with a lamp head shaped like a number. The method includes smoothing the initial numbers of the target single light in each frame image by means of the following:

[0035] Based on the tracking ID of the target single light, obtain the final color of the target single light in each frame image;

[0036] The video segment is divided into multiple color-independent time periods based on the final color; wherein, the final color of the target single light is the same in all the same color-independent time periods, and the final color of the target single light is different in two adjacent color-independent time periods.

[0037] The Z-score method is used to detect outliers in the initial values ​​of the target single lamp during the color-independent time period, and the outliers during the color-independent time period are obtained.

[0038] The Lowess smoothing method is used to smooth the abnormal numbers based on the normal numbers within the color-independent time period, where the normal numbers are the initial numbers remaining within the color-independent time period excluding the abnormal numbers.

[0039] The countdown numbers of the target single light during all color-independent time periods are used as the optimized numbers of the target single light in each frame of the video segment.

[0040] Based on the optimized numbers, the final numbers of the target single lamp in each frame image are obtained.

[0041] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the step of obtaining the final number of the target single light in each frame image based on the optimized number includes:

[0042] Based on the optimized numbers, a second candidate time frame for the numerical mutation is obtained; wherein, the optimized numbers of the target single lamp are different in the second candidate time frame and the first time frame thereafter;

[0043] Obtain the mutation number corresponding to the second candidate time frame, where the mutation number is the optimized number of the target single lamp in the first time frame after the second candidate time frame;

[0044] Obtain a second time window of a second preset duration following the second candidate time frame;

[0045] The number of occurrences of the mutation number within the second time window is obtained, and the number of time frames within the second time window is obtained; if the difference between the number of time frames and the number of occurrences is less than a second difference threshold, then the second candidate time frame is taken as the second reliable time frame.

[0046] The optimized number is smoothed again based on the second reliable time frame to obtain the final number of the target single lamp.

[0047] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the step of smoothing the optimized numbers again based on the second reliable time frame includes:

[0048] The optimized value of the target single lamp in each second credible time frame is added to the time value of each second credible time frame to obtain the sum value of the target single lamp in each second credible time frame.

[0049] The average of all summed values ​​is obtained, and the average value is subtracted from the optimized value of the target single light in each time frame to obtain the ideal value of the target single light in each time frame, where each time frame is the time frame within the video segment.

[0050] Each time frame within the video segment is taken as a target time frame, and the deviation between the optimized number and the ideal number of the target single lamp in the target time frame is obtained.

[0051] If the absolute value of the deviation is less than a preset deviation threshold, the final number is obtained based on the optimized number; if the absolute value of the deviation is greater than or equal to the preset deviation threshold, the final number is obtained based on the ideal number.

[0052] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the camera includes a first camera and a second camera, the first camera has a higher priority than the second camera, the video segment includes a first video segment captured by the first camera and a second video segment captured by the second camera, and the method further includes:

[0053] Obtain a first semantic result of the first video segment, wherein the first semantic result includes the first semantic information of the target single lamp in the first video segment;

[0054] Obtain the second semantic result of the second video segment, the second semantic result including the second semantic information of the target single light in the second video segment;

[0055] Each time frame within the video segment is taken as a target time frame, and the first semantic information and the second semantic information of the target single light in the target time frame are fused.

[0056] The first video segment and the second video segment each have the same time frames.

[0057] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, the fusion of the first semantic information and the second semantic information of the target single light in the target time frame includes:

[0058] When the first semantic result does not contain the first semantic information of the target single lamp in the target time frame, and the second semantic result contains the second semantic information of the target single lamp in the target time frame, the second semantic information is added to the first semantic result;

[0059] When the first semantic result contains the first semantic information and the second semantic result contains the second semantic information, if the first semantic information is different from the second semantic information, then the second semantic information is modified to the first semantic information;

[0060] When the first semantic result contains the first semantic information and the second semantic result does not contain the second semantic information, the first semantic information is added to the second semantic result.

[0061] In one technical solution of the above-mentioned semantic acquisition method for traffic lights, there are multiple second cameras, each second camera acquiring a second video segment, and the step of adding the second semantic information to the first semantic result includes:

[0062] If multiple second semantic results of second video segments all include the second semantic information, then a voting process is performed on all the second semantic information, and the second semantic information with the most occurrences is added to the first semantic result.

[0063] In a second aspect, a smart device is provided, the smart device including at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a computer program, which, when executed by the at least one processor, implements the method described in any of the technical solutions provided in the first aspect.

[0064] In a third aspect, a computer-readable storage medium is provided, wherein a plurality of program codes are stored therein, the program codes being adapted to be loaded and run by a processor to perform the method described in any of the technical solutions provided in the first aspect above.

[0065] Solution 1. A method for semantic acquisition of traffic lights, characterized in that the method includes:

[0066] Acquire video clips of traffic lights captured by cameras on vehicles;

[0067] Obtain the 3D position of a single light in each frame of the video segment, track the single light in each frame of the video segment based on the 3D position, and obtain the tracking ID of the single light in each frame of the video segment.

[0068] Detect the 2D detection box of a single light in the image, perform single-light semantic recognition on the image region where the 2D detection box is located, and obtain the initial semantic information of the single light;

[0069] The tracking ID and initial semantic information of the same single light in the image are associated;

[0070] Each tracking ID indicates a single light as a target single light, and the initial semantic information of the target single light in each frame image is obtained according to the tracking ID of the target single light.

[0071] The initial semantic information of the target lamp in each frame image is smoothed to obtain the final semantic information of the target lamp in each frame image.

[0072] Solution 2. The method according to Solution 1, characterized in that, associating the tracking ID and initial semantic information of the same single light in the image includes:

[0073] The 3D position of a single lamp in the image is projected onto the image coordinate system to obtain its 2D position;

[0074] The 2D position is matched with the 2D detection box of a single light in the image;

[0075] If a match is successful, the tracking ID of the single light corresponding to the 3D position and the initial semantic information of the single light corresponding to the 2D detection box are associated.

[0076] Solution 3. The method according to Solution 1, characterized in that the semantic information includes the color of a single light, and the method includes smoothing the initial color of the target single light in each frame image by means of:

[0077] Based on the initial color, a first candidate time frame for color abrupt change is obtained; wherein, the initial color of the target single lamp is different in the first candidate time frame and the first time frame thereafter;

[0078] Obtain the abrupt color corresponding to the first candidate time frame, wherein the abrupt color is the initial color of the target single lamp in the first time frame after the first candidate time frame;

[0079] Obtain a first time window of a first preset duration following the first candidate time frame;

[0080] The number of occurrences of the mutation color within the first time window is obtained, and the number of time frames within the first time window is obtained; if the difference between the number of time frames and the number of occurrences is less than a first difference threshold, then the first candidate time frame is taken as the first reliable time frame.

[0081] The initial color is smoothed based on the first reliable time frame.

[0082] Solution 4. The method according to Solution 3, characterized in that, the smoothing process of the initial color based on the first reliable time frame includes:

[0083] Obtain the stable color corresponding to the first trusted time frame, wherein the stable color is the initial color of the target single light in the first time frame after the first trusted time frame;

[0084] The color conversion information of the first trusted time frame is obtained based on the stable color, and the color conversion information is used to indicate that the final color of the target single light after the first trusted time frame is the stable color.

[0085] Based on the color conversion information, the final color of the target single lamp in each frame image is determined.

[0086] Solution 5. The method according to Solution 4, characterized in that, determining the final color of the target single lamp in each frame image based on the color conversion information includes:

[0087] When there are multiple first trusted time frames, the final color of the target single lamp in each frame image is determined sequentially according to the color conversion information of each first trusted time frame, in order from first to last.

[0088] Solution 6. The method according to any one of Solutions 3 to 5, characterized in that the method further includes obtaining the flashing state of the target single light based on the final color of the target single light in each frame image by means of the following:

[0089] Based on the tracking ID of the target single light, the final color of the target single light in each frame image is obtained, and the flickering period is obtained by performing flicker detection on the final color;

[0090] Each time frame within the video segment is taken as a target time frame, and the flashing state of the target single light in the target time frame is obtained according to the flashing period.

[0091] If the target time frame falls within the flashing period, the flashing state is flashing; if the target time frame does not fall within the flashing period, the flashing state is not flashing.

[0092] Solution 7. The method according to Solution 6, characterized in that, the step of performing flicker detection on the final color to obtain the flickering period includes:

[0093] The final color of the target single light is matched with a preset flashing template to obtain a color matching time period, and the color matching time period is used as the flashing time period;

[0094] The blinking template includes M template colors arranged sequentially, and the template color is the color when a single light blinks, M < N, where N is the total number of time frames in the video segment; the color matching period includes M time frames, and the final color of the target single light in the 1st to the Mth time frames is the same as the 1st to the Mth template colors in the blinking template.

[0095] Solution 8. The method according to any one of Solutions 3 to 5, characterized in that, when the target single lamp is a countdown lamp, the semantic information further includes countdown digits, the countdown lamp is a single lamp with a lamp head shaped like a number, and the method includes smoothing the initial digits of the target single lamp in each frame image by:

[0096] Based on the tracking ID of the target single light, obtain the final color of the target single light in each frame image;

[0097] The video segment is divided into multiple color-independent time periods based on the final color; wherein, the final color of the target single light is the same in all the same color-independent time periods, and the final color of the target single light is different in two adjacent color-independent time periods.

[0098] The Z-score method is used to detect outliers in the initial values ​​of the target single lamp during the color-independent time period, and the outliers during the color-independent time period are obtained.

[0099] The Lowess smoothing method is used to smooth the abnormal numbers based on the normal numbers within the color-independent time period, where the normal numbers are the initial numbers remaining within the color-independent time period excluding the abnormal numbers.

[0100] The countdown numbers of the target single light during all color-independent time periods are used as the optimized numbers of the target single light in each frame of the video segment.

[0101] Based on the optimized numbers, the final numbers of the target single lamp in each frame image are obtained.

[0102] Solution 9. The method according to Solution 8, characterized in that, obtaining the final number of the target single lamp in each frame image based on the optimized number includes:

[0103] Based on the optimized numbers, a second candidate time frame for the numerical mutation is obtained; wherein, the optimized numbers of the target single lamp are different in the second candidate time frame and the first time frame thereafter;

[0104] Obtain the mutation number corresponding to the second candidate time frame, where the mutation number is the optimized number of the target single lamp in the first time frame after the second candidate time frame;

[0105] Obtain a second time window of a second preset duration following the second candidate time frame;

[0106] The number of occurrences of the mutation number within the second time window is obtained, and the number of time frames within the second time window is obtained; if the difference between the number of time frames and the number of occurrences is less than a second difference threshold, then the second candidate time frame is taken as the second reliable time frame.

[0107] The optimized number is smoothed again based on the second reliable time frame to obtain the final number of the target single lamp.

[0108] Solution 10. The method according to Solution 9, characterized in that, the step of smoothing the optimized number again based on the second reliable time frame includes:

[0109] The optimized value of the target single lamp in each second credible time frame is added to the time value of each second credible time frame to obtain the sum value of the target single lamp in each second credible time frame.

[0110] The average of all summed values ​​is obtained, and the average value is subtracted from the optimized value of the target single light in each time frame to obtain the ideal value of the target single light in each time frame, where each time frame is the time frame within the video segment.

[0111] Each time frame within the video segment is taken as a target time frame, and the deviation between the optimized number and the ideal number of the target single lamp in the target time frame is obtained.

[0112] If the absolute value of the deviation is less than a preset deviation threshold, the final number is obtained based on the optimized number; if the absolute value of the deviation is greater than or equal to the preset deviation threshold, the final number is obtained based on the ideal number.

[0113] Solution 11. The method according to Solution 1, characterized in that the camera includes a first camera and a second camera, the first camera has a higher priority than the second camera, the video segment includes a first video segment captured by the first camera and a second video segment captured by the second camera, and the method further includes:

[0114] Obtain a first semantic result of the first video segment, wherein the first semantic result includes the first semantic information of the target single lamp in the first video segment;

[0115] Obtain the second semantic result of the second video segment, the second semantic result including the second semantic information of the target single light in the second video segment;

[0116] Each time frame within the video segment is taken as a target time frame, and the first semantic information and the second semantic information of the target single light in the target time frame are fused.

[0117] The first video segment and the second video segment each have the same time frames.

[0118] Solution 12. The method according to Solution 11, characterized in that, the fusion of the first semantic information and the second semantic information of the target single lamp in the target time frame includes:

[0119] When the first semantic result does not contain the first semantic information of the target single lamp in the target time frame, and the second semantic result contains the second semantic information of the target single lamp in the target time frame, the second semantic information is added to the first semantic result;

[0120] When the first semantic result contains the first semantic information and the second semantic result contains the second semantic information, if the first semantic information is different from the second semantic information, then the second semantic information is modified to the first semantic information;

[0121] When the first semantic result contains the first semantic information and the second semantic result does not contain the second semantic information, the first semantic information is added to the second semantic result.

[0122] Solution 13. The method according to Solution 12, characterized in that there are multiple second cameras, each second camera acquires a second video segment, and the step of adding the second semantic information to the first semantic result includes:

[0123] If multiple second semantic results of second video segments all include the second semantic information, then a voting process is performed on all the second semantic information, and the second semantic information with the most occurrences is added to the first semantic result.

[0124] Solution 14. An electronic device, characterized in that it comprises:

[0125] At least one processor;

[0126] And, a memory communicatively connected to the at least one processor;

[0127] The memory stores a computer program that, when executed by the at least one processor, implements the semantic acquisition method for traffic lights as described in any one of schemes 1 to 13.

[0128] Scheme 15. A computer-readable storage medium storing a plurality of program codes, characterized in that the program codes are adapted to be loaded and run by a processor to perform the semantic acquisition method for traffic lights as described in any one of Schemes 1 to 13.

[0129] The above-described technical solutions of this application have at least one or more of the following beneficial effects:

[0130] In one technical solution of the traffic light semantic acquisition method provided in this application, video clips of traffic lights captured by a camera on a vehicle can be acquired. The 3D positions of individual lights in each frame of the video clip can be obtained. Based on the 3D positions, individual lights in each frame are tracked to obtain the tracking ID of each light in each frame. 2D detection boxes of individual lights in the images are detected. Semantic recognition of individual lights is performed on the image regions where the 2D detection boxes are located to obtain the initial semantic information of each light. The tracking IDs and initial semantic information of the same light in the image are associated. Each light indicated by a tracking ID is treated as a target light. Based on the tracking ID of the target light, the initial semantic information of the target light in each frame is obtained. The initial semantic information of the target light in each frame is smoothed to obtain the final semantic information of the target light in each frame.

[0131] Based on the above implementation method, the semantic information of individual lights in each frame of a video clip can be quickly, accurately, and automatically identified without requiring annotation personnel to perform semantic analysis on each frame individually, greatly improving the efficiency and accuracy of acquiring individual light semantics. Furthermore, after obtaining the semantic information of individual lights in each frame, the images can be annotated based on this information to form image annotation data, eliminating the need for manual annotation and significantly improving the efficiency and accuracy of annotation. Attached Figure Description

[0132] The disclosure of this application will become more readily understood with reference to the accompanying drawings. It will be readily understood by those skilled in the art that these drawings are for illustrative purposes only and are not intended to limit the scope of protection of this application. Wherein:

[0133] Figure 1 This is a schematic flowchart of the main steps of a traffic signal light semantic acquisition method according to an embodiment of this application;

[0134] Figure 2 This is a schematic flowchart of the main steps for smoothing the initial color of a single lamp according to an embodiment of this application.

[0135] Figure 3 This is a flowchart illustrating the main steps of obtaining the blinking state of a target single light in each frame of a video segment according to an embodiment of this application.

[0136] Figure 4 This is a flowchart illustrating the main steps of smoothing the initial countdown digits of a single lamp according to an embodiment of this application.

[0137] Figure 5 This is a schematic flowchart illustrating the main steps of fusing semantic results obtained from multiple cameras according to an embodiment of this application.

[0138] Figure 6 This is a schematic image of a first camera according to an embodiment of this application. Figure 1 ;

[0139] Figure 7 This is a schematic image of a second camera according to an embodiment of this application. Figure 1 ;

[0140] Figure 8 This is a schematic image of a first camera according to an embodiment of this application. Figure 2 ;

[0141] Figure 9 This is a schematic image of a second camera according to an embodiment of this application. Figure 2 ;

[0142] Figure 10 A schematic diagram of the overall flow of a traffic signal light semantic acquisition method according to an embodiment of this application;

[0143] Figure 11 This is a flowchart illustrating the process of obtaining the 3D position, tracking ID, 2D detection box, and semantic information of a single lamp according to an embodiment of this application.

[0144] Figure 12 This is a schematic flowchart of single-lamp information timing processing according to an embodiment of this application;

[0145] Figure 13 This is a schematic diagram of the main structure of a smart device according to an embodiment of this application.

[0146] Figure label:

[0147] 11: Memory; 12: Processor. Detailed Implementation

[0148] Some embodiments of this application are described below with reference to the accompanying drawings. Those skilled in the art should understand that these embodiments are merely illustrative of the technical principles of this application and are not intended to limit the scope of protection of this application.

[0149] In the description of this application, "processor" can include hardware, software, or a combination of both. A processor can be a central processing unit, microprocessor, graphics processor, digital signal processor, or any other suitable processor. A processor has data and / or signal processing capabilities. A processor can be implemented in software, in hardware, or a combination of both. Computer-readable storage media includes any suitable medium capable of storing program code, such as magnetic disks, hard disks, optical disks, flash memory, read-only memory, random access memory, etc. The term "A and / or B" means all possible combinations of A and B, such as only A, only B, or A and B.

[0150] The relevant user personal information that may be involved in the various embodiments of this application is processed in strict accordance with the requirements of laws and regulations, following the principles of legality, legitimacy, and necessity, based on the reasonable purpose of the business scenario, and includes personal information that users actively provide or that is generated as a result of using the product / service, as well as personal information obtained with user authorization.

[0151] The personal information processed in this application will vary depending on the specific product / service scenario and will be based on the specific scenario in which the user uses the product / service. This may involve the user's account information, device information, driving information, vehicle information, or other related information. This application will treat the user's personal information and its processing with the utmost diligence.

[0152] This application attaches great importance to the security of users' personal information and has taken reasonable and feasible security protection measures that comply with industry standards to protect users' information and prevent unauthorized access, disclosure, use, modification, damage or loss of personal information.

[0153] The following describes an embodiment of the semantic acquisition method for traffic lights provided in this application.

[0154] See appendix Figure 1 , Figure 1 This is a schematic flowchart illustrating the main steps of a traffic light semantic acquisition method according to an embodiment of this application. Figure 1 As shown, the semantic acquisition method for traffic lights in this embodiment mainly includes the following steps S101 to S106.

[0155] Step S101: Acquire video clips of traffic lights captured by the camera on the vehicle.

[0156] A camera can capture images of the vehicle's driving environment; multiple consecutive frames are superimposed to form a video. A video clip of a traffic light is a video segment that includes the traffic light within the image.

[0157] Step S102: Obtain the 3D position of a single lamp in each frame of the video clip, track the single lamp in each frame based on the 3D position, and obtain the tracking ID of the single lamp in each frame.

[0158] The 3D position can be the three-dimensional position of a single lamp in the world coordinate system. In some implementations, the two-dimensional position (2D position) of a single lamp in the image coordinate system can be detected, and then the 2D position can be transformed to the world coordinate system according to the transformation relationship between the image coordinate system and the world coordinate system to obtain the 3D position of the single lamp. In some implementations, the image has annotation information of the single lamp position, and the 3D position of the single lamp can be directly obtained from the annotation information. This annotation information can be pre-annotated on the video clip by manual annotation, and this annotation information can be obtained synchronously when the image is obtained from the video clip. In this implementation, a conventional tracking method can be used to track the single lamp based on its 3D position.

[0159] Step S103: Detect the 2D detection box of a single lamp in the image, perform single-lamp semantic recognition on the image region where the 2D detection box is located, and obtain the initial semantic information of the single lamp.

[0160] Specifically, a semantic recognition model can be used to detect 2D detection boxes for a single light in an image, and semantic recognition of the image region containing the 2D detection box can be performed. The 2D detection box is a two-dimensional detection box for a single light in the image coordinate system. In this embodiment, a conventional semantic recognition model can be used to detect the 2D detection box and perform semantic recognition of the single light; this embodiment does not impose specific limitations on this.

[0161] Semantic information can include the color of a single lamp, the shape of the lamp holder, etc.

[0162] The colors include the illuminated color when a single light is on and the extinguished color when a single light is off. The illuminated color is used for traffic guidance; if a single light displays the illuminated color in multiple consecutive frames of images, then it can be determined that the single light is on. For example, the illuminated color can include green, red, and yellow, where green indicates permission to proceed, red indicates prohibition, and yellow indicates a warning. The extinguished color can be black. In some implementations, the semantic information may also include a countdown timer. The countdown timer can be understood as a countdown to the remaining time of the currently illuminated color, and the changes in the countdown timer reflect the countdown process of the single light.

[0163] Based on the semantic meaning of the lamp head shape, a single lamp can include turn signals and countdown lights. The semantic meaning of the turn signal lamp head shape is the traffic direction indication. The shape of the turn signal lamp head can be a disc, a straight arrow, a left turn arrow, a right turn arrow, and a U-turn arrow, etc. Different lamp head shapes can indicate different traffic directions. For example, a disc indicates traffic directions including straight, left turn, and U-turn; a straight arrow indicates traffic direction of going straight.

[0164] The countdown timer light has a number on its bulb, indicating the remaining time for the current illuminated color. After the remaining time reaches 0, the turn signal will display a different color. For example, if the current illuminated color is green and the countdown timer shows 15, it means there are 15 seconds left in the green light, after which the turn signal will turn red.

[0165] Step S104: Associate the tracking ID and initial semantic information of the same single lamp in the image. Based on this, the semantic information of the single lamp can be queried according to its tracking ID.

[0166] In some implementations, the tracking ID and semantic information of the same single light in the image can be associated through the following steps 11 to 12.

[0167] Step 11: Project the 3D position of a single lamp in the image to the image coordinate system to obtain the 2D position. Specifically, the coordinate system of the 3D position can be obtained, and the 3D position can be transformed to the image coordinate system to obtain the 2D position based on the transformation relationship between the coordinate system and the image coordinate system.

[0168] Step 12: Perform position matching between the 2D location and the 2D detection box of the single lamp in the image. If the match is successful, associate the tracking ID of the single lamp corresponding to the 3D location with the semantic information of the single lamp corresponding to the 2D detection box. If the match fails, no association is performed. Specifically, the positional deviation between the 2D location and the 2D detection box can be obtained. If the positional deviation is less than or equal to a preset deviation threshold, it indicates that the two locations are the same or similar, and therefore the match is successful; otherwise, the match fails. When setting the value of the deviation threshold, the maximum error between the 2D location and the 2D detection box of the same single lamp can be obtained through testing, and the deviation threshold can be set based on this maximum error.

[0169] Based on the methods described in steps 11 to 12 above, the 3D position and 2D detection box of a single light can be used to accurately associate the tracking ID and semantic information of the same single light, which helps to quickly obtain the semantic information of each single light when implementing subsequent steps.

[0170] Step S105: Treat each tracked ID as a target single light and obtain the initial semantic information of the target single light in each frame image based on the tracked ID of the target single light.

[0171] Step S106: Smooth the initial semantic information of the target single lamp in each frame image to obtain the final semantic information of the target single lamp in each frame image.

[0172] Specifically, the result of the smoothing process is used as the final semantic information of a single light in the image. Due to factors such as the accuracy of the semantic recognition model and image quality, single light color detection errors may occur, resulting in repeated jumps in semantic information. Smoothing can effectively reduce these repeated jumps and maintain the stability of the semantic information.

[0173] Based on the method described in steps S101 to S106 above, the semantic information of a single light in each frame of a video clip can be quickly, accurately, and automatically identified without requiring annotation personnel to perform semantic analysis on each frame, which greatly improves the efficiency and accuracy of obtaining the semantic information of a single light.

[0174] The following description continues with an embodiment of the semantic acquisition method for traffic lights provided in this application, specifically explaining step S106 above.

[0175] In one embodiment of step S106, the semantic information includes the color of a single light and can be obtained through... Figure 2 In the following steps S201 to S205, the initial color of the target single lamp in each frame image is smoothed to obtain the final color of the target single lamp in each frame image.

[0176] Step S201: Based on the initial color, obtain the first candidate time frame for color change; wherein, the initial color of the target single lamp is different in the first candidate time frame and the first time frame thereafter.

[0177] First, it should be noted that the color of a single lamp in a time frame in the embodiments of this application refers to the color of the single lamp in the image of that time frame. The first candidate time frame will be described below with reference to Table 1, which shows the color of the target single lamp in 16 consecutive frames of images. Table 1 Time frame 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 color green green green green yellow green yellow yellow yellow yellow yellow red red red red red

[0178] As shown in Table 1, time frame 4 is green, and the first time frame after time frame 4 is time frame 5, which is yellow. Since green and yellow are different, time frame 4 is the first candidate time frame. Similarly, time frames 5, 6, 11, and 12 are also first candidate time frames.

[0179] Step S202: Obtain the mutation color corresponding to the first candidate time frame. The mutation color is the initial color of the target single lamp in the first time frame after the first candidate time frame.

[0180] Step S203: Obtain the first time window of the first preset duration after the first candidate time frame.

[0181] The target single light's color alternates, and each color lasts for a specific period, though the duration may vary. When setting the first preset duration, the shortest duration of the abrupt color can be obtained. The first preset duration is set based on this shortest duration, and it equals the shortest duration. For example, a single light's color includes an on (green, yellow, red) and an off (black) color. When a single light displays green, it includes a constant-on phase and a flashing phase. During the constant-on phase, green is continuously displayed, and during the flashing phase, it alternates between green and black. If the constant-on phase lasts 25 seconds, and the flashing phase lasts 5 seconds, with the flashing phase displaying green, black, green, black, and green in sequence from second 1 to second, then the duration of green includes both 25 seconds and 1 second, with the shortest duration being 1 second. Furthermore, when a single light is working normally, it may display black during flashing; therefore, the shortest duration of black is also 1 second.

[0182] Step S204: Obtain the number of occurrences of the mutated color within the first time window, and obtain the number of time frames within the first time window; obtain the first reliable time frame based on the difference between the number of time frames and the number of occurrences. Specifically, if the difference between the number of time frames and the number of occurrences is less than a first difference threshold, it indicates that the number of mutated colors is relatively large, and the target single light has undergone normal color conversion, so the first candidate time frame is used as the first reliable time frame; otherwise, the first candidate time frame will not be used as the first reliable time frame.

[0183] When setting the first difference threshold, the duration of abnormal color abrupt changes when the target single lamp displays each color can be detected, the longest duration can be obtained from these durations, the number of time frames within this longest duration can be obtained, and the first difference threshold can be set based on this number.

[0184] It should be noted that the frequency at which the vehicle's camera captures images is much higher than the frequency of individual light color changes. For example, if an individual light color alternates between green, yellow, and red, with each green, yellow, and red color lasting for 30 seconds, 3 seconds, and 30 seconds respectively, the camera captures 25 frames per second.

[0185] The following explanation uses time frames 4 and 5 in Table 1 as examples.

[0186] Time frame 4: The mutation color of time frame 4 is yellow. Assuming that the first time window has 5 time frames based on the shortest duration of yellow, the time frames falling into the first time window after time frame 4 are time frames 5, 6, 7, 8, and 9. The colors of time frames 5, 6, 7, 8, and 9 are yellow, green, yellow, yellow, and yellow, respectively. The number of occurrences of the mutation color (yellow) is 4. The difference between the number of time frames (5) and the number of occurrences (4) in the first time window is 1. Assuming that the first difference threshold is 2, since the difference (1) is less than the first difference threshold (2), time frame 4 is the first reliable time frame.

[0187] Time frame 5: The mutation color of time frame 5 is green. Assuming that the first time window also has 5 time frames, the time frames that fall into the first time window after time frame 5 are time frames 6, 7, 8, 9, and 10, and the number of occurrences of the mutation color (green) is 1. The difference between the number of time frames (5) and the number of occurrences (1) in the first time window is 4. Assuming that the first difference threshold is 2, since the difference (4) is greater than the first difference threshold (2), time frame 5 is not the first reliable time frame.

[0188] Step S205: Smooth the initial color based on the first reliable time frame. The first reliable time frame indicates that the single lamp color has undergone normal color conversion after this time frame. Therefore, the initial color of the target single lamp in the first time frame after the first reliable time frame can be obtained, and the final color of the target single lamp after the first reliable time frame can be obtained based on this initial color.

[0189] In some implementations, the initial color can be smoothed according to the first reliable time frame through the following steps S2051 to S2053.

[0190] Step S2051: Obtain the stable color corresponding to the first reliable time frame. The stable color is the initial color of the target single lamp in the first time frame after the first reliable time frame. Step S2052: Obtain the color conversion information of the first reliable time frame based on the stable color. The color conversion information indicates that the final color of the target single lamp after the first reliable time frame is the stable color. Step S2053: Determine the final color of the target single lamp in each frame image based on the color conversion information. Based on the method described in steps S2051 to S2053 above, the final color of the target single lamp in each frame image can be quickly and accurately determined using the color conversion information.

[0191] In some embodiments of step S2053 above, when there are multiple first trusted time frames, the final color of the target single lamp in each frame image can be determined sequentially according to the color conversion information of each first trusted time frame in order from first to last.

[0192] The following explanation will still use Table 1 as an example.

[0193] Assume that time frame 4 and time frame 11 in Table 1 are the first reliable time frames, the stable color corresponding to time frame 4 is yellow, and the stable color corresponding to time frame 11 is red.

[0194] Following the order of the first reliable time frames, based on the color transition information of time frame 4, the target single light is determined to be yellow from time frames 5 to 16. Then, based on the color transition information of time frame 11, the target single light is determined to be red from time frames 12 to 16. Finally, the final colors of each time frame in Table 1 are shown in Table 2 below. Table 2 Time frame 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 color green green green green yellow yellow yellow yellow yellow yellow yellow red red red red red

[0195] A comparison of Tables 1 and 2 confirms that the abnormal color change of the single light in time frame 6 has been eliminated, and the color of time frame 6 has changed from green to yellow.

[0196] Based on the method described in steps S201 to S205 above, abnormal color changes of the target single light can be effectively removed, thereby ensuring the color stability of the target single light in each time frame.

[0197] The following description continues with an embodiment of the semantic acquisition method for traffic lights provided in this application, focusing on step S106.

[0198] In some embodiments of step S106 above, the semantic information may include the flashing state, and can be obtained through... Figure 3 The following steps S206 to S208 are shown to obtain the flickering state of the target single light in each frame of the video clip.

[0199] Step S206: Based on the tracking ID of the target single lamp, obtain the final color of the target single lamp in each frame image, and perform flicker detection on the final color to obtain the flickering period. Flickering can be understood as the single lamp color repeatedly alternating between a lit color and black within a certain period of time. Therefore, the flickering period of a single lamp can be determined based on its color. For example, the color can switch between lit color and black every 0.5 seconds.

[0200] In some implementations, the final color of the target single light can be matched with a preset flashing template to obtain a color matching time period, which is then used as the flashing time period. The flashing template includes M template colors arranged sequentially, where each template color represents the color of the single light when it flashes, M < N, and N is the total number of time frames in the video segment. The color matching time period includes M time frames, and the final color of the target single light in the first to the Mth time frames is the same as the first to the Mth template colors in the flashing template, respectively. For example, the flashing template is "BBGGGBB" or "BBGGGBBB", where B represents black and G represents green. This implementation allows for quick and accurate determination of whether a single light is flashing using the flashing template, thereby obtaining the flashing state of the single light.

[0201] Step S207: Take each time frame in the video clip as the target time frame, and obtain the flashing state of the target single light in the target time frame according to the flashing period. If the target time frame falls within the flashing period, the flashing state is flashing; if the target time frame does not fall within the flashing period, the flashing state is not flashing. As long as the single light is within the flashing period, regardless of whether the single light is lit or black at this time, the flashing state of the single light is flashing.

[0202] Based on the methods described in steps S206 to S207 above, the flashing state of a single light can be accurately obtained according to the color of the single light.

[0203] The following description continues with an embodiment of the semantic acquisition method for traffic lights provided in this application, focusing on step S106.

[0204] In some embodiments of step S106 above, when the target single lamp is a countdown lamp, the semantic information also includes countdown numbers, and the countdown lamp is a single lamp with a lamp head shaped like a number. In this embodiment, it can be achieved through... Figure 4 The following steps S301 to S306 are used to smooth the initial number (i.e. the initial countdown number) of the target single lamp in each frame image to obtain the final number (i.e. the final countdown number) of the target single lamp in each frame image.

[0205] Step S301: Obtain the final color of the target single light in each frame image based on the tracking ID of the target single light. Since the tracking ID and semantic information of the same single light have already been associated through the aforementioned step S104, the semantic information of the target single light can be queried based on its tracking ID. Then, the countdown number of the single light can be obtained from the semantic information; this number is the starting number.

[0206] Step S302: Divide the video segment into multiple color-independent time periods based on the final color; wherein, the final color of the target single light is the same in all the same color-independent time periods, and the final color of the target single light is different in two adjacent color-independent time periods.

[0207] Referring again to Table 2, based on the final color, time frames 1 to 16 can be divided into three color-independent time periods. The first color-independent time period is time frames 1-4, the second color-independent time period is time frames 5-11, and the third color-independent time period is time frames 12-16.

[0208] Step S303: Use the Z-score method to detect outliers in the initial values ​​of the target single lamp in the color-independent time period, and obtain the outlier values ​​in the color-independent time period.

[0209] When a single LED changes from one illuminated color to another, the countdown timer also changes abruptly. For example, when a single LED is green, the countdown timer starts at 30 and decreases sequentially to 1. After the countdown timer reaches 1, the single LED changes to red, and the countdown timer changes from 1 to 30. However, this abrupt change in the countdown timer is a normal color transition (described below as a normal abrupt change). Dividing the video clip into multiple color-independent time segments and performing outlier detection on each segment can prevent the countdown timer during these normal abrupt changes from being mistakenly detected as an anomaly.

[0210] Step S304: Use the Lowess smoothing method and smooth the abnormal numbers according to the normal numbers in the color-independent time period. The normal numbers are the initial numbers remaining in the color-independent time period excluding the abnormal numbers.

[0211] The Z-score method is a conventional outlier detection method, and the Lowess smoothing method is a conventional smoothing method. For the sake of brevity, this implementation will not provide a detailed explanation of the principles of either method.

[0212] Step S305: Use the countdown numbers of the target single light in all color-independent time periods as the optimized numbers of the target single light in each frame of the video clip.

[0213] Step S306: Based on the optimized numbers, obtain the final numbers of the target single lamp in each frame image.

[0214] Based on the method described in steps S301 to S306 above, abnormal countdown numbers of the target single lamp can be effectively removed, thereby ensuring the accuracy of the countdown of the target single lamp.

[0215] Step S306 in the above embodiments will be described below.

[0216] In some embodiments of step S306 above, the final number of the target single lamp can be obtained based on the optimized number through the following steps S3061 to S3065.

[0217] Step S3061: Based on the optimized numbers, obtain the second candidate time frame for the numerical mutation; the optimized numbers of the target single lamp are different in the second candidate time frame and the first time frame thereafter.

[0218] First, it should be noted that the numbers for a single lamp in each embodiment of this application refer to the numbers of the single lamp in the image of that time frame. The second candidate time frame will be described below with reference to Table 3, which shows the optimized numbers of the target single lamp in 15 consecutive frames of images. Table 3 Time frame 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 number 7 7 7 7 6 7 6 6 6 5 5 5 5 5 4

[0219] In this context, time frame 4 has the number "7", and the first time frame after time frame 4 is time frame 5, which has the number "6". Since "7" and "6" are different, time frame 4 is the second candidate time frame. Similarly, time frames 5, 6, 9, and 14 are also second candidate time frames.

[0220] Step S3062: Obtain the mutation number corresponding to the second candidate time frame. The mutation number is the optimized number of the target single lamp in the first time frame after the second candidate time frame.

[0221] Step S3063: Obtain a second time window of a second preset duration after the second candidate time frame.

[0222] The numbers on the target LED change sequentially at fixed time intervals, with each number having the same duration. When setting the second preset duration, the duration of any one of the numbers can be obtained, and the second preset duration can be set based on that duration.

[0223] Step S3064: Obtain the number of occurrences of mutation numbers within the second time window, and obtain the number of time frames within the second time window; if the difference between the number of occurrences and the number of time frames is less than the second difference threshold, it indicates that the number of mutation numbers is relatively large, and the target single lamp has undergone normal digital conversion, so the second candidate time frame is used as the second reliable time frame; otherwise, the second candidate time frame will not be used as the second reliable time frame.

[0224] When setting the second difference threshold, the duration of abnormal numerical changes when the target single lamp displays each number can be detected, the longest duration can be obtained from these durations, the number of time frames within this longest duration can be obtained, and the second difference threshold can be set based on this number.

[0225] The following explanation uses time frames 4 and 5 in Table 3 as examples.

[0226] Time frame 4: The mutation number of time frame 4 is "6". Assuming there are 5 time frames in the second time window, the time frames that fall into the second time window after time frame 4 are time frames 5, 6, 7, 8, and 9. The numbers of time frames 5, 6, 7, 8, and 9 are "6, 7, 6, 6, 6". The number of occurrences of the mutation number (6) is 4. The difference between the number of time frames (5) and the number of occurrences (4) in the second time window is 1. Assuming the second difference threshold is 2, since the difference (1) is less than the second difference threshold (2), time frame 4 is the second reliable time frame.

[0227] Time frame 5: The mutation number of time frame 5 is "7". The time frames that fall into the second time window after time frame 5 are time frames 6, 7, 8, 9, and 10. The numbers of time frames 6, 7, 8, 9, and 10 are "7, 6, 6, 6, 5" respectively. The number of occurrences of the mutation number (7) is 1. The difference between the number of time frames (5) and the number of occurrences (1) in the second time window is 4. Since the difference (4) is greater than the second difference threshold (2), time frame 5 is not the second reliable time frame.

[0228] Step S3065: Smooth the optimized numbers again based on the second reliable time frame to obtain the final number of the target single lamp. The second reliable time frame indicates that the countdown numbers have undergone normal digital conversion after this time frame. Therefore, based on the second reliable time frame and the optimized numbers of the target single lamp in the second reliable time frame, the final number of the target single lamp between every two second reliable time frames can be analyzed.

[0229] Based on the method described in steps S3061 to S3065 above, the time frame (second reliable time frame) when the countdown numbers are normally converted and the optimized number of the target single lamp in that time frame can be obtained, thereby accurately obtaining the final number of the target single lamp in all time frames.

[0230] The following is a further explanation of step S3065.

[0231] In some embodiments of step S3065, the optimized numbers can be smoothed again based on the second reliable time frame through the following steps 21 to 23.

[0232] Step 21: Add the optimized digital value of the target single lamp in each second credible time frame to the time value of each second credible time frame to obtain the sum value of the target single lamp in each second credible time frame.

[0233] Step 22: Obtain the average of all summed values, and subtract the average from the optimized value of the target single light in each time frame to obtain the ideal value of the target single light in each time frame. Each time frame refers to each time frame within the video segment.

[0234] In some implementations, the sum of each second reliable time frame is used as the target sum, and the deviation between the target sum and the aforementioned average value is calculated. If the deviation is large, the sum of this second reliable time frame can be removed, and the average value can be recalculated. Based on this, the accuracy of the ideal number obtained using the average value can be further improved.

[0235] Step 23: Take each time frame in the video clip as the target time frame and obtain the deviation between the optimized number and the ideal number of the target single lamp in the target time frame.

[0236] If the absolute value of the deviation is less than a preset deviation threshold, the final number is obtained based on the optimized number. Specifically, the optimized number can be used as the final number. If the absolute value of the deviation is greater than or equal to the preset deviation threshold, the final number is obtained based on the ideal number. Specifically, the integer digits of the ideal number can be used as the final number. During the countdown, the countdown numbers decrease sequentially, and the difference between any two adjacent countdown numbers is equal. The deviation threshold can be set based on this difference. For example, the deviation threshold can be equal to this difference.

[0237] The final figures are explained below with reference to Table 4, where the deviation threshold is 1. Table 4

[0238] In the calculations at 10.5s and 10.6s, the absolute value of the deviation between the optimized number and the ideal number was less than 1, therefore the final number was the optimized number "9". At 10.8s, the absolute value of the deviation between the optimized number and the ideal number was greater than 1, so the integer part of the ideal number was used as the final number, and the final number was still "9". Furthermore, even when the optimized number was unknown, the final number was still obtained based on the ideal number; therefore, the final number at 10.7s was also "9".

[0239] Based on the methods described in steps 21 to 23 above, the ideal number corresponding to each time frame can be accurately obtained by utilizing the correspondence between the countdown numbers and time frames. This allows the ideal number to be used to smooth the optimized number, further improving the accuracy of the final number.

[0240] The following continues to describe embodiments of the semantic acquisition method for traffic lights provided in this application. In some embodiments of this application, the camera on the vehicle includes a first camera and a second camera, and the video clip includes a first video clip captured by the first camera and a second video clip captured by the second camera. The first and second cameras are positioned at different locations on the vehicle, and their fields of view are different. The first camera can at least capture images of the driving environment in front of the vehicle, and the first camera has a higher priority than the second camera. In this embodiment, it can be achieved through... Figure 5 The following steps S401 to S403 are shown to fuse the semantic results obtained from multiple cameras.

[0241] Step S401: Obtain the first semantic result of the first video segment. The first semantic result includes the first semantic information of the target single lamp in the first video segment.

[0242] Step S402: Obtain the second semantic result of the second video segment. The second semantic result includes the second semantic information of the target single lamp in the second video segment.

[0243] Step S403: Take each time frame in the video segment as the target time frame, and fuse the first semantic information and the second semantic information of the target single light in the target time frame; wherein, each time frame in the first video segment and the second video segment are the same.

[0244] The first and second cameras have different fields of view and resolutions. Therefore, in some scenarios, the first or second camera may not capture or may not capture a clear image of the traffic lights. This could result in individual lights being undetectable when performing single-light detection on the first or second video segments. By fusing the semantic results of the first and second cameras, semantic information about a single light can be obtained from the semantic results of the second video segment when no single light is detected in the first video segment; similarly, semantic information about a single light can be obtained from the semantic results of the first video segment when no single light is detected in the second video segment.

[0245] See appendix Figure 6 and attached Figure 7 , Figure 6 and Figure 7 These are images of the same traffic light captured by the first and second cameras, respectively. The first camera is a telephoto camera, so it can still capture a clear image of the traffic light even when the vehicle is far away. However, the second camera, being a short-focus camera, cannot capture a clear image of the traffic light, which means it may be unable to detect individual lights when performing single-light detection on the image.

[0246] See appendix Figure 8 and attached Figure 9 , Figure 8 and Figure 9These are images captured simultaneously by the first and second cameras. The first camera has a narrower field of view than the second camera; therefore, the images captured by the first camera do not include traffic lights, while the images captured by the second camera do. When performing single-light detection on the images captured by the first camera, no single light will be detected.

[0247] In some embodiments of step S403 above, the first and second semantic information can be fused in the following ways.

[0248] Specifically, when the first semantic result does not contain the first semantic information of the target single lamp in the target time frame, and the second semantic result contains the second semantic information of the target single lamp in the target time frame, the second semantic information is added to the first semantic result.

[0249] If the first semantic result contains first semantic information and the second semantic result does not contain second semantic information, the first semantic information is added to the second semantic result.

[0250] When the first semantic result contains first semantic information and the second semantic result contains second semantic information, if the first semantic information and the second semantic information are different, the second semantic information is modified to the first semantic information. For example, when the semantic information is the color of a single light, if the first semantic information of the target single light in the target time frame is red in the first semantic result, and the second semantic information of the target single light in the target time frame is green in the second semantic result, then the second semantic information is modified from green to red.

[0251] In some implementations, there are multiple second cameras, each capturing a second video segment. If the second semantic results of multiple second video segments all include second semantic information (i.e., multiple second cameras captured the same single light and all obtained the semantic information of that single light), then all the second semantic information is voted on, and the second semantic information that appears most frequently is added to the first semantic result. For example, when the semantic information is the color of a single light, if there are three second video segments, and the first semantic information of the target single light at the target time frame obtained from these three second video segments is red, red, and green respectively, then since red appears most frequently, red is added to the first semantic result. In this way, the first semantic result contains the semantic information of the target single light at the target time frame (i.e., red).

[0252] The following is in conjunction with the appendix Figure 10 To be continued Figure 12 This paper describes the semantic acquisition method for traffic lights provided in this application. First, please refer to the appendix. Figure 10 , Figure 10 The overall flow of a method for obtaining the semantics of traffic lights is illustrated exemplarily. For example... Figure 10As shown, the semantic information of a single lamp can be obtained through the following steps S501 to S502.

[0253] Step S501: Obtain the 3D position, tracking ID, 2D bounding box, and semantic information of a single light. (See Appendix) Figure 11 In step S5011, the 3D position of a single light in each frame image is obtained, and the single light in each frame image is tracked according to the 3D position to obtain the tracking ID of the single light in each frame image; in step S5012, the 2D detection box of a single light in the image is detected, and the semantic recognition of the single light is performed on the image area where the 2D detection box is located to obtain the semantic information of the single light; in step S5013, the tracking ID and semantic information of the same single light in the image are associated to obtain the full-time semantic information of the traffic light.

[0254] Step S502: Single lamp information timing processing. (See appendix) Figure 12 In step S5021, the methods described in steps S201 to S205 of the aforementioned method embodiment are used to smooth the color of a single lamp to obtain the final color. In step S5022, the methods described in steps S206 to S208 of the aforementioned method embodiment are used to detect flicker in a single lamp. In step S5023, the methods described in steps S301 to S306 of the aforementioned method embodiment are used to smooth the countdown number of a single lamp to obtain the final number. In step S5024, the methods described in steps S401 to S403 of the aforementioned method embodiment are used to fuse multi-target camera information. The multi-target camera information includes the first semantic result obtained using the first camera and the second semantic result obtained using the second camera in the aforementioned embodiment.

[0255] It should be noted that although the steps in the above embodiments are described in a specific order, those skilled in the art will understand that in order to achieve the effect of this application, different steps do not necessarily have to be executed in such an order. They can be executed simultaneously (in parallel) or in other orders. These adjusted solutions are equivalent to the technical solutions described in this application and therefore will also fall within the protection scope of this application.

[0256] Those skilled in the art will understand that all or part of the processes in the method of the above-described embodiment can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable file, or some intermediate form. The computer-readable storage medium can include any entity or device capable of carrying the computer program code, a medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory, a random access memory, an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.

[0257] Another aspect of this application provides a computer-readable storage medium.

[0258] In one embodiment of a computer-readable storage medium according to this application, the computer-readable storage medium may be configured to store a program that performs the semantic acquisition method of traffic lights according to the above-described method embodiments. This program may be loaded and run by a processor to implement the semantic acquisition method of traffic lights. For ease of explanation, only the parts related to the embodiments of this application are shown; for specific technical details not disclosed, please refer to the method section of the embodiments of this application. The computer-readable storage medium may be a storage device comprising various electronic devices. Optionally, in the embodiments of this application, the computer-readable storage medium is a non-transitory computer-readable storage medium.

[0259] Another aspect of this application provides an electronic device. In an embodiment of an electronic device according to this application, the electronic device may include at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores a computer program, which, when executed by the at least one processor, implements the method described in any of the above-described embodiments of the traffic light data generation method. See Appendix Figure 13 , Figure 13 The image exemplarily illustrates a communication connection between memory 11 and processor 12 via a bus. The electronic device described in this application may be, but is not limited to, tablet computers, desktop computers, laptop computers, ultra-mobile personal computers (UMPCs), netbooks, etc., and the embodiments of this application do not limit this to any particular type.

[0260] The technical solution of this application has been described above with reference to one embodiment shown in the accompanying drawings. However, it will be readily understood by those skilled in the art that the scope of protection of this application is obviously not limited to these specific embodiments. Without departing from the principles of this application, those skilled in the art can make equivalent changes or substitutions to the relevant technical features, and the technical solutions after these changes or substitutions will all fall within the scope of protection of this application.

Claims

1. A method for semantic acquisition of traffic lights, characterized in that, The method includes: Acquire video clips of traffic lights captured by cameras on vehicles; Obtain the 3D position of a single light in each frame of the video segment, track the single light in each frame of the video segment based on the 3D position, and obtain the tracking ID of the single light in each frame of the video segment. Detect the 2D detection box of a single light in the image, perform single-light semantic recognition on the image region where the 2D detection box is located, and obtain the initial semantic information of the single light; The tracking ID and initial semantic information of the same single light in the image are associated; Each tracking ID indicates a single light as a target single light, and the initial semantic information of the target single light in each frame image is obtained according to the tracking ID of the target single light. The initial semantic information of the target lamp in each frame image is smoothed to obtain the final semantic information of the target lamp in each frame image.

2. The method according to claim 1, characterized in that, The association of the tracking ID and initial semantic information of the same single light in the image includes: The 3D position of a single lamp in the image is projected onto the image coordinate system to obtain its 2D position; The 2D position is matched with the 2D detection box of a single light in the image; If a match is successful, the tracking ID of the single light corresponding to the 3D position and the initial semantic information of the single light corresponding to the 2D detection box are associated.

3. The method according to claim 1, characterized in that, The semantic information includes the color of a single light, and the method includes smoothing the initial color of the target single light in each frame image by: Based on the initial color, a first candidate time frame for color abrupt change is obtained; wherein, the initial color of the target single lamp is different in the first candidate time frame and the first time frame thereafter; Obtain the abrupt color corresponding to the first candidate time frame, wherein the abrupt color is the initial color of the target single lamp in the first time frame after the first candidate time frame; Obtain a first time window of a first preset duration following the first candidate time frame; The number of occurrences of the mutation color within the first time window is obtained, and the number of time frames within the first time window is obtained; if the difference between the number of time frames and the number of occurrences is less than a first difference threshold, then the first candidate time frame is taken as the first reliable time frame. The initial color is smoothed based on the first reliable time frame.

4. The method according to claim 3, characterized in that, The step of smoothing the initial color based on the first reliable time frame includes: Obtain the stable color corresponding to the first trusted time frame, wherein the stable color is the initial color of the target single light in the first time frame after the first trusted time frame; The color conversion information of the first trusted time frame is obtained based on the stable color, and the color conversion information is used to indicate that the final color of the target single light after the first trusted time frame is the stable color. Based on the color conversion information, the final color of the target single lamp in each frame image is determined.

5. The method according to claim 4, characterized in that, Determining the final color of the target single light in each frame image based on the color conversion information includes: When there are multiple first trusted time frames, the final color of the target single lamp in each frame image is determined sequentially according to the color conversion information of each first trusted time frame, in order from first to last.

6. The method according to any one of claims 3 to 5, characterized in that, The method further includes obtaining the flashing state of the target single light based on the final color of the target single light in each frame image, and in the following manner: Based on the tracking ID of the target single light, the final color of the target single light in each frame image is obtained, and the flickering period is obtained by performing flicker detection on the final color; Each time frame within the video segment is taken as a target time frame, and the flashing state of the target single light in the target time frame is obtained according to the flashing period. If the target time frame falls within the flashing period, the flashing state is flashing; if the target time frame does not fall within the flashing period, the flashing state is not flashing.

7. The method according to claim 6, characterized in that, The step of detecting the flickering period of the final color includes: The final color of the target single light is matched with a preset flashing template to obtain a color matching time period, and the color matching time period is used as the flashing time period; in, The blinking template includes M template colors arranged in sequence. The template color is the color when a single light blinks, M < N, where N is the total number of time frames in the video segment. The color matching period includes M time frames, and the final color of the target single light in the 1st to the Mth time frames is the same as the color of the 1st to the Mth template in the flashing template.

8. The method according to any one of claims 3 to 5, characterized in that, When the target single lamp is a countdown lamp, the semantic information also includes the countdown digits. The countdown lamp is a single lamp with a lamp head shaped like a number. The method includes smoothing the initial digits of the target single lamp in each frame image by means of the following: Based on the tracking ID of the target single light, obtain the final color of the target single light in each frame image; The video segment is divided into multiple color-independent time periods based on the final color; wherein, the final color of the target single light is the same in all the same color-independent time periods, and the final color of the target single light is different in two adjacent color-independent time periods. The Z-score method is used to detect outliers in the initial values ​​of the target single lamp during the color-independent time period, and the outliers during the color-independent time period are obtained. The Lowess smoothing method is used to smooth the abnormal numbers based on the normal numbers within the color-independent time period, where the normal numbers are the initial numbers remaining within the color-independent time period excluding the abnormal numbers. The countdown numbers of the target single light during all color-independent time periods are used as the optimized numbers of the target single light in each frame of the video segment. Based on the optimized numbers, the final numbers of the target single lamp in each frame image are obtained.

9. The method according to claim 8, characterized in that, The step of obtaining the final number of the target single lamp in each frame image based on the optimized number includes: Based on the optimized numbers, a second candidate time frame for the numerical mutation is obtained; wherein, the optimized numbers of the target single lamp are different in the second candidate time frame and the first time frame thereafter; Obtain the mutation number corresponding to the second candidate time frame, where the mutation number is the optimized number of the target single lamp in the first time frame after the second candidate time frame; Obtain a second time window of a second preset duration following the second candidate time frame; The number of occurrences of the mutation number within the second time window is obtained, and the number of time frames within the second time window is obtained; if the difference between the number of time frames and the number of occurrences is less than a second difference threshold, then the second candidate time frame is taken as the second reliable time frame. The optimized number is smoothed again based on the second reliable time frame to obtain the final number of the target single lamp.

10. The method according to claim 9, characterized in that, The step of smoothing the optimized number again based on the second reliable time frame includes: The optimized value of the target single lamp in each second credible time frame is added to the time value of each second credible time frame to obtain the sum value of the target single lamp in each second credible time frame. The average of all summed values ​​is obtained, and the average value is subtracted from the optimized value of the target single light in each time frame to obtain the ideal value of the target single light in each time frame, where each time frame is the time frame within the video segment. Each time frame within the video segment is taken as a target time frame, and the deviation between the optimized number and the ideal number of the target single lamp in the target time frame is obtained. If the absolute value of the deviation is less than a preset deviation threshold, the final number is obtained based on the optimized number; if the absolute value of the deviation is greater than or equal to the preset deviation threshold, the final number is obtained based on the ideal number.