A city road detection method based on a multi-modal large model

By employing a multi-stage detection and adaptive iterative optimization method for prompt words, the problem of high misjudgment rate in road defect detection using multimodal large models is solved, thereby improving the accuracy and reliability of detection and meeting the accuracy and stability requirements of road defect detection.

CN122244665APending Publication Date: 2026-06-19ZHEJIANG SUPCON INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG SUPCON INFORMATION TECH CO LTD
Filing Date
2025-12-31
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multimodal large models for road defect detection have coarse prompts, rely on experience for optimization, and have a high misjudgment rate, making it difficult to meet the requirements for accuracy and reliability.

Method used

The design employs a multi-stage prompt word scheme, which integrates image quality detection, road event detection, and refined event determination stages. By combining the analysis of false positives and false negatives, the prompt words are adaptively and iteratively optimized to gradually improve the detection accuracy.

Benefits of technology

Through multi-stage detection and adaptive iterative optimization, the accuracy and reliability of detection are significantly improved, false alarms are reduced, model performance is optimized, and the accuracy and stability requirements of road defect detection are met.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244665A_ABST
    Figure CN122244665A_ABST
Patent Text Reader

Abstract

This invention relates to the field of image or video recognition technology, and discloses a method for urban road detection based on a multimodal large model. The method includes: designing a multi-stage prompt word scheme, comprising a sequential image quality detection stage, a road event detection stage, and a refined event judgment stage. The output of each stage serves as the input for the next stage. Each stage uses targeted prompt words to guide the multimodal large model to output a judgment result. After detecting a road event, the monitoring point is repeatedly detected at preset time points. When the event is continuously detected in the time series, it is determined to be a credible event. Based on misjudged and missed samples in historical detection results, the judgment rules in the prompt words are automatically analyzed and optimized through the large model, and the prompt words are adaptively iteratively updated until the detection accuracy meets a threshold. This method solves the problems of coarse prompt words and high model misjudgment rate, achieving the goal of adaptive optimization of prompt words and improved detection accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image or video recognition or understanding, and more particularly to a method for urban road detection based on a multimodal large model. Background Technology

[0002] Road defect detection is a crucial aspect of road maintenance and traffic safety management. Traditional methods primarily rely on manual inspections or image acquisition by vehicle-mounted equipment followed by visual analysis by professionals. This approach is inefficient, costly, and fails to meet the routine inspection needs of large-scale road networks. In recent years, automated detection technologies based on computer vision, especially the application of deep learning instance segmentation models (such as Mask R-CNN), have improved the efficiency of defect identification to some extent. However, these methods are still affected by complex environmental factors such as changes in lighting and occlusion, making it difficult to accurately extract both minute cracks and large potholes. More importantly, their functionality is largely limited to defect identification and location, lacking the ability to scientifically quantify the severity of defects. The assessment process is highly dependent on human experience, failing to provide standardized and reproducible decision-making criteria. Furthermore, although the emergence of multimodal large models has brought new possibilities for understanding complex scenarios, general models have significant shortcomings in professional fields such as road maintenance: their output results are prone to deviation from engineering specifications and professional expectations, and the existing prompt word engineering methods are crude and optimization relies on experience, resulting in high misjudgment rates and insufficient stability in practical applications, making it difficult to meet the core requirements of road maintenance for accuracy, reliability and automated assessment.

[0003] For example, Chinese patent CN120808112A discloses a method for intelligent identification and evaluation of road defects based on a multimodal large model and instance segmentation algorithm. It provides the following technical solutions: collecting road surface defect image data to construct an original dataset; training a road surface defect identification model based on instance segmentation (Mask-RCNN, etc.) deep learning methods; constructing a multimodal prompting engineering framework based on parameter state transfer to generate a domain-adaptive question-answering dataset; collecting historical defect data, road defect-related professional knowledge, and evaluation standards to construct a professional knowledge compilation framework and generate a multimodal large model defect evaluation dataset; optimizing the training results based on a collaborative optimization method using a multimodal large model fusion specification compliance function to form a domain-specific large model; and constructing an end-to-end intelligent road defect identification and evaluation system based on the fine-tuned domain-specific large model. This method solves the problems of traditional road surface detection methods relying on human experience and being inefficient, while simultaneously utilizing a large model to construct a professional road condition evaluation system. However, the aforementioned intelligent road defect identification and assessment method based on a multimodal large model and instance segmentation algorithm uses a fixed prompt word library or human input for setting prompt words. The accuracy of the large model's output results is highly dependent on the prompt word input. This fixed or human-influenced prompt word input will cause the large model's output to fluctuate greatly and be inaccurate. At the same time, this patent cannot enable the large model to deeply internalize professional knowledge and perform quantitative hazard assessment and repair decisions. It also lacks standardized compliance functions and end-to-end collaborative optimization capabilities. The assessment remains superficial and the optimization relies on a large amount of manual annotation. Summary of the Invention

[0004] This invention addresses the problems of coarse prompt words, reliance on experience for optimization, and high model misjudgment rate in existing technologies. It proposes a city road detection method based on a multimodal large model, achieving the goals of adaptive optimization of prompt words, improved detection accuracy, and reduced manual parameter tuning.

[0005] Furthermore, the present invention aims to achieve more accurate and stable performance of a general-purpose large model in road defect detection tasks by designing a process that automatically iterates and optimizes prompt words based on the analysis of misjudged and missed judgment samples without modifying model parameters.

[0006] To achieve the above objectives, the present invention adopts the following technical solution: A method for urban road detection based on a multimodal large model includes: The design incorporates a multi-stage prompting scheme, which includes sequential image quality detection, road event detection, and refined event judgment stages. The output of each stage serves as the input for the next stage, and each stage uses targeted prompting words to guide the multimodal large model to output the judgment result. After a road event is detected, the monitoring point is repeatedly detected at preset time points. When the event is continuously detected in the time series, it is determined to be a credible event. Based on the misjudged and missed samples in the historical detection results, the judgment rules in the prompt words are automatically analyzed and optimized through a large model, and the prompt words are adaptively iterated and updated until the detection accuracy meets the preset threshold.

[0007] By refining the detection process through a multi-stage prompt word scheme, the accuracy and reliability of detection are improved. Repeated detection confirms the credibility of events, reducing false alarms. Adaptive iterative updating of prompt words optimizes model performance and improves detection accuracy.

[0008] Preferably, the multi-stage prompt word scheme includes: The first stage is the image quality detection stage. The prompt words guide the large model to judge whether there are problems such as abnormal lighting, image blurring, and road unrecognization caused by occlusion interference in the image. Based on the preset positive standard and strict exclusion rules, the model outputs the judgment result of whether the image quality is reliable. The second stage is the event detection stage. The prompt words guide the large model to determine whether a specified road event exists in the image, and locate and describe the detected events, outputting a structured result containing the existence of the event, its location coordinates, and a text description. The third stage is the refined event judgment stage. The prompt words guide the large model to review the specific image region where the event detected in the second stage is located. Based on more granular positive criteria and exclusion rules, the final judgment is made within the specified image region.

[0009] Image quality inspection ensures reliable input, event detection initially identifies events, and refined judgment and verification confirm the results, thus forming a complete detection process that improves the level and accuracy of detection.

[0010] Preferably, the automatic analysis and optimization of the judgment rules in the prompt words through the large model includes: collecting structured sample data containing image IDs, model judgment labels, manual annotation labels, and region description text; filtering out misjudged samples that were judged as positive by the model but manually labeled as negative, and missed samples that were judged as negative by the model but manually labeled as positive; inputting the description text of the missed samples into the large model, summarizing common visual features and generating new supplementary suggestions for positive criteria; inputting the description text of the misjudged samples into the large model, summarizing common visual features and generating new supplementary suggestions for strict exclusion rules.

[0011] By using large models to automatically analyze misjudged and missed samples, the system learns from the data and optimizes the rules, reducing manual intervention and improving optimization efficiency and accuracy, making the prompts more adaptable to real-world scenarios.

[0012] Preferably, the automatic analysis and optimization of prompt words through a large model further includes: after obtaining new positive criteria and strict exclusion rules, supplementing the original stage prompt words with the generated new rules to form a new version of prompt words to be reviewed; inputting the new version of prompt words into the large model to review whether there are rule duplication or logical conflicts; after the review is correct, outputting the integrated final version of prompt words and calculating its accuracy on the new sample set; if the accuracy is still lower than the preset threshold, then initiating the iterative optimization process.

[0013] By reviewing new rules through a large model, we ensure the logical consistency and lack of conflict of rules, avoiding redundancy or contradictions. At the same time, we continuously improve the accuracy through iterative optimization, thus achieving self-improvement of the system.

[0014] Preferably, the positive criteria preset in the prompt words of the first stage are used to determine whether there are quality problems in the image that cause the road area to be unrecognizable, and strict exclusion rules are used to exclude image anomalies that do not affect the recognition of the road area.

[0015] Preferably, the prompts in the second stage are for the detection of specific road events. The positive criteria include several judgment conditions that must be met simultaneously and that correspond to the key features of the event. The strict exclusion rules include negative conditions that distinguish the event from easily confused or interfering scenarios.

[0016] By using positive criteria that meet multiple conditions simultaneously and employing targeted exclusion rules, the specificity of event detection is improved, false positives and false negatives are reduced, and the accuracy of detection is enhanced.

[0017] Preferably, the iterative update includes: taking the current prompt word as the baseline prompt word, and combining it with a predefined search space description containing adjustable parameter types, value ranges, and step sizes, as well as the accuracy results, correct pattern summaries, and incorrect pattern summaries from the previous evaluation to form input information; submitting the input information to a large model that acts as a prompt word optimizer, instructing it to make only minimal, controllable changes to the baseline prompt word in no more than a certain number of places while keeping the paragraph headings and output format unchanged; and outputting several candidate prompt words containing specific parameter changes, the full text of the prompt word, and the reasons for the changes according to the instructions.

[0018] By using structured inputs and constrained modifications, the optimization process becomes controllable and efficient. The large model acts as an optimizer, generating multiple candidate prompts and providing diverse optimization directions, making it easier to select the best solution.

[0019] Preferably, the iterative update further includes: terminating the update when, after three consecutive rounds of optimization iterations, the detection accuracy of the new prompt words has not improved compared to the previous round, or when the total number of optimization iterations has reached a preset upper limit.

[0020] Preferably, the step of repeatedly detecting the monitoring point at preset time points includes: when a target event is detected at a camera device at a first time point, acquiring the image of the device again at a preset second time point and confirming the detection; for road construction events, the second time point is several minutes after the first detection; for road damage events, two confirmation detections are performed within one hour after the first detection; events confirmed through repeated detection are recorded in the database, and only events that occur at least twice consecutively within a day are recorded in the database. Events that occur intermittently only three times or less are filtered out and considered unreliable events.

[0021] By detecting repetitions and determining continuity in time series data, instantaneous or sporadic false alarms are filtered out, improving the stability and reliability of event detection and ensuring the reliability of the data entering the database.

[0022] Preferably, the search space includes several adjustable parameters, including a comprehensive confidence threshold, a minimum crack width threshold, a length compensation rule switch, and an exclusion rule switch for specific false alarm types.

[0023] Compared with the prior art, the beneficial effects of the present invention are as follows.

[0024] 1. This invention significantly improves detection accuracy through a multi-stage questioning mechanism. The first stage rigorously screens image quality to effectively eliminate interference factors such as blurring and occlusion, laying a reliable foundation for subsequent analysis. The second stage performs refined identification and localization of specific disease types, combining positive criteria and exclusion rules to ensure sufficient basis for judgment. The third stage conducts final verification within the red-framed area, minimizing misjudgments through multiple conditional constraints, and outputting structured results for easy system processing.

[0025] 2. This invention automatically analyzes misjudged and missed samples by comparing model and manually labeled results, and generates targeted rule optimization suggestions, constructing a complete iterative closed loop. It can continuously optimize prompt words without changing model parameters, making the detection standard increasingly closer to the needs of real-world scenarios and gradually improving comprehensive performance indicators.

[0026] 3. This invention designs a time-series-based repeated confirmation logic, which effectively distinguishes between temporary interference and real diseases, improves the credibility of events, and performs continuous filtering of data at the front-end display level to ensure that the output results have practical reference value. Attached Figure Description

[0027] Figure 1 This is an overall flowchart of an urban road detection method based on a multimodal large model according to the present invention.

[0028] Figure 2This is a flowchart illustrating the three stages of an embodiment of the urban road detection method based on a multimodal large model according to the present invention. Detailed Implementation

[0029] See Figures 1-2 As shown, a method for urban road detection based on a multimodal large model includes: The design incorporates a multi-stage prompting scheme, which includes sequential image quality detection, road event detection, and refined event judgment stages. The output of each stage serves as the input for the next stage, and each stage uses targeted prompting words to guide the multimodal large model to output the judgment result. After a road event is detected, the monitoring point is repeatedly detected at preset time points. When the event is continuously detected in the time series, it is determined to be a credible event. Based on the misjudged and missed samples in the historical detection results, the judgment rules in the prompt words are automatically analyzed and optimized through a large model, and the prompt words are adaptively iterated and updated until the detection accuracy meets the preset threshold.

[0030] like Figure 1 and Figure 2 In one embodiment shown, Figure 1 This is an overall flowchart of an urban road detection method based on a multimodal large model according to the present invention. Figure 2 This is a flowchart illustrating the three stages of an embodiment of the urban road detection method based on a multimodal large model according to the present invention. The present invention constructs an intelligent detection process based on a multimodal large model, achieving reliable identification and judgment of urban road defects through a phased, multi-round image analysis and rule-adaptive mechanism.

[0031] The method mainly includes three key steps: multi-stage progressive image analysis and event determination, time-series-based continuous event verification, and adaptive iterative optimization of prompt words.

[0032] In practical implementation, the first step is to design and execute a multi-stage prompting scheme. This step generally includes three sequentially advancing stages. The first stage is the image quality detection stage, where prompts guide the model to determine whether the image has quality issues such as overexposure, underexposure, blurring, or occlusion, ensuring that subsequent analysis is based on reliable image regions. The second stage is the road event detection stage, where the system guides the large model to detect the presence of specified road events in the image, such as road construction or missing road markings, and locates and describes the identified events. The third stage is the refined event judgment stage, where more stringent judgment rules are applied to the areas where the events detected in the second stage are located for verification, and finally, a conclusion is output regarding whether defects are confirmed.

[0033] Each stage's prompts include clearly defined "positive criteria" and "strict exclusion rules" to ensure that the judgment conditions are clear and actionable. For example, in road construction inspection, the positive criteria require that multiple conditions be met simultaneously, including construction equipment, work surface, construction personnel, warning facilities, and road occupancy impact; while the strict exclusion rules filter out non-construction scenes, image quality interference, and insufficient scale.

[0034] After a single detection, the system enters the time-series verification phase. Once a target event is identified at a certain time point, the system will repeat image acquisition and identification at the same monitoring point at preset subsequent time points. For example, for construction events, a second confirmation is usually performed after a few minutes; for road damage, two additional identifications are performed within an hour. Only events that are continuously detected in the time series are judged as reliable events and recorded in the database. Before displaying the results, the system also performs continuity filtering on the events in the database, retaining only event records that appear at least twice consecutively within a day; events that appear intermittently a few times are considered unreliable and filtered out.

[0035] To continuously improve detection accuracy, this invention also incorporates an adaptive optimization mechanism for prompt words. This mechanism first filters samples whose model judgments differ from manual annotations based on historical detection results, including misjudged and missed samples. Next, the descriptive text of these samples is input into the large model, which automatically analyzes their common visual features and generates targeted supplementary suggestions for "positive criteria" and "strict exclusion rules," respectively.

[0036] The system then integrates these newly generated rules into the existing prompts, forming new prompts. These new prompts are then reviewed by a large model to ensure there are no duplicate rules or logical conflicts before their accuracy is calculated on a test set. If the accuracy does not reach a preset threshold, an iterative optimization process is initiated. In this process, the system submits the current prompts, the search space for adjustable parameters, and the performance feedback from the previous round to the large model, which acts as a "prompt optimizer." This model generates several candidate prompts with minimal modifications. After testing and evaluation, if the performance improvement is insufficient or the maximum number of iterations is reached, the optimization process terminates, and the version of the prompt with the best performance is finalized.

[0037] In one embodiment, for event detection of road defects (including road damage, faded or missing road markings, fallen rocks, and road construction), the system requires information on the existence of the event, its location in the image, and a description of the event. In this embodiment, a one-stage, two-stage, or three-stage query scheme can be used (the specific scheme can be automatically configured through database tables).

[0038] One stage mainly focuses on image quality-related detection. For example, when the three-stage detection reaches the final stage, it checks whether there are phenomena such as image blurring, occlusion, rain or snow in the red-boxed area of ​​the image. If these phenomena are found, it means that the event in the red-boxed area is unreliable; otherwise, it is reliable.

[0039] The following is an example of the prompt words for the first stage in this embodiment: You are a rigorous image quality control and analysis expert, detecting image quality issues such as overexposure, underexposure, jitter, blur, black screen, and distorted image quality that could render road sections unrecognizable.

[0040] [Positive Criterion] (Any hit is automatically considered a positive result (true)): Overexposure judgment: The highlight areas of the image lose a lot of detail, appearing as a completely white area, making the road parts difficult to identify; Overly dark judgment: The dark areas of the image lose a lot of details, appearing as a completely black area or with a lot of noise, making it difficult to identify the road part; Shaking / Blurriness Judgment: The main outline of the image shows obvious ghosting or blurring, the details are blurred and unclear, and the road part is difficult to identify; Other unidentifiable criteria: The core content of the image cannot be identified due to occlusion, foreign objects, extreme angles, etc., making it difficult to identify the road section.

[0041] [Strict Exclusion Rule] (Any hit will result in a negative result, false): Non-road image quality issues were resolved: the main road portion was clear, but only non-road portions such as the background and vehicles were blurred and exposed. Minor image quality issues with roads have been ruled out: While there are minor image quality issues with roads, key features and core details are still recognizable. Road-related issues: These are not image quality problems, but rather road damage, water accumulation, shadows, etc. Dynamic blur elimination: The main subject is clear, and only the background or non-critical moving objects have dynamic blur that conforms to the laws of motion; Stylized noise removal: Deliberately added film grain or stylized noise generated under non-low light conditions; Soft focus effect excluded: slight focus inaccuracies caused by specific soft focus lenses or post-processing, rather than device shake or focus failure; Glare and reflection elimination: Glare, halo, or lens reflection caused by direct illumination from a light source, rather than an overall image quality issue.

[0042] Output Rules: The output is {"label": true} only when any one of the positive criteria is met, no strict exclusion rules are triggered, and the overall confidence level is greater than or equal to 0.9; otherwise, the output is {"label": false}. In this embodiment, the output must strictly adhere to the JSON format, and other formats are strictly prohibited.

[0043] The following is an example of the two-stage prompt words in this embodiment: First question: You are a professional road construction analysis expert, specializing in detecting construction sites in images that clearly occupy the road and affect traffic on the motor vehicle lanes.

[0044] [Analysis Area] The area of ​​the motor vehicle lane in the image.

[0045] [Positive Criteria] (must be met simultaneously): 1. Construction equipment / materials: There are stationary construction equipment or materials piled up; 2. Work surface: There is obvious excavation, repair or other work surface; 3. Construction workers: Personnel wearing reflective clothing are visible; 4. Warning facilities: No fewer than 5 traffic cones, barriers, or warning signs shall surround the construction area; 5. Impact of road occupation: The construction area occupies ≥1 / 3 of the lane width and has significantly affected the normal passage of vehicles.

[0046] [Strict Exclusion Rules] (A negative result is determined by triggering any of the following criteria): 6. Non-construction scenarios, specifically including: Ordinary vehicles and non-operational engineering vehicles that are temporarily parked; Non-excavation operations such as sanitation work and greening maintenance; The operation was mobile and no stable enclosure area was formed.

[0047] 7. Image interference, specifically including: The image has large areas of occlusion (vehicles, leaves, awnings, etc.), is severely blurred, overexposed or underexposed, making it impossible to distinguish key details; Optical interferences such as shadows, reflections, and water reflections were misjudged as construction areas.

[0048] 8. Insufficient scale, specifically including: The number of warning facilities is less than 5, or they do not form an effective enclosure; The area occupied is less than 1 / 3 of the lane, or has no substantial impact on traffic flow.

[0049] Output Rules: Only when all five positive criteria are fully met, no strict exclusion rules are triggered, and the analysis confidence level is greater than 0.85, will the output be: {"label": true}; otherwise, the output will be: {"label": false}. In this embodiment, the output must strictly adhere to the JSON format, and explanatory text is strictly prohibited.

[0050] Second question: You are now a reporting member of the road maintenance department. The first phase of analysis has confirmed the presence of road construction issues as shown in the image below.

[0051] Your task: 1. Mark: Precisely locates a representative point in the construction scene within the road in the image. The format is "[x,y]", where "[x,y]" are the pixel coordinates of the point.

[0052] 2. Explanation: Describe the problem in one paragraph, with a minimum of 30 words and a maximum of 200 words. Output format: Output only a single, strict JSON object: { "label": true, "bbox": [x,y], "explain": "Description text, between 30 and 200 words" }

[0053] The following are examples of three-stage prompts in this embodiment: First question: You are the "City Road Incident Analysis Control Manager". Please indicate whether a candidate event exists in the image. Outputting the reasoning process is strictly prohibited; only output JSON, adhering to the given schema. [User] Candidate Event: ["Traffic Marking Faded / Incomplete"]. Assign a coarse confidence level of 0-1 to the candidate event (without location details). Please consider each step and provide your reasoning process.

[0054] When the confidence level is greater than 0.6, output: {"label": true}, otherwise output: {"label": false}.

[0055] If the output is true, then ask a second question: You are now a reporting member of the road maintenance department. The first phase of analysis has confirmed that the road signs and markings in the image below are faded or damaged.

[0056] Your task: 1. Location (Mark): Accurately identify the points in the image where the road signs and markings, such as zebra crossings, lane dividers, and road directional arrows, are most severely faded or missing, and which best represent the overall problem. Avoid selecting relatively intact areas, edges, or areas near distracting objects. The format is [x,y], where x and y are the pixel coordinates of the point.

[0057] 2. Explanation: Describe the problem in one paragraph, with a minimum of 30 words and a maximum of 200 words.

Output Format

[0058] Then, the third question was asked: You are a rigorous urban management image analysis expert, tasked with detecting only the white / yellow road markings within the red-framed area of ​​an image and determining if there are any serious issues with the markings being indistinguishable.

[0059] [Positive Criteria] (All must be met): The area where the outline of the road marking disappears is greater than 15% of the area of ​​the red frame, and the area where the road marking is difficult to distinguish from the road surface color is greater than 15% of the area of ​​the red frame.

[0060] [Strict Exclusion Rule] (Any hit is immediately judged as negative): Image quality issues: Overexposure > 25%, average brightness < 20, severe blurring / ghosting causing texture loss > 50%; (If either water stains or standing water are present on the ground, the result is automatically considered negative). Water stains on the ground / mirror reflection / slippery / puddles with smooth edges / light spots, camera glare; Checking for regularly intermittent marking objects (any hit is immediately judged as negative): dashed lines / long solid lines; Exclude erased marking objects (any hit is immediately considered negative): erased old markings; Exclusion of residual old markings (any hit will result in a negative result): spray number / residual old markings / semi-transparent shadow lines; Artifacts / chromatic aberration objects are excluded (any one hit is immediately judged as negative): tree shadows / solid color blocks / gradient chromatic aberration / camera compression noise / light spots / lens flare; Exclusion of debris and projected objects (any hit is immediately judged as negative): Gray cement marks / fallen leaves / paper scraps / gravel / mud spots / dry mud / wet mud / vehicle or traffic cone shadows; Other object phenomena are excluded (any hit is directly judged as negative): local wear and tear, local darkening / lightening.

[0061] [Output Rules]: Output {"label": true} only if all [positive criteria] are met, no [strict exclusion rules] are triggered, and the overall confidence level is greater than 0.85; otherwise, output {"label": false}. In this embodiment, the output strictly follows the JSON format, and explanatory text is strictly prohibited.

[0062] Business logic processing: When an event is confirmed through multi-stage questioning, for example, if a camera with ID 12 captures a construction event at 8:00, then a new image needs to be acquired from that camera at 8:05 to confirm whether the construction event actually occurred. This is because construction can sometimes be mistakenly identified as construction when a vehicle passes by. Similarly, when identifying road damage, if a damage event is detected at 8:00, it needs to be confirmed twice more at 8:30 and 9:00, because tree shadows and vehicle shadows can cause damage to be misidentified.

[0063] After the records of confirmed events are entered into the database, discontinuous time records need to be filtered out at the front-end display level. This is because the source camera images are output at a frequency of 1 hour, and the normal detection frequency is also once per hour. If the road markings or damage events corresponding to a certain device only appear intermittently 3 times a day, then the event of that type for that device is unreliable and can not be displayed on the front end. If the road markings or damage events corresponding to a certain device appear consecutively 2 times a day, then the event of that type for that device is reliable and can be used for front-end display.

[0064] Automatic optimization of prompts to determine the presence or absence of events: Without modifying the large view model parameters (Qwen2.5-VL), the system automatically optimizes the strict exclusion rules and positive criteria for prompt words based on structured analysis of false positive (FP) and false negative (FN) samples. If the accuracy of the generated prompt words does not meet the requirements, the system continues to iterate and generate a self-evolutionary closed loop for the prompt.

[0065] For example, road damage detection tasks (in the third phase). 1. The initial prompt word is: "You are a city road inspection expert, analyzing only ground damage events within the red-framed area of ​​the image. Positive criteria: potholes or cracks ≥2cm wide. Strict exclusion rules: overexposure, shadows, water accumulation, road markings, patches, tree shadows, etc."

[0066] 2. Manually categorized sample data and their corresponding descriptions within the red boxes: The description in the red box is automatically generated by Qwen2.5-VL and must be 50 characters or less. 400 road surveillance images were collected, of which: 100 are genuine damage images, 100 are repair marks, water stains, or tree shadows (easily misjudged as damage), and 200 are normal ground images.

[0067] 3. Calculate the event detection precision: Based on the initial prompts, the road damage assessment identified by Qwen2.5-VL, compared with the manually categorized images, is shown below: image_id,model_label,human_label,note 001.jpg, true, false, light gray rectangular patch with regular borders 002.jpg, true, true, obvious pit 003.jpg, false, true, fine cracks with low grayscale but continuous. 004.jpg,true,false,Dark area with tree shadow lines.

[0068] Fields: image_id: Image filename or unique ID model_label: The determination of the current Prompt in the view's large model (true / false). human_label: Human-labeled truth values ​​(true / false) Note (optional): A one-sentence description of the area in red (generated manually or by VQA, ≤50 words).

[0069] The "precision rate" is calculated as the number of correctly detected events divided by the total number of detections = the number of events where both model_label and human_label are true / the number of events where model_label is true.

[0070] If the accuracy is less than 85%, then proceed with the following initial iteration steps: The model output was compared with the manually labeled results, and misclassified and missed samples were found.

[0071] FP (False Positive) filtering: model_label==true && human_label==false FN (Missed Detection) Filtering: model_label==false && human_label==true.

[0072] 4. Then, input the "note" field from the missed image sample into Qwen2.5-VL to generate new mandatory detection conditions. The input prompt word is: You are an expert in urban road damage detection. The following samples were identified as genuine damage through manual annotation, but the model failed to recognize them. Please summarize their common visual characteristics (note field data: fine but continuous cracks with low grayscale, network cracks, longitudinal cracks). It also provides new supplementary suggestions on "mandatory inspection conditions".

[0073] Output JSON: { "Common visual patterns": [ "Linear cracks less than 2cm wide but clearly extending", "Area where edge damage and asphalt splicing lines are mixed", "Light-colored cracks with low grayscale contrast" "Possible reasons for missed detection": "The crack is too narrow and has not reached the original width threshold". Insufficient brightness weakens texture. "The edges of the repair marks show actual damage." ], "Recommended positive rules": "A crack length ≥ 30cm, even if the width is < 2cm, is considered damage." "A rough texture or breakage near the seam is considered a sign of damage." "Low-contrast gray cracks that are continuous in shape are also considered damage." }

[0074] 5. Then, input the "note" field from the false positive image samples into Qwen2.5-VL to generate new strict exclusion rules. The input prompt word is: You are an expert in urban road damage detection. The following samples were manually labeled as undamaged, but the model identifies them as damaged. Please summarize their common visual features (note field data: regular light gray rectangular patch boundaries, tree shadow-like dark areas). It also provides supplementary recommendations for new "strict exclusion rules".

[0075] Output JSON: { "common_visual_patterns": ["Light-colored stripes", "Smoothly edged reflective areas", "Linear shadows", "Light-colored rectangular patch blocks"], "likely_causes": ["Reflective light caused bright bands to be misjudged", "Linear dark areas of tree shadows were mistaken for cracks"] "recommended_negative_rules": [ "Exclude light gray / light-colored bands", "Exclude reflective wet spots with smooth edges", "Excluding linear dark areas caused by tree shadows", Exclude light-colored rectangular repair areas ] }

[0076] Then, based on the mandatory conditions and strict exclusion rules returned from 4 and 5, add them to the third-stage prompt words. Then, let Qwen2.5-VL review the new third-stage prompt words to check for duplicates and conflict detection issues, and then output the integrated prompt words: You are a rigorous urban management image analysis expert, tasked with detecting only ground road damage events within the red-boxed areas of the images.

[0077] [Positive Criteria] (All must be met): Potholes or rough cracks ≥ 2cm wide; A rough texture or breakage near the splicing line is considered a sign of damage. Low-contrast gray cracks that are continuous in shape are also considered as damage.

[0078] [Strict Exclusion Rule] (Any hit is immediately judged as negative): Exclude light gray / light-colored bands; Exclude reflective wet spots with smooth edges; Exclude linear dark areas caused by tree shadows; Exclude the light-colored rectangular repair area; Other objects excluded: tire tracks / manhole covers and the area around them.

[0079] Output Rules: The output is {"label": true} only when all positive criteria are met, no strict exclusion rules are triggered, and the overall confidence level is greater than or equal to 0.85; otherwise, the output is {"label": false}. In this embodiment, the output strictly follows the JSON format, and explanatory text is strictly prohibited.

[0080] 6. Based on the optimized prompts, the road damage assessment identified by Qwen2.5-VL is compared with the manually categorized images to calculate the accuracy. If the accuracy is <85%, the prompt iterative update step begins. A complete example of "LLM as a prompt word optimizer (automatic candidate generation)". The content includes: baseline prompt words, adjustable parameter definitions, feedback from the previous evaluation round, LLM's System / User prompt words, and the expected LLM output format (JSON).

[0081] 1) Baseline Prompt File: prompt_baseline.txt You are a rigorous urban management image analysis expert, tasked with detecting only ground road damage events within the red-boxed areas of the images.

[0082] [Positive Criteria] (All must be met): Potholes or rough cracks with a width of ≥2mm; The crack edges are rough and the paint is not marked.

[0083] [Strict Exclusion Rule] (Any hit is immediately judged as negative): Image quality exclusions: overexposure > 25%, average brightness < 20, severe blur / ghosting causing texture loss > 50%; Rainy weather and water accumulation: large areas of mirror-like reflection, slippery surfaces, and smooth-edged water accumulation; Exclusions for repair / patch objects: rectangular / strip patches with slightly darker or lighter colors and regular boundaries; splicing lines between new and old asphalt, cutting lines, and crack sealant repair tape; Excluded from road markings / paintings: white and yellow arrows, lane lines, guide lines, zebra crossings, colored anti-skid strips or speed bumps; Artifacts / chromatic aberration object elimination: solid color blocks, compressed noise, flare, lens flare, tree shadows; Exclude clutter and projected objects: fallen leaves, scraps of paper, gravel, mud, vehicle or traffic cone shadows; Other objects excluded: tire tracks, manhole covers and their surroundings.

[0084] [Confidence Level Rule]: A positive result can only be output if the overall confidence level is ≥0.86.

[0085] Output Rules: If the positive criteria are met, no exclusion rules are triggered, and the confidence level is met, output: {"label": true}; otherwise, output: {"label": false}. In this embodiment, the output strictly follows the JSON format, and explanatory text is strictly prohibited.

[0086] 2) Adjustable search space definition (controlled modification by the optimizer) File: search_space.json (for LLM reference) { "CONF_MIN": {"type":"float","min":0.80,"max":0.92,"step":0.01}, "W_MIN_MM": {"type":"enum","choices":[1.5, 2.0, 2.5, 3.0]}, "L_MIN_CM_RULE": {"type":"toggle","text":"Crack length ≥ {L_MIN_CM}cm, even if width < {W_SOFT_MM}mm, is considered damage"}, "L_MIN_CM": {"type":"enum","choices":[20, 30, 40]}, "W_SOFT_MM": {"type":"enum","choices":[1.0, 1.5, 2.0]}, "EXC_GLARE": {"type":"toggle","text":"Excludes reflective wet spots with smooth edges"}, "EXC_LIGHT_STRIPE": {"type":"toggle","text":"Exclude light gray or light-colored stripes"}, "EXC_SHADOW_LINE": {"type":"toggle","text":"Exclude linear dark areas caused by tree shadows"}, "INTENSIFIER": {"type":"enum","choices":["obvious","significant","clear"]}}. See the table below for details: Table 1

[0087] 3) Feedback from the previous round of evaluation (provided to LLM) Example: Previous round of indicators: Precision = 0.902; Correct pattern summary: Correctly identify: long cracks, network cracks, and pits; Error mode summary: False alarms are concentrated in the following areas: reflective wet spots, light gray stripes, and thin lines of tree shadows. Missed reports include: long (≥30cm) but thin (<2mm) cracks; and low-contrast continuous cracks.

[0088] Optimization objective (fixed): The primary optimization is that Precision should not be lower than 0.80; Please make only minimal modifications (≤3) and keep the paragraph headings and output format unchanged; It is preferred to moderately increase the confidence threshold first.

[0089] The precision can be calculated by the following formula: "Precision" is equal to the number of correctly detected events / the total number of detections = the number of cases where model_label is true and human_label is also true / the number of cases where model_label is true.

[0090] If the precision < 85%, then perform prompt optimization iteration; Iteration termination conditions: When the precision has not improved compared to the previous round for three consecutive rounds, or the iteration count reaches the upper limit of 5 times.

[0091] The error mode summary and optimization objectives are both automatically generated by the LLM, specifically as follows: 1) Obtain the result fields model_label, human_label, note from the previous round, select the note of the samples where both model_label and human_label are true, and input it into the LLM to summarize and generate the correct mode summary; 2) Select the note of the samples where model_label is false and human_label is true, and input it into the LLM to summarize and generate the error mode summary; 3) The optimization objective is fixed text, where 0.80 can be considered as a fixed value.

[0092] 4) Use the LLM to generate candidate prompts (System and User roles) (1) System role (text content fixed) You are the "prompt optimizer". Please make the smallest controllable changes to the given prompt to improve the precision without changing the paragraph headings and output format, while ensuring that the recall rate is not lower than the specified lower limit.

[0093] You can only make the following modifications: ① Adjust the numerical thresholds: CONF_MIN (0.80 - 0.92, step size 0.01 each time); W_MIN_MM (1.5 / 2.0 / 2.5 / 3.0); ② Optionally enable a positive rule that "if the length ≥ L_MIN_CM but the width < W_SOFT_MM, it is also considered damaged"; ③ Add or remove the following exclusion entries: EXC_GLARE / EXC_LIGHT_STRIPE / EXC_SHADOW_LINE; ④ Replace the intensity word with one of (obvious / significant / clear); ⑤ No more than 3 changes should be made to each candidate.

[0094] Output strict JSON, with the following structure: { "candidates": [ { "id": "C1", "changes": { "CONF_MIN": 0.88, "W_MIN_MM": 2.0, "L_MIN_CM_RULE": true, "L_MIN_CM": 30, "W_SOFT_MM": 1.5, "EXC_GLARE": true, "EXC_LIGHT_STRIPE": true, "EXC_SHADOW_LINE": true, "INTENSIFIER": "Clear" }, "prompt_full": "(The complete and revised prompt message)", "rationale": "(No more than 40 words, explaining why this change was made)" } ] } Returning any text other than the JSON provided above is prohibited.

[0095] (2) User role (dynamic text content) (Input the baseline prompt, search space, and previous feedback together as prompts for the user role into LLM) Baseline prompts Paste the full text of prompt_baseline.txt. Search Space Paste the contents of search_space.json. [Previous round of feedback] <Paste the feedback text from section 3> Please generate no more than 4 candidates within the constraints.

[0096] Each candidate requires only ≤3 minimal changes and provides a full prompt_full.

[0097] (3) The final output prompt_full example is as follows: { "candidates": [ { "id": "C1", "changes": { "CONF_MIN": 0.88, "EXC_GLARE": true, "EXC_LIGHT_STRIPE": true }, "prompt_full": "You are a rigorous urban management image analysis expert, only detecting ground road damage events within the red-boxed area of ​​the image.\n\n

Positive Criteria

Strict Exclusion Rules

Positive Criteria

Strict Exclusion Rules

Positive Criteria

Strict Exclusion Rules

Positive Criteria

Strict Exclusion Rules

[0098] 5) Test each of the newly generated candidate prompt_full and calculate its accuracy. If the iteration termination condition is not met, generate a new batch of candidate prompt_full based on the previous round of evaluation feedback each time, until the termination condition is met.

[0099] All data collection and extraction in this invention are carried out under compliant and legal conditions.

Claims

1. A method for urban road detection based on a multimodal large model, characterized in that, include: The design incorporates a multi-stage prompting scheme, which includes sequential image quality detection, road event detection, and refined event judgment stages. The output of each stage serves as the input for the next stage, and each stage uses targeted prompting words to guide the multimodal large model to output the judgment result. After a road event is detected, the monitoring point is repeatedly detected at preset time points. When the event is continuously detected in the time series, it is determined to be a credible event. Based on the misjudged and missed samples in the historical detection results, the judgment rules in the prompt words are automatically analyzed and optimized through a large model, and the prompt words are adaptively iterated and updated until the detection accuracy meets the preset threshold.

2. The urban road detection method based on a multimodal large model according to claim 1, characterized in that, The multi-stage prompt word scheme includes: The first stage is the image quality detection stage. The prompt words guide the large model to judge whether there are problems such as abnormal lighting, image blurring, and road unrecognization caused by occlusion interference in the image. Based on the preset positive standard and strict exclusion rules, the model outputs the judgment result of whether the image quality is reliable. The second stage is the event detection stage. The prompt words guide the large model to determine whether a specified road event exists in the image, and locate and describe the detected events, outputting a structured result containing the existence of the event, its location coordinates, and a text description. The third stage is the refined event judgment stage. The prompt words guide the large model to review the specific image region where the event detected in the second stage is located. Based on more granular positive criteria and exclusion rules, the final judgment is made within the specified image region.

3. The urban road detection method based on a multimodal large model according to claim 2, characterized in that, The automatic analysis and optimization of judgment rules in prompt words through a large model includes: We collect structured sample data containing image IDs, model-defined labels, human-defined labels, and region description text. We then filter out misjudged samples that were judged positive by the model but negative by human labeling, as well as missed samples that were judged negative by the model but positive by human labeling. We input the description text of the missed samples into the large model, summarize common visual features, and generate new supplementary suggestions for positive criteria. We input the description text of the misjudged samples into the large model, summarize common visual features, and generate new supplementary suggestions for strict exclusion rules.

4. The urban road detection method based on a multimodal large model according to claim 3, characterized in that, The automatic analysis and optimization of prompt words through a large model also includes: After obtaining new positive criteria and strict exclusion rules, the generated new rules are added to the original stage prompts to form a new version of prompts to be reviewed. The new version of prompts is then input into the large model to review whether there are any rule duplications or logical conflicts. After the review is completed without errors, the final integrated version of prompts is output, and its accuracy on the new sample set is calculated. If the accuracy is still lower than the preset threshold, the iterative optimization process is started.

5. A method for urban road detection based on a multimodal large model according to claim 2 or 4, characterized in that, The first stage uses a preset positive criterion in the prompt words to determine whether there are quality problems in the image that make the road area unrecognizable, and strictly excludes image anomalies that do not affect the recognition of the road area.

6. The urban road detection method based on a multimodal large model according to claim 5, characterized in that, The second-stage prompts are for the detection of specific road events. The positive criteria include several judgment conditions that must be met simultaneously and that correspond to the key features of the event. The strict exclusion rules include negative conditions that distinguish the event from easily confused or interfering scenarios.

7. The urban road detection method based on a multimodal large model according to claim 6, characterized in that, The iterative update includes: The current prompt word is used as the baseline prompt word. Together with the predefined search space description, which includes adjustable parameter types, value ranges, and step sizes, as well as the accuracy results and correct and incorrect pattern summaries from the previous evaluation, the input information is submitted to the large model, which acts as the prompt word optimizer. The large model is instructed to make only a few minimal and controllable changes to the baseline prompt word while keeping the paragraph headings and output format unchanged. Based on the instructions, the large model outputs several candidate prompt words that include specific parameter changes, the full text of the prompt word, and the reasons for the changes.

8. The urban road detection method based on a multimodal large model according to claim 7, characterized in that, The iterative update also includes: terminating the update when, after three consecutive rounds of optimization iterations, the detection accuracy of the new prompt words has not improved compared to the previous round, or when the total number of optimization iterations has reached the preset upper limit.

9. A method for urban road detection based on a multimodal large model according to claim 7 or 8, characterized in that, The step of repeatedly detecting the monitoring point at preset time points includes: When a target event is detected at a camera device at the first time point, the image of the device is acquired again at a preset second time point and the identification is confirmed. For road construction events, the second time point is several minutes after the first identification. For road damage events, two confirmation identifications are performed within one hour after the first identification. Events that have been repeatedly verified will be recorded in the database. Only events that occur at least twice in a single day will be recorded in the database. Events that occur intermittently three times or less will be filtered out and considered unreliable events.

10. The urban road detection method based on a multimodal large model according to claim 7, characterized in that, The search space includes several adjustable parameters, including a comprehensive confidence threshold, a minimum crack width threshold, a length compensation rule switch, and an exclusion rule switch for specific false alarm types.