Method for change detection, sample generation method, model training method, device, medium and product
By generating and verifying semantically consistent changed image data, the problem of data scarcity and high cost in existing technologies is solved, enabling the multimodal change detection model to achieve efficient and accurate change detection in more scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHE JIANG SHEN XIANG ZHI NENG KE JI YOU XIAN GONG SI
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
Existing change detection methods rely on real change detection data, resulting in data scarcity and high costs, and an inability to cover more scenarios. Furthermore, traditional methods have shortcomings in object category identification, efficiency, accuracy, and cost, making it difficult to achieve real-time and accurate change detection.
By acquiring raw image data, semantically consistent change image data is generated using diffusion models and multimodal large models, and its rationality is verified. Training image data is then generated to train the multimodal change detection model. Combined with text prompts and image features, synthetic data is generated and verified, reducing data acquisition costs and improving the model's generalization ability.
It significantly reduces the cost of data acquisition, and the generated training data can more effectively cover more scenarios. The trained model can more accurately and effectively detect changes, achieving real-time perception and precise recognition.
Smart Images

Figure CN122244600A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of change detection technology in image analysis, and in particular to a change detection method, sample generation method, model training method, device, medium and product. Background Technology
[0002] With the rapid iteration of artificial intelligence technology, the demand for object perception in IoT scenarios is increasing day by day. Among existing technologies, multimodal object change detection methods are widely used in smart cities, industrial inspection, ecological protection and retail management and other fields.
[0003] However, existing change detection methods rely on real change detection data, but real change detection data is relatively scarce. Multimodal change detection models based on real change detection data cannot cover more scenarios, making it impossible for multimodal change detection models to perform change detection in some scenarios.
[0004] Therefore, how to perform change detection more effectively is a technical problem that needs to be solved in this field. Summary of the Invention
[0005] The main objective of this application is to provide a method for change detection, a sample generation method, a model training method, an apparatus, a medium, and a product, so as to perform change detection more effectively.
[0006] In a first aspect, embodiments of this application provide a data processing method for change detection, comprising: acquiring original image data, the original image data including a first target object; generating change image data based on the first target object, the change image data including a second target object; performing a reasonableness check on the background scene of the second target object and the original image data, and generating training image data based on the reasonableness check result, the training image data being used to train a change detection model.
[0007] Secondly, embodiments of this application provide a method for training a change detection model, comprising: acquiring original image data; generating training image data according to the change detection data processing method described in the first aspect; and training a multimodal change detection model based on the original image data and the training image data.
[0008] Thirdly, embodiments of this application provide a method for change detection, comprising: acquiring image data to be detected; and performing change detection on a target in the image data to be detected according to a multimodal change detection model trained by the change detection model training method described in the second aspect.
[0009] Fourthly, embodiments of this application provide an electronic device, including: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to cause the electronic device to perform the methods described in the first, second, or third aspects above.
[0010] Fifthly, embodiments of this application provide a computer-readable storage medium storing computer-executable instructions, which, when executed by a processor, implement the methods described in the first, second, or third aspects above.
[0011] Sixthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the methods described in the first, second, or third aspects above.
[0012] The change detection method, sample generation method, model training method, device, medium, and product provided in this application, after acquiring the original image data, generate change image data including a second target object based on the first target object included in the original image data, and perform a reasonableness verification on the generated second target object and the background scene of the original image data. Based on the reasonableness verification result, training image data is generated for subsequent training of the change detection model, thereby significantly reducing the data acquisition cost, enabling the data for subsequent model training to more effectively cover more scenes, and enabling the trained model to perform change detection more accurately and effectively. Attached Figure Description
[0013] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application. It is obvious that the drawings described below are some embodiments of the invention, and that those skilled in the art can obtain other drawings based on these drawings without any inventive effort.
[0014] Figure 1 A schematic diagram of an embodiment of the change detection method provided in this application;
[0015] Figure 2 A schematic diagram of the image data provided in this application;
[0016] Figure 3 A flowchart illustrating another embodiment of the data processing method for change detection provided in this application;
[0017] Figure 4A flowchart illustrating yet another embodiment of the data processing method for change detection provided in this application;
[0018] Figure 5 A flowchart illustrating an embodiment of the change detection model training method provided in this application;
[0019] Figure 6 A schematic diagram of the model structure of the multimodal change detection model provided in this application;
[0020] Figure 7 A schematic diagram of the processing logic of the image information feature extractor in the multimodal change detection model provided in this application;
[0021] Figure 8 A schematic diagram of the structure of an embodiment of the electronic device provided in this application.
[0022] The accompanying drawings illustrate specific embodiments of this application, which will be described in more detail below. These drawings and descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concept of this application to those skilled in the art through reference to particular embodiments. Detailed Implementation
[0023] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.
[0024] In this article, the term "and / or" is used to describe the relationship between related objects. Specifically, it means that there can be three kinds of relationships. For example, A and / or B can mean: A exists alone, A and B exist at the same time, or B exists alone.
[0025] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data must comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points are provided for users to choose to authorize or refuse.
[0026] To clearly describe the technical solutions of the embodiments of this application, the terms involved in this application are first defined as follows:
[0027] Multimodal technology refers to a technical system that integrates two or more information representation formats (such as visual and textual). Its core value lies in overcoming the limitations of single-modal information. Through collaborative processing of cross-modal data, feature fusion, and semantic association, it achieves more comprehensive and accurate information understanding and task execution, and has stronger environmental adaptability and scenario robustness compared to single-modal technology.
[0028] Change detection encompasses the process of capturing changes in the form, attributes, and state of various physical objects in the natural world, artificial environments, and across time or space using technological means. The core objective is to achieve accurate identification, location, quantification, and causal tracing of changes, providing data support for decision-making. The change detection algorithm provided in this application can serve as a generalized solution for scenarios requiring specified change outputs, such as changes in shelf inventory, monitoring of valuables, and instances of people leaving behind items.
[0029] Modal fusion: One of the core components of multimodal technology, it refers to integrating modal data (such as visual features of images and semantic features of text) from different sources and in different forms through specific algorithms to eliminate heterogeneity between modalities, generate fusion features with unified semantics, and improve the performance of subsequent detection tasks.
[0030] With the rapid iteration of artificial intelligence technology, the demand for object perception in IoT scenarios is increasing daily. At the same time, breakthroughs in deep learning algorithms in areas such as image segmentation, time-series data analysis, and feature extraction have solved the complex change recognition problems that traditional manual monitoring and simple threshold judgment cannot handle. This provides core technical support for the accurate detection of changes in everything. From changes in land space to monitoring of living scenarios, from the ecological evolution of the natural environment to the real-time dynamics of urban operations, the technology has the foundation to achieve full-scenario, high-precision change perception.
[0031] This application relates to a multimodal method for detecting changes in all things, which has wide applications in smart cities, industrial inspection, ecological protection, and retail management. In smart city scenarios, such as monitoring illegal construction, road construction, and flood disaster early warning, traditional manual inspections suffer from limited coverage and delayed response. This method, however, can achieve dynamic risk early warning and proactive prevention by capturing environmental changes in real time (such as the addition of illegal buildings, abnormal road construction progress, or the expansion of flood-inundated areas).
[0032] For example, in the field of industrial inspection, such as power facility inspection and pipeline equipment monitoring, traditional methods that rely on manual visual inspection or simple image comparison are difficult to detect missing parts or hidden faults. This method can significantly improve the efficiency and safety of equipment operation and maintenance by detecting semantic-level changes (such as identifying "missing transformer insulators" or "pipeline crack propagation").
[0033] For example, in ecological protection scenarios such as forest fire monitoring and wetland ecological evolution tracking, traditional remote sensing image analysis relies on periodic data comparisons, which is difficult to cope with the needs of high-frequency changes. However, this method combines multimodal features (such as image and text prompts) to accurately identify changes in vegetation cover or illegal logging events.
[0034] For example, in the retail industry, such as detecting changes in shelf merchandise and monitoring the loss prevention of valuable items, traditional systems can only detect differences in the images and cannot distinguish the categories of items (such as "beverages have been replaced with alcohol" or "products have been removed"). This method, through text-guided semantic understanding capabilities, can output structured change information (such as "the location of a specific product has changed" or "abnormal items have been left behind"), thereby achieving more efficient intelligent management.
[0035] One existing change detection method is the traditional dual-temporal image difference method, which extracts the changed region by directly calculating the pixel-level differences (such as absolute difference or ratio) between the two temporal images. This type of method is extremely sensitive to environmental interference such as lighting and weather, and can only identify differences at the pixel level, unable to distinguish semantic changes (such as "a car becomes a truck").
[0036] Another change detection method provided in the prior art is an end-to-end change detection model based on deep learning, which specifically uses architectures such as Siamese networks, U-Net, or two-stream Transformers to directly regress change masks. However, such models heavily rely on large-scale real-world labeled data (precisely aligned bi-temporal images and pixel-level change labels), which are extremely costly to acquire, and the models have weak generalization ability, usually only applicable to specific scenarios (such as remote sensing or medical imaging), making it difficult to achieve cross-domain general change detection.
[0037] Another existing change detection method is a zero-shot inference method based on large models. Specifically, it uses multimodal large models such as CLIP and LLaVA to extract image features and determines whether a "change" has occurred through similarity calculation. Such methods can only output binary judgments (changed / unchanged) or rough descriptions, and cannot locate specific changed targets, identify categories, or regress bounding boxes. Moreover, the inference cost is high, making it difficult to deploy in real-time detection scenarios.
[0038] It can be seen that existing change detection methods have at least the following problems:
[0039] Insufficient object recognition capability: Traditional change detection only detects changes without distinguishing object categories, resulting in excessively high alarm frequency and limiting application scenarios.
[0040] Inefficient: Relying on manual inspections, offline verification, or periodic data comparisons makes real-time monitoring impossible, and there is a serious time lag for large-scale, highly dynamic change scenarios;
[0041] Insufficient accuracy: Traditional methods based on threshold judgment and simple image comparison are easily affected by environmental interference (such as lighting, weather, and noise) in order to distinguish between valid changes and invalid interference, resulting in a high false detection rate and false negative rate.
[0042] High cost: Large-scale manual monitoring requires a significant investment of manpower, material resources, and financial resources, especially in remote areas, dangerous scenarios, or where high-frequency monitoring is required, resulting in significant cost pressures.
[0043] Passive decision-making: Due to the lack of real-time and accurate change data, management decisions rely heavily on experience-based judgment or post-event retrospective analysis, making it difficult to achieve forward-looking early warning and proactive prevention and control.
[0044] How to more effectively detect changes is a technical problem that needs to be solved in this field. Based on this, the main objective of the embodiments of this application is to provide a method for change detection, a sample generation method, a model training method, an apparatus, a medium, and a product, so as to realize real-time perception, accurate identification, intelligent early warning, and full-process traceability of various changes in the physical world, providing core technical support for scientific decision-making, efficient management, and risk prevention and control in various industries, and promoting the transformation from "passive response" to "proactive prevention and control" and from "experience-based decision-making" to "data-driven".
[0045] The following detailed description of some embodiments of this application is provided in conjunction with the accompanying drawings. Where there is no conflict between the embodiments, the following embodiments and features can be combined with each other. Furthermore, the timing of the steps in the following method embodiments is merely an example and not a strict limitation.
[0046] Figure 1 A schematic diagram of an embodiment of the change detection method provided in this application is shown below. Figure 1 The change detection method shown can be applied to any electronic device with relevant data processing capabilities, such as a computer, server, or workstation. Specifically, for example... Figure 1 The methods for detecting changes shown include:
[0047] S101: Acquire raw image data, wherein the raw image data includes the first target object;
[0048] Specifically, in this embodiment of the application, the acquired, unprocessed image data is referred to as raw image data. The raw image data may include a large amount of known, working condition detection image data. The collected public dataset does not need to be tailored to the change detection task; it only needs to be a detection dataset. For example, the detection dataset may include PascalVOC, COCO, or Objects365, etc.
[0049] In one embodiment, the original image data may encompass three categories: "surveillance scenes," "drone scenes," and "natural scenes." Simultaneously, the acquired images can be used as the unprocessed images and incorporated into the overall data tag. In a specific implementation, the original image data also includes the location information of the first target object within it. With category information .
[0050] For example, Figure 2 A schematic diagram of the image data provided in this application, such as Figure 2 The original image data shown uses a natural ocean scene as an example. In the original image data, the first target object is a sea turtle in the ocean.
[0051] S102: Generate changed image data based on the first target object in the original image data obtained in S101, wherein the changed image data includes the second target object.
[0052] Combination Figure 2 The background scene of the altered image data remains consistent with that of the original image data, both being ocean scenes. However, the altered image data includes a second target object that is a fish in the ocean. This suggests that the altered image data can be obtained by replacing the first target object in the original image data with the second target object.
[0053] In one embodiment, the step of generating changed image data based on the first target object in S102 specifically includes:
[0054] S1021: Obtain the text prompts corresponding to the data processing; wherein, the text prompts are used to guide the generated semantic descriptions, including the type of change and the characteristics of the target object. For example, the text prompts may be "replace with fish" or "disappear: turtle", etc.
[0055] S1022: Determine the position and category information of the first target object in the original image data.
[0056] S1023: Based on text prompts, location information, and category information, replace the first target object with a second target object in the background scene to generate changed image data.
[0057] In one embodiment, in S1023, text prompts, location information, and category information can be input into the diffusion model to generate changed image data. Specifically, the diffusion model generates image content that conforms to the text prompts. For example, replacing "turtle" with "fish" in the original image data generates new changed image data.
[0058] In one embodiment, the diffusion model can be a generative model that generates images by evolving a latent space vector field, such as Stable Diffusion.
[0059] As can be seen, the process of generating changed image data provided in this embodiment selects an original image from a publicly available detection dataset (such as a "turtle" image from the COCO dataset), extracts the target object region based on its bounding box annotation information, inputs it into a diffusion model, and combines it with text prompts (such as "replace with fish") to generate a new object image. The diffusion model gradually denoises in the latent space, adjusts the generated result according to the text prompts, and finally outputs a new object image consistent with the text semantics. This further enhances the diversity and coverage of the synthetic data by leveraging the universality of the publicly available detection dataset. The publicly available detection dataset contains a large number of common and long-tail object categories (such as "turtle" and "truck"), and combines them with text prompts to generate diverse change samples (such as "replace" and "disappear"), ensuring that the synthetic dataset covers complex change types in open-world scenes.
[0060] S103: Perform a reasonableness check on the background scene of the second target object and the original image data, and generate training image data based on the reasonableness check results. The training image data is used to train the change detection model.
[0061] In one embodiment, the step of performing a reasonableness check on the background scene of the second target object and the original image data in S103, and generating training image data based on the reasonableness check result, specifically includes:
[0062] S1031: Input the changed image data and text prompts into the multimodal large model, so that the multimodal large model can verify the semantic consistency between the second target object and the background scene.
[0063] Specifically, the multimodal large model can be used to combine image and text information verification mechanisms to verify the rationality of the generated changed image data, i.e. semantic consistency. Specifically, it is the logical matching between the newly generated second target object and the background scene of the original image data. For example, it can determine whether the new second target object conforms to the background scene, such as "whether fish are suitable to appear in the corresponding ocean scene in the original image data".
[0064] In one embodiment, the multimodal large model is specifically a pre-trained model with cross-modal understanding capabilities, such as DeepseekVL, QwenVL, etc.
[0065] S1032: If the verification result is reasonable, use the changed image data as training image data.
[0066] S1033: If the verification result is unreasonable, then regenerate the changed image data based on the first target object until the verification result is reasonable or the maximum number of regenerations is reached.
[0067] When the generated image data and text prompts (such as "replace with fish") are input into the multimodal large model, the model analyzes the semantic relationship between the new object and the background scene (such as "is the fish suitable to appear in the original background"). If the verification result is unreasonable (such as "fish appear in the desert"), the diffusion model parameters (such as text prompt weights and random seeds) are adjusted, and the image data is regenerated until it passes the verification. This further filters invalid samples (such as "fish are generated in the desert") through the semantic verification mechanism of the multimodal large model, ensuring the scene rationality of the synthesized data. This mechanism significantly reduces semantic errors in generated samples by analyzing the logical relationship between the new object and the background across modalities, improving dataset quality and model training efficiency.
[0068] As can be seen, through the steps S101-S103 described above, after acquiring the original image data, it is possible to generate changed image data including a second target object based on the first target object included in the original image data. The reasonableness of the generated second target object and the background scene of the original image data is then verified. Based on the reasonableness verification result, training image data is generated for subsequent training of the change detection model. This solves the problems of scarce real change detection data and high manual annotation costs through the text-guided diffusion model and the verification mechanism of the multimodal large model. Specifically, the text prompts provide a clear generation direction for the diffusion model (e.g., "replace with fish"), ensuring that the generated image is semantically consistent with the target. The multimodal verification module filters unreasonable generated samples (e.g., "fish generated in the desert") by combining image content and text description, improving data quality. Therefore, the change detection data processing method provided in this embodiment can significantly reduce data acquisition costs, and the data used for subsequent model training can more effectively cover more scenes, while avoiding manual annotation errors, providing highly generalizable and high-precision training data support for change detection of everything in open-world scenarios.
[0069] In one embodiment, labels corresponding to the original image data and the training image data can also be generated based on the verification results obtained in S103.
[0070] The label specifically includes at least one of the following: change type, used to describe the category of object change, such as replacement, disappearance or appearance; category, including the category of the first target object and the category of the second target object; bounding box coordinates, used to indicate the position information of the target object in the image, specifically including the bounding box coordinates of the first target object in the original image data and the bounding box coordinates of the second target object in the training image data.
[0071] For example, the label specifically includes the type of change, category, and bounding box annotation information. For instance, one label might be: Replacement: Turtle-Fish, with bounding box coordinates (x1, y1, x2, y2). Thus, the label is generated based on a combination of the change type (e.g., "Replacement"), object category (e.g., "Turtle-Fish"), and bounding box coordinates (e.g., (x1, y1, x2, y2) of the original and generated images), ensuring complete and aligned label information. This label is directly used for subsequent model training, providing accurate supervision signals.
[0072] As can be seen, this embodiment, based on generating training image data from the original image data, further enhances the supervised accuracy of model training by diversifying the content of the labels (such as change type, category, and bounding box). The complete description of the labels enables the model to learn both semantic and spatial information about changes simultaneously, significantly enhancing the detection network's adaptability to complex changing scenes.
[0073] Figure 3 A flowchart illustrating another embodiment of the data processing method for change detection provided in this application shows a specific implementation of the data processing method for change detection provided in this application, such as... Figure 3 The data processing method for change detection shown can include three main parts: preparing raw image data, automated sample generation, and outputting data labels. Figure 3 These three parts are marked with dashed boxes.
[0074] Specifically, such as Figure 3 The section on preparing raw image data shown corresponds to the steps of acquiring raw image data. Raw image data includes a large number of publicly available detection datasets. These publicly available datasets do not need to be tailored to a specific change detection task; they only need to be detection datasets (e.g., Pascal VOC, COCO, Objects365, etc.). They contain location information. With category information In such Figure 3 The example shown covers three categories of image data: "surveillance scenes," "drone scenes," and "natural scenes." Furthermore, the acquired images can be used as the original images and incorporated into the overall data label.
[0075] like Figure 3 The automated sample generation stage shown corresponds to the steps of generating variable image data based on the first target object, and performing rationality verification and generating training image data. Its main idea is to utilize the data from the collected dataset... As the location information of an object, As object category information, it guides the raw image data. The region corresponding to the first target object is modified, thereby replacing the first target object with the parent target object.
[0076] refer to Figure 2 The example shown is the original image data. The first target object included , and their corresponding bounding boxes Marked with dashed lines. Figure 3 The object selection module can extract the turtle region from the original image based on the bounding box annotations, denoted as... The extracted turtle region is input into the object transformation module and the quality control module. These two modules then iteratively update, replacing the original object with a new target object (such as a fish). This process is denoted as... The system generates the target region location information of the object in the original image data, forming the changed image data. Finally, through the image matching module, the newly generated second target object is fused with the scene in the original image data to achieve positional consistency, thus completing the image restoration. Finally, the changed image material adapted to the original scene is generated and is referred to as the training image data.
[0077] Specifically, the input to the object transformation module is in the form of prompt text and an image. Specifically, the image input accepted by the "object transformation module" is... The system uses the text prompt "Text Prompt 1" and sequentially goes through preprocessing and semantic feature extraction, diffusion vector field inference, image decoding and postprocessing to quickly output a target image that meets semantic requirements and is scene-appropriate. The text prompt can include the category of the first target object. And the category of the changed second target object. .
[0078] In one embodiment, the task corresponding to the object change provided in this application can be denoted as: Different tasks have different text prompts, and the final recorded change matching tags are also different. Table 1 below lists the text prompts and their corresponding change matching tags for different tasks.
[0079] Table 1
[0080]
[0081] In Table 1, the blank spaces within the symbols [] are based on... Figure 3 The output of the quality control module consists of the category of the second target object. For initial selection, you can choose to use the collection of all object categories in the collected dataset. One is randomly selected from the list. It is denoted as: Where N is the total number of categories of collected materials, the input part of the object transformation module is now complete, including the original image data before the transformation. And text prompt P.
[0082] In one embodiment, during the data generation stage, prompt word one and prompt word two can be modified to generate samples that only contain changes in light and shadow. The form of the change matching tags can also be modified, for example, by... Changes matching the tag "appear:" Feature 1: []. ” is changed to “newly appearing in the image”. "Objects" are used to enhance the interactivity of the model.
[0083] More specifically, the processing flow of the object transformation module can include the following three parts:
[0084] Input preprocessing and semantic feature extraction: This stage involves the object transformation module providing standardized input and precise semantic constraints. The core objective is to convert raw image data and text instructions into a feature format that the model can process, ensuring that the input data distribution is consistent with the training phase and providing reliable semantic guidance signals for subsequent vector field inference. for The original guiding features after input preprocessing and semantic feature extraction.
[0085] Based on diffusion vector field inference: The main idea of this stage is to use a neural network to fit the change process of the moving path from noise to the target image. This neural network can adopt existing mainstream text-to-image generation models (such as Stable Diffusion or its variants), because it has been pre-trained on large-scale image-text pairs and can effectively respond to semantic instructions. Therefore, this patent does not need to redesign the generation backbone, but focuses on the overall process coordination and quality control mechanism.
[0086] Image Decoding and Post-processing: This stage uses the decoder Dimage to decode the denoised latent code into a high-resolution image, which can be expressed by the formula: ,in, Features extracted during the input preprocessing and semantic feature extraction stages The latent feature vector is generated after t diffusion vector field inferences. This vector, after being reconstructed by the image decoder, becomes the transformed image described above. .
[0087] As can be seen, the aforementioned object change module is essentially a controlled vector field evolution process conducted in latent space under the guidance of language. Each step of inference predicts the change in the vector field, moving from random noise towards a target point that conforms to the textual instructions while maintaining the image context, ultimately mapped by the decoder to a convincing pixel-level change. This change is highly suitable for data generation in change detection tasks because the saved change matching labels can serve as text labels for change detection, and the images before and after the change can be used as image training material for change detection.
[0088] In the specific implementation process, such as Figure 3 The quality control module shown can specifically be a multimodal large model. This application does not limit the specific implementation of the multimodal large model, such as DeepseekVL, QianwenVL, etc. The quality control module utilizes the cross-modal understanding and generation capabilities of the large model to verify the generated sample content of change detection based on the task description, replacing human verification, and generates text content that meets the requirements of this patent. This module focuses on the essential features of images, generating content that not only matches the image content but also adds details often overlooked during manual object annotation. This enhances the generalization performance of subsequent multimodal change detection models, ultimately achieving the goal of generating massive amounts of data without manual annotation. The input image portion of this module is... , The text portion corresponds to text prompt two, denoted as .
[0089] Figure 4 A schematic flowchart illustrating another embodiment of the data processing method for change detection provided in this application shows that... Figure 3 The iterative process shown involves verifying the rationality of the object change module and the quality control module to ultimately generate training image data that meets the requirements.
[0090] Specifically, such as Figure 4 As shown, and The images are stitched together to form a single image, which is then used as the image input for the quality control module. Specifically, it could be like this Figure 2 The dashed bounding boxes corresponding to the first and second target objects in the text are used as markers and then pieced together. Another input text prompt for the quality control module is... It depends on different , and The resulting concatenated text content, exemplarily, can be applied to a task. Text prompt 2 include:
[0091] "The image shows two objects with the same background, marked with a red box. Answer the following 5 questions based on the image and output the results in the standard format."
[0092] Question 1: Determine if the second object is... If both are true, then answer 1 is True; otherwise, it is False.
[0093] Question 2: Determine whether the second object is clearly visible, and whether there are any missing or incomplete parts or image quality issues. If there are issues, the answer is True; otherwise, it is False.
[0094] Question 3: Determine whether the appearance of the second object in the background image corresponding to the first object meets the background information conditions. If it does, then answer 3 is True; otherwise, output a reasonable category as answer 3.
[0095] Question 4: Describe the characteristics of the first object using a noun array, denoted as Answer 4, such as ['aquatic life', 'yellow'].
[0096] Question 5: Describe the characteristics of the second object using a noun array, denoted as Answer 5, such as ['aquatic life', 'blue'].
[0097] Standard output format: {'Question 1':'Answer 1', 'Question 2':'Answer 2', 'Question 3':'Answer 3', 'Feature 1':'Answer 4', 'Feature 2':'Answer 5'}
[0098] Note: The standard output format is JSON, where the keys are for illustrative purposes only and do not require any actual content.
[0099] At this point, the input preparation for the quality control module is complete, including... and The quality control module can then be used to perform reasoning based on the input data to obtain output that meets the format requirements. Then The input is fed into the decision-making logic to determine whether the currently generated sample is a positive or negative sample.
[0100] The sample judgment logic includes: First, parsing the output text content. Then, judging whether 'Question 1', 'Question 2', and 'Question 3' are all True, the parsed 'Feature 1' is recorded as... 'Feature 2' is denoted as Fill in the text labels as positive samples for subsequent training and testing, according to the change matching label format in Table 1. If the output is a different value, do not fill in the feature, only the change category. As negative samples for subsequent training and testing, the parameters of the 'Object Transformation Module' (e.g., random seed) are modified and the iteration is restarted. If the 'Question 3' output of the quality control module is not True (i.e., the new object does not match the background), the random seed or target category of the 'Object Transformation Module' is adjusted and regenerated; this process is repeated until a positive sample that meets all the verification conditions is obtained, or the maximum number of retries (e.g., 5 times) is reached, to avoid infinite loops.
[0101] The above steps will then yield the labels used for subsequent training of the change detection model. The image matching module can then be used to paste the selected image back into the background of the original image, resulting in... Its specific form is as follows Figure 3 The data labels shown on the right can be used to... Paste the sample image. If the iteration process generates multiple negative samples, then this module needs to paste the image multiple times to generate multiple negative samples. They are then marked as negative samples for subsequent training.
[0102] As can be seen, through the above steps and combined with the acquired raw image data, all the data required for subsequent model training has been prepared, specifically including: raw image data. Training image data and tags .
[0103] Specifically, the data processing method for change detection provided in this application utilizes publicly available detection datasets as source material. By combining text prompts with an image input diffusion model, it can controllably generate semantically consistent post-change images and automatically generate precise instance-level change labels (category, bounding box, change type), constructing a large-scale training set. This effectively overcomes the data scarcity problem during model training for change detection. In particular, by using publicly available general detection datasets (such as COCO and Objects365) as raw material and combining them with a text-guided diffusion generation model to automatically construct large-scale, precisely labeled "before-and-after image pairs," it significantly reduces data acquisition costs and time without the need to collect real-world change scene data. Furthermore, when forming training image data, it can generate samples with arbitrary category combinations and arbitrary change types (appearance, disappearance, replacement) as needed, covering long-tail scenes. The generated labels (change type, category, bounding box) are automatically generated and naturally aligned by the generation process, without annotation noise.
[0104] Figure 5 A flowchart illustrating an embodiment of the change detection model training method provided in this application is shown below. Figure 5 The change detection model training method shown can be used to train a multimodal change detection model for change detection, such as... Figure 5 The change detection model training method shown can be applied to any electronic device with relevant data processing capabilities, such as a computer, server, or workstation. Specifically, for example... Figure 5 The model training methods for change detection shown include:
[0105] S201: Acquire raw image data. Raw image data refers to the acquired, unprocessed image data. Raw image data may include a large amount of known, working condition detection image data. The collected public dataset does not need to be tailored to the change detection task; it only needs to be a detection dataset. For example, the detection dataset may include Pascal VOC, COCO, or Objects365, etc.
[0106] S202: According to as follows Figure 1 The data processing method for change detection shown is used to generate training image data. For details on how to generate training image data, please refer to [link / reference]. Figure 1 This will not be elaborated upon here.
[0107] In one embodiment, a data preparation process is included before training the multimodal change detection model, specifically for the data required to train the multimodal change detection model.
[0108] Specifically, the raw image data obtained during the data processing of change detection methods. Training image data and tags Each of these tags The description of a single object is denoted as a 𝑆𝑎𝑚𝑝l𝑒. Inevitably, multiple object variations with the same category or features will appear in the same image. Therefore, we need to merge all variations with the same description to prevent the network from including ambiguous samples during training. Specifically, the following merging method can be used:
[0109] Organize all of the images Text prompts, integrated and and feature one With feature two The set of can be expressed by the following formula:
[0110]
[0111]
[0112]
[0113]
[0114] Where N represents the N Kelps representing the same image in the above formula. Then all the acquired... , , as well as The content is combined to form a set of all possible variations. This set is then sampled, iterating through all possible variations. If a variation satisfies the sampled variation, its corresponding variation is assigned to it. Record the sample to form a positive sample. If there is no object description in the current image that satisfies the sample combination (i.e. no real change instance matches the category-feature pair), then the sample combination is marked as a negative sample and its corresponding bounding box is set to an empty set.
[0115] The above example describes The final training samples are generated in a certain way, and Data preparation methods and Same, just ignore and That's sufficient. Furthermore, in practical data preparation, the original image before the change can be treated as a state without target objects, and the newly appearing objects in the changed image can be used as detection targets, thus expanding to a third type of task. The object appears. Therefore, in the actual data preparation, it can be... Central Plains This task has become ; its original Become Final record If it remains unchanged, the corresponding change matching tag is: [ Feature 1: []. At this point, the training data for the entire network is ready.
[0116] S203: Based on the original image data obtained in S201 and the training image data generated in S202, a multimodal change detection model is trained. The obtained multimodal change detection model can be used to detect changes in everything.
[0117] In one embodiment, the multimodal change detection model provided in this application includes: a dual-stream encoder, a feature fusion module, and a decoding module, wherein the dual-stream encoder is used to extract image features and text semantic features respectively; the feature fusion module is used to integrate image features and text semantic features into image-text fusion features; and the decoding module is used to parse the fusion features and output the position and confidence level of the target object in the image.
[0118] Specifically, the multimodal change detection model provided in this application first converts the input image and text into feature forms that can be processed by the deep learning model through the corresponding image encoder and text encoder, respectively. Then, the image feature extractor and text feature extractor extract the visual features of the image and the semantic features of the text, respectively. Next, these two types of features are fed into the feature fusion module to integrate them into a unified image-text fusion feature. Subsequently, the fused feature enters the decoding module for parsing and finally outputs the target location and confidence level.
[0119] For example, Figure 6 This is a schematic diagram of the model structure of the multimodal change detection model provided in this application. The following is in conjunction with... Figure 6 This application provides an explanation of the multimodal change detection model.
[0120] like Figure 6 The problem feature processing shown includes:
[0121] Text feature processing: using text encoders commonly used in language processing tasks. With text feature extractor Text prompt Convert to text feature vector Specifically, it can be expressed by the following formula:
[0122]
[0123]
[0124] Among them, the text encoder Its function is to map each discrete symbol (such as a single word or subword) to a low-dimensional continuous vector (i.e., "word embedding"), which is learned through pre-training or model training to create a distributed representation of the symbols. Text Feature Extractor This model structure consists of multiple stacked Transformer encoders. Through a self-attention mechanism, the embedding vector of each symbol "attention" to other symbols in the sequence, calculating features that include contextual relationships, and ultimately outputting a text feature vector. It is a text feature vector that incorporates global context, and can accurately represent text prompts. Semantic information.
[0125] Image feature processing: using a visual encoder For the original image data respectively Training image data Image encoding is performed to transform it into a feature space, thus obtaining a feature representation. and Specifically, it can be expressed by the following formula:
[0126]
[0127]
[0128] Specifically, Figure 7 This is a schematic diagram of the processing logic of the image information feature extractor in the multimodal change detection model provided in this application, as shown below. Figure 7 As shown, visual encoder First, the image Divide the image into N equal parts according to a fixed length and width K. Image patches of various sizes, including:
[0129]
[0130] The pixel values in each image patch are then flattened into multiple 1-dimensional vectors:
[0131]
[0132] And, each The vector is encoded by linear layer operations in deep learning to obtain:
[0133]
[0134] Where D is the output dimension of the linear layer. Since the feature extraction structure of the subsequent feature extractor is mainly a Transformer, and the Transformer structure itself is "order-independent," spatial location information of the image patch needs to be injected through positional encoding. Therefore:
[0135]
[0136] in, Each of the N specially encoded vectors represents a unique vector. Location information. Furthermore, when using a visual encoder... Processing raw image data and training image data When using linear layers with the same weights and the same... Encode the image so that the encoded feature representation and Having the same feature space, and features at the same location corresponding to the same location information in the image, makes it easier for the model to compare changing information.
[0137] like Figure 7 As shown, the image feature extractor provided in this embodiment includes a two-stream network composed of a Transformer structure. Furthermore, to accurately describe the image features extracted from... arrive The interaction relationship, and Each component independently passes through a number (N) of Transformers before its output features are fused and interacted. The entire feature extraction structure consists of L modules with feature fusion capabilities.
[0138] Furthermore, the core of the multimodal change detection model lies in the computational logic of the feature fusion module, which receives image feature vectors. Features extracted by a separate Transformer encoder With changing eigenvectors Features extracted by a separate Transformer encoder As input, through specific feature interactions and aggregation operations, the modeling of the change correlation information between the two types of features is strengthened, and the final output is a joint feature vector that can simultaneously represent the basic features of the image and the relationship between feature changes. The feature representations corresponding to the two types of target scenes can be expressed by the formula:
[0139]
[0140] More specifically, It is As the query state during self-attention computation The cross-attention calculation, using the bonding value as the focus, can be expressed by the following formula:
[0141]
[0142] Where Q is the query vector for this image, and K and V are constraint vectors provided by another image and participate in subsequent calculations. To set the normalization parameter to a constant based on empirical values, it can be expressed by the following formula:
[0143]
[0144]
[0145]
[0146] , and The parameter weights of the inference network are generated after passing through 𝐹𝑢𝑠𝑖𝑜𝑛. This serves as an interaction vector for image features in subsequent modules.
[0147] In the final stage of the feature extraction network, the output features are... and The combined images represent the final image feature output. Ultimately, and Inputting the feature into the decoding module will decode it into a bounding box.
[0148] In one embodiment, the decoding module consists of multiple standard Transformer decoders cascaded together, with the input queries of the decoders set to a fixed number of learnable parameters. Where N is the number of objects queried, and D is the feature dimension of the query, which only needs to satisfy the set decoder dimension. A crucial step in this process is to... and The fusion is performed to obtain the query characteristics of a specified event. This achieves the effect of predicting specified events. The fusion method still utilizes the FREN module and can be expressed by the following formula:
[0149]
[0150] Here, FFN is a linear layer in deep learning, the purpose of which is to align... and The feature dimensions. Ultimately, and After passing through the decoding module, the final output feature will be generated. .
[0151] In one embodiment, to achieve more accurate results, the change prediction is... With annotation box By decoupling the prediction task, we can obtain:
[0152]
[0153]
[0154]
[0155]
[0156] in, Predict the category of the query for the model (including two categories: change and background). The coordinate information predicted by the model (including the top left and bottom right corners of each target box).
[0157] MLP is a commonly used multilayer perceptron architecture in deep learning, consisting of linear layers and activation layers. The final dimension is [x×2], where dimension 2 represents the corresponding Whether it occurs and whether the box is a background image. The final dimension is [x×4], representing the coordinate position information of each query box. The overall network structure is now complete.
[0158] In one embodiment, this application employs the following loss function. Train the entire network:
[0159]
[0160] in, Category labels for the actual targets; The set of outputs for all predicted queries; This represents the "optimal predicted query set corresponding to the i-th real target" obtained through bipartite graph matching. This part is solved using the Hungarian algorithm. For indicator functions: when Set to 1 if true, otherwise set to 0 (background targets are not included in the bounding box loss); M is the number of real targets in the image; category The regression uses standard cross-entropy loss, calculated as follows:
[0161]
[0162] in, For the j-th prediction query, the "true category" The predicted probability of "". Used to prevent smoothing terms with a logarithmic input of 0.
[0163] The bounding box regression loss uses an L1+GIoU weighted form, and its specific calculation method can be expressed by the following formula:
[0164]
[0165] in, is the L1 norm (Manhattan distance), used to measure the absolute error of the bounding box coordinates; GIoU is the generalized intersection-union ratio calculation function; and These are weighting coefficients, and they are included by default. , .
[0166] In one embodiment, at the algorithm level, an auxiliary loss can be introduced into the training loss to add more constraints, such as adding a constraint to determine whether the image before and after the change only contains changes in light and shadow.
[0167] Through the above process, a multimodal object change detection model can be trained, which can then be used to detect changes in various objects based on, for example... Figure 1The sample generation method constructs large-scale training and testing data, and according to, for example Figure 5 The model structure shown is used to build a model for detecting changes in everything. Specifically, during the model training process, the built model is trained according to the training loss function mentioned above to obtain the final trained model, which can then be used in actual production scenarios.
[0168] As can be seen, the change detection model training method provided in this embodiment achieves end-to-end instance change detection by fusing dual-temporal image features and textual prompt semantics, and adopts a modal fusion query-target matching mechanism. It also outputs structured results: the location, category, and change semantics of the changed object. Thus, through the technical route of "synthetic data generation and multimodal change detection network", it breaks through the limitations of traditional change detection methods in terms of data, generalization, and semantic understanding, and has significant technical advancement, economic feasibility, and social application value.
[0169] In particular, this application designs an end-to-end multimodal detection network during the training of the multimodal object change detection model, integrates visual and textual prompts, and employs Hungarian matching, L1, and GIoU loss for instance-level supervision. This not only determines whether a change has occurred, but also outputs the category, location, and semantics of the changed object (e.g., 'car-fire truck'). It maintains high accuracy even in complex contexts such as surveillance and drone inspections, significantly outperforming traditional difference methods and pure large model solutions.
[0170] Furthermore, the multimodal change detection model provided in this application utilizes a pre-trained diffusion model (one-time input) in the generation stage and a lightweight Transformer network (distillable and compressible) in the detection stage. Therefore, the training stage can fully utilize synthetic data for pre-training, reducing reliance on expensive real data. In the subsequent inference stage, the model can be deployed on edge devices (such as cameras and drone-borne computers), enabling it to empower key areas such as smart cities (anomaly detection), power line inspection (equipment missing alarm), and ecological protection (illegal building monitoring), thereby enhancing the level of automated perception and other social value.
[0171] This application also provides a method for change detection, including: acquiring image data to be detected; and, according to... Figure 5 The change detection model training method shown employs a multimodal change detection model to detect changes in targets within acquired image data, thus realizing the application of the multimodal change detection model. In one embodiment, the change detection results include structured results such as change type, target object category, and bounding box coordinates.
[0172] In the foregoing embodiments of this application, the data processing method, model training method, and change detection method for change detection provided by the embodiments of this application have been described. To implement the functions of the methods provided by the embodiments of this application, the electronic device, as the execution subject, can implement the above functions through hardware structures and / or software modules. Whether a particular function is executed through hardware structures, software modules, or a combination of hardware structures and software modules depends on the specific application and design constraints of the technical solution.
[0173] For example, this application provides a data processing apparatus for change detection, comprising: an acquisition module for acquiring original image data, the original image data including a first target object; a change module for generating changed image data based on the first target object, the changed image data including a second target object; and a verification module for performing a reasonableness verification on the background scene of the second target object and the original image data, and generating training image data based on the reasonableness verification result, the training image data being used to train a change detection model.
[0174] For example, this application provides a model training device for change detection, including: an acquisition module for acquiring original image data; a generation module for a data processing method for change detection to generate training image data; and a training module for training a multimodal change detection model based on the original image data and the training image data.
[0175] For example, this application provides a change detection apparatus, comprising: an acquisition module for acquiring image data to be detected; and a detection module for detecting changes in targets in the image data to be detected based on a multimodal change detection model trained by a change detection model training method.
[0176] It should be understood that the division of the various modules in the above device is merely a logical functional division. In actual implementation, they can be fully or partially integrated into a single physical entity, or they can be physically separated. Furthermore, these modules can be implemented entirely in software via processing element calls; they can be fully implemented in hardware; or some modules can be implemented by processing element calls to software, while others are implemented in hardware. For example, a module can be a separately established processing element, or it can be integrated into a chip within the above device. Alternatively, it can be stored as program code in the memory of the above device, and its functions can be called and executed by a processing element of the device. The implementation of other modules is similar. Moreover, these modules can be fully or partially integrated together, or they can be implemented independently. The processing element here can be an integrated circuit with signal processing capabilities. In the implementation process, each step of the above method or each of the above modules can be completed through integrated logic circuits in the hardware of the processor element or through software instructions.
[0177] For example, these modules can be one or more integrated circuits configured to implement the above methods, such as one or more application-specific integrated circuits (ASICs), one or more digital signal processors (DSPs), or one or more field-programmable gate arrays (FPGAs). As another example, when a module is implemented using processing element scheduler code, the processing element can be a general-purpose processor, such as a central processing unit (CPU) or other processor capable of calling program code. Furthermore, these modules can be integrated together to implement a system-on-a-chip (SOC).
[0178] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of this application is generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid-state disk (SSD)).
[0179] For example, Figure 8 A schematic diagram of the structure of an embodiment of the electronic device provided in this application is shown below. Figure 8 The electronic device 2000 shown can be used to execute the data processing method for change detection, the model training method for change detection, and the method for change detection provided in any embodiment of this application.
[0180] In one embodiment, such as Figure 8 The illustrated electronic device 2000 includes at least one processor 2001 and a memory communicatively connected to the at least one processor. The memory 2002 stores instructions executable by the at least one processor. When the instructions are executed by the processor 2001, the processor 2001 implements the change detection data processing method, change detection model training method, and change detection method provided in any of the foregoing embodiments of this application.
[0181] In the above embodiments, it should be understood that the processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in this invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules within the processor.
[0182] The memory may include random access memory (RAM) and may also include non-volatile memory (NVM), such as at least one disk storage device.
[0183] The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EISA) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of illustration, the buses shown in the accompanying drawings are not limited to a single bus or a single type of bus.
[0184] This application also provides a computer-readable storage medium storing computer-executable instructions. When a processor executes the computer-executable instructions, it implements the data processing method for change detection, the model training method for change detection, and the method for change detection provided in any of the foregoing embodiments.
[0185] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the data processing method for change detection, the model training method for change detection, and the method for change detection provided in any of the foregoing embodiments.
[0186] This application also provides a chip for executing instructions, which is used to execute the data processing method for change detection, the model training method for change detection, and the method for change detection as provided in any of the foregoing embodiments of this application.
[0187] It should be understood that the aforementioned processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), etc. A general-purpose processor can be a microprocessor or any conventional processor. The steps of the method disclosed in the application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor. The memory may include high-speed RAM (Random Access Memory), and may also include non-volatile memory (NVM), such as at least one disk storage device, and may also be a USB flash drive, external hard drive, read-only memory, disk, or optical disc, etc.
[0188] The aforementioned storage media can be implemented from any type of volatile or non-volatile storage device or a combination thereof, such as Static Random-Access Memory (SRAM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Erasable Programmable Read-Only Memory (EPROM), Programmable Read-Only Memory (PROM), Read-Only Memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk. The storage media can be any available medium accessible to general-purpose or special-purpose computers.
[0189] An exemplary storage medium is coupled to a processor, enabling the processor to read information from and write information to the storage medium. Alternatively, the storage medium can be an integral part of the processor. Both the processor and the storage medium can reside in an Application Specific Integrated Circuit (ASIC). Alternatively, the processor and storage medium can exist as discrete components in an electronic device or host device.
[0190] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, garment, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, garment, or apparatus. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, garment, or apparatus that includes that element.
[0191] The sequence numbers of the embodiments in this application are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0192] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal device (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods of the various embodiments of this application.
[0193] The collection, storage, use, processing, transmission, provision, and disclosure of user data and other information involved in the technical solution of this application all comply with the provisions of relevant laws and regulations and do not violate public order and good morals.
[0194] The above are merely preferred embodiments of this application and do not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
[0195] Finally, it should be noted that other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This invention is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary techniques in the art not disclosed herein, and is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of the invention is limited only by the appended claims.
Claims
1. A data processing method for change detection, characterized by, include: Acquire raw image data, wherein the raw image data includes a first target object; Based on the first target object, change image data is generated, and the change image data includes a second target object; The second target object and the background scene of the original image data are subjected to a reasonableness check, and training image data is generated based on the reasonableness check result. The training image data is used to train the change detection model.
2. The method of claim 1, wherein, Based on the first target object, change image data is generated, including: Get the text prompts corresponding to data processing; Determine the position and category information of the first target object in the original image data; Based on the text prompt, the location information, and the category information, the first target object is replaced with the second target object in the background scene to generate the changed image data.
3. The method of claim 2, wherein, The step of performing a reasonableness check on the background scene of the second target object and the original image data, and generating training image data based on the reasonableness check result, includes: The changed image data and the text prompt are input into a multimodal large model, which then verifies the semantic consistency between the second target object and the background scene. If the verification result is reasonable, the changed image data will be used as the training image data; If the verification result is unreasonable, the changed image data will be regenerated based on the first target object until the verification result is reasonable or the maximum number of regenerations is reached.
4. The method according to any one of claims 1 to 3, characterized in that, Also includes: Generate labels for the original image data and the training image data, the labels being used to train the change detection model; The label specifically includes at least one of the following: Change types include: replacement, disappearance, or appearance; The categories include the category of the first target object and the category of the second target object; The bounding box coordinates include the bounding box coordinates of the first target object in the original image data and the bounding box coordinates of the second target object in the training image data. 5.A method for model training of change detection, characterized in that, include: Obtain the raw image data; The data processing method for change detection according to any one of claims 1-4 generates training image data; A multimodal change detection model is trained based on the original image data and the training image data.
6. The method of claim 5, wherein, The multimodal change detection model includes: A dual-stream encoder is used to extract image features and text semantic features separately; The feature fusion module is used to integrate the image features and the text semantic features into image-text fusion features; The decoding module is used to parse the fused features and output the position and confidence level of the target object in the image.
7. A method of change detection, characterized by, include: Acquire the image data to be detected; The multimodal change detection model trained by the change detection model training method according to any one of claims 5-6 performs change detection on the target in the image data to be detected.
8. An electronic device, comprising: include: At least one processor; as well as A memory that is communicatively connected to the at least one processor; The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, cause the electronic device to perform the method according to any one of claims 1-7.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer-executable instructions, which, when executed by a processor, implement the method as described in any one of claims 1-7.
10. A computer program product, characterised in that, Includes a computer program that, when executed by a processor, implements the method as described in any one of claims 1-7.