Manifold matching based method and system for recovering low-illumination image in dark scene

By using a manifold matching-based image processing method, combined with a manifold matching model and a Transformer module, efficient, real-time, and high-quality video restoration under low-light conditions is achieved. This solves the problems of high cost, poor compatibility, and slow processing speed of existing night vision surveillance systems, and provides an all-weather monitoring solution.

CN122243751APending Publication Date: 2026-06-19WUHAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WUHAN UNIV
Filing Date
2026-02-26
Publication Date
2026-06-19

Smart Images

  • Figure CN122243751A_ABST
    Figure CN122243751A_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for restoring low-light images in dark scenes based on manifold matching. The method includes the following steps: extracting features from the low-light image to obtain original latent space features and a first feature map; adding the first feature map element-wise to the original latent space features to obtain intermediate features; obtaining semantic guidance information and adding it element-wise to the intermediate features to obtain fused features; inputting Gaussian noise and the fused features into a manifold matching module, and generating latent space encoding of the target image through a linear sampling mechanism; and outputting the restored image after decoding. This invention also discloses a corresponding restoration system. Through this invention, low-light videos can be accurately and efficiently restored with low processing latency, and it has the advantages of low cost, high accuracy, and strong real-time performance. It is a high-efficiency low-light video restoration solution suitable for all-weather monitoring scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of low-light image and video restoration, and particularly relates to a model for implementing night vision function of a camera in a dark scene, specifically a method and system for restoring low-light images in a dark scene based on manifold matching. Background Technology

[0002] In the field of video surveillance, achieving high-definition video imaging under nighttime or low-light conditions is crucial for enhancing the all-weather application capabilities of surveillance systems. With the development of artificial intelligence technology, modern surveillance systems have evolved from traditional passive recording to proactive intelligent analysis, using deep learning algorithms to achieve functions such as abnormal behavior recognition and risk prediction. This places higher demands on the image quality of the input video. However, nighttime is a high-incidence period for criminal activity, and the problem of insufficient surveillance effectiveness has long existed. Statistics show that the effective recognition rate of many surveillance cameras in China drops significantly at night; the recognition accuracy of traditional surveillance at night is often less than 60%, with excessively long response times, easily leading to missed opportunities for optimal response.

[0003] To address the challenges of nighttime surveillance, existing technological approaches mainly fall into the following categories: Firstly, there's the low-light camera solution based on active illumination. This solution achieves night vision by using an infrared or white light illumination system, and its cost is relatively low. However, the image quality of this solution heavily relies on the stability and response efficiency of the illumination equipment. Illumination equipment has limited lifespan, energy consumption issues, and is susceptible to factors such as power supply stability and environmental interference, resulting in a high failure rate and making it difficult to guarantee stable operation around the clock. Furthermore, active illumination poses a risk of exposure in scenarios requiring covert monitoring, thus limiting its application.

[0004] Secondly, the thermal imaging night vision technology solution. This solution uses a thermal imaging sensor array to convert temperature differences into a visible image by detecting the 8-14μm wavelength infrared radiation emitted by objects. Its core component is a microbolometer array made of vanadium oxide or amorphous silicon. A readout circuit converts the resistance changes caused by temperature into electrical signals, which are then amplified, filtered, and converted from analog to digital to form temperature distribution data, which is then presented through pseudo-color encoding. This solution has the following drawbacks in practical applications: (1) High cost. Thermal imaging sensors are expensive. The cost of a single uncooled microbolometer is usually several thousand to tens of thousands of RMB, which is much higher than that of ordinary visible light sensors, making it difficult to popularize on a large scale in the field of civilian monitoring.

[0005] (2) Low resolution. Due to manufacturing process and cost limitations, the mainstream thermal imaging resolution is usually 160×120 to 640×480 pixels, which is far lower than the resolution of millions of pixels of visible light cameras, making it difficult to meet the needs of detail recognition.

[0006] (3) Lack of detailed information. Thermal imaging can only display the temperature outline and cannot obtain detailed information such as the texture, color, and text of the target. It is not effective in scenarios that require recognition of license plates, faces, etc.

[0007] (4) Limited temperature contrast. When the target and background temperatures are close (such as when the human body and ambient temperature are close in summer), the thermal imaging system has difficulty distinguishing the target effectively, and the detection effect is significantly reduced.

[0008] (5) Poor environmental adaptability. The system is greatly affected by changes in ambient temperature and requires regular non-uniformity correction, which affects the continuity of monitoring.

[0009] (6) High difficulty in image interpretation. Pseudo-color display lacks intuitiveness, operators need to have professional interpretation skills, and ordinary users find it difficult to quickly and accurately understand the image content.

[0010] (7) Limited penetration capability. It has insufficient penetration capability through transparent or reflective materials such as glass and water surfaces, and cannot effectively detect targets behind glass.

[0011] Thirdly, video restoration solutions based on image post-processing. Most existing video restoration systems use neural networks or diffusion models to enhance surveillance videos. However, diffusion models require sampling layer by layer according to Markov chains, which inherently contradicts the relationship between sampling speed and image quality, making it difficult to simultaneously achieve efficient processing and high-quality restoration, and thus failing to meet the needs of real-time monitoring scenarios.

[0012] Furthermore, existing camera products generally suffer from a deep integration of algorithms and hardware. To protect the exclusivity of their product technologies, manufacturers often encapsulate and encrypt key processing modules, resulting in highly specialized products that are difficult to decouple and reuse algorithms across different manufacturers and models. This means that improving night vision typically requires replacing the dedicated camera hardware, with costs proportional to the number of cameras that need modification, and substantial ongoing maintenance costs.

[0013] In summary, there is an urgent need for a system that can be compatible with mainstream surveillance systems without replacing existing camera hardware, and can achieve efficient, real-time, and high-quality video restoration under low-light conditions, in order to solve the problems of high cost, slow processing speed, and poor algorithm reusability of existing technologies. Summary of the Invention

[0014] This invention addresses the technical problems of high cost, poor compatibility, slow processing speed, and insufficient generalization ability in existing night vision surveillance systems. It provides a method for restoring low-light images in dark scenes based on manifold matching. An image processing module is connected to the video output of the surveillance camera, enabling a single software program to adapt to different hardware without requiring camera replacement. Simultaneously, a manifold matching model replaces traditional neural networks or diffusion models, utilizing its linear sampling mechanism to accelerate processing. A model training and optimization module is added, using incremental training with daytime and nighttime similar scene samples to enhance generalization ability. This technology enables real-time processing and restoration of dark videos captured by cameras in the dark, achieving low-cost, high-compatibility, and high-efficiency real-time enhancement of nighttime video, realizing all-weather monitoring performance akin to a "dark eagle eye."

[0015] According to one aspect of the present invention, a method for restoring low-light images in dark scenes based on manifold matching is provided, comprising: Acquire a low-light input image; The low-light input image is input into the encoder and the pre-trained Transformer module respectively to obtain the original latent space features and the first feature map. The first feature map is added element-wise to the original latent space features to obtain intermediate features; Obtain semantic guidance information, and add the intermediate features to the semantic guidance information element by element to obtain the fused features; Gaussian noise and the fused features are input into the manifold matching module, and the latent space code of the target image is generated by the manifold matching model using a straight line sampling mechanism; The latent space code is input into the decoder for decoding, and the restored high-quality image is output.

[0016] As a further technical solution, the semantic guidance information comes from a replaceable semantic input module, which provides at least one of the following three input modes: natural language semantic input, specific reference image semantic input, and no additional semantic input.

[0017] As a further technical solution, the natural language semantic input extracts semantic features through a pre-trained natural language processing model; the specific reference image semantic input extracts deep features through a pre-trained convolutional neural network; a default feature is generated in the absence of additional semantic input mode; and the semantic features, the deep features, and the default feature are linearly combined to obtain a combined feature as the semantic guidance information.

[0018] As a further technical solution, the following training steps are also included: Construct a training dataset containing paired low-light images and corresponding high-quality images; The low-light image is enhanced, and the high-quality image is darkened and noise-added to generate enhanced training samples. The processed low-light image and high-quality image are input into the encoder respectively. After being processed by the pre-trained Transformer module, the latent space of the high-quality image is noise-added to obtain the noisy image. The noisy image and the low-light image features fused with semantic guidance information are jointly input into the manifold matching module for training, and the model parameters are optimized by a linear combination of content diffusion loss and color consistency loss.

[0019] As a further technical solution, the encoder and decoder are pre-trained using autoregressive loss, and the gradients of the encoder and decoder are frozen when training the manifold matching module after the pre-training is completed.

[0020] According to one aspect of the present invention, a system for restoring low-light images in dark scenes based on manifold matching is provided, comprising: An encoder is used to map an image into a latent space code; The Transformer module is used for feature extraction from images; Replaceable semantic input modules are available to provide semantic guidance information; The manifold matching module receives Gaussian noise and features fused with semantic guidance information, and generates the latent space encoding of the target image through a linear sampling mechanism; The decoder is used to decode the latent space code generated by the manifold matching module into a restored high-quality image; The output feature map of the Transformer module is added element-wise to the original latent space features output by the encoder, and then added element-wise to the semantic guidance information provided by the replaceable semantic input module to obtain the fused features, which are then input to the manifold matching module.

[0021] As a further technical solution, the replaceable semantic input module includes a natural language input unit, a specific image input unit, and a no-input unit, as well as a combination unit for linearly combining the outputs of each unit.

[0022] As a further technical solution, the replaceable semantic input module also includes an image resizing unit and a reference image feature extraction Transformer. The image resizing unit is used to resize a specific reference image to a uniform size, and the reference image feature extraction Transformer is used to extract features from the resized specific reference image. The extracted depth features are used as the output of the specific image input unit.

[0023] As a further technical solution, the natural language input unit is connected to a pre-trained natural language processing model, and the specific image input unit is connected to a pre-trained convolutional neural network.

[0024] As a further technical solution, a dataset augmentation module is also included, which is used to enhance low-light images in the training data and darken and add noise to high-quality images.

[0025] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. This invention uses a complete image processing program to access the video output terminal of the surveillance camera, and achieves night vision enhancement through software algorithms. It does not require replacing the existing camera hardware, and can adapt to different manufacturers and models of surveillance equipment with a single program, which greatly reduces the hardware procurement and deployment costs. It achieves decoupling between the algorithm and the hardware, significantly improves the system's compatibility and reusability, and solves the technical problems of high cost and difficulty in large-scale popularization of existing night vision solutions.

[0026] 2. This invention uses a manifold matching module to replace the traditional neural network or diffusion model. By utilizing the straight-line sampling mechanism of manifold matching, it avoids the speed bottleneck caused by the layer-by-layer Markov chain sampling of the diffusion model. While ensuring the image restoration quality, it significantly improves the processing speed, which can meet the low latency requirements of real-time monitoring scenarios and solves the technical problem that existing video restoration systems cannot balance speed and quality.

[0027] 3. This invention provides three semantic guidance modes—natural language, specific reference image, and no input—and any linear combination thereof through a replaceable semantic input module. Users can flexibly choose the semantic input method according to actual needs, realizing functions such as text-controlled image restoration effects, style transfer, and target detail enhancement. This significantly improves the applicability of the system and the user experience, and solves the technical problem that existing night vision technologies cannot flexibly control the restoration effect.

[0028] 4. The training method of this invention adopts a dataset augmentation strategy to enhance low-light images and darken and add noise to high-quality images, effectively expanding the diversity of training samples and improving the model's generalization ability in rare scenes and extreme lighting conditions. It adopts a staged pre-training strategy, first pre-training the encoder-decoder and Transformer modules and freezing the gradients, and then training the manifold matching module. This not only ensures the feature extraction capabilities of each module, but also avoids the model instability caused by joint training, thus optimizing the overall training effect.

[0029] 5. The system of this invention adopts a modular design. The encoder, Transformer module, replaceable semantic input module, manifold matching module, and decoder each have clear division of labor and clear connections. The feature map output by the Transformer module is added element-wise to the original latent space features output by the encoder. This preserves the low-level details of the original image and integrates the high-level semantic features extracted by the Transformer. Then, it is added element-wise with the semantic guidance information to accurately inject semantic control conditions into the generation process. The replaceable semantic input module includes a natural language input unit, a specific image input unit, and a no-input unit. The outputs of each unit are linearly combined through a combination unit to achieve flexible configuration of semantic information. The image size adjustment module ensures that the input image size is uniform. The Transformer module, encoder, and decoder all use pre-trained modules and freeze gradients when training the manifold matching module, effectively preventing catastrophic forgetting of pre-trained features.

[0030] 6. This invention, through its innovative manifold matching architecture, diverse semantic control mechanisms, and modular system design, achieves real-time nighttime video enhancement with low cost, high compatibility, high efficiency, and strong generalization capabilities without replacing existing camera hardware. It provides a practical technical solution for scenarios such as all-weather urban security governance, intelligent traffic management, and emergency response, and has significant practical value and broad application prospects. Attached Figure Description

[0031] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is a flowchart illustrating the reasoning process of the low-light image restoration method in dark scenes based on manifold matching in this embodiment of the invention.

[0033] Figure 2This is a flowchart illustrating the training process of the low-light image restoration method for dark scenes based on manifold matching in this embodiment of the invention.

[0034] Figure 3 This is a schematic diagram of the replaceable semantic input module in an embodiment of the present invention. Detailed Implementation

[0035] The terms “comprising” and “having”, and any variations thereof, in the specification, claims, and accompanying drawings of this invention are intended to cover a non-exclusive inclusion, such as a process, method, system, product, or apparatus that includes a series of steps or units, not necessarily limited to those explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0036] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. In addition, the technical features of the various embodiments or individual embodiments provided by the present invention can be arbitrarily combined to form new technical solutions. Such combinations are not bound by the order of steps and / or structural composition patterns, but must be based on the ability of those skilled in the art to implement them. When the combination of technical solutions is contradictory or cannot be implemented, it should be considered that such a combination of technical solutions does not exist and is not within the scope of protection claimed by the present invention.

[0037] Please see Figure 1 This embodiment provides a method for restoring low-light images in dark scenes based on manifold matching, including the following steps: Step S1: Acquire a low-light input image. This image can be a video frame captured by a surveillance camera at night or under low-light conditions.

[0038] Step S2: Input the low-light input image into the pre-trained encoder and the pre-trained Transformer module, respectively. The encoder maps the image to the original latent space features, and the Transformer module extracts the high-level semantic features of the image and outputs the first feature map.

[0039] Step S3: Add the first feature map to the original latent space features element by element to obtain intermediate features that integrate low-level details and high-level semantics.

[0040] Step S4: Obtain semantic guidance information. This semantic guidance information comes from the replaceable semantic input module and supports at least one of the following three input modes: natural language semantic input, specific reference image semantic input, and no additional semantic input. The features corresponding to the three modes are linearly combined to obtain the semantic guidance information.

[0041] Step S5: Add the intermediate features and semantic guidance information element by element to obtain the fused features.

[0042] Step S6: Input the Gaussian noise and the fused features together into the manifold matching module. The manifold matching module uses a linear sampling mechanism to progressively predict the Gaussian noise as the latent space code of the target image under the guidance of semantic information.

[0043] Step S7: Input the generated latent space code into the decoder for decoding, and output the restored high-quality image.

[0044] Please see Figure 2 It provides a training method for training the manifold matching module and related components in the above-mentioned restoration method, specifically including the following steps: Step T1: Construct the training dataset. This dataset contains paired low-light images and their corresponding high-quality images, where the high-quality images are the sharp images expected to be output.

[0045] Step T2: Dataset Augmentation. Augment low-light images (e.g., contrast stretching, denoising) to improve their recognizability; darken and add noise to high-quality images to generate new low-light image samples, thereby expanding the diversity of training data, improving the model's generalization ability, and increasing data utilization.

[0046] Step T3: Input the enhanced low-light image and the high-quality image into the image resizing module respectively, adjust them to a uniform size, and save the original size information for subsequent restoration.

[0047] Step T4: Input the resized low-light image and the high-quality image into the encoder respectively to obtain their respective original latent space features z. low and z high Simultaneously, the resized low-light image and the high-quality image are input into the pre-trained Transformer module to obtain their respective first feature maps f. low and f high The first feature map f of the low-light image. low Its original latent space feature z low By performing element-wise addition, the intermediate feature m corresponding to the low-light image is obtained. low =z low +f low; the first feature map f of the high-quality image high Its original latent space feature z high By performing element-wise addition, the intermediate features m corresponding to the high-quality image are obtained. high =z high +f high m high Save the target data used for training the manifold matching module.

[0048] Step T5: Extract the original latent space features z from the high-quality image. high Noise is added to obtain the noisy image z. noisy , which serves as the input noise for the manifold matching module.

[0049] Step T6: Extract the intermediate features m corresponding to the low-light image. low The fusion feature c is obtained by adding the semantic guidance information s provided by the replaceable semantic input module element by element, and serves as the semantic condition of the manifold matching module.

[0050] Step T7: Transform the noisy image z noisy Together with the fused feature c, the input is the manifold matching module. Guided by the semantic condition c, the manifold matching model selects the noisy image z. noisy The latent space code z of a high-quality image is gradually restored. pred .

[0051] Step T8: Encode the generated latent space z pred The image is sent to the decoder for decoding and then restored to its original size by the image size restoration module to obtain the predicted restored image.

[0052] Step T9: Using a linear combination of content diffusion loss and color consistency loss as the optimization objective, calculate the loss between the predicted image and the real high-quality image, and backpropagate to optimize the parameters of the manifold matching module.

[0053] The loss function used during training is as follows: , in For the content diffusion loss of the diffusion model, For manifold matching training objectives, Let t be the predicted target output by manifold matching, t be the current step number of manifold matching, T be the total number of steps of manifold matching, and q(x0|x) be the predicted target under the known low-light image distribution. The distribution of loss is given by E, where E represents the expected value of the loss and the statistical value of the distributed loss. For color consistency loss, These are the color histogram features of the predicted restored image and the target image in the dataset, respectively. Let C be a small constant, and c be the index of the current bin in the color histogram, used to prevent the denominator from being zero. The two are linearly combined and used to train the diffusion model and its related components. The overall formula for the loss function is as follows: , in The coefficients are linear combination coefficients of the two losses and can be freely chosen.

[0054] For the encoder and decoder modules described above, autoregressive loss is used for pre-training. This involves calculating the difference between the latent space features of the image after passing through the encoder and the image after passing through the decoder, aiming to minimize the loss. It is important to note that the encoder should be trained first to meet the semantic input requirements of the manifold matching module. When training the decoder, the encoder gradient needs to be frozen before using autoregressive loss to ensure that the loss is minimized after passing through the decoder. The loss function is as follows: , in Input semantics into the expected image for the manifold matching module. These are the decoder layer and the encoder layer, respectively. For original low-light images and high-quality images, The input consists of the original low-light image and the high-quality image.

[0055] In addition, after the Transformer and encoder-decoder layers have been pre-trained, their gradients need to be frozen when training the manifold matching module, etc., to prevent the original training effect from being destroyed during the training process.

[0056] Please see Figure 3 The specific implementation of the replaceable semantic input module is described in detail. The replaceable semantic input module is designed with three semantic input formats to provide diverse semantic guidance for the image restoration process: Natural Language Input: This mode uses natural language to describe and control the image reconstruction effect, achieving a near-text-to-image effect. In this mode, the input natural language text is first processed by language encoding, and then fed into a pre-trained natural language processing module (such as the BERT model) for semantic feature extraction, resulting in the corresponding semantic feature vector.

[0057] Specific image input: This mode is used to perform style transfer on the desired image to make certain image content more prominent and clearer. In this mode, the specific reference image is first fed into a pre-trained convolutional neural network (such as a VGG network) for deep feature extraction, resulting in a corresponding deep feature map, which serves as style or content guidance information.

[0058] No input: When the user does not provide any additional semantic information, only the processed low-light image itself is used as input. In this mode, the module generates default features (such as zero vectors or unit vectors), indicating that no external semantic guidance is introduced.

[0059] The processing results of the three input modes—semantic features, deep features, and default features—are linearly combined in the combination unit to generate the final semantic guidance information. The coefficients of the linear combination can be dynamically adjusted according to actual application needs to achieve flexible fusion of different semantic inputs and comprehensive processing of image effects. This semantic guidance information is then added element-wise with the image features for subsequent manifold matching.

[0060] Based on the same inventive concept as the aforementioned method embodiments, this invention also provides a manifold matching-based system for restoring low-light images in dark scenes. (See attached image.) Figure 1 and Figure 2 The system includes: The encoder maps the input image to a latent space code. Its input is connected to the image resizing module, and its output is connected to the first adder node.

[0061] The Transformer module is used to extract features from the input image and output a first feature map. Its input is connected to the image resizing module, and its output is connected to the first addition node.

[0062] A replaceable semantic input module is used to provide semantic guidance information. Its output is connected to the second adder node.

[0063] The first adder node is used to add the first feature map output by the Transformer module to the original latent space features output by the encoder element-wise to obtain intermediate features. Its output is connected to the second adder node.

[0064] The second addition node is used to add the intermediate features element-wise to the semantic guidance information output by the alternative semantic input module to obtain the fused features. Its output is connected to the manifold matching module.

[0065] The manifold matching module receives Gaussian noise and fused features, and generates the latent space code of the target image using a straight-line sampling mechanism through a manifold matching model. Its output is connected to the decoder.

[0066] The decoder is used to decode the latent space code generated by the manifold matching module into a restored high-quality image. Its output is connected to the image size restoration module.

[0067] The image resizing module is used to resize the input low-light images to a uniform size before inputting them into the encoder and Transformer modules respectively.

[0068] The image size restoration module is used to restore the image output by the decoder to its original size.

[0069] Combination Figure 3 The replaceable semantic input module specifically includes: The natural language input unit receives natural language text descriptions and extracts semantic features through a pre-trained natural language processing model. A specific image input unit receives a reference image, which is first normalized in size by an image resizing unit, and then input into a pre-trained reference image feature extraction Transformer to extract depth features. No input unit is used; default features are generated when the user does not provide additional semantic information. The combination unit linearly combines the semantic features, deep features, and default features output by the above three units to generate the final semantic guidance information.

[0070] In addition, the system may include a dataset augmentation module used during the training phase to enhance low-light images in the training data and darken and add noise to high-quality images.

[0071] During the inference phase, the system operates according to the process described in Method Implementation Example 1; during the training phase, the manifold matching module is trained according to the process described in Method Implementation Example 2, at which time it is necessary to ensure that the gradients of the Transformer module, encoder, and decoder are frozen.

[0072] The system of this invention adopts a modular architecture and achieves high-efficiency linear sampling through a manifold matching module, which ensures the restoration quality while meeting real-time processing requirements. The replaceable semantic input module provides a variety of semantic guidance modes to achieve flexible image restoration control. It does not require replacement of existing camera hardware, has strong compatibility and low deployment cost, and provides an efficient and intelligent night vision enhancement solution for all-weather monitoring scenarios.

[0073] It should be noted that the system embodiments provided by the present invention are used not only to implement the methods in the above method embodiments, but also to implement the methods in other method embodiments provided by the present invention. The only difference is that corresponding functional modules are set. The principle is basically the same as that of the above system embodiments provided by the present invention. As long as those skilled in the art can improve the modules in the above system embodiments by referring to the specific technical solutions in other method embodiments and combining technical features to obtain corresponding technical means and technical solutions composed of these technical means, on the basis of the above system embodiments, and on the premise of ensuring the practicality of the technical solutions, they can obtain corresponding system-like embodiments for implementing the methods in other method-like embodiments.

[0074] In summary, this invention discloses a method and system for restoring low-light images in dark scenes based on manifold matching. The method includes the following steps: extracting features from the low-light image to obtain original latent space features and a first feature map; adding the first feature map element-wise to the original latent space features to obtain intermediate features; obtaining semantic guidance information and adding it element-wise to the intermediate features to obtain fused features; inputting Gaussian noise and the fused features into a manifold matching module, generating latent space encoding of the target image through a linear sampling mechanism; and outputting the restored image after decoding. This invention also discloses a restoration system. Through this invention, low-light videos can be accurately and efficiently restored with low processing latency, and it has the advantages of low cost, high accuracy, and strong real-time performance. It is a high-efficiency low-light video restoration solution suitable for all-weather monitoring scenarios.

[0075] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the technical solutions of the embodiments of the present invention.

Claims

1. A method for restoring low-light images in dark scenes based on manifold matching, characterized in that, include: Acquire a low-light input image; The low-light input image is input into the encoder and the pre-trained Transformer module respectively to obtain the original latent space features and the first feature map. The first feature map is added element-wise to the original latent space features to obtain intermediate features; Obtain semantic guidance information, and add the intermediate features to the semantic guidance information element by element to obtain the fused features; Gaussian noise and the fused features are input into the manifold matching module, and the latent space code of the target image is generated by the manifold matching model using a straight line sampling mechanism; The latent space code is input into the decoder for decoding, and the restored high-quality image is output.

2. The method for restoring low-light images in dark scenes based on manifold matching according to claim 1, characterized in that, The semantic guidance information comes from a replaceable semantic input module, which provides at least one of the following three input modes: natural language semantic input, specific reference image semantic input, and no additional semantic input.

3. The method for restoring low-light images in dark scenes based on manifold matching according to claim 2, characterized in that, The natural language semantic input extracts semantic features through a pre-trained natural language processing model; the specific reference image semantic input extracts deep features through a pre-trained convolutional neural network. In the absence of additional semantic input, a default feature is generated; the semantic feature, the deep feature, and the default feature are linearly combined to obtain a combined feature as the semantic guidance information.

4. The method for restoring low-light images in dark scenes based on manifold matching according to claim 1, characterized in that, It also includes the following training steps: Construct a training dataset containing paired low-light images and corresponding high-quality images; The low-light image is enhanced, and the high-quality image is darkened and noise-added to generate enhanced training samples. The processed low-light image and high-quality image are input into the encoder respectively. After being processed by the pre-trained Transformer module, the latent space of the high-quality image is noise-added to obtain the noisy image. The noisy image and the low-light image features fused with semantic guidance information are jointly input into the manifold matching module for training, and the model parameters are optimized by a linear combination of content diffusion loss and color consistency loss.

5. The method for restoring low-light images in dark scenes based on manifold matching according to claim 4, characterized in that, The encoder and decoder are pre-trained using autoregressive loss. After pre-training, the gradients of the encoder and decoder are frozen when training the manifold matching module.

6. A system for restoring low-light images in dark scenes based on manifold matching, characterized in that, include: An encoder is used to map an image into a latent space code; The Transformer module is used for feature extraction from images; Replaceable semantic input modules are available to provide semantic guidance information; The manifold matching module receives Gaussian noise and features fused with semantic guidance information, and generates the latent space encoding of the target image through a linear sampling mechanism; The decoder is used to decode the latent space code generated by the manifold matching module into a restored high-quality image; The output feature map of the Transformer module is added element-wise to the original latent space features output by the encoder, and then added element-wise to the semantic guidance information provided by the replaceable semantic input module to obtain the fused features, which are then input to the manifold matching module.

7. The system for restoring low-light images in dark scenes based on manifold matching according to claim 6, characterized in that, The replaceable semantic input module includes a natural language input unit, a specific image input unit, and a no-input unit, as well as a combination unit for linearly combining the outputs of each unit.

8. The system for restoring low-light images in dark scenes based on manifold matching according to claim 7, characterized in that, The replaceable semantic input module further includes an image resizing unit and a reference image feature extraction Transformer. The image resizing unit is used to resize a specific reference image to a uniform size, and the reference image feature extraction Transformer is used to extract features from the resized specific reference image. The extracted depth features are used as the output of the specific image input unit.

9. The system for restoring low-light images in dark scenes based on manifold matching according to claim 7, characterized in that, The natural language input unit is connected to a pre-trained natural language processing model, and the specific image input unit is connected to a pre-trained convolutional neural network.

10. The system for restoring low-light images in dark scenes based on manifold matching according to claim 6, characterized in that, It also includes a dataset augmentation module for enhancing low-light images in the training data and darkening and adding noise to high-quality images.