Focusing method and imaging device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By utilizing image registration and low-frequency feature matching in a dual-camera system to establish a mapping relationship, the focus position can be quickly determined, solving the focusing problem in long-distance and significantly out-of-focus scenarios. This achieves efficient and accurate autofocus, resulting in good image quality, low hardware cost, and wide applicability.

WO2026124471A1PCT designated stage Publication Date: 2026-06-18CONVERGENCE TECH CO LTD

Patent Information

Authority / Receiving Office: WO · WO
Patent Type: Applications
Current Assignee / Owner: CONVERGENCE TECH CO LTD
Filing Date: 2025-12-09
Publication Date: 2026-06-18

AI Technical Summary

⚠Technical Problem

Existing focusing methods struggle to achieve high-precision focusing at long distances and significant defocusing. Traditional methods are slow, deep learning methods face data acquisition difficulties, and active focusing methods have limited range, affecting image quality. Furthermore, existing methods cannot simultaneously meet the demands for fast, accurate, wide-range, and low-cost focusing.

⚗Method used

By acquiring images between the first and second shooting modules, and using image registration and low-frequency feature matching to establish a mapping relationship, the focus position is quickly determined. Combined with a predictive model to optimize the focusing process, fast and accurate focusing is achieved in long-distance and large-scale out-of-focus scenarios.

🎯Benefits of technology

It achieves fast and accurate autofocus in long-distance, large-scale out-of-focus scenarios, with high image quality, low hardware cost, wide applicability to focusing scenarios, reduced prediction model complexity and hardware dependence, adaptive capability, and reduced calibration and maintenance costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN2025141133_18062026_PF_FP_ABST

Patent Text Reader

Abstract

The present invention relates to the technical field of photography, and provides a focusing method and an imaging device. The focusing method comprises the following steps: S1, acquiring a first image; S2, acquiring a second image; S3, acquiring a second pixel position of a matching region in the second image; S4, calculating a third pixel position of the first image in the second image on the basis of a first pixel position and the second pixel position; S5, acquiring a mapping relationship between a focus position of a first photographing module and the third pixel position, and calculating a first preset focus position corresponding to a target image scene; and S6, adjusting the focus position of the first photographing module on the basis of the first preset focus position corresponding to the target image scene. In the present invention, a focus position can be quickly determined by acquiring a first image captured by a first photographing module and a second image captured by a second photographing module, thereby eliminating significant defocus states.

Need to check novelty before this filing date? Find Prior Art

Description

A focusing method and imaging device

[0001] This application claims priority to application number 202411828689.7, filed on December 12, 2024, the entire contents of which are incorporated herein by reference. Technical Field

[0002] This invention relates to the field of photography technology, and more specifically, to a focusing method and an imaging device. Background Technology

[0003] Current focusing methods can be categorized into traditional methods and deep learning methods based on their implementation. Traditional methods mainly include contrast-based and phase-based methods. Contrast-based methods are based on the principle of maximizing image contrast. They require continuous adjustment of the lens position, acquiring a new image and calculating the contrast after each adjustment. By gradually moving the lens, the point with the highest contrast is found, which is the sharpest focus point. This method has high accuracy and is suitable for still photography, but it is slow and requires the lens to move back and forth, which is time-consuming. Phase-based methods divide a single pixel into left and right sub-pixels, which receive light from the left and right parts of the lens respectively. If the focus is correct, the two beams of light will align on the left and right sub-pixels; otherwise, a phase difference will occur. By calculating this phase difference, the direction and extent of lens movement can be determined. Phase-based methods are fast, but because some pixels are divided into dedicated pixels, the dynamic range of the sensor is lost, affecting the overall image quality. Deep learning methods train focusing models by collecting different focus images and corresponding focus positions to form a training set. However, due to the complexity of indoor and outdoor focusing scenarios, the data acquisition process for deep learning models is labor-intensive and difficult to cover all scenarios. Furthermore, the model size is limited by hardware resources, further restricting the speed and accuracy of focusing. Another approach is active focusing, such as LiDAR focusing, which uses a laser emission module to emit a beam of light towards the target and estimates the distance by measuring the time it takes for the emitted light to return. This method offers fast focusing speed and good performance in low-light environments, but its focusing distance is limited (typically within 10 meters) and requires additional hardware space.

[0004] Currently, most focusing methods are optimizations and improvements of single methods, but they cannot solve the problem of high-precision focusing when the camera is far away and significantly out of focus. Traditional contrast-based focusing is slow, phase-based focusing affects image quality, active focusing has a short focusing distance, and deep learning methods have high data acquisition difficulty. Furthermore, when the camera is significantly out of focus, high-frequency information in the image is drastically attenuated, resulting in severe homogenization of image content and texture loss, leading to low prediction accuracy of models that rely on texture detail information. Summary of the Invention

[0005] To address the aforementioned problems, embodiments of this application provide a focusing method and an imaging device.

[0006] In a first aspect, embodiments of this application provide a focusing method, comprising the following steps:

[0007] S1: Obtain a first image from the first shooting module, obtain a target image from the first image, and obtain the first pixel position of the target image in the first image, wherein the target image is the first image or a partial image of the first image;

[0008] S2: Obtain a second image from a second shooting module at a first preset distance from the first shooting module, wherein the field of view of the second image covers the field of view of the first image or the field of view of the second image at least partially overlaps with the field of view of the first image;

[0009] S3: Match the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, and obtain the second pixel position of the matching region in the second image;

[0010] S4: Calculate the position of the third pixel in the second image based on the position of the first pixel and the position of the second pixel;

[0011] S5: Obtain the mapping relationship between the third pixel position and the focus position of the first shooting module, and calculate the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position;

[0012] S6: Adjust the focus position of the first shooting module according to the first preset focus position corresponding to the target image scene.

[0013] Preferably, the optical axis of the first imaging module when capturing the first image and the optical axis of the second imaging module when capturing the second image are set parallel to each other.

[0014] Preferably, the optical axes of the first shooting module and the second shooting module are both fixed; or, at least one of the optical axes of the first shooting module and the second shooting module is rotatable.

[0015] Preferably, step S3 includes, before matching the target image with the second image through image registration, performing distortion correction on at least one of the first image and the second image.

[0016] Preferably, step S3 specifically includes: S31: performing image transformation on the target image using an image transformation matrix to obtain a target transformed image, such that the target transformed image, the overlapping field of view of the target image and the second image have the same image resolution;

[0017] S32: Match the target transformed image and the second image through image registration to obtain a matching region in the target transformed image that matches the second image, and obtain the second pixel position of the matching region in the second image.

[0018] Preferably, step S3, which involves matching the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, specifically includes: obtaining at least one corresponding low-frequency feature in the target image and the second image, and performing image registration on the target image and the second image based on the low-frequency feature to obtain a matching region in the target image that matches the second image.

[0019] Preferably, obtaining the low-frequency features corresponding to the image specifically includes: performing at least one low-frequency transformation on the image to obtain corresponding first operation data; using the first operation data as the low-frequency features corresponding to the image when performing a low-frequency transformation on the image; and combining the multiple first operation data when performing multiple low-frequency transformations on the image to obtain the low-frequency features corresponding to the image.

[0020] Preferably, the low-frequency features include one or more of the following: the overall structural features of the image, the macroscopic statistical distribution features of the image, the global semantic features of the image, or the first feature data of the image; the high-frequency components or local pixel mutations in the first feature data of the image are suppressed or ignored.

[0021] Preferably, the low-frequency transformation is one of the following: calculating and obtaining the average brightness value of the image, obtaining the color histogram of the image, obtaining the brightness histogram of the image, performing a downsampling operation on the image and obtaining low-resolution data of the downsampled image, and processing the image through a low-pass filter and obtaining filtered data of the processed image. The first operation data is one of the following: average brightness value, color histogram, brightness histogram, low-resolution data, and filtered data.

[0022] Preferably, the low-frequency transformation is one of performing a Fourier transform on the image and obtaining the first low-frequency coefficients of the transformed image, performing a discrete cosine transform on the image and obtaining the second low-frequency coefficients of the transformed image, and performing a wavelet transform on the image and obtaining the low-frequency subband coefficients of the transformed image, wherein the first operation data is one of the first low-frequency coefficients, the second low-frequency coefficients, and the low-frequency subband coefficients.

[0023] Preferably, the low-frequency transformation is one of obtaining the density distribution map of the edges in the image and obtaining the overall direction statistics of the edges in the image, and the first operation data is one of the density distribution map and the overall direction statistics.

[0024] Preferably, step S5, obtaining the mapping relationship between the third pixel position and the focus position of the first shooting module, specifically includes: constructing a first prediction model; moving the first shooting module multiple times, obtaining the first adjustment focus position corresponding to each movement of the first shooting module, and obtaining a third image from the first shooting module; obtaining the pixel coordinates of the second preset pixel in the third image in the second image based on the third image; constructing a dataset based on the pixel coordinates of the second preset pixel in the second image and multiple first adjustment focus positions; fitting the first prediction model with the dataset to obtain a first fitting model; and obtaining the mapping relationship between the third pixel position and the focus position of the first shooting module based on the first fitting model.

[0025] Preferably, step S5, which calculates the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position, specifically includes: obtaining a first adjustment focus position based on the mapping relationship and the third pixel position; obtaining a fourth image from the first shooting module when the first shooting module moves to the first adjustment focus position, and obtaining a first target adjustment image from the fourth image; determining whether the clarity of the first target adjustment image meets the first preset focus condition, stopping execution if it does, and continuing execution if it does not; constructing a second prediction model, and predicting the first predicted focus position of the first shooting module based on the first target adjustment image using the second prediction model; and using the first predicted focus position as the first preset focus position.

[0026] Preferably, obtaining the first pixel position of the target image in the first image in step S1 specifically includes: obtaining the first pixel position by detecting it through a target detection algorithm or receiving the first pixel position sent by the input module.

[0027] Preferably, the target image is an irregularly shaped image, and the first pixel position of the target image in the first image is the pixel coordinate of the upper left vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the first image. The second pixel position of the matching region in the second image is the pixel coordinate of the upper left vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the second image.

[0028] Preferably, the target image is a rectangular image, and the first pixel position of the target image in the first image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the first image, or the second pixel position of the matching region in the second image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the second image.

[0029] Preferably, step S6 specifically includes:

[0030] S61: Obtain the current focus position of the first shooting module;

[0031] S62: Calculate the number of steps and direction of movement required for the first shooting module to move from the current focus position to the first preset focus position based on the first preset focus position and the current focus position;

[0032] S63: Adjust the focus position of the first shooting module according to the number of steps and the direction of movement to match the first preset focus position corresponding to the target image scene.

[0033] Preferably, after step S63, the method further includes:

[0034] S64: Calculate and obtain multiple image contrast or image sharpness information at multiple positions adjacent to the first preset focus position;

[0035] S65: Obtain the maximum value of image contrast or image sharpness information from multiple image contrast or image sharpness information, and obtain the focus position corresponding to the maximum value of the image contrast or image sharpness information;

[0036] S66: Move the focus position of the first shooting module to the focusing position.

[0037] Secondly, embodiments of this application provide an imaging device, including a first shooting module, a second shooting module, a first image acquisition module, a second image acquisition module, a first matching module, a first storage module, a first calculation module, a second calculation module, and a first adjustment module, wherein the distance between the second shooting module and the first shooting module is a first preset length;

[0038] The first and second shooting modules are used to capture images respectively;

[0039] The first image acquisition module is used to acquire a first image from the first shooting module, acquire a target image from the first image, and acquire the first pixel position of the target image in the first image, wherein the target image is the first image or a partial image of the first image;

[0040] The second image acquisition module is used to acquire a second image from the second shooting module, wherein the field of view of the second image covers the field of view of the first image or the field of view of the second image at least partially overlaps with the field of view of the first image;

[0041] The first matching module is used to match the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, and to obtain the second pixel position of the matching region in the second image;

[0042] The first calculation module is used to calculate the position of the third pixel in the second image based on the position of the first pixel and the position of the second pixel;

[0043] The first storage module is used to store the mapping relationship between the position of the third pixel and the focus position of the first shooting module;

[0044] The second calculation module is used to obtain the mapping relationship between the focus position of the first shooting module and the third pixel position from the first storage module, and calculate the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position.

[0045] The first adjustment module is used to adjust the focus position of the first shooting module according to the first preset focus position corresponding to the target image scene.

[0046] Preferably, the optical axis of the first imaging module when capturing the first image and the optical axis of the second imaging module when capturing the second image are set parallel to each other.

[0047] Preferably, the optical axes of the first shooting module and the second shooting module are both fixed; or, at least one of the optical axes of the first shooting module and the second shooting module is rotatable.

[0048] Preferably, the first matching module includes a first image transformation unit, which is used to perform image transformation on the target image through an image transformation matrix to obtain a target transformed image, such that the target transformed image, the overlapping field of view of the target image and the second image have the same image resolution;

[0049] The first image matching unit is used to match the target transformed image and the second image through image registration to obtain a matching region in the target transformed image that matches the second image, and to obtain the second pixel position of the matching region in the second image.

[0050] Preferably, the first matching module performs image registration to match the target image and the second image to obtain a matching region in the target image that matches the second image. Specifically, this includes: obtaining at least one corresponding low-frequency feature in the target image and the second image, and performing image registration on the target image and the second image based on the low-frequency feature to obtain a matching region in the target image that matches the second image.

[0051] Preferably, the first matching module obtains the low-frequency features corresponding to the image by: performing at least one low-frequency transformation on the image to obtain corresponding first operation data; using the first operation data as the low-frequency feature corresponding to the image when performing a low-frequency transformation on the image; and combining the multiple first operation data when performing multiple low-frequency transformations on the image to obtain the low-frequency feature corresponding to the image.

[0052] Preferably, the low-frequency features include one or more of the following: the overall structural features of the image, the macroscopic statistical distribution features of the image, the global semantic features of the image, or the first feature data of the image; the high-frequency components or local pixel mutations in the first feature data of the image are suppressed or ignored.

[0053] Preferably, the low-frequency transformation is one of the following: calculating and obtaining the average brightness value of the image, obtaining the color histogram of the image, obtaining the brightness histogram of the image, performing a downsampling operation on the image and obtaining low-resolution data of the downsampled image, and processing the image through a low-pass filter and obtaining filtered data of the processed image. The first operation data is one of the following: average brightness value, color histogram, brightness histogram, low-resolution data, and filtered data.

[0054] Preferably, the low-frequency transformation is one of performing a Fourier transform on the image and obtaining the first low-frequency coefficients of the transformed image, performing a discrete cosine transform on the image and obtaining the second low-frequency coefficients of the transformed image, and performing a wavelet transform on the image and obtaining the low-frequency subband coefficients of the transformed image, wherein the first operation data is one of the first low-frequency coefficients, the second low-frequency coefficients, and the low-frequency subband coefficients.

[0055] Preferably, the low-frequency transformation is one of obtaining the density distribution map of the edges in the image and obtaining the overall direction statistics of the edges in the image, and the first operation data is one of the density distribution map and the overall direction statistics.

[0056] Preferably, the first matching module acquires the low-frequency features corresponding to the target image and the second image respectively. The first matching module performs image registration on the target image and the second image based on the low-frequency features to obtain the matching region in the target image that matches the second image. Specifically, this includes: performing a similarity calculation operation on the low-frequency features of the target image and the second image, and taking the region with the highest similarity as the matching region in the target image that matches the second image.

[0057] Preferably, it further includes a mapping relationship acquisition module; the mapping relationship acquisition module is used to construct a first prediction model; move the first shooting module multiple times, obtain the first adjustment focus position corresponding to each movement of the first shooting module and obtain a third image from the first shooting module, obtain the pixel coordinates of the second preset pixel in the third image in the second image based on the third image, and construct a dataset based on the pixel coordinates of the second preset pixel in the second image and the multiple first adjustment focus positions; fit the first prediction model with the dataset to obtain a first fitting model; and obtain the mapping relationship between the third pixel position and the focus position of the first shooting module based on the first fitting model.

[0058] Preferably, the second calculation module for calculating the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position specifically includes: obtaining a first adjustment focus position based on the mapping relationship and the third pixel position; obtaining a fourth image from the first shooting module when the first shooting module moves to the first adjustment focus position, and obtaining a first target adjustment image from the fourth image; determining whether the clarity of the first target adjustment image meets the first preset focus condition, and if not, constructing a second prediction model, and obtaining the first predicted focus position of the first shooting module based on the first target adjustment image through the second prediction model; and using the first predicted focus position as the first preset focus position.

[0059] Preferably, it further includes an input module; the first image acquisition module is used to detect and acquire the first pixel position through a target detection algorithm or to receive the first pixel position sent by the input module.

[0060] Preferably, the target image is an irregularly shaped image, and the first pixel position of the target image in the first image is the pixel coordinate of the upper left vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the first image. The second pixel position of the matching region in the second image is the pixel coordinate of the upper left vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the second image.

[0061] Preferably, the target image is a rectangular image, and the first pixel position of the target image in the first image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the first image, or the second pixel position of the matching region in the second image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the second image.

[0062] Preferably, the first adjustment module includes a first position acquisition unit for acquiring the current focus position of the first shooting module;

[0063] The mobile calculation unit is used to calculate the number of steps and direction of movement required for the first shooting module to move from the current focus position to the first preset focus position based on the first preset focus position and the current focus position.

[0064] The focusing unit is used to adjust the focusing position of the first shooting module according to the number of steps and the direction of movement to match the first preset focusing position corresponding to the target image scene.

[0065] Preferably, after adjusting the focus position of the first shooting module to match the first preset focus position corresponding to the target image scene according to the number of moving steps and the moving direction, the focusing unit further includes: calculating and obtaining multiple image contrast or image sharpness information at multiple positions adjacent to the first preset focus position; obtaining the maximum value of the image contrast or image sharpness information from the multiple image contrast or image sharpness information, obtaining the focus position corresponding to the maximum value of the image contrast or image sharpness information; and moving the focus position of the first shooting module to the focus position.

[0066] Preferably, it also includes a display module, which can display a first image and / or a second image.

[0067] Preferably, the display module is a touch display module capable of displaying at least a first image; the touch display module can sense the click or touch position on the first image and generate the target image position.

[0068] Thirdly, embodiments of this application provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method provided as in the first aspect or any possible implementation of the first aspect.

[0069] Fourthly, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the method provided as in the first aspect or any possible implementation thereof.

[0070] The beneficial effects of this invention are as follows:

[0071] 1. The field of view of the second image covers the field of view of the first image, or the field of view of the second image at least partially overlaps with the field of view of the first image. The first pixel position of the target image in the first image and the second pixel position of the matching region in the second image are obtained respectively. Based on the mapping relationship between the third pixel position and the focus position of the first shooting module, the first preset focus position corresponding to the target image scene is calculated based on the third pixel position. The focus position of the first shooting module is adjusted according to the first preset focus position corresponding to the target image scene. Compared with the traditional contrast focusing method, in this application, the focus position can be quickly determined by acquiring a first image acquired by the first shooting module and a second image acquired by the second shooting module, eliminating the large defocus state.

[0072] 2. Compared with phase detection autofocus, this application does not require a special dual-pixel sensor for focusing, and the focusing process does not lose image resolution or dynamic range, resulting in good image quality; in addition, compared with phase detection autofocus, this method has no strict requirements on the sensor model.

[0073] 3. Compared with deep learning focusing methods, this application uses a third image obtained from a position close to the optimal focus point to input a prediction model for focus prediction. This limits the prediction range of the prediction model to the quasi-focused neighborhood close to the optimal focus point, reducing the complexity and number of parameters of the prediction model, reducing the dependence on massive full-scene training data, and effectively improving the speed and accuracy of focusing.

[0074] 4. Compared with the LiDAR focusing method, this application has a wide focusing range and can simultaneously achieve focusing in both long-distance and short-distance scenes, without requiring additional hardware transmission modules.

[0075] 5. Compared with conventional dual-camera focusing methods, this application does not rely on the intrinsic parameter information of camera calibration. Based on dual-camera images, the system has adaptive capabilities through feature registration and dynamic mapping. Even when the optical axis undergoes slight deformation or mechanical displacement due to long-term use of the equipment, the real-time registration mechanism based on image content can still ensure focusing accuracy, reducing the labor costs of production line calibration and the maintenance difficulty throughout the equipment's life cycle.

[0076] 6. When faced with distant and significantly out-of-focus scenes, single focusing methods rely on high-frequency signals (texture, edge) that are either absent or unreliable, leading to slow focusing or even failure. To address this issue, this application utilizes the relatively stable low-frequency features of images to achieve fast and accurate autofocus in distant and significantly out-of-focus scenes, while also providing high image quality, low hardware cost, and wide applicability. Attached Figure Description

[0077] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0078] Figure 1 is a schematic flowchart of a focusing method provided in an embodiment of this application;

[0079] Figure 2 is a schematic diagram of an imaging device provided in an embodiment of this application;

[0080] Figure 3 is a schematic diagram of the structure of an electronic device provided in an embodiment of this application;

[0081] Figure 4 is a schematic diagram of the first image and the location of the target image in the first image provided in the embodiment of this application;

[0082] Figure 5 is a schematic diagram showing the positions of the second image, the first image, and the target image provided in an embodiment of this application;

[0083] Figure 6 is a schematic diagram showing the positions of the linearly scaled first image and the target image in the linearly scaled first image provided in an embodiment of this application.

[0084] Figure 7 is a schematic diagram showing the relationship between the focus position of the first shooting module and the contrast of the target image provided in the embodiment of this application;

[0085] Figure 8 is a schematic diagram of the interface of a touch display unit in an imaging device provided in an embodiment of this application;

[0086] Figure 9 is a flowchart illustrating the implementation of focusing based on image registration and pre-calibration mapping relationships provided in an embodiment of this application;

[0087] Figure 10 is a flowchart illustrating the implementation of image prediction-based focusing according to an embodiment of this application;

[0088] Figure 11 is a flowchart illustrating the implementation of focusing by sequentially moving the focusing position according to an embodiment of this application. Detailed Implementation

[0089] The technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings.

[0090] In the following description, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance. The following description provides multiple embodiments of this application, which can be substituted or combined with each other. Therefore, this application can also be considered to include all possible combinations of the same and / or different embodiments described. Thus, if one embodiment includes features A, B, and C, and another embodiment includes features B and D, then this application should also be considered to include embodiments containing one or more other possible combinations of A, B, C, and D, even if such embodiments are not explicitly described in the following text.

[0091] The following description provides examples and does not limit the scope, applicability, or examples set forth in the claims. Changes may be made to the function and arrangement of the described elements without departing from the scope of this application. Various processes or components may be appropriately omitted, substituted, or added to the examples. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Furthermore, features described with respect to some examples may be combined into other examples.

[0092] Please refer to Figures 1 and 4-7. Figure 1 is a flowchart illustrating a focusing method provided in an embodiment of this application. Figure 4 is a schematic diagram showing the positions of the first image and the target image within the first image, provided in an embodiment of this application. Figure 5 is a schematic diagram showing the positions of the second image, the first image, and the target image, provided in an embodiment of this application. Figure 6 is a schematic diagram showing the positions of the linearly scaled first image and the target image within the linearly scaled first image, provided in an embodiment of this application. Figure 7 is a schematic diagram showing the relationship between the focus position of the first shooting module and the contrast of the target image, provided in an embodiment of this application. In this embodiment, the method includes the following steps:

[0093] S1: Obtain a first image from the first shooting module, obtain a target image from the first image, and obtain the first pixel position of the target image in the first image, wherein the target image is the first image or a partial image of the first image;

[0094] S2: Obtain a second image from a second shooting module at a first preset distance from the first shooting module, wherein the field of view of the second image covers the field of view of the first image or the field of view of the second image at least partially overlaps with the field of view of the first image;

[0095] S3: Match the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, and obtain the second pixel position of the matching region in the second image;

[0096] S4: Calculate the position of the third pixel in the second image based on the position of the first pixel and the position of the second pixel;

[0097] S5: Obtain the mapping relationship between the third pixel position and the focus position of the first shooting module, and calculate the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position;

[0098] S6: Adjust the focus position of the first shooting module according to the first preset focus position corresponding to the target image scene.

[0099] In the embodiments of this application, the method of adjusting the focus position of the first shooting module should not be understood as only the mechanical displacement of the optical element or imaging sensor in physical space, but should be broadly understood as all adjustment methods that change the focus state of the imaging system and cause the focus position to change relative to the imaging medium; including but not limited to at least one of the following: physical displacement adjustment method, adjustment method based on optical degree change, optical path adjustment method, or imaging algorithm adjustment method.

[0100] In this embodiment, a first image and a second image are acquired by a first shooting module and a second shooting module, respectively. A target image is obtained from the first image. The field of view of the second image covers the field of view of the first image, or the field of view of the second image at least partially overlaps with the field of view of the first image. Image registration is used to match the target image and the second image to obtain a matching region. The first pixel position of the target image in the first image and the second pixel position of the matching region in the second image are obtained. Based on the mapping relationship between the third pixel position and the focus position of the first shooting module, a first preset focus position corresponding to the target image scene is calculated based on the third pixel position. The focus position of the first shooting module is adjusted according to the first preset focus position corresponding to the target image scene. Compared with the traditional contrast focusing method, this application can quickly determine the focus position by acquiring one first image acquired by the first shooting module and one second image acquired by the second shooting module, eliminating... Except for significant out-of-focus conditions; compared to phase detection autofocus methods, this application does not require a special dual-pixel sensor for focusing, and the focusing process does not lose image resolution or dynamic range, resulting in good image quality; furthermore, compared to phase detection autofocus methods, this application has no strict requirements on the sensor model; compared to deep learning autofocus methods, this application uses a third image input prediction model obtained near the optimal focus point for focus prediction, limiting the prediction range of the prediction model to the quasi-focused neighborhood near the optimal focus point, reducing the complexity and number of parameters of the prediction model, reducing the dependence on massive full-scene training data, and effectively improving the speed and accuracy of focusing; compared to LiDAR autofocus methods, this application has a wide focusing range, can simultaneously achieve focusing in both long-distance and short-distance scenes, and does not require additional hardware transmission modules; compared to conventional dual-camera autofocus methods, this application does not rely on the intrinsic parameter information of camera calibration, and based on dual-camera images, through the establishment of feature registration and dynamic mapping relationships, the system has adaptive capabilities. Even when the optical axis undergoes slight deformation or mechanical displacement due to long-term use of the equipment, the real-time registration mechanism based on image content in this application can still ensure the accuracy of focusing, reducing the labor cost of production line calibration and the maintenance difficulty throughout the equipment's life cycle.

[0101] In this embodiment, the mapping relationship between the third pixel position and the focus position of the first shooting module can be preset by the user, and the mapping relationship between the third pixel position and the focus position of the first shooting module can be stored in the storage module in advance.

[0102] In one possible implementation, the optical axis of the first imaging module when capturing the first image and the optical axis of the second imaging module when capturing the second image are set parallel to each other.

[0103] In this embodiment, the first shooting module can be a first camera, and the second shooting module can be a second camera. The first shooting module can be a telephoto camera or a wide-angle camera, and the second shooting module can be a wide-angle camera or a telephoto camera. The second shooting module is not limited to telephoto scenes; wide-angle scenes are also possible. The first shooting module is not limited to telephoto scenes; wide-angle scenes are also possible.

[0104] In this embodiment, the first and second shooting modules are integrated into the same product, with the second shooting module assisting the first shooting module in focusing. The product can be a telescope with dual cameras, or something similar. The method is executed by the first and second shooting modules, or it can be executed by a control unit within the product.

[0105] In one possible implementation, the optical axis of the first shooting module and the optical axis of the second shooting module are both fixed; or, at least one of the optical axes of the first shooting module and the second shooting module is rotatable.

[0106] In this embodiment, the optical axes of both the first and second imaging modules can be fixed, ensuring that they are parallel. At least one of the optical axes of the first and second imaging modules is rotatable, allowing them to be adjusted to be parallel when the first imaging module captures a first image and the second imaging module captures a second image.

[0107] In one possible implementation, step S3 includes, before matching the target image with the second image through image registration, performing distortion correction on at least one of the first image and the second image.

[0108] In this embodiment of the application, before matching the target image with the second image, distortion correction is performed on one or both of the target image and the second image to improve matching accuracy and thus improve focusing efficiency.

[0109] In one possible implementation, step S3 specifically includes: S31: performing image transformation on the target image using an image transformation matrix to obtain a target transformed image, such that the target transformed image, the overlapping field of view of the target image and the second image have the same image resolution;

[0110] S32: Match the target transformed image and the second image through image registration to obtain a matching region in the target transformed image that matches the second image, and obtain the second pixel position of the matching region in the second image.

[0111] In the embodiments of this application, having the same image resolution means that the target image and the second image have the same pixel dimension, and the object-side field of view corresponding to the target image is consistent with the object-side field of view corresponding to the overlapping field of view region of the target image and the second image.

[0112] In one possible implementation, step S4 specifically includes:

[0113] S41: Obtain the first pixel coordinates (x1, y1) of the first preset pixel point at the first preset position in the first image.

[0114] S42: Obtain the second pixel coordinates (x2, y2) of the first preset pixel point in the second image;

[0115] S43: Perform image transformation on the first image using the image transformation matrix to obtain the first transformed image, and obtain the third pixel coordinates (x1', y1') of the first preset pixel point in the first transformed image.

[0116] S44: Calculate the fourth pixel coordinates (x3, y3) of the third pixel position based on the second pixel coordinates (x2, y2) and the third pixel coordinates (x1', y1'), where x3 = x2 - x1' and y3 = y2 - y1'.

[0117] In one possible implementation, the target image is an image with a rectangular outline, the matching region is a region with a rectangular outline, and the first preset pixel at a first preset position in the matching region can be any vertex of the rectangular outline of the matching region.

[0118] In one possible implementation, the image transformation matrix is a linear scaling matrix A. Where r1 represents the linear scaling ratio of the first or second image in the x direction, and r2 represents the linear scaling ratio of the first or second image in the y direction, the third pixel coordinates (x1', y1') in step S43 satisfy: x1'=r1*x1, y1'=r2*y1, and the fourth pixel coordinates (x3, y3) in step S44 satisfy: x3=x2-x1'=x2-r1*x1, y3=y2-y1'=y2-r2*y1.

[0119] In one possible implementation, step S3, which involves matching the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, specifically includes: obtaining at least one corresponding low-frequency feature in the target image and the second image, and performing image registration on the target image and the second image based on the low-frequency feature to obtain a matching region in the target image that matches the second image.

[0120] In this application, the focus is on the problem of high-precision autofocus in long-distance, large-scale defocus scenarios. As mentioned above, current single focusing methods cannot simultaneously meet the requirements of fast focusing speed, high accuracy, good image quality, wide focusing distance range, low hardware cost, and complex and varied focusing scenarios. Especially when facing long-distance, large-scale defocus scenarios, the core signal on which it relies is no longer available or unreliable, resulting in a slow focusing process or even failure. To solve the above problems, this application proposes a dual-camera camera focusing method that utilizes the low-frequency characteristics of images. By taking advantage of the relatively stable characteristics of low-frequency images, a new approach is provided to solve this technical problem. It can achieve fast and accurate autofocus in long-distance, large-scale defocus scenarios, with high image quality, low hardware cost, and wide applicability to focusing scenarios.

[0121] In one possible implementation, obtaining the low-frequency features corresponding to an image specifically includes: performing at least one low-frequency transformation on the image to obtain corresponding first operation data; using the first operation data as the low-frequency features corresponding to the image when performing a low-frequency transformation on the image; and combining the multiple first operation data when performing multiple low-frequency transformations on the image to obtain the low-frequency features corresponding to the image.

[0122] In this embodiment, at least one low-frequency transform is performed on the target image to obtain corresponding first operation data. At least one low-frequency transform is performed on the second image to obtain corresponding first operation data.

[0123] In one possible implementation, the low-frequency features include one or more of the following: the overall structural features of the image, the macroscopic statistical distribution features of the image, the global semantic features of the image, or the first feature data of the image; the high-frequency components or local pixel mutations in the first feature data of the image are suppressed or ignored.

[0124] In one possible implementation, the low-frequency transformation is one of the following: calculating and obtaining the average brightness value of the image, obtaining the color histogram of the image, obtaining the brightness histogram of the image, performing a downsampling operation on the image and obtaining low-resolution data of the downsampled image, and processing the image through a low-pass filter and obtaining filtered data of the processed image. The first operation data is one of the following: average brightness value, color histogram, brightness histogram, low-resolution data, and filtered data.

[0125] In this embodiment, the low-frequency transformation can be one of the following: calculating the average brightness value of an image, obtaining the color histogram of the image, obtaining the brightness histogram of the image, performing a downsampling operation on the image and obtaining low-resolution data of the downsampled image, and processing the image through a low-pass filter and obtaining filtered data of the processed image. By performing one of the above low-frequency transformations, first operation data is obtained, that is, when performing the low-frequency transformation to calculate the average brightness value of the image, the average brightness value is used as the first operation data; when performing the low-frequency transformation to obtain the color histogram of the image, the color histogram is used as the first operation data; when performing the low-frequency transformation to obtain the brightness histogram of the image, the brightness histogram is used as the first operation data; when performing the downsampling operation on the image and obtaining low-resolution data of the downsampled image, the low-resolution data is used as the first operation data; when performing the low-frequency transformation to process the image through a low-pass filter and obtaining filtered data of the processed image, the filtered data is used as the first operation data, and the first operation data is used as a low-frequency feature. By performing various low-frequency transformations described above, that is, combining multiple low-frequency transformations to perform multiple processing on the same image, multiple corresponding first operation data are obtained. These first operation data are then combined to form low-frequency features. The combination can be averaging, weighting, addition, subtraction, multiplication, division, or direct concatenation. For example, the target image can be subjected to low-frequency transformations that perform downsampling operations to obtain low-resolution data of the downsampled image, and low-frequency transformations that process the image through a low-pass filter to obtain filtered data of the processed image. This yields low-resolution data and filtered data respectively. The low-resolution data and filtered data are then averaged to obtain new data, which is used as low-frequency features.

[0126] In this embodiment of the application, the low-frequency transformation of performing downsampling operation on the image and obtaining low-resolution data of the downsampled image specifically includes: downsampling the target image and the second image based on the horizontal field of view ratio and the vertical field of view ratio of the first shooting module and the second shooting module, respectively, to obtain low-frequency images of the target image and the second image.

[0127] In the embodiments of this application, the low-pass filter used in the low-frequency transformation of the image to process the image and obtain the filtered data of the processed image may include one or a combination of the following types: a) spatial domain linear convolution filter, including mean filter and / or Gaussian filter; b) frequency domain filter, including ideal low-pass filter and / or Butterworth low-pass filter; c) nonlinear filter, including median filter and / or bilateral filter.

[0128] In one possible implementation, the low-frequency transformation is one of performing a Fourier transform on the image and obtaining the first low-frequency coefficients of the transformed image, performing a discrete cosine transform on the image and obtaining the second low-frequency coefficients of the transformed image, and performing a wavelet transform on the image and obtaining the low-frequency subband coefficients of the transformed image, wherein the first operation data is one of the first low-frequency coefficients, the second low-frequency coefficients, and the low-frequency subband coefficients.

[0129] In this embodiment, low-frequency features are obtained through the transform domain, including one or a combination of the following methods: extracting low-frequency coefficients after Fourier transform of the image, extracting low-frequency coefficients after discrete cosine transform of the image, and extracting low-frequency sub-band coefficients after wavelet transform of the image. The low-frequency transform can be one of the following: performing a Fourier transform on the image and obtaining the first low-frequency coefficients of the transformed image, performing a discrete cosine transform on the image and obtaining the second low-frequency coefficients of the transformed image, and performing a wavelet transform on the image and obtaining the low-frequency sub-band coefficients of the transformed image. By performing one of the above low-frequency transforms, first operation data is obtained. That is, when performing a low-frequency transform to obtain the first low-frequency coefficients of the transformed image through Fourier transform, the first low-frequency coefficients are used as the first operation data; when performing a low-frequency transform to obtain the second low-frequency coefficients of the transformed image through discrete cosine transform, the second low-frequency coefficients are used as the first operation data; and when performing a wavelet transform to obtain the low-frequency sub-band coefficients of the transformed image, the low-frequency sub-band coefficients are used as the first operation data. The first operation data is used as the low-frequency feature. By performing multiple low-frequency transformations, that is, combining multiple low-frequency transformations to perform multiple processing on the same image, multiple first operation data are obtained. These first operation data are combined to form low-frequency features. The combination can be averaging, weighting, addition, subtraction, multiplication, division or direct splicing.

[0130] In one possible implementation, the low-frequency transformation is one of obtaining a density distribution map of the edges in the image and obtaining overall direction statistics of the edges in the image, wherein the first operation data is one of the density distribution map and overall direction statistics.

[0131] In this embodiment, the low-frequency feature is a macroscopic representation based on edge information, including one or a combination of the following methods: extracting the density distribution map of edges in an image, and extracting the overall direction statistics of edges in an image. The low-frequency transformation can be one of the following methods: obtaining the density distribution map of edges in an image and obtaining the overall direction statistics of edges in an image. By performing one of the above low-frequency transformations, first operation data is obtained; that is, when performing the low-frequency transformation to obtain the density distribution map of edges in an image, the density distribution map is used as the first operation data; when performing the low-frequency transformation to obtain the overall direction statistics of edges in an image, the overall direction statistics are used as the first operation data, and the first operation data is used as the low-frequency feature. By performing multiple of the above low-frequency transformations, that is, combining multiple low-frequency transformations to perform multiple processing on the same image, multiple corresponding first operation data are obtained. These first operation data are combined to form a low-frequency feature; the combination can be averaging, weighting, addition, subtraction, multiplication, division, or direct concatenation.

[0132] In this embodiment, the low-frequency transformation is one of extracting a depth feature map of an image through a neural network and extracting an image embedding vector of an image through a neural network, and the first operation data is one of the depth feature map and the image embedding vector.

[0133] In this embodiment, low-frequency features refer to the depth feature map or image embedding vector extracted from an image using a neural network. Low-frequency transformation can be performed in one of the following ways: extracting the depth feature map of an image using a neural network and extracting the image embedding vector of an image using a neural network. By performing one of the above low-frequency transformations, first operation data is obtained; that is, when performing a low-frequency transformation of the depth feature map extracted from the image using a neural network, the depth feature map is used as the first operation data; when performing a low-frequency transformation of the image embedding vector extracted from the image using a neural network, the image embedding vector is used as the first operation data, and the first operation data is used as the low-frequency feature. By performing multiple of the above low-frequency transformations, that is, combining multiple low-frequency transformations to perform multiple processing on the same image, multiple corresponding first operation data are obtained. These first operation data are combined to form low-frequency features. The combination can be averaging, weighting, addition, subtraction, multiplication, division, or direct concatenation. Neural networks include, but are not limited to, ResNet series neural networks, EfficientNet series neural networks, MobileNet series neural networks, or Vision Transformer. ResNet series neural networks are a classic architecture of deep convolutional neural networks (CNNs), which solves the gradient vanishing problem in deep networks through residual modules and skip connections. The EfficientNet series of neural networks are high-efficiency convolutional neural networks designed based on Neural Architecture Search (NAS). They achieve the optimal solution for accuracy and efficiency by balancing network depth, width, and resolution through a composite scaling method. The MobileNet series of neural networks are lightweight convolutional neural networks designed specifically for mobile devices and embedded systems. Vision Transformer (ViT) is an image processing model proposed in 2020. By introducing the Transformer architecture from natural language processing into the field of computer vision, it breaks through the limitations of traditional convolutional neural networks (CNNs).

[0134] In this embodiment of the application, a corresponding low-frequency feature is obtained from a target image and a second image through image transformation. The low-frequency feature is a transformed image obtained from a target image and a second image after image transformation. The image registration of the target image and the second image based on the low-frequency feature to obtain a matching region in the target image that matches the second image specifically includes: performing a similarity calculation operation between the transformed image and the other of the target image and the second image, and taking the region with the highest similarity as the matching region in the target image that matches the second image.

[0135] In this embodiment of the application, the low-frequency features corresponding to the target image and the second image are obtained respectively. The image registration of the target image and the second image based on the low-frequency features to obtain the matching region in the target image that matches the second image specifically includes: performing a similarity calculation operation on the low-frequency features of the target image and the second image, and taking the region with the highest similarity as the matching region in the target image that matches the second image.

[0136] In this embodiment of the application, the low-frequency features of the second image and the low-frequency features of the target image are used to calculate the similarity, and the region with the highest similarity is determined as the matching region in the target image that matches the second image.

[0137] In this embodiment, the similarity calculation operation is one of the following: calculating the cross-correlation value between low-frequency features of two images, calculating the normalized cross-correlation value between low-frequency features of two images, calculating the absolute difference between low-frequency features of two images, calculating the sum of squared differences between low-frequency features of two images, calculating the mean absolute error between low-frequency features of two images, calculating the mean squared error between low-frequency features of two images, calculating the Bartholomew's distance between low-frequency features of two images, calculating the chi-square distance between low-frequency features of two images, calculating the EMD distance between the histograms of low-frequency features of two images, calculating the mutual information between low-frequency features of two images, calculating the structural similarity index between low-frequency features of two images, and calculating the cosine similarity between low-frequency feature vectors of two images. Calculate the Euclidean distance between the low-frequency feature vectors of two images, the Manhattan distance between the low-frequency feature vectors of two images, and the Mahalanobis distance between the low-frequency feature vectors of two images; the region corresponding to the highest similarity is one of the following: the region corresponding to the maximum cross-correlation value, the region corresponding to the maximum normalized cross-correlation value, the region corresponding to the minimum absolute difference, the region corresponding to the minimum sum of squared differences, the region corresponding to the minimum mean absolute error, the region corresponding to the minimum mean squared error, the region corresponding to the minimum Bartholomew's distance, the region corresponding to the minimum chi-squared distance, the region corresponding to the minimum EMD distance, the region corresponding to the maximum mutual information, the region corresponding to the maximum structural similarity index, the region corresponding to the maximum cosine similarity, the region corresponding to the minimum Euclidean distance, the region corresponding to the minimum Manhattan distance, and the region corresponding to the minimum Mahalanobis distance.

[0138] In this embodiment, the similarity calculation operation includes one of the following methods: calculating the cross-correlation value or normalized cross-correlation value between the low-frequency features of two images; calculating the sum of absolute differences or sum of squared differences between the low-frequency features of two images; calculating the mean absolute error or mean square error between the low-frequency features of two images. The similarity calculation operation includes one of the following methods: calculating the Bach distance between the low-frequency feature histograms of two images; calculating the chi-square distance between the low-frequency feature histograms of two images; calculating the Earth Mover's Distance between the low-frequency feature histograms of two images. The similarity calculation operation calculates the mutual information between the low-frequency features of two images. The similarity calculation operation calculates the structural similarity index between the low-frequency features of two images. The similarity calculation operation includes one of the following methods: calculating the cosine similarity between the low-frequency feature vectors of two images; calculating the Euclidean distance or Manhattan distance between the low-frequency feature vectors of two images; calculating the Mahalanobis distance between the low-frequency feature vectors of two images.

[0139] In one possible implementation, obtaining the mapping relationship between the third pixel position and the focus position of the first shooting module in step S5 specifically includes: constructing a first prediction model; moving the first shooting module multiple times, obtaining the first adjustment focus position corresponding to each movement of the first shooting module, and obtaining a third image from the first shooting module; obtaining the pixel coordinates of a second preset pixel in the third image in the second image based on the third image; constructing a dataset based on the pixel coordinates of the second preset pixel in the second image and multiple first adjustment focus positions; fitting the first prediction model with the dataset to obtain a first fitting model; and obtaining the mapping relationship between the third pixel position and the focus position of the first shooting module based on the first fitting model.

[0140] In this embodiment, the first imaging module is moved multiple times to change its first adjustable focus position. The first imaging module can be moved arbitrarily or sequentially from beginning to end. Each movement of the first imaging module acquires a third image, a first mapped target image, and the fourth pixel position of the first mapped target image within the third image. Image registration is used to match the first mapped target image with a second image to obtain a mapping matching region in the first mapped target image that matches the second image. The fifth pixel position of this mapping matching region in the second image is then obtained. Based on the fourth and fifth pixel positions, the sixth pixel position of the third image in the second image is calculated. The image is a rectangular image. The fourth pixel position of the first mapped target image in the third image is the pixel coordinate of the top left corner vertex of the mapped matching region in the third image. The fifth pixel position of the mapped matching region in the third image is the pixel coordinate of the top left corner vertex of the mapped matching region in the second image. Thus, the pixel coordinate position of the top left corner vertex of the third image in the second image is obtained under different first adjustment focus positions f_i. Data set D is constructed by multiple first adjustment focus positions and multiple pixel coordinates of the top left corner vertex of the third image in the second image. A first prediction model is constructed, and the first prediction model is fitted to the dataset to obtain a first fitting model. The mapping relationship between the third pixel position and the focus position of the first shooting module is obtained based on the first fitting model.

[0141] In this embodiment, the second preset pixel can be the top left corner vertex of the third image, the center point of the third image, or the centroid of the third image. The first prediction model can include a prediction model formed based on one or a combination of the following methods: linear fitting, polynomial fitting, exponential fitting, logarithmic fitting, power function fitting, logistic regression fitting, Gaussian fitting, spline fitting, local regression, and nonlinear fitting methods based on machine learning. The first prediction model can be a prediction model based on a linear fitting method, fitting a straight line f_i=kp_i+b according to the dataset D, where k represents the slope of the line and b represents the intercept of the line. The fitting method can be at least one of least squares, overall least squares, Hough transform, RANSAC, and E-Estimators. The first prediction model can be a prediction model based on a multi-segment linear fitting method, where the number of segments n can be manually selected, satisfying the condition 2≤n. The first prediction model can be a prediction model based on a multi-segment linear fitting method, where the number of segments n can be automatically calculated, using at least one of dynamic programming-based methods, greedy search-based methods, and RANSAC-based fitting methods.

[0142] In one possible implementation, step S5, calculating the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position, specifically includes: obtaining a first adjustment focus position based on the mapping relationship and the third pixel position; obtaining a fourth image from the first shooting module when the first shooting module moves to the first adjustment focus position, and obtaining a first target adjustment image from the fourth image; determining whether the sharpness of the first target adjustment image meets the first preset focus condition, stopping execution if it does, and continuing execution if it does not; constructing a second prediction model, predicting the first predicted focus position of the first shooting module based on the first target adjustment image using the second prediction model; and using the first predicted focus position as the first preset focus position.

[0143] In this embodiment, the first target adjustment image can be a fourth image or a partial image from the fourth image. The first target adjustment image can be an image set composed of multiple target images obtained by the first shooting module after continuously moving the focusing motor; continuously moving the focusing motor means moving the focusing motor in fixed steps, or it can be moving the focusing motor in non-fixed steps.

[0144] In this embodiment, the first preset focusing condition can be: the sharpness exceeds a first preset value, the blur level is lower than a second preset value, the sharpness score exceeds a third preset value, or the sharpness is classified as sharp. Determining whether the sharpness of the first target adjusted image meets the first preset focusing condition specifically includes: obtaining the sharpness of the first target adjusted image; if the first preset focusing condition is that the sharpness exceeds the first preset value, then the sharpness of the first target adjusted image is directly compared with the first preset value; if the first preset focusing condition is that the blur level is lower than the second preset value, then the blur level of the first target adjusted image is obtained based on the sharpness of the first target adjusted image, and the blur level of the first target adjusted image is compared with the second preset value; if the first preset focusing condition is that the sharpness score exceeds the third preset value, then the sharpness score of the first target adjusted image is obtained based on the sharpness of the first target adjusted image, and the sharpness score of the first target adjusted image is compared with the third preset value; if the first preset focusing condition is that the sharpness is classified as sharp, then the sharpness classification of the first target adjusted image is obtained based on the sharpness of the first target adjusted image, and the sharpness classification of the first target adjusted image is compared with the sharpness. Methods for obtaining image sharpness values include, but are not limited to, traditional algorithms or neural network algorithms. Traditional algorithms may include one of the following: Laplacian Energy Method, Tenegrad gradient function, image variance method, FFT frequency domain method, DCT frequency domain method, and NIQE algorithm. The Laplacian Energy Method is an algorithm based on the second derivative of the image, measuring image sharpness by calculating the response value of the Laplacian operator. The Tenegrad gradient method uses the Sobel operator to extract the gradients of the image in the horizontal and vertical directions, thereby effectively evaluating image sharpness. The image variance method evaluates image sharpness by calculating the variance. The FFT frequency domain method determines the degree of blur by analyzing the spectral characteristics of the image. The DCT frequency domain method can improve sharpness by adjusting the high-frequency coefficients of the image. The NIQE (Naturalness Image Quality Evaluator) algorithm is a no-reference image quality evaluation algorithm designed to assess the naturalness of an image, i.e., whether the image looks like a natural scene. The neural network algorithm includes at least one deep learning prediction algorithm based on the MobileNet series, Transformer series, GhostNet series, or ShuffleNet series. The second prediction model uses a trained neural network algorithm for prediction, which includes one deep learning prediction algorithm based on the MobileNet series, Transformer series, GhostNet, or ShuffleNet series.The first predicted focus position of the first shooting module is obtained by predicting the first focus position of the first target image using a second prediction model. This means that the input image is trained by an algorithm model to predict the first predicted focus position. The second prediction model is an algorithm model, which can be a model based on a neural network algorithm, including one of the deep learning prediction algorithms based on the MobileNet series, Transformer series, GhostNet, and ShuffleNet series. Alternatively, the algorithm model can be a model based on feature extraction and model prediction methods. Feature extraction includes at least one of HOG features, color histograms, color moments, LBP, Haralick features, Gabor features, SIFT features, SURF features, ORB features, and Haar-like features. Model prediction methods include at least one of SVM methods, random forests, AdaBoost, and K-nearest neighbor methods. SVM (Support Vector Machine) is a supervised learning algorithm used for classification and regression. Its core idea is to find an optimal hyperplane to maximize the margin between different categories. Random forest is an ensemble learning method that constructs multiple decision trees and randomly selects data and features for training, ultimately improving prediction accuracy through voting or averaging results. AdaBoost (Adaptive Boosting) is an ensemble learning algorithm that iteratively adjusts sample weights, focusing on misclassified samples and gradually combining multiple weak classifiers to form a strong classifier. K-Nearest Neighbors (KNN) is a distance-based nonparametric classification algorithm that calculates the class of a new sample relative to its K nearest neighbors in the training set and selects the majority class as the prediction result.

[0145] In this embodiment, the image registration algorithm can be a template matching algorithm or a feature point matching algorithm. The template matching algorithm can use normalized cross-correlation or normalized squared difference as the matching similarity criterion. The feature point matching algorithm can include one of the following: a feature point matching algorithm based on SIFT feature points, a feature point matching algorithm based on SURF feature points, or a feature point matching algorithm based on ORB feature points.

[0146] In the embodiments of this application, the image registration algorithm may include at least one of the following: a deep learning registration algorithm based on Siamese networks, a deep learning registration algorithm based on MatchNet, and a deep learning registration algorithm based on SuperPoint and SuperGlue.

[0147] In one possible implementation, obtaining the first pixel position of the target image in the first image in step S1 specifically includes: obtaining the first pixel position by detecting it through a target detection algorithm or receiving the first pixel position sent by the input module.

[0148] In this embodiment of the application, step S1 of obtaining the target image from the first image specifically includes: obtaining the target image from the first image based on the user input location, obtaining the target image from the first image based on a target detection algorithm, or obtaining the target image from the first image based on a subject detection algorithm.

[0149] In this embodiment, user input refers to directly obtaining a fixed-size region based on the user's input location. User input also refers to the user manually selecting a region of a specified size and location. A target image is obtained from the first image based on a target detection algorithm. This target detection algorithm can be a single-stage or two-stage deep learning target detection algorithm, or it can be a traditional target detection algorithm based on HOG and SVM. A target image is obtained from the first image based on a subject detection algorithm. This subject detection algorithm can include at least one of the following: a traditional saliency detection method based on GBVS, a traditional saliency detection method based on Itti-Koch, or a traditional saliency detection method based on HS. It can also include at least one of the following: a deep learning saliency detection algorithm based on the CNN series or a deep learning saliency detection algorithm based on the Transformer series.

[0150] In one possible implementation, the object detection algorithm is one of the following: a YOLO-based deep learning object detection algorithm, an R-CNN-based deep learning object detection algorithm, a Transformer-based deep learning object detection algorithm, a traditional object detection algorithm based on a Haar cascade classifier, or a traditional object detection algorithm based on DPM.

[0151] In one possible implementation, the target image is an irregularly shaped image, and the first pixel position of the target image in the first image is the pixel coordinates of the upper left vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the first image. The second pixel position of the matching region in the second image is the pixel coordinates of the upper left vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the second image.

[0152] In one possible implementation, the target image is a rectangular image, the first pixel position of the target image in the first image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the first image, and the second pixel position of the matching region in the second image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the second image.

[0153] In this embodiment, the first pixel position can be the pixel coordinates of the top-left corner vertex of the matching region in the first image, and the third pixel position can be the pixel coordinates of the top-left corner vertex of the first image in the second image. The mapping relationship is the relationship between the third pixel position and the focus position of the first shooting module, or the relationship between the focus position of the first shooting module and the pixel coordinates of the top-left corner vertex of the first image in the second image. Alternatively, the first pixel position can be the pixel coordinates of the center point of the matching region in the first image, and the third pixel position can be the pixel coordinates of the center point of the first image in the second image. The mapping relationship is the relationship between the third pixel position and the focus position of the first shooting module, or the relationship between the focus position of the first shooting module and the pixel coordinates of the center point of the first image in the second image. Furthermore, the first pixel position can be the pixel coordinates of the centroid of the matching region in the first image, and the third pixel position can be the pixel coordinates of the centroid of the first image in the second image. The mapping relationship is the relationship between the third pixel position and the focus position of the first shooting module, or the relationship between the focus position of the first shooting module and the pixel coordinates of the centroid of the first image in the second image.

[0154] In one possible implementation, step S6 specifically includes:

[0155] S61: Obtain the current focus position of the first shooting module;

[0156] S62: Calculate the number of steps and direction of movement required for the first shooting module to move from the current focus position to the first preset focus position based on the first preset focus position and the current focus position;

[0157] S63: Adjust the focus position of the first shooting module according to the number of steps and the direction of movement to match the first preset focus position corresponding to the target image scene.

[0158] In this embodiment, the number of steps and direction of movement required for the first shooting module to move from the current focus position to the first preset focus position corresponding to the target image scene are calculated based on the focus position corresponding to the target image scene. The focus position of the first camera is adjusted according to the obtained number of steps and direction of movement to make it consistent with the first preset focus position corresponding to the target image scene, thereby completing the focusing of the first shooting module and realizing the first adjustment of the first shooting module.

[0159] In one possible implementation, step S63 is followed by:

[0160] S64: Calculate and obtain multiple image contrast or image sharpness information at multiple positions adjacent to the first preset focus position;

[0161] S65: Obtain the maximum value of image contrast or image sharpness information from multiple image contrast or image sharpness information, and obtain the focus position corresponding to the maximum value of the image contrast or image sharpness information;

[0162] S66: Move the focus position of the first shooting module to the focusing position.

[0163] In this embodiment, after adjusting the focus position of the first shooting module to match the first preset focus position corresponding to the target image scene based on the acquired number of movement steps and movement direction, the method further includes: acquiring image contrast or image sharpness information at multiple adjacent positions of the focus position corresponding to the target image scene, and moving the focus position of the first shooting module to the focus position corresponding to the maximum value of image contrast or image sharpness at multiple positions, thereby achieving fine-tuning of the first shooting module. This application is not limited to the physical displacement of the first shooting module, and should not be understood only as the mechanical displacement of optical elements or imaging sensors in physical space, but should be broadly understood as all adjustment methods that change the focus state of the imaging system and cause the focus position to change relative to the imaging medium; including but not limited to at least one of the following: physical displacement adjustment methods, adjustment methods based on optical dilatancy change, adjustment methods based on optical path adjustment, or imaging algorithm adjustment methods.

[0164] In this embodiment, when the first shooting module moves to the first preset focus position, a fifth image is acquired from the first shooting module, and a second target adjustment image is acquired from the fifth image; it is determined whether the sharpness of the second target adjustment image meets the second preset focus condition. If it does, execution stops; otherwise, execution continues. The first shooting module is moved multiple times, and multiple sixth images are acquired corresponding to each completed movement of the first shooting module. Based on the sharpness of the multiple sixth images, a second adjustment focus position corresponding to the first shooting module is calculated. The focus position of the first shooting module is moved to the second adjustment focus position. The second target adjustment image can be the fifth image or a partial image of the fifth image.

[0165] In this embodiment, the second preset focusing condition can be: sharpness exceeding a fourth preset value, blur level below a fifth preset value, sharpness score exceeding a sixth preset value, or sharpness classified as sharp. Determining whether the sharpness of the second target adjusted image meets the second preset focusing condition specifically includes: acquiring the sharpness of the second target adjusted image; if the second preset focusing condition is sharpness exceeding the fourth preset value, then directly comparing the sharpness of the second target adjusted image with the fourth preset value; if the second preset focusing condition is blur level below the fifth preset value, then acquiring the blur level of the second target adjusted image based on its sharpness, and comparing the blur level of the second target adjusted image with the fifth preset value; if the second preset focusing condition is sharpness score exceeding the sixth preset value, then acquiring the sharpness score of the second target adjusted image based on its sharpness, and comparing the sharpness score of the second target adjusted image with the sixth preset value; if the second preset focusing condition is sharpness classified as sharp, then acquiring the sharpness classification of the second target adjusted image based on its sharpness, and comparing the sharpness classification of the second target adjusted image with the sharpness classification.

[0166] In this embodiment, the method for determining whether the sharpness of the second target adjusted image meets the second preset focus condition can be the same as the method for determining whether the sharpness of the first target adjusted image meets the first preset focus condition. Alternatively, the method for determining whether the sharpness of the second target adjusted image meets the second preset focus condition can be different from the method for determining whether the sharpness of the first target adjusted image meets the first preset focus condition. The method for determining whether the sharpness of the first target adjusted image meets the first preset focus condition has a faster calculation speed than the method for determining whether the sharpness of the second target adjusted image meets the second preset focus condition. The first preset focus condition and the second preset focus condition can be the same. Alternatively, the first preset focus condition and the second preset focus condition can be different conditions, with the first preset focus condition having a higher requirement for sharpness than the second preset focus condition.

[0167] In this embodiment, moving the first shooting module multiple times specifically includes: moving the first shooting module multiple times according to a movement rule. The movement rule refers to moving the first shooting module according to a fixed step size S, where 1 ≤ S ≤ 0.1 * S_all, and S_all represents the total step size of the focusing motor. Alternatively, the movement rule refers to moving the first camera according to an adaptive step size, the selection of which can refer to traditional or improved hill-climbing methods.

[0168] In this embodiment, calculating the second focus adjustment position corresponding to the first shooting module based on the sharpness of multiple sixth images specifically includes: calculating the sharpness of each of the multiple sixth images; selecting the sixth image with the highest sharpness from the multiple sixth images based on their sharpness; obtaining the position of the first shooting module when shooting the sixth image, and using this position as the second focus adjustment position. Calculating the sharpness of multiple sixth images refers to using a trained neural network algorithm, including at least one of deep learning prediction algorithms based on the MobileNet series, Transformer series, GhostNet, and ShuffleNet series. Calculating the sharpness of multiple sixth images also refers to using a traditional algorithm to obtain the image sharpness value, which can be one of the Laplacian Energy Method, Tenengrad gradient function, image variance method, FFT frequency domain method, DCT frequency domain method, or NIQE algorithm. The Laplacian Energy Method is an algorithm based on the second derivative of an image, which measures image sharpness by calculating the response value of the Laplacian operator. The Tenegrad gradient method uses the Sobel operator to extract the gradients of an image in the horizontal and vertical directions, thus effectively evaluating image sharpness. The image variance method assesses image sharpness by calculating variance. The FFT frequency domain method determines the degree of blur by analyzing the spectral characteristics of an image. The DCT frequency domain method can improve sharpness by adjusting the high-frequency coefficients of the image. The NIQE (Naturalness Image Quality Evaluator) algorithm is a no-reference image quality assessment algorithm designed to evaluate the naturalness of an image, i.e., whether the image looks like a natural scene.

[0169] The method of this application may also include seven embodiments:

[0170] First embodiment

[0171] This invention proposes a focusing method, as shown in Figure 9. Figure 9 is a flowchart illustrating the implementation of focusing based on image registration and pre-calibration mapping relationships provided in this embodiment, including the following steps:

[0172] S700: Acquire a first image from the first camera as target image one;

[0173] S701: Acquire a second image from a second camera located at a distance from the first camera, the second image covering the field of view of the first image; the optical axes of the first camera and the second camera are parallel;

[0174] S702. Use an image registration method to match the first image and the second image, and obtain the pixel coordinates of the top left corner of the first image in the second image;

[0175] In this embodiment, a trained neural network is used to perform image registration on the first image and the second image; the neural network is a convolutional neural network, which requires two images as input, namely the first image scaled to 128x128 and the second image scaled to 256x256, and the output of the neural network is the pixel coordinates of the top left corner vertex of the first image on the second image;

[0176] S703. Calculate the focus position of the first camera by using the mapping relationship between the pixel coordinates of the top left corner vertex of the first image in the second image and the focus position of the first camera stored in the memory.

[0177] S704. Adjust the first camera to the corresponding focus position one, and acquire the third image from the first camera as the target image two;

[0178] S705. Determine whether target image two is clear;

[0179] In this embodiment, S705, determine whether the second target image is clear, that is, determine whether the second target image meets the predetermined condition one; use a trained neural network to determine whether the second target image is clear; the neural network is a convolutional neural network, its input is a local image of the center position of the first image with an image size of 512x512, and the output of the neural network is 0 or 1, where 0 indicates that the image is blurry and 1 indicates that the image is clear;

[0180] S720: If the target image is clear, then stop focusing;

[0181] S706. If the second target image is not clear, predict the second focus position of the first camera based on the second target image.

[0182] In this embodiment, a trained neural network is used to predict the second focus position of the first camera; the neural network is a convolutional neural network, the input of which is a local image of the image center position of the first image with a size of 512 x 512, and the output of the neural network is a number representing the focus position of the first camera;

[0183] Referring to Figure 10, which is a flowchart of the image prediction-based focusing implementation provided in this embodiment, and is an extension of Figure 9, the following steps are included:

[0184] S707. Based on the second focus position of the first camera predicted by S106, move the first camera to the second focus position and acquire the fourth image as the third target image;

[0185] S708. Determine whether target image three is clear;

[0186] In this embodiment, the method and step S705 for determining whether target image three is clear are the same;

[0187] S720: If the target image is in sharp focus, then stop focusing;

[0188] S709. If the target image is not clear, the focus position of the first camera is moved sequentially, and the corresponding image and clarity are acquired sequentially.

[0189] In this embodiment, obtaining the image sharpness refers to obtaining the image sharpness value using a trained neural network; the neural network is a convolutional neural network, whose input is a local image with a resolution of 512 x 512 at the center position of the first image, and the output of the neural network is a number representing the sharpness value of the input image;

[0190] In this embodiment, the strategy of moving sequentially is to first move in a certain direction with a fixed step size. If the clarity of the acquired image continues to decrease after multiple consecutive moves, then move in the opposite direction until the focus position with the highest clarity is reached.

[0191] Referring to Figure 11, which is a flowchart of the implementation of focusing by sequentially moving the focusing position according to this embodiment, and is an extension of Figure 10, the following steps are included:

[0192] S710. Save the image clarity and corresponding focus position after each movement of the first camera to the memory; that is, calculate the image clarity after each movement of the first camera and save it to the memory.

[0193] S711. Find the highest-resolution focus position three in the memory; that is, obtain the image with the highest resolution in the memory, and obtain the focus position three corresponding to the first camera when capturing the image;

[0194] S712, Move the first camera to focus position three;

[0195] S720, focusing complete.

[0196] Second embodiment

[0197] In this embodiment, steps S700 to S706 are consistent with those in the first embodiment. The difference is that after predicting the focus position two of the first camera based on the target image two in step S706, this embodiment directly moves the first camera to the focus position two. At this time, it is not necessary to continue to acquire the fourth image from the first camera. After the movement is completed, the focusing ends directly.

[0198] Third embodiment

[0199] In this embodiment, steps S700 to S705 are consistent with those in the first embodiment. The difference is that after step S705 is executed, this embodiment directly executes steps S709 to S720, which is equivalent to step S705 being equivalent to step S708. Steps S706 to S707 are skipped, and target image three is considered to be target image two.

[0200] Fourth embodiment

[0201] This invention proposes a focusing method, comprising the following steps:

[0202] S110: Acquire a first image from the first camera and use it as the target image;

[0203] S120: Acquire a second image from a second camera located at a distance from the first camera, wherein the field of view of the second image covers the field of view of the first image;

[0204] The optical axes of the first and second cameras are parallel.

[0205] S130: Use image registration to match the first image and the second image, and obtain the pixel coordinates of the top left corner vertex of the first image in the second image;

[0206] The image registration algorithm is an image registration algorithm based on SIFT feature points;

[0207] S140: Calculate the focal position corresponding to the scene in the first image by using the mapping relationship between the focal position of the first camera stored in the memory and the pixel coordinate position of the top left corner vertex of the first image in the second image;

[0208] S150: Calculate the number of steps and direction of movement required for the first camera to move from the current focus position to the focus position corresponding to the first image scene based on the focus position corresponding to the first image scene obtained in step S140;

[0209] S160: Adjust the focus position of the first camera according to the acquired number of steps and direction of movement so that it matches the focus position corresponding to the first image scene and complete the focusing.

[0210] Fifth embodiment

[0211] This invention proposes a focusing method, as shown in Figures 4-6. The second embodiment includes the following steps:

[0212] S210: Obtain a partial image of a rectangular shape outline from the first image 211 from the first camera as the target image 213;

[0213] The image coordinates of the upper left corner vertex 201 of the target image 213 in the first image 211 can be the first pixel coordinates, i.e. (x1, y1).

[0214] S220: Acquire a second image 215 from a second camera located at a distance from the first camera, wherein the field of view of the second image 215 covers the field of view of the first image 211;

[0215] The optical axes of the first and second cameras are parallel.

[0216] S230: Apply an image transformation matrix to the target image 213 so that the transformed target image 223 and the overlapping field of view 219 of the target image 213 and the second image 215 have the same image resolution;

[0217] The image transformation matrix is a linear scaling matrix A, satisfying: Where r1 represents the linear scaling ratio of the first image 211 or the second image 215 in the x direction, and r2 represents the linear scaling ratio of the first image 211 or the second image 215 in the y direction;

[0218] The same image resolution means that the overlapping field of view 219 of the transformed target image 223, target image 213 and second image 215 has the same image dimension, and the object-side field of view corresponding to the overlapping field of view 219 of the transformed target image 223, target image 213 and second image 215 is consistent.

[0219] S240: Use image registration to match the target image 223 after linear scaling transformation with the second image 215, and obtain the pixel position of the upper left corner vertex 205 of the overlapping field of view region 219 in the second image 215. The pixel position can be the second pixel coordinate, i.e. (x2, y2).

[0220] The image registration algorithm can be a template matching algorithm that uses the normalized correlation coefficient as the matching similarity criterion.

[0221] S250: Perform image transformation on the first image 211 using a linear scaling matrix A, and obtain the pixel coordinates of the top left vertex 207 of the transformed target image 223 in the transformed first image 221 after linear scaling. These pixel coordinates can be the third pixel coordinates, i.e. (x1', y1'), and the third pixel coordinates (x1', y1') satisfy: x1'=r1*x1, y1'=r2*y1;

[0222] S260: Using the pixel coordinates of the top left vertex 207 of the transformed target image 223 in the transformed first image 221 and the pixel coordinates of the top left vertex 205 of the overlapping field of view region 219 in the second image 215, calculate the pixel coordinates of the top left vertex 203 of the first image 217 in the second image 215. This pixel coordinate can be the fourth pixel coordinate, i.e. (x3, y3). The fourth pixel coordinate (x3, y3) satisfies: x3=x2-x1'=x2-r1*x1, y3=y2-y1'=y2-r2*y1;

[0223] S270: Calculate the focus position corresponding to the target image scene by using the mapping relationship between the focus position of the first camera stored in the memory and the pixel coordinates of the upper left corner vertex 203 of the first image 217 in the second image 215;

[0224] S280: Adjust the focus position of the first camera according to the focus position corresponding to the scene in the acquired target image;

[0225] The method for adjusting the focus position of the first camera includes two steps: a. Calculating the number of steps and direction of movement required to adjust the focus position of the first camera based on the difference between the current focus position of the first camera and the focus position of the scene contained in the target image; b. Adjusting the focus position of the first camera according to the number of steps and the direction of movement so that it matches the focus position corresponding to the scene in the target image and completing the focusing.

[0226] Sixth embodiment

[0227] In this embodiment, the first few steps are the same as S210~S270 in the second embodiment. The specific method for adjusting the focus position of the first camera according to the focus position corresponding to the acquired target image scene in step S280 includes the following steps: a. Calculate the number of movement steps and movement direction required to adjust the focus position of the first camera based on the difference between the current focus position of the first camera and the focus position corresponding to the target image scene; b. Adjust the focus position of the first camera to be consistent with the focus position corresponding to the target image scene in step S270 according to the number of movement steps and movement method; c. Continue to acquire the image contrast at multiple adjacent positions of the focus position corresponding to the target image scene, and adjust the focus position of the first camera to the position corresponding to the maximum image contrast at the multiple positions.

[0228] Step c specifically involves continuing to move the first camera according to the moving method described in step a, and acquiring the target image contrast during the movement. Figure 7 shows a schematic diagram of the relationship between the focus position of the first shooting module and the target image contrast. If the direction of movement of the first camera determined in step a is the positive x-axis direction, there are two possibilities:

[0229] In scenario one, the focus position of the target image acquired in steps S210~S270 is to the left of the actual focus position of the target image, as shown in Figure 7 (301). The actual focus position of the target image is 303, and the focus position of the acquired target image is 301. The movement direction is to the right. If the first shooting module continues to be adjusted in the movement direction, the image contrast value during the movement will first increase and then decrease. When the image contrast value is detected to first increase and then decrease, it is necessary to stop moving the first shooting module in the direction and calculate the focus position of the first shooting module corresponding to the maximum image contrast value 303 during the movement. Then, the focus position of the first shooting module is adjusted to the focus position of the first shooting module corresponding to 303 to complete the focusing.

[0230] In scenario two, the focus position of the target image obtained in steps S210-S270 is to the right of the actual focus position of the target image, as shown in Figure 7 at 305. The actual focus position of the target image is 303, and the focus position of the obtained target image is 305. If the first shooting module continues to be adjusted in the moving direction, the image contrast value will continuously decrease during the movement. When a continuous decrease in image contrast value is detected, the moving direction of the first shooting module needs to be adjusted, and the first camera needs to continue moving, while recording the image contrast during the movement. When an image contrast value is detected to have first increased and then decreased, the movement of the first shooting module in the current direction needs to be stopped, and the focus position of the first shooting module corresponding to the maximum image contrast value 303 during the movement after the first shooting module turns needs to be calculated. Then, the focus position of the first shooting module is adjusted to the focus position corresponding to 303 to complete focusing. In this embodiment, image contrast refers to the Sobel response of the image. Specifically, if the input image is denoted as... The image contrast C satisfies: In this embodiment, the focus position of the target image obtained in steps S210 to S270 is already close to the actual focus position of the target image. Therefore, it is only necessary to move the first shooting module a few steps near the focus position corresponding to the target image scene in step S270 to determine the actual focus position corresponding to the target image scene, thereby achieving fast focusing.

[0231] The imaging apparatus provided in the embodiments of this application will now be described in detail with reference to Figure 2. It should be noted that the imaging apparatus shown in Figure 2 is used to perform the method of the embodiment shown in Figure 1 of this application. For ease of explanation, only the parts related to the embodiments of this application are shown. For specific technical details not disclosed, please refer to the embodiments shown in Figure 1 of this application.

[0232] Please refer to Figures 2 and 4-8. Figure 2 is a structural schematic diagram of an imaging device provided in an embodiment of this application. Figure 4 is a schematic diagram of the position of the first image and the target image in the first image provided in an embodiment of this application. Figure 5 is a schematic diagram of the position of the second image, the first image, and the target image provided in an embodiment of this application. Figure 6 is a schematic diagram of the position of the first image after linear scaling and the target image in the first image after linear scaling provided in an embodiment of this application. Figure 7 is a schematic diagram of the relationship curve between the focus position of the first shooting module and the contrast of the target image provided in an embodiment of this application. Figure 8 is a schematic diagram of the interface of the touch display unit in an imaging device provided in an embodiment of this application. As shown in Figures 2 and 4-8, the device includes a first shooting module 401, a second shooting module 402, a first image acquisition module 403, a second image acquisition module 404, a first matching module 405, a first storage module 406, a first calculation module 407, a second calculation module 408, and a first adjustment module 409. The distance between the second shooting module 402 and the first shooting module 401 is a first preset length.

[0233] The first shooting module 401 and the second shooting module 402 are used to capture images respectively;

[0234] The first image acquisition module 403 is used to acquire a first image from the first shooting module 401, acquire a target image from the first image, and acquire the first pixel position of the target image in the first image, wherein the target image is the first image or a partial image of the first image;

[0235] The second image acquisition module 404 is used to acquire a second image from the second shooting module 402, wherein the field of view of the second image covers the field of view of the first image or the field of view of the second image at least partially overlaps with the field of view of the first image;

[0236] The first matching module 405 is used to match the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, and to obtain the pixel position of the matching region in the second image, wherein the pixel position is the second pixel position;

[0237] The first calculation module 407 is used to calculate the pixel position of the first image in the second image based on the first pixel position and the second pixel position, wherein the pixel position is the third pixel position;

[0238] The first storage module 406 is used to store the mapping relationship between the position of the third pixel and the focus position of the first shooting module;

[0239] The second calculation module 408 is used to obtain the mapping relationship between the focus position of the first shooting module and the third pixel position from the first storage module, and calculate the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position.

[0240] The first adjustment module 409 is used to adjust the focus position of the first shooting module according to the first preset focus position corresponding to the target image scene.

[0241] In the embodiments of this application, the method of adjusting the focus position of the first shooting module should not be understood as only the mechanical displacement of the optical element or imaging sensor in physical space, but should be broadly understood as all adjustment methods that change the focus state of the imaging system and cause the focus position to change relative to the imaging medium; including but not limited to at least one of the following: physical displacement adjustment method, adjustment method based on optical degree change, optical path adjustment method, or imaging algorithm adjustment method.

[0242] In this embodiment, the first shooting module 401 and the second shooting module 402 respectively capture images. The first image acquisition module 403 acquires a first image from the first shooting module 401, acquires a target image from the first image, and acquires the first pixel position of the target image in the first image. The second image acquisition module 404 acquires a second image from the second shooting module 402. The field of view of the second image covers the field of view of the first image, or the field of view of the second image at least partially overlaps with the field of view of the first image. The first matching module 405 matches the target image and the second image through image registration to acquire a matching region and the pixel position of the matching region in the second image, i.e., the second pixel position. The first calculation module 407 calculates the pixel position of the first image in the second image, i.e., the third pixel position, based on the first pixel position and the second pixel position. The second calculation module 408 calculates and acquires the first preset focus position corresponding to the target image scene based on the mapping relationship between the third pixel position and the focus position of the first shooting module and the third pixel position. The first adjustment module 409 adjusts the first shooting module according to the first preset focus position corresponding to the target image scene. The focus position of the module is determined as follows: Compared with traditional contrast-based focusing methods, this application can quickly determine the focus position by acquiring a first image captured by the first shooting module and a second image captured by the second shooting module; Compared with phase-based focusing methods, this application does not require the use of a special dual-pixel sensor for focusing, and the focusing process does not lose image resolution or dynamic range, resulting in good image quality; In addition, compared with phase-based focusing methods, this application has no strict requirements on the sensor model; Compared with deep learning focusing methods, this application does not require a large amount of data collection for focusing scene training, saving a lot of manpower and machine costs. The input of the focusing prediction method based on neural networks is the third image captured by the first camera, which is already close enough to the optimal focus point, avoiding the neural network making predictions on image data corresponding to positions that are too far from the optimal focus point, effectively improving the accuracy of focusing; Compared with LiDAR focusing methods, this application has a wide focusing range and can simultaneously achieve focusing in both long-distance and short-distance scenes, without requiring additional hardware transmission modules; Compared with conventional dual-camera focusing methods, this application does not rely on the intrinsic parameter information of camera calibration, and can complete focusing based solely on dual-camera images, reducing the manpower costs associated with calibration.

[0243] In this embodiment of the application, both the first shooting module 401 and the second shooting module 402 can be cameras.

[0244] In one possible implementation, the optical axis of the first imaging module 401 when capturing the first image and the optical axis of the second imaging module 402 when capturing the second image are set parallel to each other.

[0245] In one possible implementation, the optical axis of the first shooting module 401 and the optical axis of the second shooting module 402 are both fixed; or, at least one of the optical axes of the first shooting module 401 and the second shooting module 402 is rotatable.

[0246] In this embodiment, the optical axes of both the first imaging module 401 and the second imaging module 402 can be fixed, ensuring that they are parallel. At least one of the optical axes of the first imaging module 401 and the second imaging module 402 is rotatable, allowing the optical axes of the first imaging module 401 and the second imaging module 402 to be adjusted to be parallel when the first imaging module 401 captures the first image and the second imaging module 402 captures the second image.

[0247] In this embodiment of the application, the image resolution of both the first image and the second image is 1920 x 1080.

[0248] In one possible implementation, the first matching module 405 includes a first image transformation unit, which is used to perform image transformation on the target image through an image transformation matrix to obtain a target transformed image, such that the target transformed image, the overlapping field of view of the target image and the second image have the same image resolution.

[0249] The first image matching unit is used to match the target transformed image and the second image through image registration to obtain a matching region in the target transformed image that matches the second image, and to obtain the second pixel position of the matching region in the second image.

[0250] In this embodiment, the first image transformation unit linearly scales the target image so that the target image and the overlapping field of view of the target image and the second image have the same image pixel dimension, and the overlapping field of view of the target image and the second image correspond to the same object-side field of view; the first image matching unit uses a template matching algorithm to match the linearly scaled target image and the second image.

[0251] In one possible implementation, the first computing module 407 includes

[0252] The first coordinate acquisition unit is used to acquire the first pixel coordinates (x1, y1) of the first preset pixel point at the first preset position in the matching area in the first image.

[0253] The second coordinate acquisition unit is used to acquire the second pixel coordinates (x2, y2) of the first preset pixel point in the second image.

[0254] The third coordinate acquisition unit is used to perform image transformation on the first image through the image transformation matrix to obtain the first transformed image, and to acquire the third pixel coordinates (x1', y1') of the first preset pixel point in the first transformed image.

[0255] The fourth coordinate acquisition unit is used to calculate and obtain the fourth pixel coordinate (x3, y3) of the third pixel position based on the second pixel coordinate (x2, y2) and the third pixel coordinate (x1', y1'), where x3 = x2 - x1' and y3 = y2 - y1'.

[0256] In one possible implementation, the image transformation matrix is a linear scaling matrix A. Where r1 represents the linear scaling ratio of the first or second image in the x-direction, and r2 represents the linear scaling ratio of the first or second image in the y-direction. The third pixel coordinates (x1', y1') satisfy: x1' = r1*x1, y1' = r2*y1. The fourth pixel coordinates (x3, y3) satisfy: x3 = x2 - x1' = x2 - r1*x1, y3 = y2 - y1' = y2 - r2*y1.

[0257] In this embodiment of the application, the first coordinate acquisition unit obtains the coordinates of the top left corner vertex in the first image, i.e., the first pixel coordinates (x1, y1); the second coordinate acquisition unit obtains the pixel coordinates of the top left corner vertex 205 of the linearly scaled target image in the second image, i.e., the second pixel coordinates (x2, y2); the third coordinate acquisition unit calculates the pixel coordinates of the top left corner vertex 207 of the linearly scaled target image in the first image, i.e., the third pixel coordinates (x1', y1'); the fourth coordinate acquisition unit calculates the pixel coordinates of the top left corner vertex of the first image in the second image based on the coordinates of the top left corner vertex of the target image in the first image (x1, y1), the pixel coordinates of the top left corner vertex of the target image in the second image (x2, y2), and the pixel coordinates of the top left corner vertex of the linearly scaled target image in the first image (x1', y1'), i.e., the fourth pixel coordinates (x3, y3), where x3 = x2 - x1' and y3 = y2 - y1'. The second calculation module obtains the mapping relationship between the focus position of the first shooting module and the third pixel position from the first storage module, and calculates the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position. The movement calculation unit in the first adjustment module calculates the number of steps and the direction of movement of the first shooting module from the current focus position to the first preset focus position corresponding to the target image scene based on the current focus position of the first shooting module and the focus position corresponding to the target image scene. The focusing unit in the first adjustment module calculates the image contrast or sharpness information at multiple adjacent positions of the focus position corresponding to the target image scene. The focusing unit obtains the maximum value of the image contrast or sharpness information from the multiple image contrast or sharpness information, obtains the focusing position corresponding to the maximum value of the image contrast or sharpness information, and moves the focus position of the first shooting module to the focusing position to complete focusing.

[0258] In the embodiments of this application, the image registration algorithm is a template matching algorithm or a feature point matching algorithm.

[0259] In this embodiment of the application, the template matching algorithm uses normalized cross-correlation or normalized squared difference as the matching similarity criterion.

[0260] In this embodiment of the application, the feature point matching algorithm is one of the following: a feature point matching algorithm based on SIFT feature points, a feature point matching algorithm based on SURF feature points, or a feature point matching algorithm based on ORB feature points.

[0261] In one possible implementation, the first matching module 405 performs image registration to match the target image and the second image to obtain a matching region in the target image that matches the second image. Specifically, this includes: obtaining at least one corresponding low-frequency feature in the target image and the second image, and performing image registration on the target image and the second image based on the low-frequency feature to obtain a matching region in the target image that matches the second image.

[0262] In one possible implementation, the first matching module 405 acquires the low-frequency features corresponding to the image by: performing at least one low-frequency transformation on the image to obtain corresponding first operation data; using the first operation data as the low-frequency feature corresponding to the image when performing a low-frequency transformation on the image; and combining the multiple first operation data when performing multiple low-frequency transformations on the image to obtain the low-frequency feature corresponding to the image.

[0263] In this embodiment, at least one low-frequency transform is performed on the target image to obtain corresponding first operation data. At least one low-frequency transform is performed on the second image to obtain corresponding first operation data.

[0264] In one possible implementation, the low-frequency features include one or more of the following: the overall structural features of the image, the macroscopic statistical distribution features of the image, the global semantic features of the image, or the first feature data of the image; the high-frequency components or local pixel mutations in the first feature data of the image are suppressed or ignored.

[0265] In one possible implementation, the low-frequency transformation is one of the following: calculating and obtaining the average brightness value of the image, obtaining the color histogram of the image, obtaining the brightness histogram of the image, performing a downsampling operation on the image and obtaining low-resolution data of the downsampled image, and processing the image through a low-pass filter and obtaining filtered data of the processed image. The first operation data is one of the following: average brightness value, color histogram, brightness histogram, low-resolution data, and filtered data.

[0266] In this embodiment, the low-frequency transformation can be one of the following: calculating the average brightness value of an image, obtaining the color histogram of the image, obtaining the brightness histogram of the image, performing a downsampling operation on the image and obtaining low-resolution data of the downsampled image, and processing the image through a low-pass filter and obtaining filtered data of the processed image. By performing one of the above low-frequency transformations, first operation data is obtained, that is, when performing the low-frequency transformation to calculate the average brightness value of the image, the average brightness value is used as the first operation data; when performing the low-frequency transformation to obtain the color histogram of the image, the color histogram is used as the first operation data; when performing the low-frequency transformation to obtain the brightness histogram of the image, the brightness histogram is used as the first operation data; when performing the downsampling operation on the image and obtaining low-resolution data of the downsampled image, the low-resolution data is used as the first operation data; when performing the low-frequency transformation to process the image through a low-pass filter and obtaining filtered data of the processed image, the filtered data is used as the first operation data, and the first operation data is used as a low-frequency feature. By performing various low-frequency transformations described above, that is, combining multiple low-frequency transformations to perform multiple processing on the same image, multiple corresponding first operation data are obtained. These first operation data are then combined to form low-frequency features. The combination can be averaging, weighting, addition, subtraction, multiplication, division, or direct concatenation. For example, the target image can be subjected to low-frequency transformations that perform downsampling operations to obtain low-resolution data of the downsampled image, and low-frequency transformations that process the image through a low-pass filter to obtain filtered data of the processed image. This yields low-resolution data and filtered data respectively. The low-resolution data and filtered data are then averaged to obtain new data, which is used as low-frequency features.

[0267] In this embodiment of the application, the low-frequency transformation of performing downsampling operation on the image and obtaining low-resolution data of the downsampled image specifically includes: downsampling the target image and the second image based on the horizontal field of view ratio and the vertical field of view ratio of the first shooting module and the second shooting module, respectively, to obtain low-frequency images of the target image and the second image.

[0268] In the embodiments of this application, the low-pass filter used in the low-frequency transformation of the image to process the image and obtain the filtered data of the processed image may include one or a combination of the following types: a) spatial domain linear convolution filter, including mean filter and / or Gaussian filter; b) frequency domain filter, including ideal low-pass filter and / or Butterworth low-pass filter; c) nonlinear filter, including median filter and / or bilateral filter.

[0269] In one possible implementation, the low-frequency transformation is one of performing a Fourier transform on the image and obtaining the first low-frequency coefficients of the transformed image, performing a discrete cosine transform on the image and obtaining the second low-frequency coefficients of the transformed image, and performing a wavelet transform on the image and obtaining the low-frequency subband coefficients of the transformed image, wherein the first operation data is one of the first low-frequency coefficients, the second low-frequency coefficients, and the low-frequency subband coefficients.

[0270] In this embodiment, low-frequency features are obtained through the transform domain, including one or a combination of the following methods: extracting low-frequency coefficients after Fourier transform of the image, extracting low-frequency coefficients after discrete cosine transform of the image, and extracting low-frequency sub-band coefficients after wavelet transform of the image. The low-frequency transform can be one of the following: performing a Fourier transform on the image and obtaining the first low-frequency coefficients of the transformed image, performing a discrete cosine transform on the image and obtaining the second low-frequency coefficients of the transformed image, and performing a wavelet transform on the image and obtaining the low-frequency sub-band coefficients of the transformed image. By performing one of the above low-frequency transforms, first operation data is obtained. That is, when performing a low-frequency transform to obtain the first low-frequency coefficients of the transformed image through Fourier transform, the first low-frequency coefficients are used as the first operation data; when performing a low-frequency transform to obtain the second low-frequency coefficients of the transformed image through discrete cosine transform, the second low-frequency coefficients are used as the first operation data; and when performing a wavelet transform to obtain the low-frequency sub-band coefficients of the transformed image, the low-frequency sub-band coefficients are used as the first operation data. The first operation data is used as the low-frequency feature. By performing multiple low-frequency transformations, that is, combining multiple low-frequency transformations to perform multiple processing on the same image, multiple first operation data are obtained. These first operation data are combined to form low-frequency features. The combination can be averaging, weighting, addition, subtraction, multiplication, division or direct splicing.

[0271] In one possible implementation, the low-frequency transformation is one of obtaining a density distribution map of the edges in the image and obtaining overall direction statistics of the edges in the image, wherein the first operation data is one of the density distribution map and overall direction statistics.

[0272] In this embodiment, the low-frequency feature is a macroscopic representation based on edge information, including one or a combination of the following methods: extracting the density distribution map of edges in an image, and extracting the overall direction statistics of edges in an image. The low-frequency transformation can be one of the following methods: obtaining the density distribution map of edges in an image and obtaining the overall direction statistics of edges in an image. By performing one of the above low-frequency transformations, first operation data is obtained; that is, when performing the low-frequency transformation to obtain the density distribution map of edges in an image, the density distribution map is used as the first operation data; when performing the low-frequency transformation to obtain the overall direction statistics of edges in an image, the overall direction statistics are used as the first operation data, and the first operation data is used as the low-frequency feature. By performing multiple of the above low-frequency transformations, that is, combining multiple low-frequency transformations to perform multiple processing on the same image, multiple corresponding first operation data are obtained. These first operation data are combined to form a low-frequency feature; the combination can be averaging, weighting, addition, subtraction, multiplication, division, or direct concatenation.

[0273] In this embodiment, the low-frequency transformation is one of extracting a depth feature map of an image through a neural network and extracting an image embedding vector of an image through a neural network, and the first operation data is one of the depth feature map and the image embedding vector.

[0274] In this embodiment, low-frequency features refer to the depth feature map or image embedding vector extracted from an image using a neural network. Low-frequency transformation can be performed in one of the following ways: extracting the depth feature map of an image using a neural network and extracting the image embedding vector of an image using a neural network. By performing one of the above low-frequency transformations, first operation data is obtained; that is, when performing a low-frequency transformation of the depth feature map extracted from the image using a neural network, the depth feature map is used as the first operation data; when performing a low-frequency transformation of the image embedding vector extracted from the image using a neural network, the image embedding vector is used as the first operation data, and the first operation data is used as the low-frequency feature. By performing multiple of the above low-frequency transformations, that is, combining multiple low-frequency transformations to perform multiple processing on the same image, multiple corresponding first operation data are obtained. These first operation data are combined to form low-frequency features. The combination can be averaging, weighting, addition, subtraction, multiplication, division, or direct concatenation. Neural networks include, but are not limited to, ResNet series neural networks, EfficientNet series neural networks, MobileNet series neural networks, or Vision Transformer. ResNet series neural networks are a classic architecture of deep convolutional neural networks (CNNs), which solves the gradient vanishing problem in deep networks through residual modules and skip connections. The EfficientNet series of neural networks are high-efficiency convolutional neural networks designed based on Neural Architecture Search (NAS). They achieve the optimal solution for accuracy and efficiency by balancing network depth, width, and resolution through a composite scaling method. The MobileNet series of neural networks are lightweight convolutional neural networks designed specifically for mobile devices and embedded systems. Vision Transformer (ViT) is an image processing model proposed in 2020. By introducing the Transformer architecture from natural language processing into the field of computer vision, it breaks through the limitations of traditional convolutional neural networks (CNNs).

[0275] In this embodiment of the application, a corresponding low-frequency feature is obtained from a target image and a second image through image transformation. The low-frequency feature is a transformed image obtained from a target image and a second image after image transformation. The image registration of the target image and the second image based on the low-frequency feature to obtain a matching region in the target image that matches the second image specifically includes: performing a similarity calculation operation between the transformed image and the other of the target image and the second image, and taking the region with the highest similarity as the matching region in the target image that matches the second image.

[0276] In this embodiment of the application, the first matching module 405 obtains the low-frequency features corresponding to the target image and the second image respectively. The first matching module 405 performs image registration on the target image and the second image based on the low-frequency features to obtain the matching region in the target image that matches the second image. Specifically, it performs a similarity calculation operation on the low-frequency features of the target image and the second image, and takes the region with the highest similarity as the matching region in the target image that matches the second image.

[0277] In this embodiment of the application, the low-frequency features of the second image and the low-frequency features of the target image are used to calculate the similarity, and the region with the highest similarity is determined as the matching region in the target image that matches the second image.

[0278] In this embodiment, the similarity calculation operation is one of the following: calculating the cross-correlation value between low-frequency features of two images, calculating the normalized cross-correlation value between low-frequency features of two images, calculating the absolute difference between low-frequency features of two images, calculating the sum of squared differences between low-frequency features of two images, calculating the mean absolute error between low-frequency features of two images, calculating the mean squared error between low-frequency features of two images, calculating the Bartholomew's distance between low-frequency features of two images, calculating the chi-square distance between low-frequency features of two images, calculating the EMD distance between the histograms of low-frequency features of two images, calculating the mutual information between low-frequency features of two images, calculating the structural similarity index between low-frequency features of two images, and calculating the cosine similarity between low-frequency feature vectors of two images. Calculate the Euclidean distance between the low-frequency feature vectors of two images, the Manhattan distance between the low-frequency feature vectors of two images, and the Mahalanobis distance between the low-frequency feature vectors of two images; the region corresponding to the highest similarity is one of the following: the region corresponding to the maximum cross-correlation value, the region corresponding to the maximum normalized cross-correlation value, the region corresponding to the minimum absolute difference, the region corresponding to the minimum sum of squared differences, the region corresponding to the minimum mean absolute error, the region corresponding to the minimum mean squared error, the region corresponding to the minimum Bartholomew's distance, the region corresponding to the minimum chi-squared distance, the region corresponding to the minimum EMD distance, the region corresponding to the maximum mutual information, the region corresponding to the maximum structural similarity index, the region corresponding to the maximum cosine similarity, the region corresponding to the minimum Euclidean distance, the region corresponding to the minimum Manhattan distance, and the region corresponding to the minimum Mahalanobis distance.

[0279] In this embodiment, the similarity calculation operation includes one of the following methods: calculating the cross-correlation value or normalized cross-correlation value between the low-frequency features of two images; calculating the sum of absolute differences or sum of squared differences between the low-frequency features of two images; calculating the mean absolute error or mean square error between the low-frequency features of two images. The similarity calculation operation includes one of the following methods: calculating the Bach distance between the low-frequency feature histograms of two images; calculating the chi-square distance between the low-frequency feature histograms of two images; calculating the Earth Mover's Distance between the low-frequency feature histograms of two images. The similarity calculation operation calculates the mutual information between the low-frequency features of two images. The similarity calculation operation calculates the structural similarity index between the low-frequency features of two images. The similarity calculation operation includes one of the following methods: calculating the cosine similarity between the low-frequency feature vectors of two images; calculating the Euclidean distance or Manhattan distance between the low-frequency feature vectors of two images; calculating the Mahalanobis distance between the low-frequency feature vectors of two images.

[0280] In one possible embodiment, a mapping relationship acquisition module is further included; the mapping relationship acquisition module is used to construct a first prediction model; move the first shooting module 401 multiple times, acquire the first adjustment focus position corresponding to each movement of the first shooting module 401, and acquire a third image from the first shooting module 401; based on the third image, acquire the pixel coordinates of a second preset pixel in the third image in the second image; construct a dataset based on the pixel coordinates of the second preset pixel in the second image and multiple first adjustment focus positions; fit the first prediction model with the dataset to obtain a first fitting model; and acquire the mapping relationship between the third pixel position and the focus position of the first shooting module 401 based on the first fitting model.

[0281] In this embodiment, the first imaging module is moved multiple times to change its first adjustable focus position. The first imaging module can be moved arbitrarily or sequentially from beginning to end. Each movement of the first imaging module acquires a third image, a first mapped target image, and the fourth pixel position of the first mapped target image within the third image. Image registration is used to match the first mapped target image with a second image to obtain a mapping matching region in the first mapped target image that matches the second image. The fifth pixel position of this mapping matching region in the second image is then obtained. Based on the fourth and fifth pixel positions, the sixth pixel position of the third image in the second image is calculated. The image is a rectangular image. The fourth pixel position of the first mapped target image in the third image is the pixel coordinate of the top left corner vertex of the mapped matching region in the third image. The fifth pixel position of the mapped matching region in the third image is the pixel coordinate of the top left corner vertex of the mapped matching region in the second image. Thus, the pixel coordinate position of the top left corner vertex of the third image in the second image is obtained under different first adjustment focus positions f_i. Data set D is constructed by multiple first adjustment focus positions and multiple pixel coordinates of the top left corner vertex of the third image in the second image. A first prediction model is constructed, and the first prediction model is fitted to the dataset to obtain a first fitting model. The mapping relationship between the third pixel position and the focus position of the first shooting module is obtained based on the first fitting model.

[0282] In this embodiment, the second preset pixel can be the top left corner vertex of the third image, the center point of the third image, or the centroid of the third image. The first prediction model can include a prediction model formed based on one or a combination of the following methods: linear fitting, polynomial fitting, exponential fitting, logarithmic fitting, power function fitting, logistic regression fitting, Gaussian fitting, spline fitting, local regression, and nonlinear fitting methods based on machine learning. The first prediction model can be a prediction model based on a linear fitting method, fitting a straight line f_i=kp_i+b according to the dataset D, where k represents the slope of the line and b represents the intercept of the line. The fitting method can be at least one of least squares, overall least squares, Hough transform, RANSAC, and E-Estimators. The first prediction model can be a prediction model based on a multi-segment linear fitting method, where the number of segments n can be manually selected, satisfying the condition 2≤n. The first prediction model can be a prediction model based on a multi-segment linear fitting method, where the number of segments n can be automatically calculated, using at least one of dynamic programming-based methods, greedy search-based methods, and RANSAC-based fitting methods.

[0283] In one possible implementation, the second calculation module 408 is used to calculate and obtain the first preset focus position corresponding to the target image scene according to the mapping relationship and the third pixel position, specifically including: obtaining a first adjustment focus position according to the mapping relationship and the third pixel position; obtaining a fourth image from the first shooting module 401 when the first shooting module 401 moves to the first adjustment focus position, and obtaining a first target adjustment image from the fourth image; determining whether the sharpness of the first target adjustment image meets the first preset focus condition, and if not, constructing a second prediction model, and obtaining the first predicted focus position of the first shooting module 401 based on the first target adjustment image through the second prediction model; and using the first predicted focus position as the first preset focus position.

[0284] In this embodiment, the first target adjustment image can be a fourth image or a partial image from the fourth image. The first target adjustment image can be an image set composed of multiple target images obtained by the first shooting module after continuously moving the focusing motor; continuously moving the focusing motor means moving the focusing motor in fixed steps, or it can be moving the focusing motor in non-fixed steps.

[0285] In this embodiment, the first preset focusing condition can be: the sharpness exceeds a first preset value, the blur level is lower than a second preset value, the sharpness score exceeds a third preset value, or the sharpness is classified as sharp. Determining whether the sharpness of the first target adjusted image meets the first preset focusing condition specifically includes: obtaining the sharpness of the first target adjusted image; if the first preset focusing condition is that the sharpness exceeds the first preset value, then the sharpness of the first target adjusted image is directly compared with the first preset value; if the first preset focusing condition is that the blur level is lower than the second preset value, then the blur level of the first target adjusted image is obtained based on the sharpness of the first target adjusted image, and the blur level of the first target adjusted image is compared with the second preset value; if the first preset focusing condition is that the sharpness score exceeds the third preset value, then the sharpness score of the first target adjusted image is obtained based on the sharpness of the first target adjusted image, and the sharpness score of the first target adjusted image is compared with the third preset value; if the first preset focusing condition is that the sharpness is classified as sharp, then the sharpness classification of the first target adjusted image is obtained based on the sharpness of the first target adjusted image, and the sharpness classification of the first target adjusted image is compared with the sharpness. Methods for obtaining image sharpness values include, but are not limited to, traditional algorithms or neural network algorithms. Traditional algorithms may include at least one of the following: Laplacian Energy Method, Tenegrad Gradient Function, Image Variance Method, FFT Frequency Domain Method, DCT Frequency Domain Method, and NIQE algorithm. The Laplacian Energy Method is an algorithm based on the second derivative of the image, measuring image sharpness by calculating the response value of the Laplacian operator. The Tenegrad Gradient Method uses the Sobel operator to extract the gradients of the image in the horizontal and vertical directions, thereby effectively evaluating image sharpness. The Image Variance Method evaluates image sharpness by calculating the variance. The FFT Frequency Domain Method determines the degree of blur by analyzing the spectral characteristics of the image. The DCT Frequency Domain Method can improve sharpness by adjusting the high-frequency coefficients of the image. The NIQE (Naturalness Image Quality Evaluator) algorithm is a no-reference image quality evaluation algorithm designed to assess the naturalness of an image, i.e., whether the image looks like a natural scene. Neural network algorithms include at least one of the deep learning prediction algorithms based on the MobileNet series, Transformer series, GhostNet, or ShuffleNet series. The second prediction model uses a trained neural network algorithm for prediction, which includes one of the deep learning prediction algorithms based on the MobileNet series, Transformer series, GhostNet series, and ShuffleNet series.The first predicted focus position of the first shooting module is obtained by predicting the first focus position of the first image through a second prediction model based on the first target adjustment image. This refers to the algorithm model using a trained input image to predict the first predicted focus position. The second prediction model is an algorithm model, which can be a model based on a neural network algorithm, including at least one of the deep learning prediction algorithms based on the MobileNet series, Transformer series, GhostNet, and ShuffleNet series. Alternatively, the algorithm model can refer to a model based on feature extraction and model prediction methods. Feature extraction methods include at least one of HOG features, color histograms, color moments, LBP, Haralick features, Gabor features, SIFT features, SURF features, ORB features, and Haar-like features. Model prediction methods include at least one of SVM methods, random forests, AdaBoost, and K-nearest neighbors. SVM (Support Vector Machine) is a supervised learning algorithm used for classification and regression. Its core idea is to maximize the margin between different categories by finding an optimal hyperplane. Random forest is an ensemble learning method that constructs multiple decision trees and randomly selects data and features for training, ultimately improving prediction accuracy through voting or averaging results. AdaBoost (Adaptive Boosting) is an ensemble learning algorithm that iteratively adjusts sample weights, focusing on misclassified samples and gradually combining multiple weak classifiers to form a strong classifier. K-Nearest Neighbors (KNN) is a distance-based nonparametric classification algorithm that calculates the class of a new sample relative to its K nearest neighbors in the training set and selects the majority class as the prediction result.

[0286] In one possible implementation, an input module is further included; the first image acquisition module is used to detect and acquire the first pixel position through a target detection algorithm or to receive the first pixel position sent by the input module.

[0287] In one possible implementation, the object detection algorithm is one of the following: a YOLO-based deep learning object detection algorithm, an R-CNN-based deep learning object detection algorithm, a Transformer-based deep learning object detection algorithm, a traditional object detection algorithm based on a Haar cascade classifier, or a traditional object detection algorithm based on DPM.

[0288] In one possible implementation, the target image is an irregularly shaped image, and the first pixel position of the target image in the first image is the pixel coordinates of the upper left vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the first image. The second pixel position of the matching region in the second image is the pixel coordinates of the upper left vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the second image.

[0289] In one possible implementation, the target image is a rectangular image, the first pixel position of the target image in the first image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the first image, and the second pixel position of the matching region in the second image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the second image.

[0290] In this embodiment, the first pixel position can be the pixel coordinates of the top-left corner vertex of the matching region in the first image, and the third pixel position can be the pixel coordinates of the top-left corner vertex of the first image in the second image. The mapping relationship is the relationship between the third pixel position and the focus position of the first shooting module, or the relationship between the focus position of the first shooting module and the pixel coordinates of the top-left corner vertex of the first image in the second image. Alternatively, the first pixel position can be the pixel coordinates of the center point of the matching region in the first image, and the third pixel position can be the pixel coordinates of the center point of the first image in the second image. The mapping relationship is the relationship between the third pixel position and the focus position of the first shooting module, or the relationship between the focus position of the first shooting module and the pixel coordinates of the center point of the first image in the second image. Furthermore, the first pixel position can be the pixel coordinates of the centroid of the matching region in the first image, and the third pixel position can be the pixel coordinates of the centroid of the first image in the second image. The mapping relationship is the relationship between the third pixel position and the focus position of the first shooting module, or the relationship between the focus position of the first shooting module and the pixel coordinates of the centroid of the first image in the second image.

[0291] In one possible implementation, the first adjustment module 409 includes a first position acquisition unit for acquiring the current focus position of the first shooting module;

[0292] The mobile calculation unit is used to calculate the number of steps and direction of movement required for the first shooting module to move from the current focus position to the first preset focus position based on the first preset focus position and the current focus position.

[0293] The focusing unit is used to adjust the focusing position of the first shooting module according to the number of steps and the direction of movement to match the first preset focusing position corresponding to the target image scene.

[0294] In this embodiment, the number of steps and direction of movement required for the first shooting module to move from the current focus position to the first preset focus position corresponding to the target image scene are calculated based on the focus position corresponding to the target image scene. The focus position of the first camera is adjusted according to the obtained number of steps and direction of movement to make it consistent with the first preset focus position corresponding to the target image scene, thereby completing the focusing of the first shooting module and realizing the first adjustment of the first shooting module.

[0295] In one possible implementation, after adjusting the focus position of the first shooting module to match the first preset focus position corresponding to the target image scene according to the number of movement steps and the movement direction, the focusing unit further includes: calculating and obtaining multiple image contrast or image sharpness information at multiple positions adjacent to the first preset focus position; obtaining the maximum value of the image contrast or image sharpness information from the multiple image contrast or image sharpness information, obtaining the focus position corresponding to the maximum value of the image contrast or image sharpness information; and moving the focus position of the first shooting module to the focus position.

[0296] In this embodiment, after adjusting the focus position of the first shooting module to match the first preset focus position corresponding to the target image scene based on the acquired number of movement steps and movement direction, the method further includes: acquiring image contrast or image sharpness information at multiple adjacent positions of the focus position corresponding to the target image scene, and moving the focus position of the first shooting module to the focus position corresponding to the maximum value of image contrast or image sharpness at multiple positions, thereby achieving fine-tuning of the first shooting module. This application is not limited to the physical displacement of the first shooting module, and should not be understood only as the mechanical displacement of optical elements or imaging sensors in physical space, but should be broadly understood as all adjustment methods that change the focus state of the imaging system and cause the focus position to change relative to the imaging medium; including but not limited to at least one of the following: physical displacement adjustment methods, adjustment methods based on optical dilatancy change, adjustment methods based on optical path adjustment, or imaging algorithm adjustment methods.

[0297] In this embodiment, when the first shooting module moves to the first preset focus position, a fifth image is acquired from the first shooting module, and a second target adjustment image is acquired from the fifth image; it is determined whether the sharpness of the second target adjustment image meets the second preset focus condition. If it does, execution stops; otherwise, execution continues. The first shooting module is moved multiple times, and multiple sixth images are acquired corresponding to each completed movement of the first shooting module. Based on the sharpness of the multiple sixth images, a second adjustment focus position corresponding to the first shooting module is calculated. The focus position of the first shooting module is moved to the second adjustment focus position. The second target adjustment image can be the fifth image or a partial image of the fifth image.

[0298] In this embodiment, the second preset focusing condition can be: sharpness exceeding a fourth preset value, blur level below a fifth preset value, sharpness score exceeding a sixth preset value, or sharpness classified as sharp. Determining whether the sharpness of the second target adjusted image meets the second preset focusing condition specifically includes: acquiring the sharpness of the second target adjusted image; if the second preset focusing condition is sharpness exceeding the fourth preset value, then directly comparing the sharpness of the second target adjusted image with the fourth preset value; if the second preset focusing condition is blur level below the fifth preset value, then acquiring the blur level of the second target adjusted image based on its sharpness, and comparing the blur level of the second target adjusted image with the fifth preset value; if the second preset focusing condition is sharpness score exceeding the sixth preset value, then acquiring the sharpness score of the second target adjusted image based on its sharpness, and comparing the sharpness score of the second target adjusted image with the sixth preset value; if the second preset focusing condition is sharpness classified as sharp, then acquiring the sharpness classification of the second target adjusted image based on its sharpness, and comparing the sharpness classification of the second target adjusted image with the sharpness classification.

[0299] In this embodiment, the method for determining whether the sharpness of the second target adjusted image meets the second preset focus condition can be the same as the method for determining whether the sharpness of the first target adjusted image meets the first preset focus condition. Alternatively, the method for determining whether the sharpness of the second target adjusted image meets the second preset focus condition can be different from the method for determining whether the sharpness of the first target adjusted image meets the first preset focus condition. The method for determining whether the sharpness of the first target adjusted image meets the first preset focus condition has a faster calculation speed than the method for determining whether the sharpness of the second target adjusted image meets the second preset focus condition. The first preset focus condition and the second preset focus condition can be the same. Alternatively, the first preset focus condition and the second preset focus condition can be different conditions, with the first preset focus condition having a higher requirement for sharpness than the second preset focus condition.

[0300] In this embodiment, moving the first shooting module multiple times specifically includes: moving the first shooting module multiple times according to a movement rule. The movement rule refers to moving the first shooting module according to a fixed step size S, where 1 ≤ S ≤ 0.1 * S_all, and S_all represents the total step size of the focusing motor. Alternatively, the movement rule refers to moving the first camera according to an adaptive step size, the selection of which can refer to traditional or improved hill-climbing methods.

[0301] In this embodiment, calculating the sharpness of multiple sixth images to obtain the second focus adjustment position corresponding to the first shooting module specifically includes: calculating the sharpness of each of the multiple sixth images; selecting the sixth image with the highest sharpness from the multiple sixth images based on their sharpness; obtaining the position of the first shooting module when shooting the sixth image, and using this position as the second focus adjustment position. The method for obtaining the sharpness of the sixth image includes, but is not limited to, traditional algorithms or neural network algorithms. Traditional algorithms may include at least one of the following: Laplacian Energy Method, Tenegrad gradient function, image variance method, FFT frequency domain method, DCT frequency domain method, and NIQE algorithm. The Laplacian Energy Method is an algorithm based on the second derivative of an image, which measures image sharpness by calculating the response value of the Laplacian operator. The Tenegrad gradient method uses the Sobel operator to extract the gradients of the image in the horizontal and vertical directions, thereby achieving an effective evaluation of image sharpness. The image variance method evaluates image sharpness by calculating the variance. The FFT frequency domain method determines the degree of blur by analyzing the spectral characteristics of the image. DCT (Discrete Cosine Transform) frequency domain methods can improve image sharpness by adjusting high-frequency coefficients. NIQE (Naturalness Image Quality Evaluator) is a no-reference image quality assessment algorithm designed to evaluate the naturalness of an image, i.e., whether the image looks like a natural scene. Neural network algorithms include at least one deep learning prediction algorithm based on the MobileNet, Transformer, GhostNet, or ShuffleNet series.

[0302] In one possible implementation, a display module is also included, which can display a first image and / or a second image.

[0303] In one possible implementation, the display module is a touch display module capable of displaying at least a first image; the touch display module can sense the click or touch position on the first image and generate the target image position.

[0304] In this embodiment, as shown in Figure 8, the touch display module is used to display a first image 501 and a second image 503, and can sense the image pixel coordinates corresponding to the user's click position 507 on the displayed first image. The imaging device can obtain the image pixel coordinates corresponding to the user's click position on the first image displayed on the touch display unit, and use them as the coordinates of the upper left corner vertex of the target image in the first image, i.e., the first pixel coordinates (x1, y1). In this embodiment, the target image 505 has a square outline with a side length of 400 x 400.

[0305] Those skilled in the art will clearly understand that the technical solutions of the embodiments of this application can be implemented by means of software and / or hardware. In this specification, "unit", "module" and "part" refer to software and / or hardware that can independently complete or cooperate with other components to complete a specific function, wherein the hardware may be, for example, a field-programmable gate array (FPGA), an integrated circuit (IC), etc.

[0306] Each processing unit and / or module in the embodiments of this application can be implemented by an analog circuit that implements the functions described in the embodiments of this application, or by software that executes the functions described in the embodiments of this application.

[0307] Referring to Figure 3, which shows a schematic diagram of the structure of an electronic device according to an embodiment of this application, the electronic device can be used to implement the method in the embodiment shown in Figure 1. As shown in Figure 3, the electronic device 600 may include: at least one central processing unit 601, at least one network interface 604, a user interface 603, a memory 605, and at least one communication bus 602.

[0308] The communication bus 602 is used to enable communication between these components.

[0309] The user interface 603 may include a display screen and a camera. Optionally, the user interface 603 may also include a standard wired interface and a wireless interface.

[0310] The network interface 604 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface).

[0311] The central processing unit 601 may include one or more processing cores. The central processing unit 601 connects to various parts within the electronic device 600 using various interfaces and lines. It executes various functions of the terminal 600 and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 605, and by calling data stored in the memory 605. Optionally, the central processing unit 601 may be implemented using at least one hardware form of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), or Programmable Logic Array (PLA). The central processing unit 601 may integrate one or a combination of several of the following: a central processing unit (CPU), a graphics processing unit (GPU), and a modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content required for display on the screen; and the modem handles wireless communication. It is understood that the modem may also not be integrated into the central processing unit 601 and may be implemented as a separate chip.

[0312] The memory 605 may include random access memory (RAM) or read-only memory. Optionally, the memory 605 may include a non-transitory computer-readable storage medium. The memory 605 may be used to store instructions, programs, code, code sets, or instruction sets. The memory 605 may include a program storage area and a data storage area. The program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch functionality, sound playback functionality, image playback functionality, etc.), instructions for implementing the various method embodiments described above, etc.; the data storage area may store data involved in the various method embodiments described above, etc. Optionally, the memory 605 may also be at least one storage device located remotely from the aforementioned central processing unit 601. As shown in FIG3, the memory 605, as a computer storage medium, may include an operating system, a network communication module, a user interface module, and program instructions.

[0313] In the electronic device 600 shown in Figure 3, the user interface 603 is mainly used to provide an input interface for the user and to acquire user input data; while the central processing unit 601 can be used to call an application program of a focusing method stored in the memory 605 and specifically perform the following operations:

[0314] S1: Obtain a first image from the first shooting module, obtain a target image from the first image, and obtain the first pixel position of the target image in the first image, wherein the target image is the first image or a partial image of the first image;

[0315] S2: Obtain a second image from a second shooting module at a first preset distance from the first shooting module, wherein the field of view of the second image covers the field of view of the first image or the field of view of the second image at least partially overlaps with the field of view of the first image;

[0316] S3: Match the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, and obtain the second pixel position of the matching region in the second image;

[0317] S4: Calculate the position of the third pixel in the second image based on the position of the first pixel and the position of the second pixel;

[0318] S5: Obtain the mapping relationship between the third pixel position and the focus position of the first shooting module, and calculate the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position;

[0319] S6: Adjust the focus position of the first shooting module according to the first preset focus position corresponding to the target image scene.

[0320] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the above-described method. The computer-readable storage medium may include, but is not limited to, any type of disk, including floppy disks, optical disks, DVDs, CD-ROMs, microdrives, as well as magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, DRAMs, VRAMs, flash memory devices, magnetic cards or optical cards, nanosystems (including molecular memory ICs), or any type of medium or device suitable for storing instructions and / or data.

[0321] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, as some steps may be performed in other orders or simultaneously according to this application. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0322] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions in other embodiments.

[0323] In the several embodiments provided in this application, it should be understood that the disclosed apparatus can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some service interfaces; indirect couplings or communication connections between apparatuses or units may be electrical or other forms.

[0324] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0325] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0326] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage device (CMD). Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a memory and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned memory includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0327] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, which may include: flash drive, read-only memory (ROM), random access memory (RAM), disk or optical disk, etc.

[0328] The foregoing description is merely an exemplary embodiment of this disclosure and should not be construed as limiting the scope of this disclosure. Any equivalent changes and modifications made in accordance with the teachings of this disclosure shall still fall within the scope of this disclosure. Other embodiments of this disclosure will be readily apparent to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not described herein. The specification and embodiments are to be considered exemplary only, and the scope and spirit of this disclosure are defined by the claims.

Claims

1. A focusing method, characterized in that, Includes the following steps: S1: Obtain a first image from the first shooting module, obtain a target image from the first image, and obtain the first pixel position of the target image in the first image, wherein the target image is the first image or a partial image of the first image; S2: Obtain a second image from a second shooting module at a first preset distance from the first shooting module, wherein the field of view of the second image covers the field of view of the first image or the field of view of the second image at least partially overlaps with the field of view of the first image; S3: Match the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, and obtain the second pixel position of the matching region in the second image; S4: Calculate the position of the third pixel in the second image based on the position of the first pixel and the position of the second pixel; S5: Obtain the mapping relationship between the third pixel position and the focus position of the first shooting module, and calculate the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position; S6: Adjust the focus position of the first shooting module according to the first preset focus position corresponding to the target image scene.

2. A focusing method as described in claim 1, characterized in that: The optical axis of the first imaging module when capturing the first image is set parallel to the optical axis of the second imaging module when capturing the second image.

3. A focusing method as described in claim 2, characterized in that: The optical axes of the first and second shooting modules are both fixed; or, at least one of the optical axes of the first and second shooting modules is rotatable.

4. A focusing method as described in claim 1, 2, or 3, characterized in that: Step S3 includes, before matching the target image with the second image through image registration, performing distortion correction on at least one of the first image and the second image.

5. A focusing method as described in claim 1, 2, or 3, characterized in that, Step S3 specifically includes: S31: Performing image transformation on the target image using an image transformation matrix to obtain a target transformed image, such that the target transformed image, the overlapping field of view of the target image and the second image have the same image resolution; S32: Match the target transformed image and the second image through image registration to obtain a matching region in the target transformed image that matches the second image, and obtain the second pixel position of the matching region in the second image.

6. A focusing method as described in claim 1, 2, or 3, characterized in that, Step S3, which involves matching the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, specifically includes: obtaining at least one corresponding low-frequency feature in the target image and the second image, and performing image registration on the target image and the second image based on the low-frequency feature to obtain a matching region in the target image that matches the second image.

7. A focusing method as described in claim 6, characterized in that, Obtaining the low-frequency features corresponding to an image specifically includes: performing at least one low-frequency transformation on the image to obtain corresponding first operation data; using the first operation data as the low-frequency features corresponding to the image when performing a low-frequency transformation on the image; and combining the multiple first operation data when performing multiple low-frequency transformations on the image to obtain the low-frequency features corresponding to the image.

8. A focusing method as described in claim 6, characterized in that: The low-frequency features include one or more of the following: the overall structural features of the image, the macroscopic statistical distribution features of the image, the global semantic features of the image, or the first feature data of the image; the high-frequency components or local pixel mutations in the first feature data of the image are suppressed or ignored.

9. A focusing method as described in claim 7, characterized in that: The low-frequency transformation includes one of the following: calculating the average brightness value of the image, obtaining the color histogram of the image, obtaining the brightness histogram of the image, performing a downsampling operation on the image and obtaining low-resolution data of the downsampled image, and processing the image through a low-pass filter and obtaining filtered data of the processed image. The first operation data is one of the following: average brightness value, color histogram, brightness histogram, low-resolution data, and filtered data.

10. A focusing method as described in claim 7, characterized in that: The low-frequency transformation is one of the following: performing a Fourier transform on the image and obtaining the first low-frequency coefficients of the transformed image; performing a discrete cosine transform on the image and obtaining the second low-frequency coefficients of the transformed image; and performing a wavelet transform on the image and obtaining the low-frequency subband coefficients of the transformed image. The first operation data is one of the first low-frequency coefficients, the second low-frequency coefficients, and the low-frequency subband coefficients.

11. A focusing method as described in claim 7, characterized in that: The low-frequency transformation is one of obtaining the density distribution map of the edges in the image and obtaining the overall direction statistics of the edges in the image, and the first operation data is one of the density distribution map and the overall direction statistics.

12. A focusing method as described in claim 1, 2, or 3, characterized in that: Step S5, obtaining the mapping relationship between the third pixel position and the focus position of the first shooting module, specifically includes: constructing a first prediction model; moving the first shooting module multiple times, obtaining the first adjustment focus position corresponding to each movement of the first shooting module, and obtaining a third image from the first shooting module; obtaining the pixel coordinates of the second preset pixel in the third image in the second image based on the third image; constructing a dataset based on the pixel coordinates of the second preset pixel in the second image and multiple first adjustment focus positions; fitting the first prediction model with the dataset to obtain a first fitting model; and obtaining the mapping relationship between the third pixel position and the focus position of the first shooting module based on the first fitting model.

13. A focusing method as described in claim 1, 2, or 3, characterized in that: Step S5, which calculates the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position, specifically includes: obtaining the first adjustment focus position based on the mapping relationship and the third pixel position; obtaining the fourth image from the first shooting module when the first shooting module moves to the first adjustment focus position, and obtaining the first target adjustment image from the fourth image; determining whether the sharpness of the first target adjustment image meets the first preset focus condition, and stopping execution if it does, and continuing execution if it does not; constructing a second prediction model, and obtaining the first predicted focus position of the first shooting module based on the first target adjustment image through the second prediction model; and using the first predicted focus position as the first preset focus position.

14. A focusing method as described in claim 1, 2, or 3, characterized in that, Step S1, obtaining the first pixel position of the target image in the first image, specifically includes: obtaining the first pixel position by detecting it through a target detection algorithm or receiving the first pixel position sent by the input module.

15. A focusing method according to claim 1, 2, or 3, characterized in that: The target image is an irregularly shaped image. The first pixel position of the target image in the first image is the pixel coordinates of the upper left corner vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the first image. The second pixel position of the matching region in the second image is the pixel coordinates of the upper left corner vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the second image.

16. A focusing method as described in claim 1, 2, or 3, characterized in that: The target image is a rectangular image. The first pixel position of the target image in the first image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the first image. The second pixel position of the matching region in the second image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the second image.

17. A focusing method as described in claim 1, 2, or 3, characterized in that, Step S6 specifically includes: S61: Obtain the current focus position of the first shooting module; S62: Calculate the number of steps and direction of movement required for the first shooting module to move from the current focus position to the first preset focus position based on the first preset focus position and the current focus position; S63: Adjust the focus position of the first shooting module according to the number of steps and the direction of movement to match the first preset focus position corresponding to the target image scene.

18. A focusing method as described in claim 17, characterized in that, Step S63 is followed by: S64: Calculate and obtain multiple image contrast or image sharpness information at multiple positions adjacent to the first preset focus position; S65: Obtain the maximum value of image contrast or image sharpness information from multiple image contrast or image sharpness information, and obtain the focus position corresponding to the maximum value of the image contrast or image sharpness information; S66: Move the focus position of the first shooting module to the focusing position.

19. An imaging device, characterized in that: It includes a first shooting module, a second shooting module, a first image acquisition module, a second image acquisition module, a first matching module, a first storage module, a first calculation module, a second calculation module, and a first adjustment module. The distance between the second shooting module and the first shooting module is a first preset length. The first and second shooting modules are used to capture images respectively; The first image acquisition module is used to acquire a first image from the first shooting module, acquire a target image from the first image, and acquire the first pixel position of the target image in the first image, wherein the target image is the first image or a partial image of the first image; The second image acquisition module is used to acquire a second image from the second shooting module, wherein the field of view of the second image covers the field of view of the first image or the field of view of the second image at least partially overlaps with the field of view of the first image; The first matching module is used to match the target image and the second image through image registration to obtain a matching region in the target image that matches the second image, and to obtain the second pixel position of the matching region in the second image; The first calculation module is used to calculate the position of the third pixel in the second image based on the position of the first pixel and the position of the second pixel; The first storage module is used to store the mapping relationship between the position of the third pixel and the focus position of the first shooting module; The second calculation module is used to obtain the mapping relationship between the focus position of the first shooting module and the third pixel position from the first storage module, and calculate the first preset focus position corresponding to the target image scene based on the mapping relationship and the third pixel position. The first adjustment module is used to adjust the focus position of the first shooting module according to the first preset focus position corresponding to the target image scene.

20. An imaging device as claimed in claim 19, characterized in that: The optical axis of the first imaging module when capturing the first image is set parallel to the optical axis of the second imaging module when capturing the second image.

21. An imaging device as claimed in claim 19, characterized in that: The optical axes of the first and second shooting modules are both fixed; or, at least one of the optical axes of the first and second shooting modules is rotatable.

22. An imaging apparatus as claimed in claim 19, 20, or 21, characterized in that: The first matching module includes a first image transformation unit, which is used to perform image transformation on the target image through an image transformation matrix to obtain a target transformed image, such that the target transformed image, the overlapping field of view of the target image and the second image have the same image resolution; The first image matching unit is used to match the target transformed image and the second image through image registration to obtain a matching region in the target transformed image that matches the second image, and to obtain the second pixel position of the matching region in the second image.

23. An imaging apparatus as claimed in claim 19, 20, or 21, characterized in that: The first matching module performs image registration to match the target image and the second image to obtain a matching region in the target image that matches the second image. Specifically, this includes: obtaining at least one corresponding low-frequency feature in the target image and the second image, and performing image registration on the target image and the second image based on the low-frequency feature to obtain a matching region in the target image that matches the second image.

24. An imaging device as claimed in claim 23, characterized in that: The first matching module obtains the low-frequency features corresponding to the image by performing at least one low-frequency transformation on the image to obtain corresponding first operation data, using the first operation data as the low-frequency features corresponding to the image when performing a low-frequency transformation on the image, and combining the multiple first operation data when performing multiple low-frequency transformations on the image to obtain the low-frequency features corresponding to the image.

25. An imaging device as claimed in claim 23, characterized in that: The low-frequency features include one or more of the following: the overall structural features of the image, the macroscopic statistical distribution features of the image, the global semantic features of the image, or the first feature data of the image; the high-frequency components or local pixel mutations in the first feature data of the image are suppressed or ignored.

26. An imaging device as claimed in claim 24, characterized in that: The low-frequency transformation includes one of the following: calculating the average brightness value of the image, obtaining the color histogram of the image, obtaining the brightness histogram of the image, performing a downsampling operation on the image and obtaining low-resolution data of the downsampled image, and processing the image through a low-pass filter and obtaining filtered data of the processed image. The first operation data is one of the following: average brightness value, color histogram, brightness histogram, low-resolution data, and filtered data.

27. An imaging device as claimed in claim 24, characterized in that: The low-frequency transformation is one of the following: performing a Fourier transform on the image and obtaining the first low-frequency coefficients of the transformed image; performing a discrete cosine transform on the image and obtaining the second low-frequency coefficients of the transformed image; and performing a wavelet transform on the image and obtaining the low-frequency subband coefficients of the transformed image. The first operation data is one of the first low-frequency coefficients, the second low-frequency coefficients, and the low-frequency subband coefficients.

28. An imaging device as claimed in claim 24, characterized in that: The low-frequency transformation is one of obtaining the density distribution map of the edges in the image and obtaining the overall direction statistics of the edges in the image, and the first operation data is one of the density distribution map and the overall direction statistics.

29. An imaging apparatus as claimed in claim 19, 20, or 21, characterized in that: It also includes a mapping relationship acquisition module; the mapping relationship acquisition module is used to construct a first prediction model; move the first shooting module multiple times, obtain the first adjustment focus position corresponding to each movement of the first shooting module and obtain a third image from the first shooting module, obtain the pixel coordinates of the second preset pixel in the third image in the second image based on the third image, and construct a dataset based on the pixel coordinates of the second preset pixel in the second image and the multiple first adjustment focus positions; fit the first prediction model with the dataset to obtain a first fitting model; and obtain the mapping relationship between the third pixel position and the focus position of the first shooting module based on the first fitting model.

30. An imaging device as claimed in claim 19, 20, or 21, characterized in that: The second calculation module is used to calculate and obtain the first preset focus position corresponding to the target image scene according to the mapping relationship and the third pixel position. Specifically, it includes: obtaining the first adjustment focus position according to the mapping relationship and the third pixel position; obtaining the fourth image from the first shooting module when the first shooting module moves to the first adjustment focus position, and obtaining the first target adjustment image from the fourth image; determining whether the sharpness of the first target adjustment image meets the first preset focus condition. If it does not meet the condition, a second prediction model is constructed, and the first predicted focus position of the first shooting module is obtained by predicting the first target adjustment image through the second prediction model; and the first predicted focus position is used as the first preset focus position.

31. An imaging apparatus as claimed in claim 19, 20, or 21, characterized in that: It also includes an input module; the first image acquisition module is used to detect and acquire the first pixel position through an object detection algorithm or to receive the first pixel position sent by the input module.

32. An imaging device as claimed in claim 19, 20, or 21, characterized in that: The target image is an irregularly shaped image. The first pixel position of the target image in the first image is the pixel coordinates of the upper left corner vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the first image. The second pixel position of the matching region in the second image is the pixel coordinates of the upper left corner vertex of the minimum bounding rectangle of the matching region, or the center point or centroid of the minimum bounding rectangle in the second image.

33. An imaging device as claimed in claim 19, 20, or 21, characterized in that: The target image is a rectangular image. The first pixel position of the target image in the first image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the first image. The second pixel position of the matching region in the second image is the pixel coordinate of the upper left vertex, center point, or centroid of the matching region in the second image.

34. An imaging apparatus as claimed in claim 19, 20, or 21, characterized in that: The first adjustment module includes a first position acquisition unit, used to acquire the current focus position of the first shooting module; The mobile calculation unit is used to calculate the number of steps and direction of movement required for the first shooting module to move from the current focus position to the first preset focus position based on the first preset focus position and the current focus position. The focusing unit is used to adjust the focusing position of the first shooting module according to the number of steps and the direction of movement to match the first preset focusing position corresponding to the target image scene.

35. An imaging device as described in claim 34, characterized in that, After adjusting the focus position of the first shooting module to match the first preset focus position corresponding to the target image scene according to the number of movement steps and the movement direction, the focusing unit further includes: calculating and obtaining multiple image contrast or image sharpness information at multiple positions adjacent to the first preset focus position; obtaining the maximum value of the image contrast or image sharpness information from the multiple image contrast or image sharpness information, obtaining the focus position corresponding to the maximum value of the image contrast or image sharpness information; and moving the focus position of the first shooting module to the focus position.

36. An imaging apparatus as claimed in claim 19, 20, or 21, characterized in that: It also includes a display module that can display a first image and / or a second image.

37. An imaging device as claimed in claim 36, characterized in that: The display module is a touch display module capable of displaying at least a first image; the touch display module can sense the click or touch position on the first image and generate the target image position.

38. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method as described in any one of claims 1-18.

39. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method as described in any one of claims 1-18.