A method for automatic composition of a camera shot in an offline state

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By deploying lightweight image analysis models and mathematical aesthetic algorithms locally on the shooting device, the problems of offline use, data privacy and lack of intelligence in existing technologies are solved. It realizes real-time, intelligent and personalized composition guidance in environments without network connection, and provides a smooth user experience and security.

CN122244045APending Publication Date: 2026-06-19CHANGSHA LINGWEI INNOVATION TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: CHANGSHA LINGWEI INNOVATION TECHNOLOGY CO LTD
Filing Date: 2026-05-21
Publication Date: 2026-06-19

Application Information

Patent Timeline

21 May 2026

Application

19 Jun 2026

Publication

CN122244045A

IPC: G06T7/00; G06T5/70; G06T5/94; G06T5/80; G06T7/13; G06T7/62; G06T7/66; G06T7/64; G06N3/0495; G06N5/04; G06N3/096; G06V10/82; G06V10/52; G06V10/80

AI Tagging

Application Domain

Image enhancement Image analysis

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing photo composition assistance technologies cannot be used offline, rely on network connections which limits their use cases, pose data privacy risks, have significant response delays, lack sufficient intelligence, and are costly to use, making it difficult to meet users' needs for real-time, intelligent, personalized, and privacy-secure composition in various shooting environments.

Method used

A lightweight image analysis model is deployed locally on the shooting device. Combining mathematical aesthetic algorithms such as the Fibonacci spiral, the golden ratio, and visual center of gravity balance, it provides real-time composition guidance through multimodal methods such as voice, text, and visual markers. This enables geometric analysis of the shooting scene and calculation of the optimal composition point without the need for a network connection.

Benefits of technology

It enables real-time intelligent composition in various shooting environments, ensures data privacy and security, reduces usage costs, provides a smooth user experience and personalized composition suggestions, and breaks through the limitations of traditional static guide lines.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122244045A_ABST

Patent Text Reader

Abstract

This invention discloses a method for automatic composition in offline camera shooting. The method includes: acquiring continuous image frame data of the shooting scene in real time through the camera of the shooting device; performing geometric processing on the continuous image frame data to calculate the geometric parameters of visual elements; determining the main element and secondary elements in the current frame based on the geometric parameters of each visual element, and calculating the optimal composition point coordinates of the current frame using the visual center of gravity of the main element as a reference; and guiding the user to adjust the shooting angle and position in real time on the screen of the shooting device in a multimodal manner based on the deviation between the optimal composition point coordinates and the current position of the main element. This invention runs entirely locally on the device, requiring no network connection, and the image data is always stored locally on the user's device. It enables low-latency real-time composition guidance, effectively protects user privacy, reduces user costs, and is suitable for various indoor and outdoor shooting environments.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of image processing and computer vision technology, and in particular to a method for automatic composition when taking pictures with a camera in offline mode. Background Technology

[0002] Photographic composition is one of the core elements determining photographic quality. Good composition effectively highlights the subject, balances the visual center of gravity, and enhances overall aesthetics and visual impact. As a crucial component of photographic skill, composition ability requires long-term training and aesthetic accumulation to master, posing a significant technical barrier for most non-professional photographers. With the widespread use of smartphones and portable cameras, more and more ordinary users are participating in daily photography activities. How to help non-professional users quickly achieve professional-level compositional effects has become an important research topic in the field of mobile photography assistance technology.

[0003] Currently, mainstream shooting assistance technologies rely primarily on traditional composition guidelines. These guidelines overlay static geometric reference lines such as the rule of thirds, diagonals, and golden ratio lines onto the viewfinder to provide positional references for composition. This approach is widely used in various shooting devices due to its simplicity and low resource consumption. However, this method merely mechanically divides the image into fixed reference areas, providing a static geometric framework unrelated to the content of the shooting scene. It cannot perform any form of dynamic analysis or targeted guidance based on the content, position, and size of actual elements such as people, objects, and buildings in the scene. In actual shooting, users still need to rely on their personal experience to determine the proper placement of the subject within the frame. Composition decisions depend entirely on the user's aesthetic sense and composition skills. The assistance tools themselves lack any intelligent composition analysis or suggestion capabilities, thus failing to meet users' needs for intelligent and personalized composition assistance services.

[0004] In recent years, with the rapid development of artificial intelligence (AI) technology, some products have begun to explore the introduction of large-scale cloud-based AI models for intelligent composition assistance. The basic principle is to upload real-time images captured by the camera to a cloud server via the internet. A large-scale AI model deployed in the cloud then identifies and semantically understands the scene content, infers composition optimization suggestions, and returns them to the device for display. While this solution improves upon traditional auxiliary line solutions in terms of intelligence, its inherent cloud-based processing architecture introduces a core, unavoidable flaw: a strong dependence on network connectivity. Because image data must be uploaded to the cloud for processing, the shooting device needs a consistently stable network connection to function properly. If the user is in an outdoor, mountainous, underwater, or underground shooting environment with no network coverage or weak network signal, the solution becomes completely ineffective, severely limiting its application scenarios. Furthermore, the requirement to upload real-time image data to a third-party cloud server for processing exposes users to the risk of data interception, storage, or misuse, significantly impacting user acceptance of this solution given the increasing awareness of data security. At the same time, the entire chain of image uploading, cloud inference, and result delivery has inherent communication and computing latency, making it difficult to achieve a smooth real-time modeling guidance experience. In addition, cloud-based large model services usually involve continuous interface call fees, which increases the user's economic usage costs.

[0005] In summary, existing photo composition assistance technologies have varying degrees of shortcomings in terms of intelligence, offline availability, data privacy protection, and usage costs, making it difficult to simultaneously meet users' needs for real-time, intelligent, personalized, and privacy-secure composition guidance services in various shooting environments. Therefore, the industry urgently needs a technical solution that can operate entirely locally on the shooting device, without requiring a network connection, to achieve real-time intelligent composition analysis, personalized composition suggestions, and multimodal guidance. Summary of the Invention

[0006] To address the technical problems of existing technologies in camera composition assistance, such as inability to be used offline, high data privacy risks, significant response delays, insufficient intelligence, and high usage costs, this invention provides a method for automatic composition in offline camera shooting. This method deploys a lightweight image analysis model locally on the mobile shooting device, combining mathematical aesthetic algorithms such as the Fibonacci spiral, the golden ratio, and visual center of gravity balance to achieve real-time geometric analysis and optimal composition point calculation for various elements in the shooting scene. It then guides the user to complete professional-level composition adjustments in real time through multimodal methods such as voice, text, and visual markers. The entire process requires no network connection, and the image data is always stored locally on the user's device.

[0007] This invention provides a method for automatic framing of a camera image in offline mode, comprising the following steps:

[0008] Step 1: The camera's local scene real-time acquisition module calls the camera of the shooting device to acquire continuous image frame data of the shooting scene in real time at a preset frame rate;

[0009] Step 2: The local image real-time analysis module performs geometric processing on the continuous image frame data, extracts the geometric contours of each visual element in the image, and calculates the geometric center coordinates, geometric area, and geometric boundary rectangle parameters of each visual element.

[0010] Step 3: The local composition decision module determines the main and secondary elements in the current image based on the geometric center coordinates, geometric area, and geometric boundary rectangle parameters of each visual element, and calculates the optimal composition point coordinates of the current image based on the visual center of gravity of the main element.

[0011] Step 4: The real-time guidance module guides the user to adjust the shooting angle and position in real time on the shooting device screen in a multimodal manner based on the deviation between the optimal composition point coordinates and the current position of the main element.

[0012] In a preferred embodiment of the automatic framing method for offline camera photography provided by the present invention, step one includes a camera initialization stage, a parameter configuration stage, a real-time video stream acquisition stage, an image frame preprocessing stage, and a frame buffer queue management stage. The parameter configuration stage dynamically configures the acquisition frame rate, image resolution, color space, and exposure mode parameters according to the device hardware performance. The image frame preprocessing stage performs size normalization, color space conversion, and noise suppression filtering on the acquired raw image frames. The frame buffer queue management stage uses a double buffer or circular queue mechanism to manage the acquisition and transmission of continuous image frames.

[0013] In a preferred embodiment of the offline camera automatic composition method provided by the present invention, step two includes a target detection inference stage, a contour extraction and geometric abstraction stage, and a geometric parameter calculation stage. The target detection inference stage uses a lightweight neural network model deployed locally on the device to perform target detection inference on continuous image frame data, identify each visual element in the image and output the detection box and category label of each visual element. The contour extraction and geometric abstraction stage extracts the target contour of the image region within the detection box and abstracts the target contour of each visual element into a standard geometric shape for representation. The geometric parameter calculation stage calculates the geometric center coordinates, geometric area and geometric boundary rectangle parameters of each visual element that has completed the geometric processing in real time.

[0014] In a preferred embodiment of the offline camera automatic composition method provided by the present invention, the target detection inference stage employs multi-scale feature fusion technology to enable the lightweight neural network model to perform target detection simultaneously on feature maps of different resolutions.

[0015] In a preferred embodiment of the method for automatic composition of camera images in offline mode provided by the present invention, the geometric parameter calculation stage further includes perspective distortion correction; the perspective distortion correction is performed by establishing a lens distortion correction model to correct the coordinates of the detection box in real time before calculating the geometric center and area, and the lens distortion correction model parameters include radial distortion coefficient and tangential distortion coefficient.

[0016] In a preferred embodiment of the offline automatic composition method for camera photography provided by the present invention, step three includes a subject element scoring and determination stage, a multi-algorithm collaborative composition calculation stage, and an optimal composition point output stage. The subject element scoring and determination stage employs a multi-dimensional scoring mechanism to assign a subject weight score to each visual element. The visual element with the highest comprehensive score is determined as the subject element of the image, and the remaining visual elements are arranged as secondary elements according to their scores from highest to lowest. The multi-algorithm collaborative composition calculation stage comprehensively uses multiple algorithms to calculate the optimal composition point coordinates. The optimal composition point output stage performs a weighted comprehensive calculation based on the priority weights of various algorithms in the multi-algorithm collaborative composition calculation stage to obtain the optimal composition point coordinates (O). x O y ) and output.

[0017] In a preferred embodiment of the automatic composition method for offline camera shooting provided by the present invention, the multi-dimensional scoring mechanism includes geometric area weight, distance from the center of the image weight, target category priority weight, and detection confidence weight; the main element scoring stage adopts a dynamic weight adjustment strategy, which includes: when there is a single significant visual element in the image, the geometric area weight and the distance from the center of the image weight dominate the comprehensive score; when there are multiple visual elements with similar areas in the image, the target category priority weight and the detection confidence weight play a role.

[0018] In a preferred embodiment of the automatic composition method for offline camera shooting provided by the present invention, the multi-algorithm collaborative composition calculation stage comprehensively uses multiple algorithms to calculate the optimal composition point coordinates, including: using the Fibonacci spiral method as the first priority algorithm to generate a spiral curve based on the Fibonacci sequence within the image, with the center of the spiral as the most ideal visual focal point; using the golden section method as the second priority algorithm to divide the image horizontally and vertically according to the golden ratio to obtain four golden section points; using the visual center of gravity balance method as the third priority algorithm to perform a weighted average calculation based on the geometric center coordinates of each visual element and combined with visual weights; and using the trisection method as the fourth priority algorithm to evenly divide the image into a three-by-three grid, with the intersection of the four grids as composition reference points.

[0019] In a preferred embodiment of the automatic composition method for offline camera shooting provided by the present invention, an adaptive algorithm weight allocation mechanism is adopted in the multi-algorithm collaborative composition calculation stage. The adaptive algorithm weight allocation mechanism includes: dynamically adjusting the weight coefficients of each algorithm according to the spatial distribution characteristics of visual elements in the current image; increasing the weight coefficient of the visual center of gravity balance method when the distribution of visual elements is relatively concentrated; increasing the weight coefficients of the Fibonacci spiral method and the golden section method when the distribution of visual elements is relatively dispersed; and increasing the weight coefficient of the rule of thirds method when computing resources are scarce or the frame rate decreases.

[0020] In a preferred embodiment of the method for automatic composition of camera photos in offline mode provided by the present invention, in step four, the multimodal approach includes visual guidance, text guidance, voice guidance, and a trigger and exit mechanism; the visual guidance approach includes: displaying a guide arrow pointing from the geometric center of the current subject element to the coordinates of the optimal composition point in real time by overlaying layers; displaying the coordinates of the optimal composition point in real time by a circle with a crosshair or a flashing cursor symbol; highlighting the geometric boundary of the detected subject element in real time with a green border; and displaying a composition quality score progress bar in real time at the edge of the screen.

[0021] Compared with existing technologies, the method for automatic framing of offline camera shots provided by this invention has the following advantages:

[0022] 1. This invention deploys all computing and processing modules locally on the shooting device, enabling the entire composition analysis process to work normally without any network connection. This effectively breaks through the usage scenario limitations of existing cloud solutions and is suitable for various indoor and outdoor shooting environments, especially outdoor, mountainous, and water scenes without network coverage.

[0023] 2. By enabling image data to be processed entirely locally on the user's device without uploading to any external server, this invention fundamentally eliminates the risk of privacy breaches caused by the interception, storage, or misuse of user data, providing users with reliable data security.

[0024] 3. This invention achieves low-latency real-time composition guidance by employing a lightweight neural network model and efficient geometric algorithms in synergistic optimization. The total processing time for a single frame is controlled within 50 milliseconds, ensuring smoothness of the screen and real-time user guidance, and providing a smooth user experience.

[0025] 4. This invention reduces the economic burden and technical barriers for users by eliminating the need to subscribe to any cloud services and incurring no API call fees, which is conducive to the large-scale popularization and application of this technology.

[0026] 5. This invention dynamically calculates the optimal composition scheme based on the visual elements detected in the actual shooting scene, breaking through the limitation of traditional static auxiliary lines that only provide a fixed geometric reference frame, and providing users with a truly intelligent composition guidance service based on scene content. Attached Figure Description

[0027] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0028] Figure 1 This is a flowchart of a method for automatic framing of a camera in offline mode, provided in an embodiment of the present invention.

[0029] Figure 2 This is a schematic diagram of the workflow of the camera local scene real-time acquisition module provided in an embodiment of the present invention;

[0030] Figure 3 This is a schematic diagram of the geometric processing and parameter calculation of the local image real-time analysis module provided in this embodiment of the invention;

[0031] Figure 4 This is a schematic diagram of subject identification and optimal composition point calculation in the local composition decision module provided in this embodiment of the invention;

[0032] Figure 5 This is a schematic diagram of the multimodal guidance interface of the real-time guidance module provided in an embodiment of the present invention;

[0033] Figure 6 This is a schematic diagram of the distribution of composition points based on the golden ratio and Fibonacci sequence provided in an embodiment of the present invention. Detailed Implementation

[0034] The technical solution of the present invention will now be described with reference to the accompanying drawings.

[0035] In embodiments of the present invention, words such as "exemplarily" and "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design described as "exemplary" in the present invention should not be construed as superior or more advantageous than other embodiments or designs. Rather, the use of the term "exemplary" is intended to present the concept in a specific manner. Furthermore, in embodiments of the present invention, the meaning expressed by "and / or" can be both, or either one.

[0036] Please refer to the following: Figures 1 to 6 ,in, Figure 1 This is a flowchart of a method for automatic framing of a camera in offline mode, provided in an embodiment of the present invention. Figure 2 This is a schematic diagram of the workflow of the camera local scene real-time acquisition module provided in an embodiment of the present invention; Figure 3 This is a schematic diagram of the geometric processing and parameter calculation of the local image real-time analysis module provided in this embodiment of the invention; Figure 4 This is a schematic diagram of subject identification and optimal composition point calculation in the local composition decision module provided in this embodiment of the invention; Figure 5 This is a schematic diagram of the multimodal guidance interface of the real-time guidance module provided in an embodiment of the present invention; Figure 6 This is a schematic diagram of the distribution of composition points based on the golden ratio and Fibonacci sequence provided in an embodiment of the present invention.

[0037] This invention provides a method for automatic composition in offline camera photography. The method comprises four core modules: a local scene real-time acquisition module, a local image real-time analysis module, a local composition decision module, and a real-time guidance module. These modules form a linear, unidirectional data flow, constituting a complete localized processing pipeline. The local scene real-time acquisition module acts as the data input layer, taking the physical shooting scene as input and outputting raw image frame data to the local image real-time analysis module. The local image real-time analysis module acts as the core computation layer, taking the raw image frames as input and outputting structured geometric description data to the local composition decision module. The local composition decision module acts as the core intelligent decision layer, taking geometric parameters and visual weights as input and outputting the optimal composition point coordinates to the real-time guidance module. The real-time guidance module acts as the human-computer interaction output layer, taking the composition point coordinates and deviations as input and outputting multimodal guidance information to the user. This pipelined architecture ensures that the entire method completes all computational processing locally on the device and can operate normally without any network connection.

[0038] The method specifically includes the following steps:

[0039] Step S1: The camera local scene real-time acquisition module calls the camera of the shooting device to acquire continuous image frame data of the shooting scene in real time at a preset frame rate.

[0040] Specifically, the camera local scene real-time acquisition module is the data input layer of this invention. Its function is to acquire continuous image frame data of the shooting scene in real time, providing raw input for the subsequent local image real-time analysis module. For example... Figure 2 As shown, the workflow of this module includes the camera initialization stage, the acquisition parameter configuration stage, the real-time video stream acquisition stage, the image frame preprocessing stage, and the frame buffer queue management stage. The stages are connected by data streams to form a complete data acquisition link.

[0041] During the camera initialization phase, this module accesses and controls the camera of the shooting device by calling the standard camera interface provided by the operating system. Specifically, on the Android operating system platform, the Camera2 API interface is used for camera access. This interface provides complete control over the camera hardware and supports fine-tuning of parameters such as focus, exposure, and white balance. On the iOS operating system platform, the AVFoundation framework interface is used for camera control. This framework encapsulates high-quality image acquisition and video recording functions. On the embedded Linux operating system platform, the V4L2 (Video for Linux 2) interface is used for video device access. This interface is the standard driver framework for video devices under the Linux system. The camera initialization phase includes operations such as enumerating available camera devices, opening the target camera, creating a camera session, and configuring the camera's working mode to ensure that the camera is available and can continuously output image data.

[0042] During the parameter configuration phase, key image acquisition parameters are dynamically configured based on the device's hardware performance and the actual needs of the application scenario. The frame rate parameter can be set to a range of 15–30 frames per second. This frame rate range ensures smooth image updates, allowing the real-time guidance module to respond promptly to user position adjustments, while also ensuring stable operation on resource-constrained mobile devices. The image resolution parameter can be set to no less than 720 pixels. This resolution level is sufficient to support the detail extraction requirements of object detection and geometric analysis algorithms, while avoiding the computational burden of excessively high resolution. The color space parameter can be configured using the RGB24 format. In this format, each pixel occupies three bytes of storage space, representing the intensity values of the red, green, and blue color channels respectively. The RGB format facilitates image moment calculations in the subsequent geometric parameter calculation phase. The exposure mode parameter supports both automatic and manual exposure modes. In shooting environments with drastically changing lighting conditions, the automatic exposure mode can quickly adapt to changes in ambient brightness, while in scenes with stable lighting, the manual exposure mode can maintain consistent exposure parameters.

[0043] During the real-time video stream acquisition phase, the camera continuously outputs image frame data according to the configured acquisition parameters. This phase uses the operating system's camera driver interface to continuously read the raw image data captured by the camera sensor, and the data stream is continuously piped to subsequent processing phases. To ensure the stability of the acquisition process, this phase is executed in a background thread to avoid blocking the main thread's response to the user interface. When a camera device malfunction or data stream interruption is detected, the system automatically triggers a reconnection mechanism to attempt to reinitialize the camera and resume data acquisition.

[0044] In the image frame preprocessing stage, necessary preprocessing operations are performed on the acquired raw image frames to improve the processing accuracy and stability of subsequent analysis modules. Image size normalization is used to uniformly scale input images of different resolutions to the standard input size required by the local real-time image analysis module, such as scaling the original image to preset sizes like 320×240 pixels or 640×480 pixels. This process eliminates the impact of input resolution differences on the analysis model. Color space conversion is used to convert the input image from YUV format to RGB format to adapt to the color channel processing requirements in subsequent workflows. Noise suppression filtering uses a Gaussian blur algorithm to lightly smooth the image. This process effectively suppresses random noise introduced by the image sensor while preserving the edge features of target objects in the image, providing clearer input for the subsequent target detection inference stage.

[0045] During the frame buffer queue management phase, a double-buffering or circular queue mechanism is employed to manage the acquisition and transmission of consecutive image frames. The double-buffering mechanism maintains two independent frame buffers. While one buffer is being read by the current processing stage, the other buffer receives new data frames from the camera. This alternating use of the two buffers ensures parallel execution of data acquisition and processing. The circular queue mechanism maintains a fixed-capacity frame buffer queue. New image frames are continuously enqueued, and when the queue is full, the oldest frame is overwritten. This mechanism manages the continuous data stream by cyclically utilizing storage space. The core function of frame buffer queue management is to eliminate image jitter or frame loss caused by uneven frame processing times. When the processing time of a certain image frame exceeds the acquisition interval, the buffering mechanism can absorb this time fluctuation, ensuring the relatively smooth operation of the processing pipeline.

[0046] Step S2: The local image real-time analysis module performs geometric processing on the continuous image frame data, extracts the geometric contours of each visual element in the image, and calculates the geometric center coordinates, geometric area, and geometric boundary rectangle parameters of each visual element.

[0047] Specifically, the local image real-time analysis module is the core computing layer of this invention. It runs entirely on the local shooting device and is responsible for real-time geometric processing, geometric parameter calculation, and geometric relationship analysis of the image frames input from the acquisition module. It outputs structured geometric description data of each visual element in the image, providing numerical basis for the subsequent local composition decision module. For example... Figure 3 As shown, the processing flow of this module includes the target detection inference stage, the contour extraction and geometric abstraction stage, and the geometric parameter calculation stage.

[0048] In the object detection inference stage, this module uses a lightweight neural network model deployed locally on the device to perform object detection inference on image frames, identifying various visual elements in the image, including objects such as people, animals, buildings, plants, and vehicles, and outputting detection boxes and category labels for each visual element. The lightweight neural network model is selected from one or more of MobileNet, ShuffleNet, EfficientNet-Lite, and YOLO-Tiny. These model architectures are specially designed to significantly reduce computational complexity and the number of model parameters while maintaining high detection accuracy. The MobileNet model uses depthwise separable convolution technology, decomposing the standard convolution operation into two stages: depthwise convolution and pointwise convolution, significantly reducing the number of multiplication and addition operations. The ShuffleNet model introduces channel shuffling, reducing the network width requirement while maintaining feature representation capabilities. The EfficientNet-Lite model coordinates and adjusts the network depth, width, and resolution through a composite scaling method, achieving a good balance between efficiency and accuracy. The YOLO-Tiny model is a lightweight version of the YOLO object detection algorithm, using a single-stage detection architecture to directly output the object's category and location information, resulting in extremely fast inference speed.

[0049] To enable these lightweight neural network models to run in real-time on CPUs or mobile NPUs / DSPs, this module also employs a series of model compression techniques for optimization. INT8 quantization converts model parameters and intermediate activation values from 32-bit floating-point format to 8-bit integer format. This conversion is achieved through post-training quantization or quantization-aware training. The quantized lightweight neural network model maintains high accuracy while reducing memory usage and computation to about one-quarter of the original. Model pruning evaluates the importance of each neuron or convolutional kernel, removes less important parameter structures, and retrains the sparse model to restore accuracy. This technique can reduce model complexity by 30-50%. Knowledge distillation utilizes a large, high-precision model as the teacher network to guide the lightweight student network in learning the teacher network's feature representations and decision logic, enabling the student network to achieve near-teacher network detection accuracy while maintaining a lightweight architecture. The combined application of these optimization techniques ensures that the module's single-frame inference time does not exceed 50 milliseconds, the model file size does not exceed 20MB, and it can achieve real-time processing capabilities of over 20 frames per second on mid-to-high-end mobile devices.

[0050] In the object detection inference stage, this module also employs multi-scale feature fusion technology, enabling the lightweight neural network model to simultaneously perform object detection on feature maps of different resolutions, thereby effectively identifying visual elements of different sizes in the scene. Specifically, a feature pyramid structure is used. During the model's forward propagation, feature maps of different levels are extracted. High-level feature maps have greater semantic information but lower spatial resolution, suitable for detecting large targets; low-level feature maps retain more spatial details but have shallower semantic levels, suitable for detecting small targets. The feature pyramid merges feature maps of different levels through upsampling and fusion operations to generate multi-scale fused features, on which object detection prediction is performed. This design allows the model to maintain high detection accuracy for both large objects at close range and small secondary objects at a distance, effectively addressing the practical situation of large differences in target object size in shooting scenes.

[0051] In the contour extraction and geometric abstraction stage, this module processes the image region within the detection box output from the object detection inference stage, extracts the precise contour boundaries of each visual element, and abstracts each visual element into a standard geometric shape for representation. Contour extraction employs a combination of edge detection and contour tracking. First, the Canny or Sobel operator is used to perform edge detection on the detection region, obtaining a set of pixels with drastic grayscale changes in the image. Then, the Suzuki algorithm or a curvature-based algorithm is used to link and organize these edge pixels, forming a closed contour curve. The geometric abstraction process maps the extracted target contours to standard geometric shapes such as the minimum bounding rectangle, the minimum bounding convex polygon, or an equivalent ellipse. The minimum bounding rectangle is obtained by solving the axis-aligned bounding box or the rotated minimum bounding box of the target contour. The axis-aligned bounding box is parallel to the image coordinate axes, which is simple and fast to calculate. The rotated minimum bounding box allows for more compact wrapping of the target contour at any angle. The minimum bounding convex polygon is obtained by calculating the convex hull of the target contour. The convex hull is the minimum convex polygon that contains all points of the contour and can represent the overall shape features of the target. The equivalent ellipse is obtained by fitting the second central moment of the target contour. The major axis, minor axis and orientation angle of the equivalent ellipse can represent the size and orientation information of the target.

[0052] During the geometric parameter calculation phase, this module calculates three geometric parameters in real time for each visual element that has completed the geometricization process: geometric center coordinates, geometric area, and geometric boundary rectangle. These parameters constitute the core input data for the subsequent local composition decision module.

[0053] The geometric center coordinates are calculated using the image moment method to determine the centroid coordinates of the target contour region of the visual element. Image moments are statistical quantities describing the pixel distribution characteristics of an image, obtained by integrating the pixels within the contour region. For a two-dimensional image region, its Mpq-order moments are defined as the weighted integrals of the coordinates of each point within the region, where p and q are non-negative integers representing the orders in the x and y directions, respectively. After calculating the moments based on the binary contour image, the formula for calculating the centroid coordinates of the target contour region is C. x =M 10 / M 00 C y =M 01 / M 00 M 00 M is the zeroth moment, i.e., the area, of the target contour region. 10 and M 01 These are the first moments in the x and y directions, respectively. The centroid coordinates calculated by this method represent the visual center position of the visual element in the image, and are a key reference point for composition analysis.

[0054] Geometric area is represented by the area of the pixel region covered by the geometric shape of a visual element. This area value is calculated in square pixels and reflects the visual proportion of the visual element in the image. In the implementation of this module, the geometric area is directly taken from the denominator M in the centroid calculation formula. 00 This refers to the total number of pixels within the target outline area. The larger this value is, the more space the visual element occupies in the image, and the higher its visual weight.

[0055] The geometric boundary rectangle is calculated as the minimum bounding rectangle aligned to the axes of a visual element. This rectangle is parallel to the image coordinate axes and can be quickly obtained by traversing all pixels within the contour to find the minimum and maximum values of the x and y coordinates. The calculation result is represented in the form of a quadruple: minimum x-axis value, minimum y-axis value, maximum x-axis value, and maximum y-axis value. The geometric boundary rectangle determines the positional range and size of the visual element in the image and is the basic data for calculating the subject offset.

[0056] Visual weights are assigned numerical weights to each element by comprehensively considering multiple dimensions such as geometric area, image position, and target category confidence. In terms of geometric area, visual elements with larger areas receive higher weight coefficients. In terms of image position, visual elements located in the center of the image receive additional positional weights because the center area typically attracts more visual attention. In terms of target category, this module predefines a category priority table, with visual elements with clear subject characteristics, such as people and animals, having higher priority than background visual elements such as buildings and vegetation. Higher detection confidence indicates a more certain recognition of the visual element by the model, and the corresponding confidence weight is also higher. The final calculation of visual weights uses a weighted summation method, combining the normalized scores from the above dimensions to obtain the final visual weight value for each visual element.

[0057] It is worth noting that because mobile device cameras typically use wide-angle lenses, there is significant perspective distortion in the edge areas of the captured image. Directly calculating geometric parameters based on the original pixel coordinates can lead to subject positioning errors. To address this issue, this module introduces perspective distortion correction during the geometric parameter calculation stage. This correction first establishes a lens distortion correction model, whose parameters include radial and tangential distortion coefficients. These parameters are obtained through the camera calibration process or read from the device's factory settings. During geometric parameter calculation, the coordinates of the detection box are first substituted into the distortion correction model for inverse transformation to eliminate barrel or pincushion distortion caused by radial distortion. Then, geometric parameters such as centroid and area are calculated based on the corrected coordinates. This process effectively eliminates the impact of wide-angle lens perspective distortion on the accuracy of composition analysis, making the calculated geometric parameters more accurate and reliable.

[0058] Step S3: The local composition decision module determines the main element and secondary elements in the current image based on the geometric center coordinates, geometric area and geometric boundary rectangle parameters of each visual element, and calculates the optimal composition point coordinates of the current image based on the visual center of gravity of the main element.

[0059] Specifically, the local composition decision module is the core intelligent decision layer of this invention. It takes the geometric parameters and visual weight data of each visual element output by the local real-time image analysis module as input to complete core decision-making tasks such as subject element determination, composition strategy selection, and optimal composition point calculation. For example... Figure 4 As shown, the processing flow of this module includes the main element scoring and determination stage, the multi-algorithm collaborative graph construction calculation stage, and the optimal graph construction point output stage.

[0060] In the main element scoring stage, this module uses a multi-dimensional scoring mechanism to calculate the main weight score of each visual element in the current image. The visual element with the highest comprehensive score is judged as the main element of the image, and the remaining visual elements are arranged as secondary elements according to their scores from high to low. The multi-dimensional scoring mechanism comprehensively considers the following four scoring dimensions: Geometric area weight dimension, which considers the size of the space occupied by the visual element in the picture. The larger the visual element, the higher its visual salience. This dimension has a weight coefficient of 0.35 in the comprehensive score; Distance from the center of the picture weight dimension, which considers the distance of the geometric center of the visual element from the exact center of the picture. The closer the visual element is, the more it conforms to the distribution law of visual focus. This dimension has a weight coefficient of 0.30 in the comprehensive score; Target category priority weight dimension, which considers the degree of prominence of the subject features of the category to which the visual element belongs in the photographic composition. Categories such as people and animals have higher subject priority, while categories such as buildings and vegetation have relatively lower priority. This dimension has a weight coefficient of 0.25 in the comprehensive score; Detection confidence weight dimension, which considers the degree of certainty of the target detection model. The higher the confidence, the more reliable the detection result. This dimension has a weight coefficient of 0.10 in the comprehensive score.

[0061] This module employs a dynamic weight adjustment strategy in the subject element scoring stage to adapt to different shooting scenarios. When a single prominent visual element exists in the image, the weights of geometric area and distance from the center of the image dominate the overall score. The system prioritizes identifying the largest and most prominent visual element as the subject element, a strategy consistent with photographic composition conventions in single-subject scenes. When multiple visual elements of similar size exist in the image, the target category priority weights and detection confidence weights play a crucial role. The system determines the visual element with more photographic subject characteristics as the subject element based on a pre-set category priority table. This strategy effectively distinguishes the true subject element from multiple candidate targets. The dynamic weight adjustment strategy makes the subject element determination results more consistent with the compositional habits and aesthetic standards of professional photography.

[0062] In the multi-algorithm collaborative graph construction calculation stage, this module comprehensively utilizes four mathematical algorithms to calculate the coordinates of the optimal graph construction point. Each algorithm is executed sequentially according to its priority, from highest to lowest, and the final optimal graph construction point is obtained through a weighted summation. For example... Figure 6 As shown, the four algorithms are the Fibonacci spiral method, the golden section method, the visual center of gravity balance method, and the trisection method.

[0063] The Fibonacci spiral method, as the highest priority algorithm, generates a spiral curve based on the Fibonacci sequence within the image. This spiral extends across the image according to the increasing trend of the Fibonacci sequence, with its center being the ideal visual focal point. The Fibonacci spiral is composed of a series of quarter-circle arcs, with the radius of each arc increasing according to the Fibonacci sequence. The mathematical expression of this curve is a logarithmic spiral in polar coordinates, characterized by the distance from any point on the curve to the spiral center being exponentially proportional to the angle between the line connecting that point to the spiral center and a fixed direction. In composition calculations, the position and orientation of the spiral are adjusted according to the distribution characteristics of visual elements in the current image, with the spiral center, as the core compositional reference point, having the highest weight coefficient.

[0064] The Golden Ratio, as the second priority algorithm, divides the image horizontally and vertically according to the golden ratio (1:0.618), resulting in four Golden Ratio points. These points are located at the two intersections of the long side of the image when divided by the Golden Ratio. The optimal composition goal of this algorithm is to adjust the visual center of the subject to one of these Golden Ratio points, placing the subject in the visually golden position within the image. The Golden Ratio originates from a widely existing mathematical law in nature and is considered to possess high aesthetic appeal. Applying the Golden Ratio in photographic composition can produce harmonious and beautiful visual effects.

[0065] The visual center-of-gravity balancing method, as the third priority algorithm, calculates the overall visual center-of-gravity coordinates of the image by weighting and averaging the geometric center coordinates of each visual element and combining visual weights. The calculation formula for this algorithm is V. x = Σ(W i ×C xi ) / ΣW i V y = Σ(W i × C yi ) / ΣW i The weights are the geometric area or combined visual weights of each visual element. This algorithm considers the spatial distribution of all visual elements in the image, aiming for the overall visual balance of the image rather than the optimal position of a single subject. It is suitable for scenes where the elements are relatively evenly distributed.

[0066] The rule of thirds, as the fourth priority algorithm, divides the image into a 3x3 grid, using the intersections of the four grids as compositional reference points. The rule of thirds is one of the most commonly used and fundamental compositional rules in photography. Its advantages lie in its simple calculation and fast response time, making it suitable for rapid decision-making scenarios where computing resources are limited.

[0067] In this module, the relationship between the Fibonacci spiral method and the golden section method is as follows: the Fibonacci spiral is a dynamic extension of the golden section, and the center of the spiral convergence is the optimal point of the golden section. This invention preferentially uses the center of the Fibonacci spiral as the optimal composition point, and uses the golden section intersection point for auxiliary verification.

[0068] This module introduces an adaptive algorithm weight allocation mechanism in the multi-algorithm collaborative composition calculation stage. This mechanism dynamically adjusts the weight coefficients of each algorithm based on the spatial distribution characteristics of visual elements in the current image. When visual elements are relatively concentrated, the weight coefficient of the visual center of gravity balance method is automatically increased, prioritizing the calculation of the optimal composition point for overall visual balance. When visual elements are more dispersed, the weight coefficients of the Fibonacci spiral method and the golden ratio method are automatically increased, prioritizing the calculation of composition points that conform to the golden ratio aesthetic standard. When computing resources are limited or the frame rate decreases, the system automatically increases the weight coefficient of the rule of thirds method to ensure the real-time nature of composition decisions. This adaptive mechanism ensures that reasonable and effective composition suggestions are obtained in different shooting scenarios.

[0069] In the optimal composition point output stage, this module performs a weighted comprehensive calculation based on the priority weights of each algorithm. The coordinates of the composition points calculated by the four algorithms are weighted and averaged according to preset weights to obtain the final coordinate value of the optimal composition point, which is then output. This coordinate value represents its precise position in the image in pixels and is the core basis for the real-time guidance module to guide the user. Simultaneously, this module also calculates the deviation vector between the geometric center of the current subject element and the optimal composition point. This vector includes a horizontal offset Δx and a vertical offset Δy, as well as a composition quality score calculated based on the offsets. This score reflects the degree of difference between the current composition and the optimal composition; a higher score indicates that the composition is closer to the optimal state.

[0070] Step S4: The real-time guidance module guides the user to adjust the shooting angle and position in real time on the screen of the shooting device in a multimodal manner based on the deviation between the optimal composition point coordinates and the current position of the main element.

[0071] Specifically, the real-time guidance module is the human-computer interaction output layer of this invention. It is responsible for presenting the optimal composition point and composition adjustment information calculated by the local composition decision module on the shooting device screen in real time through a multimodal approach, guiding the user to adjust the shooting direction and position to ultimately achieve the optimal composition state. Figure 5 As shown, the multimodal guidance methods in this module include three types: visual guidance, text guidance, and voice guidance.

[0072] Visual guidance conveys compositional information to users by overlaying graphic elements on the screen. For example... Figure 5The guide arrows marked in the image are displayed in real-time as overlay layers, pointing from the geometric center of the current subject element to the coordinates of the optimal composition point. The arrow direction intuitively indicates to the user which direction to adjust the shooting angle. The arrow length is proportional to the distance from the subject element to the optimal point. By following the arrow, the user can gradually approach the optimal composition position with their mobile device. The progressive guidance strategy dynamically changes the display parameters of the guide arrows according to the degree of deviation between the subject element and the optimal composition point. When the deviation is large, the arrow is displayed with a thicker line width and a brighter color, accompanied by a larger flashing frequency to attract the user's attention. As the subject element gradually approaches the optimal composition point, the line width and brightness of the arrow gradually decrease, and the flashing frequency also slows down accordingly, guiding the user to smoothly complete the composition adjustment rather than drastically moving the mobile device.

[0073] The optimal composition point marker is displayed in real-time using a circle with a crosshair or a flashing cursor, indicating the calculated optimal composition point position. This marker flashes continuously at a fixed position on the screen. When the user adjusts the device to move the main element to the marked position, the marker coincides with the highlighted area of the main element's outline, indicating that the composition has reached its optimal state. The main element outline highlighter displays the detected main element's geometric boundary in real-time with a green border. This green border perfectly matches the shape of the detection frame, helping the user confirm that the system has correctly identified the subject. The composition quality progress bar is displayed in real-time on the screen edge, such as the right edge. This progress bar is presented as a vertical bar graph, and its height is proportional to the current composition quality score. Different colored score intervals are marked on the progress bar. When the progress bar reaches the green optimal interval, it indicates that the current composition has reached its optimal state.

[0074] The text guidance method maps the offset to concise text guidance information based on the main body's offset direction, which is displayed in real time at the top or bottom of the screen. For example... Figure 5 The text guidance information displays the following: when the horizontal offset Δx is negative, it displays "Move Left"; when it is positive, it displays "Move Right". When the vertical offset Δy is negative, it displays "Move Up"; when it is positive, it displays "Move Down". When both Δx and Δy exist simultaneously, they are combined to display directional descriptions such as "Move Up to the Left" or "Move Down to the Right". When the composition quality score exceeds a preset threshold, it displays a prompt such as "The composition is optimal, you can take the picture!" The text guidance uses natural language generation technology based on semantic understanding. It automatically generates guidance statements that conform to natural language habits based on the vector direction and magnitude of the current subject offset, rather than simple directional codes. This design makes the text guidance more intuitive and easy to understand, reducing the user's understanding cost and cognitive burden. The text guidance supports multi-language adaptation and can automatically switch the language version of the guidance text according to the user's device's system language settings.

[0075] The voice guidance method utilizes the device's local text-to-speech interface to simultaneously broadcast text instructions as voice, enabling screen-free operation. Users do not need to continuously look at the screen during shooting; they can understand the current composition and adjustment direction simply by listening to the voice guidance. This function is particularly suitable for blind shooting or shooting from special angles. The voice broadcast frequency is dynamically adjusted based on the composition quality score. When the composition deviation is large (i.e., the score is low), it broadcasts every 2-3 seconds to avoid frequent broadcasts causing interference; when the composition is close to optimal (i.e., the score is high), it broadcasts every 0.5-1 second, increasing the broadcast frequency to help users accurately grasp the optimal composition moment.

[0076] The triggering and exit mechanism of the real-time guidance module stipulates that the module automatically starts when the user activates shooting mode, operating synchronously with the camera's local scene real-time acquisition module and local image real-time analysis module to continuously provide users with real-time composition guidance services. When the composition quality score continuously exceeds a preset threshold for a preset duration, the system automatically broadcasts a voice prompt stating "Optimal composition," and the automatic shooting function can be triggered. This buffer judgment logic, which prevents accidental triggering, requires that the composition quality score continuously exceeds the threshold for a preset duration before determining that the composition is optimal. This duration is set within a reasonable time range for the user to complete a stable press operation, such as 1.5 seconds. This design effectively avoids accidental triggering caused by brief user pauses and repeated misjudgments caused by score fluctuations during rapid user adjustments, significantly improving the reliability of system judgments and the accuracy of user operations.

[0077] Example

[0078] To more clearly illustrate the specific operation of the method of the present invention, the following is a detailed description of the complete process using an actual shooting scenario. Assume a user uses a smartphone equipped with the system of the present invention to take a photo containing a person as the main subject.

[0079] After the user opens the phone's camera app and switches to composition assist mode, the camera's local scene real-time acquisition module begins operation. The camera interface is initialized, and the acquisition parameters are set according to preset configurations, including a capture frame rate of 20 frames per second, an image resolution of 1080 pixels, and an RGB24 color space format. Real-time video streaming begins, and raw image frames are preprocessed and sent to the frame buffer queue for further processing.

[0080] Image frames in the frame buffer queue are sequentially processed by the local real-time image analysis module. A lightweight neural network model identifies visual elements such as people, background buildings, and trees in the image, outputting detection boxes and category labels for each element. In the contour extraction and geometric abstraction stage, edge detection and contour tracking are performed on the region within the person detection box, abstracting the person's contour into a minimum bounding rectangle and an equivalent ellipse. In the geometric parameter calculation stage, the geometric center coordinates, geometric area, geometric boundary rectangle, and visual weights of people, buildings, and trees are calculated respectively. The geometric area of people is 5,000 pixels squared, with the highest visual weight; the geometric area of buildings is 3,000 pixels squared, with the second highest visual weight; and the geometric area of trees is 2,000 pixels squared, with the lowest visual weight.

[0081] Geometric parameter data is fed into the local composition decision module for composition analysis. In the subject scoring stage, the scores from four dimensions are combined to determine the person as the main element of the image, and buildings and trees as secondary elements. In the multi-algorithm collaborative composition calculation stage, the Fibonacci spiral method is first applied to calculate the spiral's center position, then the golden ratio method is applied to calculate the positions of the four golden ratio points, followed by the visual center of gravity balance method to calculate the weighted visual center of gravity position, and finally the rule of thirds method to calculate the grid intersection points. The comprehensive weighted calculation yields the optimal composition point coordinates, located at the golden ratio point on the right side of the image. This point deviates from the current geometric center of the person by 120 pixels to the left and 80 pixels upwards, resulting in a composition quality score of 65.

[0082] The optimal composition point coordinates and deviation data are sent to the real-time guidance module for user guidance. A large green guide arrow pointing to the upper left is displayed on the screen, and a flashing circle crosshair mark is displayed at the optimal composition point. The subject's outline is highlighted with a green border, and the composition quality progress bar on the right is at 65%. The text prompt "Move to the upper left" is displayed at the top of the screen, and this guidance information is simultaneously broadcast voice-over. The user gradually moves the phone to the upper left according to the guidance, and the guide arrow gradually becomes smaller and darker, while the composition quality score gradually improves. When the user moves the phone to a position where the geometric center of the subject coincides with the optimal composition point, the composition quality score reaches 88 points and remains at that level for 1.5 seconds. The system automatically plays the voice prompt "The composition is now optimal," the guide arrow on the screen disappears, and the user can press the shutter to obtain a professionally composed photo.

[0083] Throughout the entire shooting guidance process, all image data is processed locally on the phone and is never uploaded to any network server. Users can enjoy intelligent composition assistance services without connecting to the network, and the privacy of users' shooting images is fully guaranteed.

[0084] The above are merely specific embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for automatic framing of a camera image in offline mode, characterized in that, Includes the following steps: Step 1: The camera's local scene real-time acquisition module calls the camera of the shooting device to acquire continuous image frame data of the shooting scene in real time at a preset frame rate; Step 2: The local image real-time analysis module performs geometric processing on the continuous image frame data, extracts the geometric contours of each visual element in the image, and calculates the geometric center coordinates, geometric area, and geometric boundary rectangle parameters of each visual element. Step 3: The local composition decision module determines the main element and secondary elements in the current image based on the geometric center coordinates, geometric area and geometric boundary rectangle parameters of each visual element, and calculates the optimal composition point coordinates of the current image based on the visual center of gravity of the main element. Step 4: The real-time guidance module guides the user to adjust the shooting angle and position in real time on the screen of the shooting device through a multimodal approach, based on the deviation between the coordinates of the optimal composition point and the current position of the main element.

2. The method for automatic framing of a camera in offline mode according to claim 1, characterized in that, Step one includes the camera initialization stage, the acquisition parameter configuration stage, the real-time video stream acquisition stage, the image frame preprocessing stage, and the frame buffer queue management stage. The acquisition parameter configuration stage dynamically configures the acquisition frame rate, image resolution, color space, and exposure mode parameters according to the device hardware performance; the image frame preprocessing stage performs size normalization, color space conversion, and noise suppression filtering on the acquired raw image frames. The frame buffer queue management stage employs a double-buffering or circular queue mechanism to manage the acquisition and transmission of consecutive image frames.

3. The method for automatic framing of a camera in offline mode according to claim 1, characterized in that, Step two includes an object detection inference stage, a contour extraction and geometric abstraction stage, and a geometric parameter calculation stage. The object detection inference stage uses a lightweight neural network model deployed locally on the device to perform object detection inference on the continuous image frame data, identify each visual element in the image, and output the detection box and category label of each visual element. The contour extraction and geometric abstraction stage extracts the target contour of the image region within the detection box and abstracts the target contour of each visual element into a standard geometric shape for representation. The geometric parameter calculation stage calculates the geometric center coordinates, geometric area, and geometric boundary rectangle parameters of each visual element that has completed the geometric processing in real time.

4. The method for automatic framing of a camera in offline mode according to claim 3, characterized in that, The target detection inference stage employs multi-scale feature fusion technology, enabling the lightweight neural network model to perform target detection simultaneously on feature maps of different resolutions.

5. The method for automatic framing of a camera in offline mode according to claim 3, characterized in that, The geometric parameter calculation stage also includes perspective distortion correction; the perspective distortion correction is performed by establishing a lens distortion correction model to correct the coordinates of the detection frame in real time before calculating the geometric center and area. The lens distortion correction model parameters include radial distortion coefficient and tangential distortion coefficient.

6. The method for automatic framing of a camera in offline mode according to claim 1, characterized in that, Step 3 includes the main element scoring and determination stage, the multi-algorithm collaborative composition calculation stage, and the optimal composition point output stage. The main element scoring and determination stage uses a multi-dimensional scoring mechanism to score the main weight of each visual element. The visual element with the highest comprehensive score is determined as the main element of the picture, and the remaining visual elements are arranged as secondary elements according to their scores from high to low. The multi-algorithm collaborative graph construction calculation stage comprehensively utilizes multiple algorithms to calculate the coordinates of the optimal graph construction point; the optimal graph construction point output stage performs a weighted comprehensive calculation on the optimal graph construction point coordinates obtained by the various algorithms in the multi-algorithm collaborative graph construction calculation stage according to their priority weights, to obtain the optimal graph construction point coordinates (O). x O y ) and output.

7. The method for automatic framing of a camera in offline mode according to claim 6, characterized in that, The multi-dimensional scoring mechanism includes geometric area weight, distance from the center of the image weight, target category priority weight, and detection confidence weight. The main element scoring and determination stage adopts a dynamic weight adjustment strategy, which includes: when there is a single significant visual element in the picture, the geometric area weight and the distance from the center of the picture weight dominate the comprehensive score. When there are multiple visual elements of similar size in the image, the target category priority weight and the detection confidence weight come into play.

8. The method for automatic framing of a camera in offline mode according to claim 6, characterized in that, The multi-algorithm collaborative composition calculation stage comprehensively utilizes multiple algorithms to calculate the optimal composition point coordinates, including: using the Fibonacci spiral method as the first priority algorithm to generate a spiral curve based on the Fibonacci sequence within the image, with the center of the spiral as the most ideal visual focal point; using the golden section method as the second priority algorithm to divide the image horizontally and vertically according to the golden ratio, obtaining four golden section points; using the visual center of gravity balance method as the third priority algorithm, calculating a weighted average based on the geometric center coordinates of each visual element and visual weights; and using the rule of thirds method as the fourth priority algorithm to evenly divide the image into a three-by-three grid, with the intersection of the four grids as composition reference points.

9. The method for automatic framing of a camera in offline mode according to claim 8, characterized in that, The multi-algorithm collaborative composition calculation stage adopts an adaptive algorithm weight allocation mechanism, which includes: dynamically adjusting the weight coefficients of each algorithm according to the spatial distribution characteristics of the visual elements in the current image; increasing the weight coefficient of the visual center of gravity balance method when the distribution of the visual elements is relatively concentrated; increasing the weight coefficients of the Fibonacci spiral method and the golden section method when the distribution of the visual elements is relatively dispersed; and increasing the weight coefficient of the trisection method when computing resources are scarce or the frame rate drops.

10. The method for automatic framing of a camera in offline mode according to claim 1, characterized in that, In step four, the multimodal approach includes visual guidance, text guidance, voice guidance, and a trigger and exit mechanism. The visual guidance approach includes: displaying a guide arrow pointing from the geometric center of the current main element to the coordinates of the optimal composition point in real time using an overlay layer; displaying the coordinates of the optimal composition point in real time using a circle with a crosshair or a flashing cursor symbol; highlighting the detected geometric boundary of the main element with a green border in real time; and displaying a composition quality score progress bar in real time at the edge of the screen.