Virtual meeting area establishment method based on depth estimation model and related device
By combining a single camera with a depth estimation model and hand-raising detection, the problems of high hardware cost and insufficient boundary recognition in the establishment of virtual meeting areas are solved, achieving accurate boundary recognition and personnel identification of virtual meeting areas, and reducing system complexity and false recognition rate.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- DTEN TECH CORP LTD HANGZHOU
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244298A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a method and related apparatus for establishing a virtual meeting area based on a depth estimation model. Background Technology
[0002] Existing methods for establishing virtual meeting areas generally employ depth sensors, using binocular cameras, structured light, or Time-of-Flight (ToF) depth sensors to acquire 3D depth information of the scene. The boundaries of the meeting area are then defined in a 3D coordinate system, and the spatial position of individuals determines whether they are within the meeting area. Alternatively, methods based on individual position are used, employing human detection or head detection models to obtain the position of participants in the image, combining the camera's field of view and a pre-defined region of interest (ROI) to determine whether individuals are within the meeting room area. Another approach is based on the size of the head frame, using the detected head frame area and position to estimate the approximate distance between the individual and the camera, and estimating whether the individual is inside or outside the meeting area based on the changes in the head frame at different distances.
[0003] These methods have different technical drawbacks. For example, they rely on high hardware costs. Existing systems typically rely on multiple cameras (binocular / multi-camera) or dedicated depth sensors (such as ToF and structured light) to obtain depth information. Such hardware is expensive, complex to install and maintain, and not suitable for large-scale deployment in ordinary conference equipment. Or they lack boundary recognition capabilities. When using the camera's field of view or a fixed ROI (region of interest) to define the conference area, this is a static method that cannot accurately reflect the actual boundaries of the conference room and may easily identify people outside the glass or in the corridor as conference participants. Or they lack a dynamic calibration mechanism. Once the conference layout or camera installation position changes, the detection area needs to be manually reset, which is complex and has poor adaptability. Or they have a high false recognition rate. Relying solely on image coordinates or the size of people to determine their position can easily lead to misjudgments in multi-person activities or glass scenes. In particular, when someone is seen passing through the corridor outside the conference room, the system has difficulty distinguishing them from the participants inside the room. Summary of the Invention
[0004] The purpose of this invention is to overcome the shortcomings of the prior art. This invention provides a method and related device for establishing a virtual meeting area based on a depth estimation model, which can establish a virtual meeting area using a single camera and accurately determine the boundary inside and outside the virtual meeting area, facilitating subsequent judgment by personnel inside and outside the virtual meeting area and avoiding interference from personnel outside the virtual meeting area.
[0005] To address at least one of the aforementioned technical problems, embodiments of the present invention provide a method for establishing a virtual meeting area based on a depth estimation model, the method comprising: Real-time image acquisition and processing are performed based on preset image acquisition devices to obtain real-time scene images in the conference room; Based on the depth estimation model, the position distribution of each participant in the real-time scene image in the image coordinates and depth coordinates is output; The virtual meeting area is established based on the preset posture detection of the participants and their position distribution in the image coordinates and depth coordinates.
[0006] Optionally, the step of performing real-time image acquisition processing based on a preset image acquisition device to obtain real-time scene images in the conference room includes: A preset image acquisition device is set up outside the conference room scene so that the image acquired by the preset image acquisition device covers the conference room scene. The preset image acquisition device is a single camera device. The preset image acquisition device is controlled to perform real-time image acquisition and processing on the conference room scene to obtain real-time scene images in the conference room scene.
[0007] Optionally, the step of outputting the position distribution of each participant in the real-time scene image in image coordinates and depth coordinates based on the depth estimation model includes: The real-time scene image is input into the depth estimation model, and the depth estimation model performs the position distribution recognition processing of the participants, and outputs the position distribution of each participant in the real-time scene image in the image coordinates and depth coordinates. The depth estimation model is a monocular depth estimation neural network, which has an encoder-decoder structure and is combined with a multi-scale feature fusion network. The encoder uses convolution or visual Transformer as its backbone to extract global-local semantics; The decoder is a fusion of upsampling and skip connections, used to output pixel-level depth / inverse depth maps; During training, the depth estimation model employs a hybrid loss consisting of scale-invariant error, BerHu / L1, edge-fidelity gradient term, and normal consistency. The depth is measured through scale regression / data calibration and distribution alignment.
[0008] Optionally, the preset posture detection is unilateral hand-raising detection; The virtual meeting area is established based on the preset pose detection of the participants and their position distribution in image coordinates and depth coordinates, resulting in the virtual meeting area, including: Based on the preset posture detection of the participants, the real-time scene images are selected using corresponding judgment rules to obtain selected real-time scene images. The judgment rules define a hand-raising condition and define the real-time scene image as a valid or invalid frame. The hand-raising condition is defined as follows: , This is a relative threshold. Shoulder joint height, The wrist joint height is used as the reference. The definition of a valid or invalid frame for a real-time scene image is as follows: when only one person in the same real-time scene image meets the hand-raising condition, it is a valid frame; when multiple people in the same real-time scene image meet the hand-raising condition or when the same person meets the hand-raising condition with both hands, it is an invalid frame. After acquiring the selected real-time scene image, the center pixel of the head of the participant who meets the hand-raising condition is extracted from the selected real-time scene image based on the position distribution of the participants in the image coordinates and depth coordinates. The median neighborhood depth of the head center pixel is selected, and the median neighborhood depth of the head center pixel is used to establish a virtual meeting area, thereby obtaining the virtual meeting area.
[0009] Optionally, the step of using the median neighborhood depth of the head center pixel to establish a virtual meeting area, thereby obtaining the virtual meeting area, includes: The virtual meeting area is established by using the median neighborhood depth of the head center pixel in the virtual boundary fitting algorithm based on the maximum column depth.
[0010] Optionally, the virtual meeting region establishment process, which uses the median neighborhood depth of the head center pixel to establish the virtual meeting region based on the virtual boundary fitting algorithm according to the maximum column depth, includes: For each frame, a real-time scene image is selected, and the corresponding neighborhood depth median and the image coordinates of the neighborhood depth median are selected and aggregated to obtain an array with a length equal to the image width. Interpolation and boundary extrapolation are performed on an array with a length equal to the image width to obtain a continuous initial virtual meeting boundary curve; The continuous initial virtual meeting boundary curves are corrected and smoothed to form a virtual meeting area.
[0011] Optionally, the step of correcting and smoothing the continuous initial virtual meeting boundary curves to form a virtual meeting area includes: Isolated transitions in the continuous initial virtual meeting boundary curves are corrected to form the corrected virtual meeting boundary curves. The modified virtual meeting boundary curve is smoothed using a window midpoint smoothing algorithm to form a smoothed virtual meeting boundary curve. Based on the smoothed virtual meeting boundary curve, the boundary of the virtual meeting is determined according to the maximum depth curve to determine whether it is inside or outside the boundary, and the virtual meeting area is formed based on the results of the inside and outside boundary determination.
[0012] In addition, embodiments of the present invention also provide a virtual meeting area establishment device based on a depth estimation model, the device comprising: Image acquisition module: Used to perform real-time image acquisition and processing based on preset image acquisition devices to obtain real-time scene images in the conference room; Output module: used to output the position distribution of each participant in the real-time scene image in image coordinates and depth coordinates based on the depth estimation model; The establishment module is used to establish a virtual meeting area based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates.
[0013] In addition, embodiments of the present invention also provide an electronic device, including a processor and a memory, wherein the processor runs a computer program or code stored in the memory to implement the virtual meeting area establishment method as described in any of the above.
[0014] In addition, embodiments of the present invention also provide a computer-readable storage medium for storing a computer program or code, which, when executed by a processor, implements the virtual meeting area establishment method as described above.
[0015] In this embodiment of the invention, a real-time image acquisition process is performed by a preset image acquisition device to obtain a real-time scene image of the conference room scene; the position distribution of each participant in the real-time scene image in the image coordinates and depth coordinates is output based on a depth estimation model; a virtual meeting area is established based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates, thereby obtaining a virtual meeting area; it is possible to establish a virtual meeting area using a single camera and accurately determine the boundary inside and outside the virtual meeting area, which facilitates the subsequent judgment of the participants inside and outside the virtual meeting area and avoids interference from people outside the virtual meeting area. Attached Figure Description
[0016] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0017] Figure 1 This is a flowchart illustrating the method for establishing a virtual meeting area based on a depth estimation model in an embodiment of the present invention. Figure 2 This is a schematic diagram of the structural composition of the virtual conference area establishment device based on the depth estimation model in an embodiment of the present invention; Figure 3 This is a schematic diagram of the structural composition of the electronic device in an embodiment of the present invention. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] Example 1, please refer to Figure 1 , Figure 1 This is a flowchart illustrating the method for establishing a virtual meeting area based on a depth estimation model in an embodiment of the present invention.
[0020] like Figure 1 As shown, a method for establishing a virtual meeting area based on a depth estimation model is described, the method comprising: S101: Performs real-time image acquisition and processing based on a preset image acquisition device to obtain real-time scene images in the conference room; In a specific implementation of the present invention, the step of performing real-time image acquisition processing based on a preset image acquisition device to obtain a real-time scene image in the conference room scene includes: setting up a preset image acquisition device outside the conference room scene so that the image acquired by the preset image acquisition device covers the conference room scene, wherein the preset image acquisition device is a single-camera device; controlling the preset image acquisition device to perform real-time image acquisition processing on the conference room scene to obtain a real-time scene image in the conference room scene.
[0021] Specifically, the first step is to determine the meeting room scene. Then, a single-camera device is set up based on the meeting room scene. The image captured by this single-camera device should cover the meeting room scene. The single camera used to capture the meeting room scene is movable and does not necessarily have to be fixed in one position. However, during image capture, the single camera should be fixed and not shaken, but its fixed position can be adjusted according to specific needs. At the start of the meeting, the single camera is controlled to perform continuous image capture (which can also be considered as video capture) at the corresponding exposure frequency according to the corresponding control signals, thereby obtaining real-time scene images of the meeting room scene.
[0022] S102: Based on the depth estimation model, output the position distribution of each participant in the real-time scene image in the image coordinates and depth coordinates; In the specific implementation of this invention, the step of outputting the position distribution of each participant in the real-time scene image in image coordinates and depth coordinates based on the depth estimation model includes: inputting the real-time scene image into the depth estimation model, performing participant position distribution recognition processing in the depth estimation model, and outputting the position distribution of each participant in the real-time scene image in image coordinates and depth coordinates; the depth estimation model is a monocular depth estimation neural network, which is an encoder-decoder structure and is combined with a multi-scale feature fusion network; the encoder uses convolution or visual Transformer as the backbone to extract global-local semantics; the decoder is an upsampling and skip connection fusion to output pixel-level depth / inverse depth map; during training, the depth estimation model adopts a hybrid loss of scale-invariant error, BerHu / L1, edge-fidelity gradient term and normal consistency, wherein the depth measurement is achieved through scale regression / data calibration and distribution alignment.
[0023] Specifically, a depth estimation model is used to estimate the depth of a real-time scene image, which outputs the relative depth of each pixel or target region in the real-time scene image. This allows the determination of the position distribution of each participant in the real-time scene image in both image coordinates and depth coordinates. In other words, the real-time scene image needs to be input into the depth estimation model, where the model identifies the position distribution of participants and ultimately outputs the position distribution of each participant in the real-time scene image in both image coordinates and depth coordinates.
[0024] The depth estimation model is a monocular depth estimation neural network, which has an encoder-decoder structure and is combined with a multi-scale feature fusion network. The encoder uses a convolutional or visual Transformer (ViT) backbone to extract global-local semantics. The decoder uses upsampling and skip connections to output pixel-level depth / inverse depth maps (optionally, confidence scores can also be output). The training objective is to use a hybrid loss of scale-invariant error (SI-Log), BerHu / L1, edge-fidelity gradient term, and (optionally) normal consistency. The depth measurement version uses scale regression / data calibration and distribution alignment to make the network output approximately consistent with the real units (meters / millimeters), which facilitates threshold discrimination and geometric fitting.
[0025] Furthermore, after selecting a monocular depth estimation neural network, the following advantages are achieved: For geometric determination and correction, the depth output facilitates setting absolute thresholds (indoor / outdoor, person / behind glass) and directly interfaces with the glass plane fitting and "flattening" correction of this invention, eliminating the need for additional scale recalibration; Scene robustness, the encoder-decoder + multi-scale fusion provides stronger zero-sample generalization and edge fidelity for low-texture, high-reflection / high-gloss, and near-far mixed conference room scenes, stably providing usable contour and planar information; Edge-side friendliness, with a regular structure and replaceable operators (such as LayerNorm / activation, which can be equivalently implemented on the edge), supporting quantization (INT8) and pruning, meeting the real-time requirements of this system at medium resolution; Engineering adaptability, aligned with the first-stage head detection results at the same resolution, and can be directly called on the ROI / puzzle path; The output depth and confidence maps can be used as joint features for collision detection, indoor / outdoor discrimination, and reflection filtering.
[0026] Furthermore, in this embodiment, the monocular depth estimation neural network adopts the encoder-decoder structure of the Transformer backbone and outputs a depth metric, which is exported as a general model format and deployed in a lightweight manner (such as quantized deployment) on the edge inference engine; replacing it with a network with equivalent output and error characteristics can also achieve the same technical effect.
[0027] S103: Based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates, a virtual meeting area is established to obtain the virtual meeting area.
[0028] In the specific implementation of this invention, the preset posture detection is unilateral hand-raising detection; the process of establishing a virtual meeting area based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates to obtain the virtual meeting area includes: selecting the real-time scene image based on the preset posture detection of the participants using corresponding judgment rules to obtain a selected real-time scene image, wherein the judgment rules define a hand-raising condition and define the real-time scene image as a valid frame or an invalid frame; wherein the hand-raising condition is defined as: , This is a relative threshold. Shoulder joint height, The wrist joint height is used as the reference point. A valid or invalid frame for a real-time scene image is defined as follows: a valid frame is one where only one person in the same real-time scene image meets the hand-raising condition; an invalid frame is one where multiple people in the same real-time scene image meet the hand-raising condition or one person meets the hand-raising condition with both hands. After acquiring the selected real-time scene image, the center pixel of the head of the participant meeting the hand-raising condition is extracted based on the position distribution of the participants in the image coordinates and depth coordinates. The median depth of the neighborhood of the head center pixel is selected, and the median depth of the neighborhood of the head center pixel is used to establish a virtual meeting area, thus obtaining the virtual meeting area.
[0029] Furthermore, the process of establishing a virtual meeting region using the median neighborhood depth of the head center pixel to obtain the virtual meeting region includes: using a virtual boundary fitting algorithm based on the maximum column depth to establish the virtual meeting region using the median neighborhood depth of the head center pixel to obtain the virtual meeting region.
[0030] Furthermore, the virtual meeting region establishment process using the median neighborhood depth of the head center pixel to establish the virtual meeting region according to the virtual boundary fitting algorithm based on the maximum column depth includes: selecting a real-time scene image for each frame, selecting the corresponding median neighborhood depth and the image coordinates of the median neighborhood depth, and aggregating them to obtain an array with a length equal to the image width; performing interpolation and boundary extrapolation processing based on the array with a length equal to the image width to obtain continuous initial virtual meeting boundary curves; and correcting and smoothing the continuous initial virtual meeting boundary curves to form the virtual meeting region.
[0031] Furthermore, the step of correcting and smoothing the continuous initial virtual meeting boundary curves to form a virtual meeting region includes: correcting isolated jumps in the continuous initial virtual meeting boundary curves to form corrected virtual meeting boundary curves; smoothing the corrected virtual meeting boundary curves based on a window value smoothing algorithm to form smoothed virtual meeting boundary curves; and determining whether the virtual meeting boundary is inside or outside the boundary based on the smoothed virtual meeting boundary curves according to the maximum depth curve, and forming a virtual meeting region based on the inside and outside boundary determination results.
[0032] Specifically, the preset posture detection in this embodiment is unilateral hand raising detection. Choosing unilateral hand raising detection has the following advantages: strong detectability, relying only on the relative vertical relationship between the "shoulder" and "wrist", simple judgment, good real-time performance, and suitable for edge deployment; less ambiguity, the unilateral hand raising has a clear shape in the image, fewer false triggers, and facilitates stable acquisition of sample points near the boundary; occlusion robustness, even if the elbow or part of the upper limb is occluded, it can be determined as long as the shoulder and wrist are visible.
[0033] The condition for raising a hand is determined as follows: The condition for "raising a hand" is defined as: ,in The relative threshold is (preferably 3% to 10% of the image height); Indicates wrist joint height; The height of the shoulder joint is used to define whether a real-time scene image is a valid or invalid frame. A valid frame is one where only one person in the same real-time scene image meets the condition of raising their hand, while an invalid frame is one where multiple people in the same real-time scene image meet the condition of raising their hand or one person meets the condition of raising their hand with both hands. Valid frames are retained, while invalid frames are discarded.
[0034] This allows for the selection of real-time scene images through preset pose detection, resulting in selected real-time scene images. Then, based on the positional distribution of participants in the image coordinates and depth coordinates, the center pixel of the head of the selected participants who meet the hand-raising condition is extracted from the real-time scene images. Furthermore, the median depth of the neighborhood of the selected head center pixel is obtained. Finally, the median depth of the neighborhood of the head center pixel is used to establish a virtual meeting area, thus obtaining a virtual meeting area. Based on these data points, the spatial range of the meeting area can be calibrated and fitted, that is, a virtual boundary is established in the depth dimension and the image plane.
[0035] The specific virtual boundary fitting algorithm is a virtual boundary fitting algorithm based on the maximum depth of each column, as follows: Input: Sparse depth samples obtained from hand-raising calibration (selecting the center pixel of the head of the participant who meets the hand-raising condition in the real-time scene image and the median depth of the neighborhood of the head center pixel), and for each valid frame (selecting the real-time scene image), taking the depth of the center pixel of the raised hand's head. and its image column coordinates Aggregating by column yields an image with a length equal to the image width. The array is as follows: ; in, For example The maximum depth observed above (set to 0 if no observations are made); Interpolation and boundary extrapolation: for handling uncovered columns ( ): 1. Zero segments with values at both ends: If a zero segment Both ends are adjacent to non-zero values Perform linear interpolation at each position as follows: ; 2. Zero segment with only one end having a value: Fill the segment by pushing the value of the known end outwards towards the zero segment (constant value extrapolation); 3. All segments are zero: Keep the value at zero, pending subsequent sampling or processing by the default strategy.
[0036] By interpolation and boundary extrapolation, a continuous initial boundary curve can be obtained, which can solve the problem of lateral incompleteness in hand-raising sampling.
[0037] Jump suppression and robust smoothing are performed to eliminate single-point anomalies and noise, and the following processing is executed several times (passes=1–3): 1. Isolated jump correction, for each intermediate point ( If the following exists: ; Then it is believed For isolated spikes, the bilateral mean is used instead. ,in The threshold value is set to 1.0–2.0 (preferably in units consistent with depth).
[0038] 2. Window value smoothing, for each position In the window ( For half the window width, the median of the non-zero value within the range of 20–80 is preferred for replacement. .
[0039] By employing jump suppression and robust smoothing, sensor noise and false acquisition spikes / cliffs are suppressed, while preserving the large-scale trend of the boundary.
[0040] The final output is a smooth and continuous maximum depth curve per column. (Virtual boundary) can be used for indoor / outdoor discrimination; if the depth of the detected head... Greater than (Margin If it is outside the boundary, then it is judged as outside the boundary.
[0041] In terms of identifying the virtual meeting area, during subsequent operation, the system can determine whether a person is within the virtual meeting area based on the personnel detection results and their depth value; personnel outside the virtual meeting area are filtered out to ensure that only participants within the virtual meeting area are identified and tracked in the future.
[0042] When the software system corresponding to this method is deployed on edge devices, it can ensure that the system algorithm is lightweight and optimized, and can run at millisecond-level speeds on edge hardware such as RKNN, thus ensuring the feasibility of practical applications.
[0043] Lightweight: ① The posture triggering adopts a single-sided hand raising judgment that only relies on "wrist higher than shoulder", avoiding multiple key point confidence and angle solution; ②Only a small neighborhood depth of the center of the head of the person raising their hand is captured in the effective frame, and the virtual boundary is represented by a one-dimensional array of the image width length; the physical fact that the same column ≈ the same line of sight makes the one-dimensional boundary of "maximum depth by column" both geometric and extremely simple and efficient, which is suitable for real-time end-side applications of cameras installed under the conference room.
[0044] ③ Boundary completion and denoising adopt low-complexity operators of linear interpolation / constant extrapolation and isolated jump correction + median smoothing; the overall computational complexity of the above steps increases linearly with the image width (O(W)), with few parameters and the operators can be fixed-point quantized, which is convenient for quantization deployment on edge devices and maintaining real-time performance.
[0045] ④ The detection, feature extraction, and depth estimation modules are implemented using lightweight, low-capacity networks (e.g., fewer parameters). 15–25M, trunk FLOPs 20–40 GFLOPs@input short side 480–720), and supports INT8 quantization and edge-accelerated deployment; the network can be implemented by pruning, distillation or small backbone (Mobile / Shuffle / shallow Res / lightweight ViT), without being limited to a specific model.
[0046] In this embodiment of the invention, a real-time image acquisition process is performed by a preset image acquisition device to obtain a real-time scene image of the conference room scene; the position distribution of each participant in the real-time scene image in the image coordinates and depth coordinates is output based on a depth estimation model; a virtual meeting area is established based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates, thereby obtaining a virtual meeting area; it is possible to establish a virtual meeting area using a single camera and accurately determine the boundary inside and outside the virtual meeting area, which facilitates the subsequent judgment of the participants inside and outside the virtual meeting area and avoids interference from people outside the virtual meeting area.
[0047] Example 2, please refer to Figure 2 , Figure 2 This is a schematic diagram of the structural composition of the virtual conference area establishment device based on the depth estimation model in an embodiment of the present invention.
[0048] like Figure 2 As shown, a virtual meeting area creation device based on a depth estimation model is provided, the device comprising: Image acquisition module 201: Used to perform real-time image acquisition processing based on a preset image acquisition device to obtain real-time scene images in the conference room scene; In a specific implementation of the present invention, the step of performing real-time image acquisition processing based on a preset image acquisition device to obtain a real-time scene image in the conference room scene includes: setting up a preset image acquisition device outside the conference room scene so that the image acquired by the preset image acquisition device covers the conference room scene, wherein the preset image acquisition device is a single-camera device; controlling the preset image acquisition device to perform real-time image acquisition processing on the conference room scene to obtain a real-time scene image in the conference room scene.
[0049] Specifically, the first step is to determine the meeting room scene. Then, a single-camera device is set up based on the meeting room scene. The image captured by this single-camera device should cover the meeting room scene. The single camera used to capture the meeting room scene is movable and does not necessarily have to be fixed in one position. However, during image capture, the single camera should be fixed and not shaken, but its fixed position can be adjusted according to specific needs. At the start of the meeting, the single camera is controlled to perform continuous image capture (which can also be considered as video capture) at the corresponding exposure frequency according to the corresponding control signals, thereby obtaining real-time scene images of the meeting room scene.
[0050] Output module 202: used to output the position distribution of each participant in the real-time scene image in the image coordinates and depth coordinates based on the depth estimation model; In the specific implementation of this invention, the step of outputting the position distribution of each participant in the real-time scene image in image coordinates and depth coordinates based on the depth estimation model includes: inputting the real-time scene image into the depth estimation model, performing participant position distribution recognition processing in the depth estimation model, and outputting the position distribution of each participant in the real-time scene image in image coordinates and depth coordinates; the depth estimation model is a monocular depth estimation neural network, which is an encoder-decoder structure and is combined with a multi-scale feature fusion network; the encoder uses convolution or visual Transformer as the backbone to extract global-local semantics; the decoder is an upsampling and skip connection fusion to output pixel-level depth / inverse depth map; during training, the depth estimation model adopts a hybrid loss of scale-invariant error, BerHu / L1, edge-fidelity gradient term and normal consistency, wherein the depth measurement is achieved through scale regression / data calibration and distribution alignment.
[0051] Specifically, a depth estimation model is used to estimate the depth of a real-time scene image, which outputs the relative depth of each pixel or target region in the real-time scene image. This allows the determination of the position distribution of each participant in the real-time scene image in both image coordinates and depth coordinates. In other words, the real-time scene image needs to be input into the depth estimation model, where the model identifies the position distribution of participants and ultimately outputs the position distribution of each participant in the real-time scene image in both image coordinates and depth coordinates.
[0052] The depth estimation model is a monocular depth estimation neural network, which has an encoder-decoder structure and is combined with a multi-scale feature fusion network. The encoder uses a convolutional or visual Transformer (ViT) backbone to extract global-local semantics. The decoder uses upsampling and skip connections to output pixel-level depth / inverse depth maps (optionally, confidence scores can also be output). The training objective is to use a hybrid loss of scale-invariant error (SI-Log), BerHu / L1, edge-fidelity gradient term, and (optionally) normal consistency. The depth measurement version uses scale regression / data calibration and distribution alignment to make the network output approximately consistent with the real units (meters / millimeters), which facilitates threshold discrimination and geometric fitting.
[0053] Furthermore, after selecting a monocular depth estimation neural network, the following advantages are achieved: For geometric determination and correction, the depth output facilitates setting absolute thresholds (indoor / outdoor, person / behind glass) and directly interfaces with the glass plane fitting and "flattening" correction of this invention, eliminating the need for additional scale recalibration; Scene robustness, the encoder-decoder + multi-scale fusion provides stronger zero-sample generalization and edge fidelity for low-texture, high-reflection / high-gloss, and near-far mixed conference room scenes, stably providing usable contour and planar information; Edge-side friendliness, with a regular structure and replaceable operators (such as LayerNorm / activation, which can be equivalently implemented on the edge), supporting quantization (INT8) and pruning, meeting the real-time requirements of this system at medium resolution; Engineering adaptability, aligned with the first-stage head detection results at the same resolution, and can be directly called on the ROI / puzzle path; The output depth and confidence maps can be used as joint features for collision detection, indoor / outdoor discrimination, and reflection filtering.
[0054] Furthermore, in this embodiment, the monocular depth estimation neural network adopts the encoder-decoder structure of the Transformer backbone and outputs a depth metric, which is exported as a general model format and deployed in a lightweight manner (such as quantized deployment) on the edge inference engine; replacing it with a network with equivalent output and error characteristics can also achieve the same technical effect.
[0055] Module 203: is used to establish a virtual meeting area based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates, so as to obtain the virtual meeting area.
[0056] In the specific implementation of this invention, the preset posture detection is unilateral hand-raising detection; the process of establishing a virtual meeting area based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates to obtain the virtual meeting area includes: selecting the real-time scene image based on the preset posture detection of the participants using corresponding judgment rules to obtain a selected real-time scene image, wherein the judgment rules define a hand-raising condition and define the real-time scene image as a valid frame or an invalid frame; wherein the hand-raising condition is defined as: , This is a relative threshold. Shoulder joint height, The wrist joint height is used as the reference point. A valid or invalid frame for a real-time scene image is defined as follows: a valid frame is one where only one person in the same real-time scene image meets the hand-raising condition; an invalid frame is one where multiple people in the same real-time scene image meet the hand-raising condition or one person meets the hand-raising condition with both hands. After acquiring the selected real-time scene image, the center pixel of the head of the participant meeting the hand-raising condition is extracted based on the position distribution of the participants in the image coordinates and depth coordinates. The median depth of the neighborhood of the head center pixel is selected, and the median depth of the neighborhood of the head center pixel is used to establish a virtual meeting area, thus obtaining the virtual meeting area.
[0057] Furthermore, the process of establishing a virtual meeting region using the median neighborhood depth of the head center pixel to obtain the virtual meeting region includes: using a virtual boundary fitting algorithm based on the maximum column depth to establish the virtual meeting region using the median neighborhood depth of the head center pixel to obtain the virtual meeting region.
[0058] Furthermore, the virtual meeting region establishment process using the median neighborhood depth of the head center pixel to establish the virtual meeting region according to the virtual boundary fitting algorithm based on the maximum column depth includes: selecting a real-time scene image for each frame, selecting the corresponding median neighborhood depth and the image coordinates of the median neighborhood depth, and aggregating them to obtain an array with a length equal to the image width; performing interpolation and boundary extrapolation processing based on the array with a length equal to the image width to obtain continuous initial virtual meeting boundary curves; and correcting and smoothing the continuous initial virtual meeting boundary curves to form the virtual meeting region.
[0059] Furthermore, the step of correcting and smoothing the continuous initial virtual meeting boundary curves to form a virtual meeting region includes: correcting isolated jumps in the continuous initial virtual meeting boundary curves to form corrected virtual meeting boundary curves; smoothing the corrected virtual meeting boundary curves based on a window value smoothing algorithm to form smoothed virtual meeting boundary curves; and determining whether the virtual meeting boundary is inside or outside the boundary based on the smoothed virtual meeting boundary curves according to the maximum depth curve, and forming a virtual meeting region based on the inside and outside boundary determination results.
[0060] Specifically, the preset posture detection in this embodiment is unilateral hand raising detection. Choosing unilateral hand raising detection has the following advantages: strong detectability, relying only on the relative vertical relationship between the "shoulder" and "wrist", simple judgment, good real-time performance, and suitable for edge deployment; less ambiguity, the unilateral hand raising has a clear shape in the image, fewer false triggers, and facilitates stable acquisition of sample points near the boundary; occlusion robustness, even if the elbow or part of the upper limb is occluded, it can be determined as long as the shoulder and wrist are visible.
[0061] The condition for raising a hand is determined as follows: The condition for "raising a hand" is defined as: ,in The relative threshold is (preferably 3% to 10% of the image height); Indicates wrist joint height; The height is the shoulder joint height. A valid or invalid frame of a real-time scene image is defined as follows: when only one person in the same real-time scene image meets the condition of raising their hand, it is a valid frame; when multiple people in the same real-time scene image meet the condition of raising their hand or when the same person meets the condition of raising their hand with both hands, it is an invalid frame. Valid frames are retained, and invalid frames are discarded.
[0062] This allows for the selection of real-time scene images through preset pose detection, resulting in selected real-time scene images. Then, based on the positional distribution of participants in the image coordinates and depth coordinates, the center pixel of the head of the selected participants who meet the hand-raising condition is extracted from the real-time scene images. Furthermore, the median depth of the neighborhood of the selected head center pixel is obtained. Finally, the median depth of the neighborhood of the head center pixel is used to establish a virtual meeting area, thus obtaining a virtual meeting area. Based on these data points, the spatial range of the meeting area can be calibrated and fitted, that is, a virtual boundary is established in the depth dimension and the image plane.
[0063] The specific virtual boundary fitting algorithm is a virtual boundary fitting algorithm based on the maximum depth of each column, as follows: Input: Sparse depth samples obtained from hand-raising calibration (selecting the center pixel of the head of the participant who meets the hand-raising condition in the real-time scene image and the median depth of the neighborhood of the head center pixel), and for each valid frame (selecting the real-time scene image), taking the depth of the center pixel of the raised hand's head. and its image column coordinates Aggregating by column yields an image with a length equal to the image width. The array is as follows: ; in, For example The maximum depth observed above (set to 0 if no observations are made); Interpolation and boundary extrapolation: for handling uncovered columns ( ): 1. Zero segments with values at both ends: If a zero segment Both ends are adjacent to non-zero values Perform linear interpolation at each position as follows: ; 2. Zero segment with only one end having a value: Fill the segment by pushing the value of the known end outwards towards the zero segment (constant value extrapolation); 3. All segments are zero: Keep the value at zero, pending subsequent sampling or processing by the default strategy.
[0064] By interpolation and boundary extrapolation, a continuous initial boundary curve can be obtained, which can solve the problem of lateral incompleteness in hand-raising sampling.
[0065] Jump suppression and robust smoothing are performed to eliminate single-point anomalies and noise, and the following processing is executed several times (passes=1–3): 1. Isolated jump correction, for each intermediate point ( If the following exists: ; Then it is believed For isolated spikes, the bilateral mean is used instead. ,in The threshold value is set to 1.0–2.0 (preferably in units consistent with depth).
[0066] 2. Window value smoothing, for each position In the window ( For half the window width, the median of the non-zero value within the range of 20–80 is preferred for replacement. .
[0067] By employing jump suppression and robust smoothing, sensor noise and false acquisition spikes / cliffs are suppressed, while preserving the large-scale trend of the boundary.
[0068] The final output is a smooth and continuous maximum depth curve per column. (Virtual boundary) can be used for indoor / outdoor discrimination; if the depth of the detected head... Greater than (Margin If it is outside the boundary, then it is judged as outside the boundary.
[0069] In terms of identifying the virtual meeting area, during subsequent operation, the system can determine whether a person is within the virtual meeting area based on the personnel detection results and their depth value; personnel outside the virtual meeting area are filtered out to ensure that only participants within the virtual meeting area are identified and tracked in the future.
[0070] When the software system corresponding to this method is deployed on edge devices, it can ensure that the system algorithm is lightweight and optimized, and can run at millisecond-level speeds on edge hardware such as RKNN, thus ensuring the feasibility of practical applications.
[0071] Lightweight: ① The posture triggering adopts a single-sided hand raising judgment that only relies on "wrist higher than shoulder", avoiding multiple key point confidence and angle solution; ②Only a small neighborhood depth of the center of the head of the person raising their hand is captured in the effective frame, and the virtual boundary is represented by a one-dimensional array of the image width length; the physical fact that the same column ≈ the same line of sight makes the one-dimensional boundary of "maximum depth by column" both geometric and extremely simple and efficient, which is suitable for real-time end-side applications of cameras installed under the conference room.
[0072] ③ Boundary completion and denoising adopt low-complexity operators of linear interpolation / constant extrapolation and isolated jump correction + median smoothing; the overall computational complexity of the above steps increases linearly with the image width (O(W)), with few parameters and the operators can be fixed-point quantized, which is convenient for quantization deployment on edge devices and maintaining real-time performance.
[0073] ④ The detection, feature extraction, and depth estimation modules are implemented using lightweight, low-capacity networks (e.g., fewer parameters). 15–25M, trunk FLOPs 20–40 GFLOPs@input short side 480–720), and supports INT8 quantization and edge-accelerated deployment; the network can be implemented by pruning, distillation or small backbone (Mobile / Shuffle / shallow Res / lightweight ViT), without being limited to a specific model.
[0074] In this embodiment of the invention, a real-time image acquisition process is performed by a preset image acquisition device to obtain a real-time scene image of the conference room scene; the position distribution of each participant in the real-time scene image in the image coordinates and depth coordinates is output based on a depth estimation model; a virtual meeting area is established based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates, thereby obtaining a virtual meeting area; it is possible to establish a virtual meeting area using a single camera and accurately determine the boundary inside and outside the virtual meeting area, which facilitates the subsequent judgment of the participants inside and outside the virtual meeting area and avoids interference from people outside the virtual meeting area.
[0075] This invention provides a computer-readable storage medium storing a computer program. When executed by a processor, this program implements the virtual meeting area creation method of any of the above embodiments. The computer-readable storage medium includes, but is not limited to, any type of disk (including floppy disk, hard disk, optical disk, CD-ROM, and magneto-optical disk), ROM (Read-Only Memory), RAM (Random Access Memory), EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash memory, magnetic cards, or optical cards. In other words, the storage device includes any medium that stores or transmits information in a readable form by a device (e.g., a computer, a mobile phone), and can be a read-only memory, a disk, or an optical disk, etc.
[0076] This invention also provides a computer application running on a computer, which is used to execute the virtual meeting area creation method of any of the above embodiments.
[0077] also, Figure 3 This is a schematic diagram of the structural composition of the electronic device in an embodiment of the present invention.
[0078] This invention also provides an electronic device, such as... Figure 3 As shown. The electronic device includes a processor 302, a memory 303, an input unit 304, and a display unit 305, among other devices. Those skilled in the art will understand that... Figure 3The structural components of the illustrated electronic device do not constitute a limitation on all devices and may include more or fewer components than illustrated, or combine certain components. Memory 303 can be used to store application program 301 and various functional modules. Processor 302 runs application program 301 stored in memory 303, thereby performing various functional applications and data processing of the device. Memory can be internal memory or external memory, or both. Internal memory may include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, or random access memory. External memory may include hard disks, floppy disks, ZIP disks, USB flash drives, magnetic tapes, etc. The memory disclosed in this invention includes, but is not limited to, these types of memory. The memory disclosed in this invention is only an example and not a limitation.
[0079] Input unit 304 is used to receive signal input and user-input keywords. Input unit 304 may include a touch panel and other input devices. The touch panel can collect user touch operations on or near it (such as operations performed by the user using a finger, stylus, or any suitable object or accessory on or near the touch panel) and drive the corresponding connection device according to a pre-set program; other input devices may include, but are not limited to, one or more of physical keyboards, function keys (such as play control buttons, power buttons, etc.), trackballs, mice, joysticks, etc. Display unit 305 can be used to display user-input information or information provided to the user, as well as various menus of the terminal device. Display unit 305 may be in the form of a liquid crystal display, organic light-emitting diode, etc. Processor 302 is the control center of the terminal device, connecting various parts of the entire device through various interfaces and lines, and performing various functions and processing data by running or executing software programs and / or modules stored in memory 303, and calling data stored in memory.
[0080] As one embodiment, the electronic device includes: one or more processors 302, a memory 303, and one or more applications 301, wherein the one or more applications 301 are stored in the memory 303 and configured to be executed by the one or more processors 302, and the one or more applications 301 are configured to execute the virtual meeting area establishment method corresponding to any of the above embodiments.
[0081] In this embodiment of the invention, a real-time image acquisition process is performed by a preset image acquisition device to obtain a real-time scene image of the conference room scene; the position distribution of each participant in the real-time scene image in the image coordinates and depth coordinates is output based on a depth estimation model; a virtual meeting area is established based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates, thereby obtaining a virtual meeting area; it is possible to establish a virtual meeting area using a single camera and accurately determine the boundary inside and outside the virtual meeting area, which facilitates the subsequent judgment of the participants inside and outside the virtual meeting area and avoids interference from people outside the virtual meeting area.
[0082] Furthermore, the above provides a detailed description of a method and related apparatus for establishing a virtual meeting area based on a depth estimation model provided by the embodiments of the present invention. Specific examples have been used to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of the present invention. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A method for establishing a virtual meeting area based on a depth estimation model, characterized by, The method comprises: performing real-time image acquisition processing based on a preset image acquisition device to obtain a real-time scene image in a conference room scene; outputting the position distribution of each conference participant in the real-time scene image in image coordinates and depth coordinates based on a depth estimation model; establishing a virtual conference area based on preset posture detection of the conference participant and the position distribution of the conference participant in image coordinates and depth coordinates to obtain the virtual conference area.
2. The virtual meeting area establishing method of claim 1, wherein, The method comprises: setting a preset image acquisition device outside the conference room scene to cover the conference room scene with the preset image acquisition device, wherein the preset image acquisition device is a single camera device; controlling the preset image acquisition device to perform real-time image acquisition processing on the conference room scene to obtain a real-time scene image in the conference room scene.
3. The virtual meeting area establishing method of claim 1, wherein, The method comprises: inputting the real-time scene image into the depth estimation model to identify the position distribution of the conference participant in the depth estimation model and output the position distribution of each conference participant in the real-time scene image in image coordinates and depth coordinates; the depth estimation model is a monocular depth estimation neural network, which is an encoder-decoder structure and cooperates with a multi-scale feature fusion network; the encoder uses convolution or visual Transformer as the backbone to extract global-local semantics; the decoder is an upsampling and skip connection fusion to output pixel-level depth / inverse depth maps; the depth estimation model uses a hybrid loss of scale-invariant error, BerHu / L1, edge fidelity gradient term and normal consistency during training, wherein the depth is measured by scale regression / data calibration and distribution alignment.
4. The virtual meeting area establishing method of claim 1, wherein, The preset posture detection is single-side hand-raising detection. The method comprises: The preset posture detection of the participants is used to select and process the real-time scene image based on corresponding judgment rules to obtain a selected real-time scene image, wherein the judgment rules are defined as a hand-raising condition and a definition of real-time scene image as a valid frame or an invalid frame; wherein the hand-raising condition is defined as: , is a relative threshold value, is a shoulder joint height, is a wrist joint height; and the definition of real-time scene image as a valid frame or an invalid frame is that a same frame of real-time scene image is a valid frame when only one person in the same frame of real-time scene image satisfies the hand-raising condition, and is an invalid frame when multiple persons in the same frame of real-time scene image satisfy the hand-raising condition or the same person satisfies the hand-raising condition with both hands. after obtaining the selected real-time scene image, extracting the head center pixel of the conference participant that meets the hand-raising condition in the selected real-time scene image according to the position distribution of the conference participant in image coordinates and depth coordinates; selecting the neighborhood depth median of the head center pixel and using the neighborhood depth median of the head center pixel to establish the virtual conference area to obtain the virtual conference area.
5. The virtual meeting area establishing method of claim 4, wherein, The method comprises: using the neighborhood depth median of the head center pixel to establish the virtual conference area according to the column maximum depth virtual boundary fitting algorithm to obtain the virtual conference area.
6. The virtual meeting area establishing method of claim 5, wherein, The virtual boundary fitting algorithm based on the maximum column depth utilizes the median neighborhood depth of the head center pixel to establish a virtual meeting region, thereby obtaining the virtual meeting region, including: For each frame, a real-time scene image is selected, and the corresponding neighborhood depth median and the image coordinates of the neighborhood depth median are selected and aggregated to obtain an array with a length equal to the image width. Interpolation and boundary extrapolation are performed on an array with a length equal to the image width to obtain a continuous initial virtual meeting boundary curve; The continuous initial virtual meeting boundary curves are corrected and smoothed to form a virtual meeting area.
7. The virtual meeting area establishing method of claim 6, wherein, The process of correcting and smoothing the continuous initial virtual meeting boundary curves to form a virtual meeting area includes: Isolated transitions in the continuous initial virtual meeting boundary curves are corrected to form the corrected virtual meeting boundary curves. The modified virtual meeting boundary curve is smoothed using a window midpoint smoothing algorithm to form a smoothed virtual meeting boundary curve. Based on the smoothed virtual meeting boundary curve, the boundary of the virtual meeting is determined according to the maximum depth curve to determine whether it is inside or outside the boundary, and the virtual meeting area is formed based on the results of the inside and outside boundary determination.
8. A virtual meeting area establishing apparatus based on a depth estimation model, characterized by, The device includes: Image acquisition module: Used to perform real-time image acquisition and processing based on preset image acquisition devices to obtain real-time scene images in the conference room; Output module: used to output the position distribution of each participant in the real-time scene image in image coordinates and depth coordinates based on the depth estimation model; The establishment module is used to establish a virtual meeting area based on the preset posture detection of the participants and the position distribution of the participants in the image coordinates and depth coordinates.
9. An electronic device comprising a processor and a memory, characterized in that The processor runs a computer program or code stored in the memory to implement the virtual meeting area establishment method as described in any one of claims 1 to 7.
10. A computer readable storage medium for storing a computer program or code, characterized in that, When the computer program or code is executed by a processor, the virtual meeting area establishment method as described in any one of claims 1 to 7 is implemented.