System and apparatus for creating a spatial model of a multi-device environment

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By capturing images with video devices and using machine learning data architecture to identify and estimate the positions of audio and video devices, the problem of insufficient device coordination in multi-device environments is solved, enabling more efficient audio and video signal processing.

CN122199781APending Publication Date: 2026-06-12GN HEARING AS

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: GN HEARING AS
Filing Date: 2025-12-10
Publication Date: 2026-06-12

Application Information

Patent Timeline

10 Dec 2025

Application

12 Jun 2026

Publication

CN122199781A

IPC: G06T17/00; G06V20/40; G06V10/774; G06V10/82

CPC: H04N7/147; H04N7/142; G06T7/70; G06T17/00; H04N7/157

AI Tagging

Application Domain

Character and pattern recognition Television systems

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

In multi-device environments, insufficient understanding of each device's physical orientation in the room and its position relative to other devices leads to inadequate coordination and processing capabilities. Existing systems primarily rely on manual configuration or audio cues, which limits accuracy.

Method used

By capturing room images using video devices and processing the images using a machine learning data architecture, the spatial model is created to identify and estimate the location of audio and video devices in the room, optimize the automated configuration and location estimation of the devices.

Benefits of technology

It provides a simple and effective way to create room spatial models, improves the coordination and processing capabilities of the device, adapts to environmental changes, and enhances the audio and video signal processing quality of online meetings.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122199781A_ABST

Patent Text Reader

Abstract

The present disclosure relates to systems and devices for creating a spatial model of a multi-device environment. A system for creating a spatial model of a room comprising a plurality of audio and / or video devices is disclosed. The system comprises a first device, which is a video device capable of capturing a stream of images of the room, and a second device, which is an audio and / or video device. The system comprises one or more processing units configured to: obtain a first image captured by the first device; and create a spatial model of the room by processing the first image to identify the second device and estimate a position of the second device in the room.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to a system for creating spatial models of rooms, a video device used in a system for creating spatial models of rooms, a method for creating spatial models of rooms, a computer program product, and a conference room for conducting online meetings. Background Technology

[0002] In multi-device environments (e.g., conference rooms or meeting rooms equipped with several A / V devices), the coordination and processing capabilities of these devices often lack optimization due to insufficient understanding of each device's physical orientation in the room and its position relative to other devices.

[0003] Current systems primarily rely on manual configuration or audio cues for localization, which can be limited in terms of accuracy.

[0004] As an example, if the meeting room is equipped with multiple cameras or microphones that process different recording signals to optimize the online meeting experience, then the location of individual devices needs to be known.

[0005] Furthermore, optimizing the performance of individual devices should preferably be automated to save costs and enable the system to adapt to changes in the environment.

[0006] Examples of such changes could include providing new devices or positioning a movable microphone or camera.

[0007] Therefore, providing an improved system / device for allowing improved processing of signals recorded by A / V devices remains a problem. Summary of the Invention

[0008] According to a first aspect, this disclosure relates to a system for creating a spatial model of a room including multiple audio and / or video devices, wherein the system includes a first device and a second device, the first device being a video device capable of capturing an image stream of the room, and the second device being an audio and / or video device, and wherein the system further includes one or more processing units configured to: Obtain a first image captured by the first device; and A spatial model of the room is created by processing the first image to identify the second device and estimate the location of the second device in the room.

[0009] As a result, by processing the images captured by the first device to form a spatial model, a simple and effective way to create a spatial model of a room is provided.

[0010] The spatial model can specify the estimated position of the first device relative to the second device. The spatial model can further specify the position of the device relative to other elements of the room, such as the floor, walls, ceiling, and additional fixtures. As an example, the spatial model may include digital landmarks for each device and / or other identified elements. Digital landmarks may be estimated positions and optionally, estimated orientations.

[0011] The processing unit in one or more processing units can be any processing unit, such as a central processing unit (CPU), a graphics processing unit (GPU), a microcontroller unit (MCU), a field-programmable gate array (FPGA), or any combination thereof. A processing unit may include one or more physical processors and / or may be a combination of multiple individual processing units.

[0012] The first device may be a purely video device capable only of capturing an image stream of the room, or a combined audio and video device. The audio device may be capable of recording audio, playing audio, or both. The first device may be a video device suitable for online conferencing systems. The second device may be a purely video device, a purely audio device, or a combination of audio and video devices. One or more processing units may be located in the first and / or second device. Alternatively / additionally, one or more processing units may be located in another device communicatively coupled to the first device. The first device may be able to acquire other signals, such as audio signals, and provide those signals to one or more processing units, which may additionally use these additional signals to create a spatial model.

[0013] One or more processing units may be configured to process the first image using conventional image processing techniques. As an example, landmarks on the second device may be identified in the first image and used to estimate the location of the second device within the room.

[0014] In some embodiments, the one or more processing units are configured to process a first image and identify the location of a second device in the room by using a machine learning data architecture trained to recognize audio and / or video devices in the image.

[0015] As a result, an efficient method for creating spatial models is provided. Accurate location estimates can be provided by using machine learning data architectures, as these architectures can be trained to estimate the location of specific devices that are typically used together.

[0016] A machine learning data architecture can be a supervised machine learning architecture trained on a training dataset that includes images from a room, where different devices are provided in the images, and where the locations of the different devices in the room are known.

[0017] In some implementations, the machine learning data architecture is an artificial neural network, such as a deep structured learning architecture.

[0018] The images can be real images obtained by placing different devices in multiple test rooms in different locations.

[0019] Alternatively or additionally, the images can be artificially generated using 3D computer software. As an example, a large training dataset can be generated by randomly generating 3D models of multiple rooms of different sizes, each furnished with typical furniture. The 3D models of different devices can then be randomly arranged in different locations within the rooms, and artificial images can be generated by 3D rendering of the 3D scene. The 3D rendering should preferably take into account the optics of the first device, such that the 3D rendering corresponds as closely as possible to the real-world image. This will allow for the creation of a large training dataset where the precise locations of the devices in the images are known.

[0020] In some implementations, the machine learning data architecture is further trained to identify the position of the audio and / or video device relative to the device capturing the image, and wherein the machine learning data architecture is further configured to estimate the position of the second device relative to the first device.

[0021] In some implementations, the machine learning data architecture is further trained to identify the orientation of the audio and / or video device relative to the device capturing the image, and wherein the machine learning data architecture is further configured to estimate the orientation of the second device relative to the first device.

[0022] Therefore, this can be used to create a more accurate spatial model that takes into account the directionality of the different sensors in the device. As an example, both speakers and microphones are typically highly directional. This directionality, in turn, allows for better audio processing.

[0023] Machine learning data architectures can be trained to identify orientation by attaching the orientation of a recording device to each image in the training dataset.

[0024] In some implementations, the machine learning data architecture is further trained to identify the positions of room elements relative to the first device, and the machine learning data architecture is further configured to estimate the positions of room elements relative to the first device.

[0025] In some implementations, the elements of the room are the ceiling, floor, and / or walls.

[0026] As a result, a more accurate spatial model of the room can be provided.

[0027] Machine learning data architectures can be trained to identify the location of elements in a room by additionally labeling the location of elements in a room for each image in the training dataset.

[0028] In some embodiments, the second device is an audio and video device, and one or more processing units are further configured to: Obtain a second image captured by the second device; A room model is created by processing the first and second images.

[0029] As a result, by additionally processing the images captured by the second device, a more accurate spatial model can be created.

[0030] The second image can be processed in the same way as the first image, for example, using traditional image processing techniques or machine learning data architectures.

[0031] In some embodiments, the first device includes a first processing unit, which is a processing unit among one or more processing units.

[0032] As a result, spatial models can be created without providing the device with access to external processing capabilities.

[0033] One or more processing units may be composed of a first processing unit or a combination of a first processing unit and another processing unit.

[0034] In some embodiments, the system includes a second processing unit disposed outside the room, the second processing unit being one of the one or more processing units, and the first device being communicatively coupled to the second processing unit.

[0035] As a result, the first and second devices can be simpler because they do not need to be able to perform complex calculations.

[0036] The second device can also be communicatively coupled to the second processing unit. The second processing unit and the first device can be communicatively coupled via a LAN or WAN (such as the Internet).

[0037] One or more processing units may consist of a second processing unit or a combination of a second processing unit and other processing units.

[0038] In some embodiments, the first and second devices are configured to automatically and communicatively pair in response to the recognition of the second device in the first image.

[0039] Therefore, an efficient and simple way to pair individual devices is provided. This makes setting up new rooms for online meetings even simpler.

[0040] A spatial model can be used to process video signals recorded by one or more video devices in a room, thereby generating an extended spatial model that further specifies the location of a person in the room. For example, if a person is identified in a view from one or more video devices and the corresponding location of one or more video devices is provided by the initial spatial model, the person's location in the room can be estimated. The extended spatial model can be continuously updated.

[0041] In some embodiments, the one or more processing units are configured to use the spatial model to process audio signals recorded in the room or audio signals to be played back in the room, preferably processing the audio signals to provide: Optimized voice pickup; Noise suppression; Echo cancellation; Source location; Replay optimization; Dynamic volume adjustment; or Single listening area.

[0042] The spatial model used can be an initial spatial model that specifies the location of devices in the room and the location of people in the room, or an expanded spatial model.

[0043] The spatial model can be further used to stitch together video signals from different video devices and / or estimate a 3D model of the conference room, where conference participants allow remote participants to select a custom view of the estimated 3D model. The 3D model can be continuously updated.

[0044] According to the second aspect, this disclosure relates to a video apparatus for creating a spatial model of a room as disclosed in the first aspect of this disclosure, wherein the video apparatus is configured to: Capture the first image of the second device in the room; A first image is provided to one or more processing units, thereby allowing one or more processing units to create a spatial model of the room by processing the first image to identify the second device and estimate the position of the second device in the room.

[0045] In some embodiments, the video device includes a first processing unit, which is configured to: A spatial model of the room is created by processing the first image to identify the second device and estimate the location of the second device in the room.

[0046] In some embodiments, the second device is an audio and video device, and the processing unit is further configured to: Obtain a second image captured by the second device; A spatial model of the room is created by processing the first and second images.

[0047] According to a third aspect, this disclosure relates to a method for creating a spatial model of a room including a first device and a second device, wherein the first device is a video device capable of capturing an image stream of the room, and the second device is an audio and / or video device, and wherein the method includes: Obtain the first image captured by the first device; A spatial model of the room is created by processing the first image to identify the second device and estimate the location of the second device in the room.

[0048] In some implementations, a machine learning data architecture trained to identify audio and / or video devices in an image is used to process the first image.

[0049] In some implementations, the machine learning data architecture is further trained to identify the position of the audio and / or video device relative to the device capturing the image, and the machine learning data architecture is further used to estimate the position of the second device relative to the first device.

[0050] In some implementations, the machine learning data architecture is further trained to identify the orientation of the audio and / or video device relative to the device capturing the image, and the machine learning data architecture is further used to estimate the orientation of the second device relative to the first device.

[0051] In some implementations, the machine learning data architecture is further trained to identify the positions of room elements relative to the first device, and the machine learning data architecture is further used to estimate the positions of room elements relative to the first device.

[0052] In some implementations, the elements of the room are the ceiling, floor, and / or walls.

[0053] In some embodiments, the second device is an audio and video device, and the method further includes: Obtain a second image captured by the second device; A spatial model of the room is created by processing the first and second images.

[0054] In some embodiments, the first device includes a first processing unit for creating a spatial model of the room by processing the first image and / or the second image.

[0055] In some embodiments, a second processing unit is arranged outside the room, the first device is communicatively coupled to the second processing unit, and wherein the first processing unit and / or the second processing unit is used to create a spatial model of the room by processing the first image and / or the second image.

[0056] In some embodiments, the first and second devices are configured to automatically and communicatively pair in response to the recognition of the second device in the first image.

[0057] In some implementations, the spatial model is used to process audio signals recorded in the room or audio signals to be played back in the room, preferably processing the audio signals to provide: Optimized voice pickup; Noise suppression; Echo cancellation; Source location; Replay optimization; Dynamic volume adjustment; or Single listening area.

[0058] According to the fourth aspect, this disclosure relates to a computer program product including program code means that, when executed on a data processing system, are adapted to cause the data processing system to perform the steps of the method disclosed in the third aspect of this disclosure.

[0059] In some implementations, a computer program product includes a non-transient computer-readable medium on which program code means are stored.

[0060] According to the fifth aspect, this disclosure relates to a conference room for conducting online meetings, the conference room including a display for displaying an image stream of the online meeting and a video device as disclosed in the second aspect of this disclosure.

[0061] In some implementations, the video device forms part of a system as disclosed in the first aspect of this disclosure, wherein the system is configured to create a spatial model of a room for enhancing online meetings.

[0062] The different aspects of this disclosure can be implemented in different ways, including the systems, apparatuses, methods, computer program products, and conference rooms described above and below, each producing one or more benefits and advantages as described in connection with at least one of the foregoing aspects, and each having one or more preferred embodiments corresponding to the preferred embodiments described in connection with at least one of the foregoing aspects and / or disclosed in the dependent claims. Furthermore, it should be understood that embodiments described in connection with one aspect of the aspects described herein can be equivalently applied to the other aspects. Attached Figure Description

[0063] Referring to the accompanying drawings, the above and / or additional objects, features, and advantages of this disclosure will be further clarified by the following illustrative and non-limiting detailed description of embodiments thereof, wherein: Figure 1A schematic diagram of a system for creating a spatial model of a room according to an embodiment of the present disclosure is shown.

[0064] Figure 2 A schematic diagram of a conference room, including a system for creating a spatial model of a room, is shown according to an embodiment of the present disclosure.

[0065] Figure 3 A schematic diagram of a video apparatus used in a system for creating a spatial model of a room, according to an embodiment of the present disclosure, is shown.

[0066] Figure 4 A flowchart is shown for a method of creating a spatial model of a room that includes a first device and a second device. Detailed Implementation

[0067] In the following description, reference is made to the accompanying drawings, which illustrate by way of showing how embodiments of the present disclosure may be implemented.

[0068] Figure 1 A schematic diagram of a system 100 for creating a spatial model of a room according to an embodiment of the present disclosure is shown. System 100 includes a first device 101 and a second device 102, the first device 101 being a video device capable of capturing an image stream of the room, and the second device 102 being an audio and / or video device. The system further includes one or more processing units 103 configured to: acquire a first image captured by the first device 101; and create a spatial model of the room by processing the first image to identify the second device 102 and estimate the position of the second device 102 within the room. In this embodiment, the one or more processing units consist of a single processing unit 103. In this embodiment, the processing unit 103 is located outside the first device 101 and the second device 102. However, in other embodiments, the processing unit 103 may be located within the first device 101 or the second device 102. The processing unit 103 should be communicatively coupled to the first device 101 to receive the first image directly or indirectly (e.g., via a server). In this embodiment, the processing unit 103 is communicatively coupled to the second device 102. However, in other embodiments, the second device 102 may not be communicatively coupled to the processing unit 103.

[0069] Preferably, the processing unit 103 is configured to process the first image and identify the position of the second device in the room using a machine learning data architecture trained to identify audio and / or video devices in an image. This allows the processing unit 103 to effectively identify different types of AV devices. The machine learning data architecture can also be trained to identify the position of the audio and / or video devices relative to the device capturing the image. This allows the machine learning data architecture to further estimate the position of the second device 102 relative to the first device 101. The machine learning data architecture can also be trained to identify the orientation of the audio and / or video devices relative to the device capturing the image. This allows the machine learning data architecture to further estimate the orientation of the second device 102 relative to the first device 101.

[0070] Figure 2 A schematic diagram of a conference room 200 including a system for creating a spatial model of a room according to an embodiment of the present disclosure is shown. The system includes a first device 210, a second device 211, and a third device 212. The first device 210, the second device 211, and the third device 212 are all video devices capable of capturing image streams of the room 200. The system further includes one or more processing units (not shown) configured to: acquire a first image captured by the first device 210; a second image captured by the second device 211; and a third image captured by the third device 212; and to create a spatial model of the room 200 by processing the first, second, and third images. The room also includes a first audio device 220, a second audio device 221, and a third audio device 222. Furthermore, the room 200 includes a table 231 and a monitor 230. The first, second, and third audio devices 220-222 may be a combination of microphones and speakers. The spatial model may specify the estimated positions of the video devices 210-212 and the audio devices 220-222 within the room 200.

[0071] As an example, a first image can be processed to estimate the positions of all devices visible in the first image relative to a first video device 210, a second image can be processed to estimate the positions of all devices visible in the second image relative to a second video device 211, and a third image can be processed to estimate the positions of all devices visible in the third image relative to a third video device 212. The images can be processed by a machine learning data architecture trained to identify audio and / or video devices in the images and to identify the positions of the audio and / or video devices relative to the device capturing the image. If the first video device, the second video device, and the third video device are the same, the machine learning data architecture used to process the images can be the same. Alternatively, if the first video device 210, the second video device 211, and the third video device 212 are different, different machine learning data architectures can be used to process different images. This will result in position estimation of the devices in three local coordinate systems, which can then be merged into a global coordinate system forming a spatial model of the room.

[0072] Alternatively, a spatial model of the room can be created directly by processing the first, second, and third images together. As an example, the images can be processed by a machine learning data architecture trained to identify audio and / or video devices in multiple images and create a spatial model directly from the multiple images.

[0073] Once a spatial model has been created, it can be used to improve the processing of the recorded signal. For example, if a person is located at position A, the processing unit can use the room's spatial model to process the audio signal recorded by audio devices 220-222 for, for example, speech pickup optimization of the audio signal recorded by focusing / beamforming at position A. Additionally, dynamic volume adjustment can be used to correct loudness so that the loudness of the recorded audio signal is the same regardless of whether the person is located at position G very close to audio devices 220-222 or at positions C or E, which are at a greater distance from audio devices 220-222.

[0074] The spatial model can also be used to estimate the speaker's location. As an example, if the meeting room 200 is used for online or hybrid meetings, the spatial model can be used to process audio signals recorded in the room by audio devices 220-222 to identify the speaker and their location. This can then allow video devices 210-212 to focus on that person, for example, allowing the video system to magnify the image of a person located at position B in room 200 while that person is speaking on the monitors of other people participating in the meeting from other locations, and magnify the image of a person located at position F while that person is speaking.

[0075] Figure 3A schematic diagram of a video device 300 in a system for creating a spatial model of a room according to an embodiment of the present disclosure is shown. The video device 300 includes a video capture unit 301 and a processing unit 302. The video device 300 is configured to: capture a first image of a second device in the room using the video capture unit 301; and provide the first image to one or more processing units, thereby allowing the one or more processing units to create a spatial model of the room by processing the first image to identify the second device and estimate the position of the second device in the room. The processing unit 302 may be one of the one or more processing units. Alternatively, the one or more processing units may be disposed remotely from the video device 300 and directly or indirectly communicatively coupled to the device, thereby allowing the device to send the first image to the one or more processing units. The video device 300 may correspond to... Figure 1 The video device 101 or Figure 2 The video devices 210, 211, and 212 are included.

[0076] Figure 4 A flowchart illustrates a method for creating a spatial model of a room, including a first device and a second device. The first device is a video device capable of capturing an image stream of the room, and the second device is an audio and / or video device. For example, the first device could be related to... Figure 3 The disclosed video device 300, and the second device may also be related to... Figure 3 The disclosed video device 300 or about Figure 2 The disclosed audio devices 220-222. The method includes: At 401, a first image captured by the first device is obtained; and at 402, a spatial model of the room is created by processing the first image to identify the second device and estimate the position of the second device in the room.

[0077] While some embodiments have been described and illustrated in detail, this disclosure is not limited to these embodiments, but may be embodied in other ways within the scope of the subject matter defined in the appended claims. Specifically, it should be understood that other embodiments may be utilized and structural and functional modifications may be made without departing from the scope of this disclosure.

[0078] In an apparatus claim that enumerates several means, several of these means may be embodied by the same item of hardware. The fact that certain measures are described in different dependent claims or in different embodiments does not mean that a combination of these measures cannot be used advantageously.

[0079] It should be emphasized that when the term "including / comprises" is used in this specification, it is used to specify the presence of the stated feature, whole, step or component, but does not exclude the presence or addition of one or more other features, wholes, steps, components or groups thereof.

[0080] Listed items

[0081] Other embodiments of this disclosure are provided in the following list: 1. A system for creating a spatial model of a room including multiple audio and / or video devices, wherein the system includes a first device and a second device, the first device being a video device capable of capturing an image stream of the room, and the second device being an audio and / or video device, and wherein the system includes one or more processing units configured to: Obtain a first image captured by the first device; and A spatial model of the room is created by processing the first image to identify the second device and estimate the location of the second device in the room.

[0082] 2. The system according to item 1, wherein one or more processing units are configured to process a first image and identify the location of a second device in a room by using a machine learning data architecture trained to recognize audio and / or video devices in the image.

[0083] 3. The system according to item 2, wherein the machine learning data architecture is further trained to identify the position of the audio and / or video device relative to the device capturing the image, and wherein the machine learning data architecture is further configured to estimate the position of the second device relative to the first device.

[0084] 4. The system according to item 2 or 3, wherein the machine learning data architecture is further trained to identify the orientation of the audio and / or video device relative to the device capturing the image, and wherein the machine learning data architecture is further configured to estimate the orientation of the second device relative to the first device.

[0085] 5. The system according to any one of items 2 to 4, wherein the machine learning data architecture is further trained to identify the positions of elements of the room relative to the first device, and wherein the machine learning data architecture is further configured to estimate the positions of elements of the room relative to the first device.

[0086] 6. According to the system in item 5, the elements of a room are a ceiling, a floor, and / or walls.

[0087] 7. A system according to any one of items 1 to 6, wherein the second means is an audio and video means, and one or more of the processing units are further configured to: Obtain a second image captured by the second device; and A spatial model of the room is created by processing the first and second images.

[0088] 8. The system according to any one of items 1 to 7, wherein the first device includes a first processing unit, the first processing unit being a processing unit among one or more processing units.

[0089] 9. The system according to any one of items 1 to 8, wherein the system includes a second processing unit disposed outside the room, the second processing unit being a processing unit among one or more processing units, and the first device being communicatively coupled to the second processing unit.

[0090] 10. The system according to any one of items 1 to 9, wherein the first device and the second device are configured to automatically communicate and pair in response to the recognition of the second device in the first image.

[0091] 11. A system according to any one of items 1 to 10, wherein one or more processing units are configured to use a spatial model to process audio signals recorded in the room or audio signals to be played back in the room, preferably processing the audio signals to provide: Optimized voice pickup; Noise suppression; Echo cancellation; Source location; Replay optimization; Dynamic volume adjustment; or Single listening area.

[0092] 12. A video apparatus for use in a system for creating a spatial model of a room according to any one of items 1 to 11, wherein the video apparatus is configured to: Capture the first image of the second device in the room; A first image is provided to one or more processing units, thereby allowing one or more processing units to create a spatial model of the room by processing the first image to identify the second device and estimate the position of the second device in the room.

[0093] 13. The video apparatus according to item 12, wherein the video apparatus includes a first processing unit, the first processing unit being configured to: A spatial model of the room is created by processing the first image to identify the second device and estimate the location of the second device in the room.

[0094] 14. The video apparatus according to item 13, wherein the second apparatus is an audio and video apparatus, and wherein the processing unit is further configured to: Obtain a second image captured by the second device; and A spatial model of the room is created by processing the first and second images.

[0095] 15. A method for creating a spatial model of a room including a first device and a second device, wherein the first device is a video device capable of capturing an image stream of the room, and the second device is an audio and / or video device, and wherein the method comprises: Obtain a first image captured by the first device; and A spatial model of the room is created by processing the first image to identify the second device and estimate the location of the second device in the room.

[0096] 16. The method of item 15, wherein a machine learning data architecture trained to identify audio and / or video devices in an image is used to process the first image.

[0097] 17. The method of item 16, wherein the machine learning data architecture is further trained to identify the position of the audio and / or video device relative to the device capturing the image, and wherein the machine learning data architecture is further used to estimate the position of the second device relative to the first device.

[0098] 18. The method of claim 17, wherein the machine learning data architecture is further trained to identify the orientation of the audio and / or video device relative to the device that captured the image, and wherein the machine learning data architecture is further used to estimate the orientation of the second device relative to the first device.

[0099] 19. The method of any one of items 16 to 18, wherein the machine learning data architecture is further trained to identify the positions of the elements of the room relative to the first device, and wherein the machine learning data architecture is further used to estimate the positions of the elements of the room relative to the first device.

[0100] 20. According to the method of item 19, wherein the elements of the room are ceiling, floor and / or walls.

[0101] 21. The method according to any one of items 15 to 20, wherein the second device is an audio and video device, and wherein the method further comprises: Obtain a second image captured by the second device; A spatial model of the room is created by processing the first and second images.

[0102] 22. The method of any one of items 15 to 21, wherein the first apparatus includes a first processing unit for creating a spatial model of the room by processing the first image and / or the second image.

[0103] 23. The method of any one of items 15 to 22, wherein the second processing unit is arranged outside the room, the first device is communicatively coupled to the second processing unit, and wherein the first processing unit and / or the second processing unit is used to create a spatial model of the room by processing the first image and / or the second image.

[0104] 24. The method of any one of items 15 to 23, wherein the first device and the second device are configured to automatically communicate and pair in response to the identification of the second device in the first image.

[0105] 25. The method according to any one of items 15 to 24, wherein the spatial model is used to process audio signals recorded in the room or audio signals to be played back in the room, preferably processing the audio signals to provide: Optimized voice pickup; Noise suppression; Echo cancellation; Source location; Replay optimization; Dynamic volume adjustment; or Single listening area.

[0106] 26. A computer program product comprising program code means, which, when executed on a data processing system, is adapted to cause the data processing system to perform the steps of the method according to any one of claims 15 to 25.

[0107] 27. The computer program product according to item 26, wherein the computer program product includes a non-transient computer-readable medium on which program code means are stored.

[0108] 28. A conference room for conducting online meetings, the conference room including a display for displaying an image stream of the online meeting and a video device according to any one of claims 12 to 14, the video device forming part of a system according to any one of claims 1 to 11, wherein the system is configured to create a spatial model of the room for enhancing the online meeting.

Claims

1. A system for creating a spatial model of a room, the room comprising multiple audio and / or video devices, wherein, The system includes a first device and a second device, the first device being a video device capable of capturing an image stream of the room, and the second device being an audio and / or video device, wherein the system includes one or more processing units configured to: Obtain the first image captured by the first device; and The spatial model of the room is created by processing the first image to identify the second device and estimate the position of the second device in the room, wherein the one or more processing units are configured to process the first image and identify the position of the second device in the room by using a machine learning data architecture trained to identify audio and / or video devices in the image.

2. The system according to claim 1, wherein, The machine learning data architecture is also trained to identify the position of the audio and / or video device relative to the device capturing the image, and wherein the machine learning data architecture is further configured to estimate the position of the second device relative to the first device.

3. The system according to claim 1 or 2, wherein, The machine learning data architecture is also trained to identify the orientation of the audio and / or video device relative to the device capturing the image, and wherein the machine learning data architecture is further configured to estimate the orientation of the second device relative to the first device.

4. The system according to any one of claims 1 to 2, wherein, The second device is an audio and video device, and wherein the one or more processing units are further configured to: Obtain a second image captured by the second device; and The spatial model of the room is created by processing the first image and the second image.

5. The system according to any one of claims 1 to 2, wherein, The first device includes a first processing unit, which is one of the one or more processing units.

6. The system according to any one of claims 1 to 2, wherein, The system includes a second processing unit disposed outside the room, the second processing unit being one of the one or more processing units, and the first device being communicatively coupled to the second processing unit.

7. The system according to any one of claims 1 to 2, wherein, The one or more processing units are configured to use the spatial model to process audio signals recorded in the room or audio signals to be played back in the room, preferably processing the audio signals to provide the following: Optimized voice pickup; Noise suppression; Echo cancellation; Source location; Replay optimization; Dynamic volume adjustment; or Single listening area.

8. A video device, said video device being used in a system for creating a spatial model of a room according to any one of claims 1 to 2, wherein, The video device is configured to: Capture a first image of the second device in the room; The first image is provided to one or more processing units, thereby allowing the one or more processing units to create a spatial model of the room by processing the first image to identify the second device and estimate the position of the second device in the room.

9. The video apparatus according to claim 8, wherein, The video device includes a first processing unit, which is configured to: A spatial model of the room is created by processing the first image to identify the second device and estimate the position of the second device in the room.

10. The video apparatus according to claim 9, wherein, The second device is an audio and video device, and wherein the processing unit is further configured to: Obtain a second image captured by the second device; and The spatial model of the room is created by processing the first image and the second image.

11. A method for creating a spatial model of a room, the room comprising a first device and a second device, wherein the first device is a video device capable of capturing an image stream of the room, and the second device is an audio and / or video device, and wherein, The method includes: Obtain the first image captured by the first device; and A spatial model of the room is created by processing the first image to identify the second device and estimate the position of the second device in the room, wherein a machine learning data architecture trained to identify audio and / or video devices in an image is used to process the first image.

12. A computer program product comprising program code means, wherein when the program code means is executed on a data processing system, the program code means is adapted to cause the data processing system to perform the steps of the method according to claim 11.

13. The computer program product according to claim 12, wherein, The computer program product includes a non-transient computer-readable medium on which program code is stored.

14. A conference room for conducting online meetings, the conference room comprising a display for displaying an image stream of the online meeting and a video device according to any one of claims 8 to 10, the video device forming part of a system according to any one of claims 1 to 7, and wherein, The system is configured to create a spatial model of the room to enhance the online meeting.