Method and system for training machine learning models related to a sensor
By generating synthetic data using AI models to create 3D virtual environments and simulate scenarios, the method addresses the data availability and privacy issues in training machine learning models for public space detection, enhancing accuracy and reducing costs.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- SCHRÉDER ILUMINAÇAO SA
- Filing Date
- 2025-12-18
- Publication Date
- 2026-07-02
AI Technical Summary
Existing machine learning models for detecting and classifying real-world objects in public spaces face challenges due to the lack of large-scale, representative data, privacy concerns from real-world sensing, and the limitations of existing databases, leading to performance issues and high costs.
Generating synthetic data using artificial intelligence models to train machine learning models, which includes creating 3D virtual environments and simulating various scenarios to provide diverse and dynamic training data, reducing reliance on real-world sensing and databases.
This approach enhances the accuracy and performance of machine learning models by providing flexible, large-scale training data that accurately detects and classifies objects in real-world scenarios, including rare or unseen situations, while avoiding privacy concerns and reducing costs.
Smart Images

Figure EP2025088006_02072026_PF_FP_ABST
Abstract
Description
[0001] METHOD AND SYSTEM FOR TRAINING MACHINE LEARNING MODELS RELATED TO A SENSOR FIELD OF INVENTION
[0002] The present invention relates to a computer-implemented method, a system and a computer program product for training a machine learning model. Particular embodiments relate to a computer-implemented method, a system and a computer program product for training machine learning models for detecting and classifying real-world objects, notably related to public spaces and their users, as well as machine learning models related to a sensor.
[0003] BACKGROUND
[0004] With rapidly increasing needs and services in the fields of mobility and logistics in urban areas, a demand arises for automated and preferably real-time detection and classification within real-world data related to public spaces and the behavior of their users. In particular, this pertains to data related to outdoor public spaces, traffic situations and road users’ behavior. Such automated detection and classification can be used in many different applications, such as detecting traffic congestions or full rubbish bins, predicting availability of free parking spots for dynamic parking pricing, detecting obstacles on the curbside limiting the mobility of users with disabilities, or (dynamic) management of curbside spaces in general.
[0005] It is known to use machine learning (ML) models for such detection and classification purposes. However, the performance of an ML model heavily depends on the data used to train the model, which should be of sufficiently large scale to guarantee an adequate training process, as well as sufficiently representative for the real-world data it will perform detection and classification on, yet amply diverse so as to be capable of also handling rare, unlikely or yet unseen (i.e. not yet recorded) real-world scenarios. Unfortunately, such public spaces and public space users’ behavior data are often not available on the required scales or not available at all, so that prior to training an ML model for these purposes, an additional step of data collection becomes inevitable.
[0006] When contemplating how to collect the required data, real-world data of public spaces and public space users’ behavior can be captured by sensors placed at various locations such as on or in the proximity of a curbside. Monitoring using sensors has been described in patent applications WO2019115599, WO2019175435, WO2022122750, WO2022122755, WO2022189601, W02023006970, WO2024227897, PCT / EP2024 / 068925, PCT / EP2024 / 068933 andPCT / EP2024 / 081913, which are included herein by reference. However, such a monitoring using sensors may invade into the public space and thus come with severe privacy implications. The collected data are then to be subject to an anonymization process to clean them from personal and sensitive information, such as license plates, face images, or any other information that could be used to identify individual people. Such additional data processing may be costly with respect to time and computational resources. Moreover, when collecting data with such a monitoring using sensors, one is first to make an estimation of where to place the required sensors, and which and how many sensor locations give rise to the most relevant and / or most diverse sensing data, taking into account that costs increase with the number of sensors placed.
[0007] An alternative is to rely on existing databases or libraries of public spaces situations and their users’ behavior, yet these are most often not free of charge and not always sufficiently apt for the specific detection or classification purposes envisaged. Moreover, such databases or libraries are intrinsically static, and therefore unable to evolve with rapidly changing public space (e.g. road or curbside users) behavioral patterns.
[0008] SUMMARY
[0009] The object of embodiments of the invention is to provide a computer-implemented method, a system and a computer program product for training a machine learning model, in such a way that when collecting data to be used for training the machine learning model, one does not need to rely entirely on existing databases or libraries of input data, nor deal with the privacy implications associated with capturing real-world data, without compromising the machine learning model’s performance.
[0010] According to a first aspect of the invention, there is provided a computer-implemented method comprising the following steps:
[0011] generating, using an artificial intelligence, Al, model, synthetic data of at least one object to be classified in a three-dimensional, 3D, environment;
[0012] training a machine learning, ML, model using the synthetic data;
[0013] wherein the ML model comprises a classification model configured to detect and classify the at least one object.
[0014] By training the ML model with generated synthetic data, the disadvantages associated with the collection of real-world sensing data with sensors may be circumvented, such as privacy implications, additional costs and time spent on anonymizing such data, as well as dependency onsensor locations and numbers of sensors used. Moreover, existing databases or libraries with input data need not be relied on.
[0015] Furthermore, synthetic data generation offers a large-scale source of highly dynamic training data for an ML model, with sufficient flexibility to ensure that the training data are representative for the specific detection and / or classification purposes envisaged, as well as guarantee a certain level of diversity of the training data. As such, a high level of accuracy and performance of the ML model, in the sense of processing speed, computational efficiency, model size and / or time needed for convergence, can be achieved, both for common as well as rare or novel detection and / or classification situations, since accuracy and performance of an ML model may depend on the amount and diversity of training data received.
[0016] The synthetic data resulting from the step of generating may further comprise certain metadata indicating the presence of at least one object within the synthetic data, and a classification label attributed to said object. For example, the synthetic data can be image data such as pictures or videos in two or three dimensions, and the metadata can take the form of a bounding box encompassing an object in said synthetic image data, accompanied by a label specifying that said object belongs to a certain class, for example the class of pedestrians, the class of motorized vehicles, or the class of trees.
[0017] An ML model, trained with synthetic data of at least one object in a real or virtual 3D environment, which synthetic data was generated using an Al model, and comprising a classification model configured to detect the at least one object and classify the at least one object as belonging to a certain class of objects, may thus accurately detect and classify other members of this class of objects, in a 3D environment.
[0018] In particular, a computer-implemented method as described above can be applied for automated and preferably real-time detection and classification within real-world data related to public spaces situations and public space users’ behavior, such as traffic situations, situations regarding urban areas, roads and curbsides, situations regarding public furniture such as rubbish bins, bus stops, public benches and restaurant terraces, as well as road user behavior, body movements, postures, gestures, walking and interaction patterns of public space users. This can be of use for many different purposes, such as detecting traffic congestions and / or full rubbish bins, predicting the availability of free parking spots for dynamic parking pricing, detecting and predicting relevant activities of public space users such as aggressive behavior, detecting obstacles on the curbside limiting the mobility of users with disabilities, or management of curbside spaces in general.The Al model used for the step of generating synthetic training data may be an ML model, e.g. a Large Language Model (LLM), different from the ML model that is being trained as part of the considered computer-implemented method, or may be a different type of Al model.
[0019] The training data need not be limited to generated synthetic data, but may also comprise a certain amount of pre-existing training data, e.g. from databases or libraries of appropriate data, and the ML model may be an ML model that was pre-trained with certain pre-existing data. In both situations, the generated synthetic data will aid to improve the ML model’s accuracy and performance for detection and / or classification purposes, since the synthetic data can be generated in such a way that they are tailored to the specific situations or behavior envisaged to be detected or classified. At the same time, by allowing such hybrid training datasets, the amount of synthetic data to be generated, and thereby the costs and time required for the step of generating synthetic data, can be reduced.
[0020] According to an exemplary embodiment, the generating comprises the following steps:
[0021] obtaining sensing data of a real 3D environment and / or of the at least one object; generating, using the Al model, synthetic data of the at least one object in the 3D environment, further based on the obtained sensing data.
[0022] It is noted that sensed data may include any one of the following: image data, light data, sound data, radar data, LIDAR data, humidity data, pollution data, temperature data, motion data, biological hazard data, proximity data, data measured by a gyroscope of an object, data measured by an accelerometer of an object, signals emitted by an object, vibration data, data measured by an object, such as power consumption data measured by a luminaire.
[0023] For example, the sound data may include a sound propagation map that is obtained for example by using a sound emitter and sensing how the sound is reflected by the environment and / or object to be detected and classified. For example, the signal emitted by an object may be any one of the following: LiFi signals, RF signals, Bluetooth signals, etc, emitted by an object such as a mobile phone, a vehicle, etc.
[0024] According to a preferred embodiment, the generating comprises the following steps:
[0025] obtaining sensing data of a real 3D environment;
[0026] modeling the 3D environment as a virtual representation of the real 3D environment based on the obtained sensing data; andgenerating, using the Al model, synthetic data of the at least one object in the virtual 3D environment.
[0027] The modeling of a virtual 3D environment on the basis of sensing data of a real-world 3D environment has the advantage that information regarding said real-world 3D environment can be exploited during the generating of synthetic data, and particularly of the associated metadata that can include a classification label attributed to environmental objects. These classification labels can be assigned to the respective sensed environmental objects during the step of obtaining the sensing data and later attributed to the corresponding virtual environmental objects in the modeled virtual 3D environment, without any need for manual labeling. As such, the disadvantages of manual labeling, such as being error-prone, time-consuming, cost-intensive and cumbersome to perform on a large scale, may be circumvented. Note that when environmental objects are sensed, various types of sensed data may be obtained, such as various types of image data, sound data, temperature data, and so on, or combinations thereof.
[0028] The step of obtaining sensing data of a real 3D environment may comprise collecting 2D pictures or videos of the real 3D environment using sensors placed at various locations within the environment, and / or scanning the real 3D environment from any individual viewpoint using a low-level camera device such as the camera function on a smartphone. Other examples of sensors that may be used in the step of obtaining sensing data include an image sensor, an optical sensor such as a photodetector or any other light sensor, a sound sensor, a radar such as a Doppler effect radar, a LIDAR, a humidity sensor, a pollution sensor, a temperature sensor, a motion sensor, a biological hazard sensor, a proximity sensor, a gyroscope, an accelerometer, an antenna, an RF sensor, a vibration sensor, a metering device, an alarm device, or combinations thereof.
[0029] As such, when environmental objects are sensed, various sensed data may be obtained such as various sorts of image data, sound data, temperature data, etc., or combinations thereof. As an alternative for or in addition to the usage of sensing data of a real 3D environment, pre-existing public space databases can be exploited to provide environmental data related to buildings and other infrastructure, such as precise locations of trees and / or bus stops, and / or databases of data related to users present in the environment may be consulted.
[0030] The obtained sensing data pertain to the 3D environment, which may be to a large extent composed of static objects, such as buildings and roads. Such sensing data can therefore be collected relatively easily in an off-line fashion during a single phase preceding the step of generating the synthetic data, without any need for continuous and / or real-time monitoring of the 3D environment. Most often, these sensing data can take the form of 2D vision image data, but may just as well take the form of3D vision image data, radar data, LIDAR data, or any other form of sensed data, such as sound data, temperature data, or any other type of data that may be sensed with one of the sensors or that may be represented as image data.
[0031] Various techniques can subsequently be exploited to extract additional information, e.g. distance metrics as well as 3D structure, texture, color and depth information, from the obtained sensing data, such as using mapping wheels to measure distances on a large scale, or LIDAR and / or photogrammetry measurements performed during the collecting of sensing data. Thereafter, in case the sensing data take the form of 2D image data, data formats such as USD (Universal Scene Description) can be utilized to model the geometry of the 3D environment, and thus transform the collected 2D image data of the real 3D environment, with the aid of said distance metrics and structure / texture and / or color information, into a static 3D virtual representation of the environment. Finally, synthetic data of the at least one object within the created virtual 3D environment may be generated using an Al model, and said synthetic data may be used to train the ML model.
[0032] According to a preferred embodiment, the generating comprises the following steps:
[0033] obtaining sensing data of at least one real object in a real 3D environment;
[0034] modeling the at least one object as a virtual representation of the at least one real object based on the obtained sensing data; and
[0035] generating, using the Al model, synthetic data of the at least one object in the 3D environment.
[0036] Like the previously described embodiment, wherein sensing data may be obtained of a real 3D environment, and a virtual 3D environment may be modeled as a virtual representation of the real 3D environment based on the obtained sensing data, the above embodiment has the advantage that any available information regarding the at least one sensed real object can be exploited during the step of generating synthetic data by assigning said information as (at least part of) a classification label to be assigned to the virtual representation of said at least one real object. This may reduce the need for manual labeling of objects within the generated synthetic data. Said information can pertain to a classification of the at least one object as belonging to a certain class of objects, and / or to the object’s location and / or orientation within the 3D environment, and / or the object’s speed if moving.
[0037] The at least one sensed real object under consideration need not be static, but can be a dynamic or moving object such as a pedestrian, a vehicle or a rubbish bin that is progressively being filled. Thus, continuous and / or real-time monitoring of said object may be performed to obtain sufficiently representative sensing data, and a sufficient amount of such data, to guarantee an adequate accuracy and performance of the ML model.This can be achieved by collecting the sensing data with one or more field sensors placed at several locations in a public space, e.g. along a road and / or curbside, which are subsequently communicated in real-time to a processing means wherein the step of modeling the at least one object and the step of generating synthetic data may be performed. Said latter step of generating synthetic data may comprise augmenting data of a 3D environment with the modeled virtual representation of the at least one real object, wherein the data of a 3D environment may be Al-generated environmental data or real-world sensing data. Indeed, the above embodiment can be readily combined with the previously described embodiment to obtain a method in which sensing data of both a real 3D environment and at least one object in said real 3D environment may be used to train an ML model.
[0038] According to a preferred embodiment, the generating further comprises:
[0039] simulating, using the Al model, one or more virtual scenes involving the at least one object in the 3D environment; and
[0040] generating, using the Al model, synthetic image data of the at least one object in the 3D environment by rendering image data of the one or more virtual scenes.
[0041] The one or more virtual scenes may offer a more dynamic perspective on the at least one object, in the sense that they may represent different conditions of the at least one object, such as different phases of motion, and / or may represent the at least one object within different conditions of the 3D environment, such as different lighting and / or weather conditions. By generating synthetic data on the basis of such virtual scenes and training the ML model with said synthetic data, performance of the ML model may be improved regarding its ability to detect, recognize and classify other instances of the class to which the at least one object belongs, when confronted with real-world data in which such instances and their environment are presented under different conditions.
[0042] Dynamic representations of an object may also be created within the previously described embodiments, wherein the step of generating synthetic data comprises obtaining sensing data of at least one real object in a real 3D environment and modeling the at least one object as a virtual representation of the at least one real object based on the obtained sensing data.
[0043] In addition, the step of simulating one or more virtual scenes involving the at least one object in the 3D environment may have the advantage that virtual scenes may provide not only a replication of reality, for example with synthetic data in the form of a video replicating the sensed real-world motion of the sensed real-world object, but may also allow to visualize scenarios that may not be a replication of reality, but that may be rather unlikely to occur in the real world, and thus unlikely toform part of any obtained sensing data. Said rare or yet unseen scenarios may be difficult to classify for an ML model trained merely on real-world rather than synthetic data, due to the low probability of such scenarios to occur within these models’ training databases.
[0044] In certain situations, it may be desirable for an ML model to accurately classify not only frequently occurring, but also yet unseen scenarios. For example, when a city introduces a new system for bicycle sharing and would like to rely on ML models for detecting those bicycles, said ML models may be capable of detecting the particular bicycles concerned and scenarios involving these bicycles from the moment they are introduced onwards, when no data of these bicycles in that city are available yet and said scenarios are thus yet unseen (not yet recorded). For this reason, it is beneficial to train the ML model with data of such rare, unlikely or yet unseen scenarios, as can be achieved by simulating virtual scenes within the step of generating synthetic data.
[0045] According to the above preferred embodiment, the step of simulating, using the Al model, one or more virtual scenes involving the at least one object in the 3D environment, may comprise several sub-steps, as described in the following.
[0046] Firstly, the envisaged virtual scenes may be verbally described in terms of their components, i.e., the considered 3D environment, the static and dynamic objects placed within the 3D environment, the location of these objects, and / or the objects’ motion and / or behavior. The determination of which virtual scenes to take into account may largely depend on the preferred use cases, i.e., the preferred public space (e.g. traffic and / or curbside) or user behavior situations one aims to classify with the ML model. An example of such a verbal description may be: “a street with two driving lanes, an ambulance passing by on one of the lanes, and a pedestrian standing still on the curbside”.
[0047] Secondly, a virtual representation may be modeled, e.g. a USD representation, of each of the objects, particularly the dynamic objects, that may play a role in the considered virtual scene. In the exemplary virtual scene described above, this would amount to modeling the ambulance and the pedestrian within a static representation of the street with two driving lanes. Said representations of the objects and / or the 3D environment may be obtained by modeling on the basis of obtained sensing data, or by some other type of modeling that may use an Al model.
[0048] Thirdly, several versions or variants may be created of each of the relevant virtual representations of the objects playing a role in the virtual scene, wherein such versions or variants may comprise the object placed at different locations, in different phases of motion, under different lighting and / or weather conditions, viewed from different viewpoints or angles, and / or representations of the objectcreated on the basis of different qualities of sensed data. The envisaged virtual scenes may then arise upon suitably combining the created versions or variants of the virtual representations of each of the considered objects, placed within the considered representation of the 3D environment.
[0049] According to the above preferred embodiment, the step of generating, using the Al model, synthetic image data of the at least one object in the 3D environment may comprise rendering image data of the one or more virtual scenes, as described in the following.
[0050] The created virtual scenes can be thought of as scenes or scenarios taking place within a 3D environment, and the rendering of image data of such a virtual scene may amount to creating one or more photorealistic 2D images that may represent the virtual scene. In an example, the rendering of image data may comprise making pictures of the 3D virtual scene as seen from the viewpoint of one or more virtual sensors placed within the 3D environment. In case several such virtual sensors are considered, the rendered image data may represent one and the same virtual scene from different perspectives. In another example, the rendered image data may represent different phases of motion of an object in a 3D environment, and the ensemble of these rendered image data, displayed subsequently in time, may be considered as a video representing the motion of said object.
[0051] In embodiments involving such a simulation of virtual scenes, the simulating preferably comprises simulating, using the Al model, a plurality of different virtual scenes involving the at least one object based on at least one of the following: different locations and / or orientations of the at least one object, different environmental conditions, such as different lighting and / or weather conditions of the 3D environment, different levels of occlusion of the at least one object by one or more obstacles, such as a leaf which is stuck in front of the camera or sensor, different levels of background noise, different sensing characteristics of one or more virtual image sensors rendering said image data, such as different resolutions, fields of view, focal lengths, exposure times, frame rates, qualities and / or cleanness of lenses.
[0052] All the above different virtual scenes may contribute to the extent of representativeness and diversity of the training data, which may be sufficiently high in order for the ML model to achieve a sufficiently high performance and accuracy in detecting and / or classifying real-world data.
[0053] According to a preferred embodiment, the generating comprises obtaining textual information about the 3D environment and / or the at least one object; and generating, using the Al model, synthetic data of the at least one object in the 3D environment based on the received textual information.In particular, the step of generating synthetic data to be used for training the ML model may comprise prompting a generative Al model to generate sound and / or image data of a 3D environment and / or one or more static or dynamic objects relevant to the envisaged classification use cases, on the basis of a textual description of said environment or objects.
[0054] Such preferred embodiments have the advantage of saving computation time and not relying on sensing data. In addition, contextual information extractable from sensing data can be used to facilitate the step of generating the metadata comprising the classification label of objects detectable within the synthetic data. For most object classes, dedicated inference algorithms may be available that allow for an automated assignment of classification labels within said generated synthetic data, and only for those object classes for which no such inference algorithm is available, a manual labeling step may be needed. The resulting labeled data may then be subsequently used to train the ML model, with the assigned labels providing a source of information for the training process.
[0055] According to a preferred embodiment, the generating further comprises the following steps:
[0056] obtaining sensing data of a real 3D environment;
[0057] obtaining textual information about the at least one object; and
[0058] generating, using the Al model, synthetic data of the at least one object in the real 3D environment based on the received textual information.
[0059] Within such a preferred embodiment, the step of generating synthetic data may comprise combining synthetic data of the at least one object, that is generated based on the received textual information, with the obtained sensing data of a real 3D environment, to obtain synthetic data representing the at least one object within the real 3D environment. For example, the step of generating synthetic data may comprise augmenting sensing data of a real-world 3D environment, e.g. a static picture of a street with driving lanes, building facades and trees, with synthetic images of dynamic objects, such as vehicles or pedestrians, where the latter may be created by text-prompting a generative Al model.
[0060] Such preferred embodiments may thus combine the low complexity, low computational cost and flexibility of generative Al models controlled by text prompts, with the reliability and representativeness of the sensing data of a real-world 3D environment, as well as the exploitation of said sensing data as a source of contextual information for the creation of said classification labels.
[0061] According to an exemplary embodiment, the generating comprises:
[0062] obtaining textual information about the at least one object; and
[0063] generating, using the Al model, a 3D model of the at least one object based on the received textual information.For example, a generative Al model may be prompted not only to generate 2D images of static or dynamic objects on the basis of the received textual information, but also to use said textual information to create a virtual representation of said objects in three dimensions.
[0064] Such an exemplary embodiment may be advantageously combined with embodiments wherein a virtual 3D environment is created, for example on the basis of obtained sensing data, as well as with embodiments comprising a step of simulating one or more virtual scenes with the use of an Al model.
[0065] The generating of a 3D model of at least one object has the advantage that data, for example image data, of said object as seen from several different viewpoints may be rendered on the basis of said 3D model, thus complementing the resulting training dataset with different perspectives on said object. The higher the number of perspectives available within the training dataset, the better the performance of the ML model to accurately and rapidly detect and classify other instances of a class said object belongs to, when confronted with new data to be classified. Indeed, these new data are likely to be real-world data captured with a sensor, and the more perspectives are available within the training dataset, the higher the chances are that one of these perspectives matches the perspective of this sensor.
[0066] According to a preferred embodiment, the at least one object may be a moving object such as a vehicle or a pedestrian. Preferably, the vehicle is selected from a car, a truck such as a garbage truck or a fire truck, a bicycle, a scooter, a motorbike, an ambulance, a police car, a military vehicle, a taxi, a delivery vehicle, a bus, a tram, a train, a boat, an airplane, a drone. It should be noted that different types of vehicles may involve different driving routines, and different patterns and frequencies of stopping along the curbside. Alternatively, the at least one object may be a static object such as a tree, a luminaire, a traffic light, a traffic sign, a parking place, a parked vehicle, urban furniture such as a public bench, a bus stop, or a public bin. Preferably, the 3D environment comprises at least one of a road such as a street road, a sidewalk, a curbside, a pedestrian area, a parking area such as a parking lot, and a building facade.
[0067] When deciding which objects and which 3D environment to represent with the synthetic data, one may depart from the envisaged use cases. For example, when considering to use the ML model for indicating when a rubbish bin is full and requires emptying on the basis of image data obtained by a sensor placed in the vicinity of said rubbish bin, synthetic data comprising data of rubbish bins may be generated and used for training the ML model.According to a preferred embodiment, the training further comprises training the ML model using data of at least one real object in a real 3D environment.
[0068] In particular, the training dataset need not comprise merely synthetic data, but may also comprise real-world data of a real object within a real 3D environment, such as real sensor data, so that a more complete training process is achieved. It may be particularly beneficial to generate synthetic data of rare, unlikely or yet unseen scenarios or behaviors as seen from the curbside, since such scenarios or behaviors may be unlikely to form part of a typical set of real-world sensing data, whilst training data of common public space, curbside or traffic scenarios or behaviors may be unaltered real-world sensing data comprised in e.g. pre-existing libraries. As such, it is possible to save unnecessary computational resources and generating time on obtaining training data that is in fact readily available from other resources.
[0069] According to a preferred embodiment, the step of training the ML model further comprises the following steps:
[0070] validating the synthetic data against ground-truth data to evaluate convergence of the output of the trained ML model; and
[0071] upon determining that said convergence is not reached:
[0072] modifying at least one of a set of parameters for training the ML model and a set of settings for generating the synthetic data; and
[0073] retraining the ML model using the modified set of parameters and / or set of settings until determining that said convergence is reached.
[0074] In particular, the method may comprise, after an initial step of generating synthetic data and an initial step of training the ML model, a step of validating the synthetic data against ground-truth data. In other words, within said validating step, the ML model is confronted to ground-truth data, i.e., a set of real-world data that are sufficiently representative of the envisaged use cases and for which a fully accurate classification is available but is not transferred to the ML model.
[0075] Then, the ML model is instructed to perform detection and / or classification on said set of groundtruth data and the outcomes are compared with the fully accurate classification labels. The convergence of the outcomes is a measure for the performance of the ML model: it expresses to which extent the outcomes agree with the fully accurate classification labels. If a sufficient extent of convergence is reached, for example exceeding a certain threshold, then the ML model is deemed to be ready for being put to practice. In case the extent of convergence reached is determined not to be sufficient, either a set of parameters for training the ML model, or a set of settings for generating thesynthetic data, or both, is / are modified, and the ML model is retrained using the modified set of parameters and / or set of settings.
[0076] In the above first alternative, the training parameters, which may influence which information an ML model is able to extract from a set of training data and thus how the model will eventually perform, may be modified, and the ML model may be retrained on the basis of the same training dataset as used before, yet with the training process dictated with the new set of training parameters. Examples of such training parameters are the number of neurons and the number of layers in the ML model.
[0077] Alternatively or in addition, since in the present invention the ML model is trained on the basis of generated synthetic data, in the above second alternative, a set of settings for generating synthetic data may be modified. For example, in case the ML model underperforms when detecting and / or classifying a certain class of objects, said set of settings may be modified in such a way that synthetic data is regenerated on the basis of these modified settings results, for instances of said class, to provide more representative, more detailed and / or more (photo)realistic synthetic data.
[0078] The above step of modifying may be followed by a step of retraining the ML model using the modified set of parameters and / or set of settings, with the latter alternative amounting to generating a new set of synthetic data on the basis of said settings, and retraining the ML model with either the new set of synthetic data or a combination of said new set of synthetic data with part or all of the previously generated synthetic data.
[0079] The above step of retraining may be followed by another step of validating the synthetic data against ground-truth data to evaluate convergence of the output of the trained ML model, and these steps may be repeated iteratively in a feedback loop until a sufficient extent of convergence, for example exceeding a certain threshold, is reached.
[0080] According to an embodiment wherein the training comprises a validation step as described above, the modifying of the set of settings for generating the synthetic data preferably comprises at least one of:
[0081] changing a library storing information about the at least one object and modifying rendering settings, such as a resolution, for generating the synthetic data; and
[0082] regenerating synthetic data of the at least one object based on the changed library and / or the modified rendering settings.In particular, it may be concluded from a validation step within the feedback loop that the ML model underperforms when detecting and / or classifying a certain class of objects. For example, the trained ML model may be able to accurately detect and / or classify vehicles, but not to detect and / or classify pedestrians to the same extent of accuracy. A potential cause may be that the synthetic data may represent pedestrians in a non-sufficiently “human-like” fashion. It may be then beneficial to modify the set of settings for generating the synthetic data related to the class of pedestrians. In case said generating was previously performed using a certain library storing information about a certain class of objects, for example the class of pedestrians, said step of modifying may amount to switching to a different library storing information about the same class of objects, with said library being more extensive, more advanced and / or storing more detailed information.
[0083] Alternatively or in addition, the settings for generating synthetic data may be modified by modifying rendering settings for generating the synthetic data. In other words, in embodiments wherein the synthetic data are generated by rendering 2D image data from a 3D virtual model, such as seen from the point of view of one or several virtual sensors, it may be advantageous to modify the settings of said rendering so that higher-quality images result from the rendering. This may amount to increasing the resolution used to render the synthetic data, or this may amount to allowing the rendering process to take more time so that more detailed and more photorealistic images result from the rendering process. Another option is to maintain the rendering settings, but to adapt the weather, lighting and / or occlusion conditions under which the rendering takes place.
[0084] On the basis of the changed library and / or the modified rendering settings and / or the modified conditions under which the rendering takes place, a new set of synthetic data may then be generated, the training data set may be replaced by or complemented with the newly generated synthetic data, and the model may be retrained with the new training data set.
[0085] In embodiments comprising a validation step as presented above, said validation step may form part of the training of the ML model. After the training of the ML model is completed, the ML model may be subject to a stress-test in order to evaluate its final performance. Such a stress-test may be performed on the basis of real-world data representative of the envisaged use cases, such as obtained with sensors, that the ML model is instructed to classify. In parallel, the same set of data may be classified manually or with a pre-existing ML model known to perform well, and the outcomes of both classifications may be compared. The extent of agreement between the two or more sets of classification outcomes forms a measure for the performance of the ML model.Alternatively, the testing data may also comprise synthetic data representing conditions that differ substantially from those represented by the training data, for example representing extreme weather conditions or weather conditions that are highly unlikely to occur in the considered geographic areas, and the ML model may be tested with respect to its performance when confronted with such data.
[0086] According to a preferred embodiment, the synthetic data comprises at least one of synthetic image data and synthetic sound data.
[0087] Indeed, the detection and classification purposes of the method may not only pertain to the visual detection and classification of objects within image data, but may also pertain to the detection and classification of certain sounds as belonging to a certain class of sounds.
[0088] Herein, and throughout the whole text, image data should be understood as not limited to simply vision images, but rather as comprising any type of data that may be represented by means of images. In particular, this includes radar data and LIDAR data, but also sound data. For example, a sound sample may be represented as a collection of images, with each image corresponding to the sound collected at a certain point in time, each pixel of such an image corresponding to a certain frequency, and the color of each pixel corresponding to the amplitude or volume captured for said frequency at said point in time. More generally, any data stream, whichever the data type, may be formulated in a binary format, which in its turn is straightforwardly translated to a collection of images. Nevertheless, various types of data may also be considered without requiring to represent said data as image data. For example, the sensing data and / or the synthetic data may also comprise sound data, temperature data, pollution data or any other type of data, either directly or represented as image data.
[0089] As such, a large variety of use cases may be handled and subjected to detection and / or classification by the resulting ML models. For example, an ML model trained on the basis of synthetic sound data may be exploited for the detection of approaching emergency vehicles with sirens, such as ambulances or fire trucks, whereas ML models trained with radar data may be configured for distinguishing between bikes and motor scooters based on their speed as extracted from sensed radar data.
[0090] According to a preferred embodiment comprising a step of obtaining sensing data, the sensing data comprises at least one of sensing image data and sensing sound data.
[0091] Again, image data should be understood as comprising not only vision images, but also radar data, LIDAR data, or any other type of data that may be represented by means of images. The obtainedsensing data may be of the same data type as the synthetic data that is being generated, or of a different type. For example, in case the obtained sensing data are image data, e.g. radar data, the generated synthetic data may be image data of the same type (radar data) or a different type (e.g. vision image data), or may be sound data, and vice versa. In case data of different types are used as sensing data and synthetic data respectively, an additional conversion step may be included into the step of generating synthetic data.
[0092] Although the description of the above embodiments is mostly directed to image data, the reader will understand that said description also applies to sound data as well as other types of data, mutatis mutandis. Such detection and classification of sound data may be applied for different or similar purposes as presented above for image data. For example, sound data may be captured by a sound sensor, such as a microphone, placed at some location on a curbside and / or in a vicinity of a curbside and the identified sound samples may be classified as to whether or not they originate from e.g. a priority vehicle, such as an ambulance, a fire engine or a police truck. Such sound data classification may then for example be used for real-time prediction of approaching priority vehicles within road user devices.
[0093] According to a preferred embodiment, the synthetic data comprises at least one of two-dimensional (2D) synthetic data and 3D synthetic data.
[0094] Indeed, as already illustrated in the previous embodiments, the process of generating synthetic image data may either involve only 2D (real-world or virtual) image data, or involve only 3D (real-world or virtual) image data, or involve the creation of a 3D virtual environment and / or a 3D virtual representation of objects as well as involve the rendering of 2D image data from said 3D models.
[0095] According to a second aspect of the invention, there is provided a computer-implemented method comprising the following steps:
[0096] obtaining a first sensing data having first sensing characteristics;
[0097] obtaining sensing characteristics of a sensor;
[0098] converting, using an artificial intelligence, Al, model, the first sensing data into a second sensing data having second sensing characteristics different from the first sensing characteristics, wherein the second sensing characteristics match the sensing characteristics of the sensor; and training a machine learning, ML, model related to the sensor using the second sensing data.
[0099] Thus, according to said second aspect of the invention, a method is provided that allows to train an ML model with data that correspond to a certain set of desired sensor sensing characteristics, evenwhen no data captured by such a sensor are available. In such a situation, the method may depart from a given set of sensing data with first sensing characteristics that may be considered suboptimal and that are different from the preferred second sensing characteristics, whilst still providing an ML model tailored to the preferred second sensing characteristics. In particular, this ML model will be configured to perform certain operations, such as detection and / or classification of at least one object sensed by the sensor, when it receives sensing data having said preferred second sensing characteristics.
[0100] The second sensing characteristics are known to match the sensing characteristics of a certain sensor. The first sensing characteristics may or may not match the sensing characteristics of another sensor, in which case the latter sensor may be referred to as the first sensor and the former sensor may be referred to as the second sensor. Each of the considered sensors may either be a real-world sensor with realistic sensing characteristics, such as a limited resolution and focal length, or a virtual sensor with optimal / ideal or suboptimal realistic sensing characteristics.
[0101] The method according to said second aspect is particularly advantageous in case the first sensing characteristics correspond to an ideal virtual sensor and the second sensing characteristics correspond to a real-world sensor, in which case the step of converting the first sensing data into second sensing data amounts to transforming sensing data of certain scenes or scenarios as captured by an ideal virtual sensor into data that would hypothetically be obtained if the same scenes or scenarios were captured by said real-world sensor. In other words, the step of converting the first sensing data into second sensing data then serves at least to mimic a real-world sensing device and its effect on certain, not necessarily real-world, first sensing data.
[0102] According to a preferred embodiment, the converting comprises degrading the first sensing characteristics by at least one of lowering a resolution or quality of the first sensing data, adding noise to the first sensing data, adding blur to the first sensing data, and adding compression artifact or other distortion to the first sensing data, changing a level of occlusion of the first sensing data by one or more obstacles. For example, such a distortion may comprise a slight movement of the involved sensor due to the wind or to air displacements caused by passing vehicles.
[0103] In this way, the data with which the ML model is trained may be degraded in terms of sensing characteristics compared to the first sensing data initially input to the method, and thus may be considered as more realistic, or more representative of real-world sensing data captured with low-level sensing infrastructure, than these first sensing data.The sensing characteristics of the sensor preferably comprise at least one of a resolution or a quality. When the sensor is an image sensor, the sensing characteristics of the sensor preferably comprise at least one of a resolution, a field of view, a focal length, an exposure time, a frame rate, a quality, a cleanness of lenses, an extent of image deformation and / or an extent of brightness.
[0104] In this respect, it is noted that the choice of the type of sensing characteristics of the sensor may depend on the envisaged use cases.
[0105] Preferably, the sensor is one of a virtual sensor configured to sense virtual objects in a virtual three-dimensional, 3D, environment and a real sensor configured to sense real objects in a real 3D environment.
[0106] Both types of sensors have already been discussed above with respect to the first aspect of the invention. Virtual sensors can be used to render data, for example 2D image data, out of a 3D model of an environment which may be augmented with virtual 3D representations of certain static or dynamic objects. The ML model related to such a virtual sensor can then for example be used to predict to some extent the results of said rendering process. Real-world sensors on the other hand can be placed at any location within a real 3D environment, e.g. at some location on and / or nearby a curbside, and can be used for collecting sensing data, e.g. of certain behavior and / or public space situations, as seen from the perspective of said sensors. ML models related to such a real sensor can then e.g. be used to predict how certain real-world 3D scenarios will be visualized or otherwise rendered from the point of view of said sensor.
[0107] According to a preferred embodiment, the first and second sensing data each comprise at least one of image data and sound data, and the sensor comprises at least one of an image sensor, such as a fish-eye camera or an infrared camera, and a sound sensor.
[0108] Indeed, as illustrated above, ML models may be utilized for prediction, detection and / or classification purposes related to image data, but also to purposes related to other types of data such as sound data, temperature data or any other type of data that may be captured with a sensor, e.g. to sound data in order to identify certain sound samples as whether or not originating from e.g. a priority vehicle. The relevant sensing characteristics may then pertain to the extent to which the sound sensor is able to capture and render sound data out of a physical sound.
[0109] Herein, and throughout the whole text, image data should be understood as not limited to simply vision images, but rather as comprising any type of data that may be represented by means of images.In particular, and as illustrated above, this includes radar data, LIDAR data and sound data. As such, a large variety of use cases may be handled by the resulting ML models.
[0110] The first sensing data is preferably obtained from another sensor, wherein preferably the other sensor is one of a high-quality or high-resolution virtual sensor configured to sense virtual objects in a virtual 3D environment and a high-quality or high-resolution real sensor configured to sense real objects in a real 3D environment.
[0111] In other words, the other (first) sensor with which the first sensing data are obtained may be an idealized, seemingly perfect virtual sensor capable of reproducing within its produced data every detail of the sensed virtual environment, or may be a high-quality or high-resolution real sensor, and the second sensor with which the second sensing data are obtained may be another virtual sensor with degraded and more real-world-like sensing characteristics, or may be a real-world sensor with suboptimal sensing characteristics.
[0112] In both situations, the step of converting the first sensing data into a set of second sensing data has the effect of modifying the given first sensing data in such a way that they become more representative of real-world sensing data. Consequently, even though the initial first sensing data may be idealized virtual data, the data that will eventually be used to train the ML model may be, at least insofar the sensing characteristics are concerned, similar to real-world sensing data.
[0113] In a preferred embodiment, the first and second sensing data each comprise at least one of two-dimensional, 2D, sensing data and 3D sensing data.
[0114] Indeed, as illustrated above both 2D and 3D sensing data may be of interest when considering real or virtual sensors.
[0115] According to a third aspect of the invention, there is provided a computer-implemented method comprising the following steps:
[0116] obtaining sensing data of a real three-dimensional, 3D, environment;
[0117] generating, using an artificial intelligence, Al, model, synthetic data of at least one object in the real 3D environment based on the obtained sensing data; and
[0118] training a machine learning, ML, model using the synthetic data.
[0119] Thus, according to said third aspect of the invention, such a computer-implemented method combines the advantages of training an ML model with synthetic data rather than real-world data, aspresented above in connection with the first aspect of the invention, with the benefits of relying on obtained sensing data of a real 3D environment.
[0120] In particular, the dataset used to train the ML model may comprise sensing data of a real 3D environment augmented with at least one Al-generated representation of at least one object. Thus, the disadvantages associated with the collection of real-world sensing data of said at least one object with sensors may be circumvented, such as privacy implications, additional costs and time spent on anonymizing such data, as well as dependency on sensor locations and numbers of sensors used.
[0121] Moreover, existing databases or libraries with input data need not be relied on to collect training data of said at least one object. Furthermore, synthetic data generation offers a large-scale source of highly dynamic training data for an ML model, with sufficient flexibility to ensure that the training data are representative for the specific detection and / or classification purposes envisaged, as well as guarantee a certain level of diversity of the training data. Thus, a high level of accuracy and performance of the ML model can be achieved, both on common and rare use cases, since accuracy and performance of a ML model may depend on the amount and diversity of training data received.
[0122] The 3D environment, on the other hand, is represented within the training data in the form of real-world sensing data. This is beneficial since in practice, the ML model will typically be used to perform operations on the basis of certain input data comprising real-world sensing data of a certain use case within a 3D environment, such as detection and / or classification of at least one object. An ML model trained with ALaugmented real-world sensing data of a 3D environment is particularly suited for recognizing such environments when presented with the typical use case data.
[0123] The method according to said third aspect is particular advantageous in case the ML model is intended to be used solely with input data that are sensing data of certain situations or behaviors within one single predetermined 3D environment, such as sensing data captured by a single static sensor. In this case, it suffices in fact to obtain a single datum of said predetermined environment as captured by said static sensor, or a small-sized set of sensing data of said predetermined environment within e.g. different environmental conditions, e.g. different lighting and / or weather conditions for an image sensor or different levels of background noise for a sound sensor, as captured by said static sensor, to augment the resulting datum or data with Al-generated representations of at least one object, and to use the resulting set of synthetic data as (at least part of the) training dataset. Said training dataset will then be particularly tailored to this particular environment, and thus to the envisaged use cases, so that a high level of accuracy and performance can be expected of the ML model, whilst keeping the amount of computational time and resources readily manageable.As mentioned above in connection with the first aspect of the invention, the training dataset need not be limited to generated synthetic data, but may also comprise a certain amount of pre-existing training data, e.g. from databases or libraries of appropriate data, and the ML model may be an ML model that was pre-trained with certain pre-existing data. In both situations, the generated synthetic data may aid to improve the ML model’s accuracy and performance, since the synthetic data can be generated in such a way that they are tailored to the specific use cases envisaged. At the same time, by allowing such hybrid training datasets, the amount of synthetic data to be generated, and thereby the costs and time required for the step of generating synthetic data, can be reduced.
[0124] In a preferred embodiment, the generating comprises simulating, using the Al model, one or more virtual scenes involving the at least one object in the real 3D environment, and generating, using the Al model, synthetic image data of the at least one object in the real 3D environment by rendering image data of the one or more virtual scenes.
[0125] Preferably, the simulating comprises simulating, using the Al model, a plurality of different virtual scenes involving the at least one object based on at least one of the following: different locations and / or orientations of the at least one object, different conditions, e.g. lighting and / or weather conditions, of the 3D environment, different levels of occlusion of the at least one object by one or more obstacles, different sensing characteristics of one or more virtual sensors rendering said data, such as different resolutions, fields of view, focal lengths, exposure times, frame rates, qualities and / or cleanness of lenses.
[0126] In a preferred embodiment, the generating comprises obtaining sensing data of at least one real object in the real 3D environment, modeling the at least one object as a virtual representation of the at least one real object based on the obtained object sensing data, and generating, using the Al model, synthetic data of the at least one object in the real 3D environment.
[0127] In a preferred embodiment, the generating comprises obtaining textual information about the at least one object, and generating, using the Al model, synthetic data of the at least one object in the real 3D environment, and / or a 3D model of the at least one object, based on the received textual information.
[0128] Preferably, the synthetic data comprises at least one of synthetic image data and synthetic sound data. Preferably, the sensing data comprises at least one of sensing image data and sensing sound data. The obtained sensing data may be of the same data type as the synthetic data that is being generated, or of a different type. For example, in case the obtained sensing data are image data, e.g.radar data, the generated synthetic data may be image data of the same type (radar data) or a different type (e.g. vision image data), or may be sound data, and vice versa. In case data of different types are used as sensing data and synthetic data respectively, an additional conversion step may be included into the step of generating synthetic data.
[0129] Herein, and throughout the whole text, image data should not be limited to simply vision images, but may also comprise radar data, LIDAR data, or any other type of data that may be represented by means of images. As such, a large variety of use cases may be handled by the resulting ML models.
[0130] Indeed, the use case data intended to be presented to the ML model as input data may be image data, sound data and / or other types of data, and the operations intended to be performed by the ML model may rely on certain visual, auditive, audiovisual or other properties of the input data. For example, the purposes of the ML model may pertain to the visual detection and classification of objects within image data, to the detection and classification of certain sounds as belonging to a certain class of sounds, or to the detection and classification of certain patterns or characteristics within any type of data.
[0131] Although the description of the above embodiments is mostly directed to image data, the reader will understand that said description also applies to other data types such as sound data or temperature data, mutatis mutandis. Such detection and classification of image data and / or sound data and / or other data may be applied for different or similar purposes as presented above in connection with the first and second aspects of the invention.
[0132] In a preferred embodiment, the at least one object is a moving object such as a vehicle or a pedestrian, wherein preferably the vehicle is selected from a car, a truck such as a garbage truck, a bicycle, a scooter, a motorbike, an ambulance, a taxi, a delivery vehicle, or the at least one object is a static object such as a tree, a luminaire, a traffic light, a traffic sign, a parking place, urban furniture such as a public bench, a bus stop, or a public bin.
[0133] Preferably, the 3D environment comprises at least one of a road such as a street road, a sidewalk, a curbside, a pedestrian area, a parking area such as a parking lot, and a building facade.
[0134] Preferably, the training further comprises training the ML model using data of at least one real object in a real 3D environment.Preferably, the training further comprises validating the synthetic data against ground-truth data to evaluate convergence of the output of the trained ML model, and upon determining that said convergence is not reached: modifying at least one of a set of parameters for training the ML model and a set of settings for generating the synthetic data, and retraining the ML model using the modified set of parameters and / or set of settings until determining that said convergence is reached.
[0135] The modifying of the set of settings for generating the synthetic data preferably comprises at least one of changing a library storing information about the at least one object and modifying rendering settings, such as a resolution, for generating the synthetic data, and regenerating synthetic data of the at least one object based on the changed library and / or the modified rendering settings.
[0136] According to a preferred embodiment, the synthetic data comprises at least one of two-dimensional (2D) synthetic data and 3D synthetic data.
[0137] Both options may be of interest, depending on the envisaged use cases of the ML model. Indeed, if the ML model is intended to perform operations solely on the basis of 2D input data, it may suffice to train the model solely with 2D training data, whereas if the ML model is intended to perform operations on the basis of 3D input data, or if said operations may depend on certain spatial properties of the input data, then it may be beneficial to train the ML model with 3D instead of 2D synthetic data.
[0138] The preferred and optional properties discussed above in connection to the first and second aspect, as well as their technical effects and advantages, apply mutatis mutandis to the third aspect as well.
[0139] According to a fourth aspect of the invention, there is provided a system comprising:
[0140] a processor; and
[0141] a memory in communication with the processor, the memory containing instructions that, when executed by the processor, cause the processor to perform any one of the computer-implemented methods presented above.
[0142] Such a system according to said fourth aspect then adopts the technical effects and advantages, as presented above in connection with the first, second, and third aspects of the invention, of the method to be performed by the processor.
[0143] According to a fifth aspect of the invention, there is provided a computer program product comprising a computer readable storage medium having program instructions executable by a computer to cause the computer to perform any one of the computer-implemented methods presented above.Such a computer program product according to said fifth aspect then adopts the technical effects and advantages, as presented above in connection with the first, second, and third aspects of the invention, of the method to be performed by the computer.
[0144] BRIEF DESCRIPTION OF THE FIGURES
[0145] The accompanying drawings are used to illustrate presently preferred non-limiting exemplary embodiments of methods, systems and computer program products of the present invention. The above and other advantages of the features and objects of the invention will become more apparent and the invention will be better understood from the following detailed description when read in conjunction with the accompanying drawings, in which:
[0146] Figure 1 is a flow chart illustrating the steps of a method according to the first aspect;
[0147] Figure 2 is a flow chart illustrating the steps of several exemplary embodiments of a method according to the first aspect;
[0148] Figure 3 is a flow chart illustrating the steps of several further exemplary embodiments of a method according to the first aspect;
[0149] Figure 4 is a flow chart illustrating the steps of another exemplary embodiment of a method according to the first aspect;
[0150] Figure 5 is a flow chart illustrating the steps of yet another exemplary embodiment of a method according to the first aspect;
[0151] Figure 6 is a flow chart illustrating the steps of a method according to the second aspect;
[0152] Figure 7 is a flow chart illustrating the steps of an exemplary embodiment of a method according to the second aspect;
[0153] Figure 8 is a flow chart illustrating the steps of a method according to the third aspect;
[0154] Figure 9 illustrates schematically a first exemplary embodiment of synthetic image data that may be generated as part of a method according to the first, second, or third aspect;
[0155] Figure 10 illustrates schematically a second exemplary embodiment of synthetic image data that may be generated as part of a method according to the first, second, or third aspect; and
[0156] Figure 11 illustrates schematically a third exemplary embodiment of synthetic image data that may be generated as part of a method according to the first, second, or third aspect.
[0157] DETAILED DESCRIPTION OF EMBODIMENTSFigure 1 shows a flow chart illustrating a computer-implemented method according to the first aspect of the invention.
[0158] In step 100, synthetic data of at least one object to be classified in a three-dimensional (3D) environment is generated using an artificial intelligence (Al) model. In some embodiments, the synthetic data resulting from step 100 may comprise certain metadata indicating e.g. the presence of at least one object within the synthetic data, and a classification label attributed to said object. For example, the synthetic data can be image data such as pictures or videos in two or three dimensions, and the metadata can take the form of a bounding box encompassing an object in said synthetic image data, accompanied by a label specifying that said object belongs to a certain class, for example the class of pedestrians, the class of motorized vehicles, or the class of trees. In another example, the synthetic data can be sound data, and the metadata serve to identify certain sound fragments or frequencies as originating from at least one object that belongs to a certain class of objects.
[0159] In step 200, a machine learning (ML) model is trained using the synthetic data, wherein the ML model comprises a classification model configured to detect and classify the at least one object. In some embodiments, the resulting ML model can be applied for various purposes, including but not limited to automated and preferably real-time detection and classification within real-world data related to public space situations and the behavior of public space users. This can be of use for many different purposes, such as detecting traffic congestions and / or full rubbish bins, predicting the availability of free parking spots for dynamic parking pricing, detecting obstacles on the curbside limiting the mobility of users with disabilities, or management of curbside spaces in general.
[0160] An ML model, trained in step 200 with synthetic data of at least one object in a 3D environment, which was generated using an Al model in step 100, and comprising such a classification model, may thus accurately detect and classify other members of this class of objects, in a 3D environment.
[0161] The Al model used in step 100 may be an ML model, e.g. a Large Language Model (LLM), different from the ML model that is being trained in step 200, or may be a different type of Al model.
[0162] Figure 2 shows a flow chart illustrating several exemplary embodiments of a computer-implemented method according to the first aspect of the invention.
[0163] It is noted that the flow chart of Figure 2 encompasses several exemplary embodiments of said method, each comprising the above steps 100 and 200, but wherein step 100 may be preceded by and / or may encompass any combination of the steps outlined below.In an exemplary embodiment according to Figure 2, the method may be initiated by step 110 of obtaining sensing data of a real 3D environment, followed by step 111 of modeling the 3D environment as a virtual representation of the real 3D environment based on the obtained sensing data, in turn followed by step 100 of generating synthetic data using the Al model, wherein the synthetic data comprise synthetic data of the at least one object in the virtual 3D environment.
[0164] Examples of sensors, placed at various locations in a public space, e.g. along the road and / or the curbside, that may be used within step 110 may include but are not limited to an image sensor, an optical sensor such as a photodetector or any other light sensor, a sound sensor, a radar such as a Doppler effect radar, a LIDAR, a humidity sensor, a pollution sensor, a temperature sensor, a motion sensor, a biological hazard sensor, a proximity sensor, a gyroscope, an antenna, an RF sensor, a vibration sensor, a metering device, a measurement device for measuring a maintenance related parameter of a component of an edge device, an alarm device, or combinations thereof.
[0165] The obtained sensing data pertain to the real 3D environment, which may be to a large extent composed of static objects, such as buildings and roads. Such sensing data can therefore be collected relatively easily in an off-line fashion during a single phase preceding step 100, without any need for continuous and / or real-time monitoring of the real 3D environment.
[0166] Preferably, the obtained sensing data of the real 3D environment take the form of two-dimensional image data captured with one or various cameras, completed with certain distance metrics as well as 3D structure, texture, color and depth information collected using for example LIDAR or photogrammetry measurements. Within step 111, data formats such as USD (Universal Scene Description) can be utilized to model the geometry of the 3D environment, and thus transform the collected 2D image data of the real 3D environment, with the aid of said distance metrics and structure / texture and / or color information, into a static 3D virtual representation of the environment. Alternatively, one may consult the Al model that is being used within the step of generating synthetic data, in order to infer 3D structure, texture, depth and occlusion information simply from the collected 2D image data, and thus transform said 2D image data into a model of a 3D environment. In this case, Al techniques such as neural radiance fields may be exploited. Nevertheless, other types of data may be considered just as well.
[0167] When inspecting the resulting 3D virtual representation of the environment, one may notice certain noise within the model, as well as locations at which the model fails to realistically represent the sensed real-world 3D environment. Such artifacts may be typically caused by the limitations of thesensing infrastructure, such as a limited visibility of a used mobile LIDAR scanner, and / or a sensing resolution that unintentionally varies with the distance to the different scanned objects. This may prompt a new phase of collecting sensing data, with different settings of the sensing infrastructure, so that the captured sensing data may be cleaned from noise and artifacts on the basis of these newly collected sensing data when the latter are merged into the former. As such, steps 110 and 111 may in fact proceed iteratively in a feedback loop, wherein different phases of collecting sensing data, importing said data into the 3D model, and recalculating the settings and conditions for data collection on the basis of the artifacts spotted in the model, may succeed each other.
[0168] In another exemplary embodiment according to Figure 2, the method may be initiated by step 120 of obtaining sensing data of at least one real object in a real 3D environment, followed by step 121 of modeling the at least one object as a virtual representation of the at least one real object based on the obtained sensing data, in turn followed by step 100 of generating synthetic data using the Al model, wherein the synthetic data comprise synthetic data of the at least one object in the 3D environment. Herein, said 3D environment may be the real 3D environment of which sensing data were obtained, or may be a virtual 3D environment modeled as a virtual representation of said real 3D environment based on the obtained sensing data. The at least one sensed real object under consideration need not be static, but can be a dynamic or moving object such as a pedestrian, a vehicle or a rubbish bin that is continuously being filled.
[0169] For example, step 120 may comprise collecting the sensing data with one or more field sensors placed at several locations in a public space, e.g. along a road and / or curbside, and subsequently communicating these data, preferably in real-time, to a processing means configured to perform step 121 as well as step 100. Step 100 may then comprise augmenting data of a 3D environment with the modeled virtual representation of the at least one real object, wherein the data of a 3D environment may be real-world sensing data or ALgenerated environmental data, such as virtual 3D environment data modeled as a virtual representation of a real 3D environment. As such, one may in fact integrate or incorporate an animated or abstracted version of said real object into a modeled 3D environment, and display any motion or behavior of said object in real-time within the modeled 3D environment. Consequently, the resulting 3D model may be capable of virtually mirroring any sensed real-world behavior or motion while still remaining detached from reality and maintaining the additional advantages and functionality of a virtual 3D model.
[0170] In yet another exemplary embodiment according to Figure 2, the method may be initiated by step 130 of simulating, using the Al model, one or more virtual scenes involving the at least one object in the 3D environment, followed by step 100 of generating synthetic data using the Al model,wherein the generating comprises generating synthetic data, for example synthetic image data, of the at least one object in the real or virtual 3D environment by rendering data, for example image data, of the one or more virtual scenes.
[0171] The above one or more virtual scenes involving the at least one object in the 3D environment that are being rendered in step 100 may represent different conditions of the at least one object, such as different locations and / or orientations and / or phases of motion of the at least one object, and / or may represent the at least one object within different conditions of the 3D environment, such as different lighting and / or weather conditions, different levels of occlusion of the at least one object by one or more obstacles, such as a leaf which is stuck in front of the camera or sensor, and / or different sensing characteristics of one or more virtual sensors rendering said data, such as different qualities, resolutions, fields of view, focal lengths, exposure times, frame rates, qualities and / or cleanness of lenses. As such, said virtual scenes may provide not only a replication of reality, for example with synthetic data in the form of a video replicating the sensed real-world motion of the sensed real-world object, but may also allow to visualize scenarios that may not be a replication of reality, but that may be rather unlikely to occur in the real world, and thus unlikely to form part of any obtained sensing data. An exemplary virtual scene may offer a view on a street in a geographical region known for its warm and dry climate, yet depicted in highly unlikely weather conditions like snowstorms or heavy rainfall. Another such exemplary virtual scene may represent a rare traffic situation, such as pedestrians or livestock blocking a road where high driving speeds are allowed, such as a highway. In exemplary embodiments involving step 130, step 100 may comprise rendering photorealistic image data of the one or more virtual scenes. In an example, the rendering of image data may comprise making pictures of the 3D virtual scene as seen from the viewpoint of one or more virtual sensors placed within the 3D environment. In case several such virtual sensors are considered, the rendered image data may represent one and the same virtual scene from different perspectives. In another example, the rendered image data may represent different phases of motion of an object in a 3D environment, and the ensemble of these rendered image data, displayed subsequently in time, may be considered as a video representing the motion of said object.
[0172] The skilled reader will readily understand from the above disclosure that the exemplary embodiments presented above with respect to Figure 2 may be combined with one another. For example, step 100 may comprise or be preceded by steps 110, 111, and 130, by steps 120, 121, and 130, or by a combination of each of steps 110, 111, 120, 121, and 130.
[0173] Figure 3 shows a flow chart illustrating several further exemplary embodiments of a computer-implemented method according to the first aspect of the invention.As for Figure 2 above, it is noted that the flow chart of Figure 3 encompasses several embodiments of said method, each comprising the above steps 100 and 200, but wherein step 100 may be preceded by and / or may encompass any combination of the steps outlined below.
[0174] In the exemplary embodiments according to Figure 3, the method may be initiated by a step of obtaining textual information. In an exemplary embodiment, this step may correspond to step 140 of obtaining textual information about the 3D environment. As an alternative for step 140, or in combination and in parallel with step 140, in another exemplary embodiment the method may be initiated with step 150 of obtaining textual information about at least one object to be classified in a 3D environment, which may be combined with either step 151 of obtaining sensing data of a real 3D environment, and / or with step 152 of generating, using the Al model, a 3D model of the at least one object based on the received textual information. Subsequently, within step 100, synthetic data is generated, using the Al model, of the at least one object in the 3D environment, based on the received textual information. Finally, within step 200 an ML model is trained using the synthetic data.
[0175] In exemplary embodiments involving step 140, step 100 may comprise prompting a generative Al model, such as an LLM model, to generate from scratch image data showing both static environmental objects and static and / or dynamic objects placed within said environment, on the basis of a textual description of said objects.
[0176] In exemplary embodiments wherein steps 150 and 151 are combined, step 100 may comprise augmenting 2D pictures of a real-world 3D environment, showing e.g. building facades, driving lanes, curbsides and trees, with virtual representations of static and / or dynamic objects such as pedestrians, vehicles or rubbish bins.
[0177] In exemplary embodiments wherein steps 150 and 152 are combined, step 100 may comprise prompting a generative Al model, such as an LLM model, not only to generate 2D images of static or dynamic objects on the basis of the received textual information, but also to use said textual information to create a virtual representation of said objects in three dimensions. On the basis of said 3D model, image data of said object as seen from several different viewpoints may then be rendered as part of step 100.
[0178] In order for the generated synthetic data to be suitable for training an ML model, the output of the used generative Al model may be completed with certain metadata indicating e.g. the presence of at least one object within the synthetic data, and / or a classification label attributed to said object. Forexample, the synthetic data can be image data such as pictures or videos in two or three dimensions, and the metadata can take the form of a bounding box encompassing an object in said synthetic image data, accompanied by a label specifying that said object belongs to a certain class, for example the class of pedestrians, the class of motorized vehicles, or the class of trees. Step 100 of generating synthetic data may therefore comprise an additional substep of creating such metadata. This substep may be performed in an automated fashion, e.g. by exploiting a pre-existing inference algorithm configured to detect and label instances of a certain specified class of objects, and optionally manually reviewing and correcting the outcomes of said inference algorithm.
[0179] Alternatively, especially in case no such dedicated inference algorithms are available, said substep of creating metadata may need to be manually performed. Another alternative may be to proceed with a hybrid substep of creating metadata, wherein the first few pieces of Al-generated data may be subjected to a manual detection and labeling process, and the results of said process may be subsequently used to train an inference algorithm configured for detecting and labeling instances of the considered class of objects within the further Al-generated data.
[0180] The skilled reader will readily understand from the above disclosure that the exemplary embodiments presented above with respect to Figure 3 may be combined with one another. For example, step 100 may comprise or be preceded by step 140, by steps 150 and 151, by steps 150 and 152, or by a combination of each of steps 140, 150, 151, and 152.
[0181] Alternatively or in addition to the above combinations, the skilled reader will readily understand from the above disclosure that the exemplary embodiments presented above with respect to Figure 3 may be combined with the exemplary embodiments presented above with respect to Figure 2. For example, step 100 may be performed on the basis of both sensing data of a real 3D environment and / or at least one real object, as well as textual information about said 3D environment and / or said real object and / or a different, not necessarily real object. In other words, step 100 may comprise any combination of the steps 110, 111, 120, 121, 130, 140, 150, 151, and 152. In particular, exemplary embodiments of the method may combine obtaining sensing data of at least one real object and / or a real 3D environment with obtaining textual information about said real environment and / or about said real object and / or about a different object, and synthetic data may be generated on the basis of the ensemble of information collected as such.
[0182] Figure 4 shows a flow chart illustrating the steps of another exemplary embodiment of a computer-implemented method according to the first aspect of the invention.In step 100, synthetic data of at least one object to be classified in a three-dimensional (3D) environment is generated using an artificial intelligence (Al) model. Step 100 illustrated in Figure 4 may correspond to step 100 illustrated in any of Figures 1-3.
[0183] In step 200, a machine learning (ML) model is trained using the synthetic data, wherein the ML model comprises a classification model configured to detect and classify the at least one object. Step 200 illustrated in Figure 4 may correspond to step 200 illustrated in any of Figures 1-3.
[0184] In step 300, the synthetic data may be validated against ground-truth data to evaluate convergence of the output of the trained ML model.
[0185] In step 310, it may be determined whether or not said convergence is reached. This determining of the convergence may be performed on the basis of several different ML model performance metrics, such as the model’s accuracy, precision, recall or Fl-score.
[0186] In case it is determined in step 310 that said convergence is not reached, the method may proceed with step 320 of modifying a set of parameters for training the ML model, and / or with step 330 of modifying a set of settings for generating the synthetic data.
[0187] Step 330 may comprise any one or a combination of step 331 of changing a library storing information about the at least one object and step 332 of modifying rendering settings, such as a resolution, for generating the synthetic data.
[0188] After steps 331 and / or 332, the method may proceed with step 333 of regenerating synthetic data of the at least one object based on the changed library and / or the modified rendering settings. Subsequently, step 200 may be repeated, in the sense that the ML model may be retrained using the modified set of parameters as obtained in step 320, and / or the set of settings as obtained in steps 330, 331 and / or 332, and / or the newly generated synthetic data as obtained in step 333.
[0189] What may follow is an iterative feedback loop of the presented steps, that may be repeated until it is determined, as an outcome of the step 310, that convergence is reached. It should be noted that such an exemplary embodiment may require flexible workflows capable of processing rapidly evolving streams of input training data, typically in large amounts and of large volumes.
[0190] In case it is determined in step 310 that convergence is reached, and hence that the step of training the ML model is completed, an optional final step 340 of stress-testing the created ML model maybe performed in order to evaluate its final performance. Such a stress-test may be performed on the basis of real-world data representative of the envisaged use cases, such as obtained with sensors, that the ML model is instructed to classify. In parallel, the same set of data may be classified manually or with a pre-existing ML model known to perform well, and the outcomes of both classifications may be compared. The extent of agreement between the two or more sets of classification outcomes forms a measure for the performance of the ML model. Alternatively, the testing data may also comprise synthetic data representing conditions that differ substantially from those represented by the training data, for example representing extreme weather conditions or weather conditions that are highly unlikely to occur in the considered geographic areas, and during optional step 340 the ML model may be tested with respect to its performance when confronted with such data.
[0191] In step 341, the performance of the ML model during the stress-test may be assessed. Either it may be determined that convergence is reached within the stress-test, so that the ML model is deemed ready for being put to use, or a further step 200 of training the ML model may be prompted, wherein the training dataset may be determined in relation to the use cases reflected by the testing dataset applied in step 340.
[0192] Step 340 of stress-testing may also comprise an assessment of the performance of the used hardware when running the ML model. In this case, several different characteristics may be considered as a measure for said performance in step 341 , such as the processing speed, the amount of data processed per second, the number of image frames processed per second and the amount of heat produced by the hardware.
[0193] The outcomes of the above-mentioned measurements may be interpreted on a use case-specific basis. For example, for highly dynamic traffic situations like highway monitoring, much higher numbers of image frames may need to be processed than for less dynamic public space or traffic situations like parking lot monitoring, whereas when the relevant hardware is placed in a warm climate, excessive heat production may be deemed more undesirable than a low processing speed. In case the relevant measurement characteristic were to indicate an unsatisfactory performance for the specific use case, one may either upgrade the hardware without altering the ML model, or optimize the ML model for the used hardware e.g. by reducing the number of frames per second for processing the ML model.
[0194] Figure 5 shows a flow chart of yet another exemplary embodiment of a computer-implemented method according to the first aspect of the invention.According to Figure 5, the training data need not be limited to the generated synthetic data, but may also comprise a certain amount of pre-existing training data, e.g. from databases or libraries of appropriate data, and that the ML model may be an ML model that was pre-trained with certain preexisting data. For example, pre-existing public spaces databases can be exploited to provide environmental data related to buildings and other infrastructure, such as precise locations of trees and / or bus stops.
[0195] More precisely, step 100 of generating, using an Al model, synthetic data of at least one object to be classified in a 3D environment may be followed either by step 200 of training a ML model using said synthetic data, or may be followed by first step 201 of training an ML model using pre-existing training data and subsequently step 200 of further training said pre-trained ML model with the generated synthetic data. The training dataset may thus either only comprise synthetic data, or may be a hybrid, mixed dataset of synthetic and pre-existing training data. Steps 100 and 200 illustrated in Figure 5 may respectively correspond to steps 100 and 200 illustrated in any of Figures 1-4.
[0196] Figure 6 shows a flow chart illustrating the steps of a computer-implemented method according to the second aspect of the invention.
[0197] In step 400, first sensing data having first sensing characteristics are obtained.
[0198] In step 410, second sensing characteristics of a sensor are obtained. The second sensing characteristics are known to match the sensing characteristics of a certain sensor. The first sensing characteristics may or may not match the sensing characteristics of another sensor, in which case the latter sensor may be referred to as the first sensor and the former sensor may be referred to as the second sensor. Each of the considered sensors may either be a real-world sensor with realistic sensing characteristics, such as a limited resolution and focal length, or a virtual sensor with optimal / ideal or suboptimal realistic sensing characteristics.
[0199] In step 420, the first sensing data are converted into second sensing data using an Al model, wherein the second sensing data have second sensing characteristics different from the first sensing characteristics, and wherein the second sensing characteristics match the sensing characteristics of the sensor. In case the first sensing characteristics correspond to an ideal virtual sensor and the second sensing characteristics correspond to a real-world sensor, as illustrated in Figure 7, step 420 may amount to transforming sensing data of certain scenes or scenarios as captured by an ideal virtual sensor into data that would hypothetically be obtained if the same scenes or scenarios were captured by said real-world sensor. In other words, the step of converting the first sensing data into secondsensing data then serves at least to mimic a real-world sensing device and its effect on certain, not necessarily real-world, first sensing data.
[0200] In step 430, an ML model related to the sensor is trained using the second sensing data. In other words, the data that is being used to train the ML model in step 430 corresponds to a certain set of desired sensor sensing characteristics, even though a sensor with these characteristics, or data captured by such a sensor, may not be available. As such, one may accommodate for differences in sensing characteristics between the data available as training data, i.e. the first sensing data, and the data envisaged as inputs for the future ML model.
[0201] Figure 7 is a flow chart illustrating the steps of an exemplary embodiment of a method according to the second aspect.
[0202] As illustrated in Figure 7, the first sensing characteristics may correspond to an ideal virtual sensor configured to sense virtual objects in a virtual 3D environment, and the second sensing characteristics may correspond to a real-world sensor configured to sense real objects in a real 3D environment. The method may then comprise step 401 of obtaining virtual sensing data as captured by a first sensor which is an ideal virtual sensor, step 411 of obtaining the sensing characteristics of a second sensor which is a real-world sensor, step 421 of converting the obtained virtual sensing data into second sensing data having the sensor characteristics of the second sensor, with the use of an Al model, and finally step 430 of training a ML model, related to the second sensor, with the second sensing data. Step 421 of converting the first (virtual) sensing data into second sensing data may amount to transforming sensing data of certain scenes or scenarios as captured by an ideal virtual sensor into data that would hypothetically be obtained if the same scenes or scenarios were captured by said real-world sensor. In other words, step 421 then serves at least to mimic a real-world sensing device and its effect on certain first sensing data.
[0203] Within the embodiments of Figures 6 and 7, the considered sensors may be any sensor from the following non-exhaustive list of exemplary sensors: an image sensor such as a fish-eye camera or an infrared camera, an optical sensor such as a photodetector or any other light sensor, a sound sensor, a radar such as a Doppler effect radar, a LIDAR, a humidity sensor, a pollution sensor, a temperature sensor, a motion sensor, a biological hazard sensor, a proximity sensor, a gyroscope, an antenna, an RF sensor, a vibration sensor, a metering device, a measurement device for measuring a maintenance related parameter of a component of an edge device, an alarm device, or combinations thereof.The considered sensing data may then take the form of 2D or 3D sensing data, and the considered sensing characteristics may comprise at least one of a resolution, a field of view, a focal length, an exposure time, a frame rate, a quality, a cleanness of lenses, an extent of image deformation and / or an extent of brightness, wherein the precise choice of sensing characteristics is determined by the envisaged use cases.
[0204] Since, within the embodiment of Figure 7, the first sensing characteristics correspond to an ideal virtual sensor, whereas the second sensing characteristics correspond to a real-world sensor, said converting may comprise degrading the sensing characteristics of the first sensing data, for example by at least one of lowering a resolution of the first sensing data, adding noise to the first sensing data, adding blur to the first sensing data, and adding compression artifact or other distortion to the first sensing data, or changing a level of occlusion of the first sensing data by one or more obstacles. Step 421 of converting by degrading may lead to second sensing data which, in comparison with the first sensing data, can be considered more realistic, or more representative of real-world sensing data captured with low-level sensing infrastructure.
[0205] Figure 8 shows a flow chart illustrating the steps of a computer-implemented method according to the third aspect of the invention.
[0206] In step 500, sensing data of a real 3D environment are obtained. Step 500 illustrated in Figure 8 may correspond to step 110 illustrated in Figure 2 and / or to step 151 illustrated in Figure 3. Step 500 may amount to capturing sensing data with a single static sensor, aimed at a single predetermined 3D environment. In this case, within step 500 it suffices to obtain a single datum of said predetermined environment as captured by said static sensor, or a small-sized set of sensing data of said predetermined environment within different lighting and / or weather conditions as captured by said static sensor.
[0207] In step 510, synthetic data are generated, using an Al model, of at least one object in the real 3D environment based on the obtained sensing data. Step 510 illustrated in Figure 8 may correspond to step 100 illustrated in any of Figures 1-5. Step 510 may amount to augmenting the resulting environmental datum or data with Al-generated representations of the at least one object, and to use the resulting set of synthetic data as (at least part of) the training dataset within step 520 (see below). Said training dataset will then be particularly tailored to this particular environment, and thus to the envisaged use cases, so that a high level of accuracy and performance can be expected of the ML model, whilst keeping the amount of computational time and resources readily manageable.In step 520, an ML model is trained using the synthetic data. Step 520 illustrated in Figure 8 may correspond to step 200 illustrated in any of Figures 1-5. For example, the dataset used to train the ML model may comprise sensing data of a real 3D environment augmented with at least one AL generated representation of at least one object. The resulting ML model may be used for detection and / or classification purposes, for example for detection and / or classification of certain objects within the sensed real 3D environment, so that it may be beneficial to choose the at least one object within step 510 as a function of the classes of objects envisaged to be detected.
[0208] The sensing data obtained within step 500, and the synthetic data generated within step 510, may be 2D image data and / or 3D image data and / or sound data and / or other types of data, and it may be so that the synthetic data generated in step 510 are 3D data even though the sensing data obtained in step 500 are 2D data, since additional distance, color, structure, texture or depth metrics such as LIDAR and / or photogrammetry measurements, and / or Al techniques such as neural radiance fields, may be exploited to generate 3D spatial models on the basis of 2D image data.
[0209] Figures 9-11 illustrate exemplary embodiments of synthetic image data that may be generated by a method according to the first, second, or third aspect, notably in step 100 or 101 of the method illustrated in Figures 1-5, in step 420 or 421 of the method illustrated in Figures 6-7, or in step 510 of the method illustrated in Figure 8.
[0210] Figure 9 shows an example of a synthetic 2D image 610 that may be generated with the use of an Al model as part of the method of any one of Figures 1-8, which comprises a synthetically generated 2D image 600 as well as metadata in the form of bounding boxes that encompass certain objects that have been detected within the image 600.
[0211] The image 600 shows a 2D image of a real or virtual 3D environment that comprises a street with two driving lanes, a curbside, building facades along the street and at the horizon, a river and a shore with rocks, a bridge over said river, a terrace with chairs, tables and parasols, surrounded by a balustrade, and urban furniture / decoration such as luminaires and palm trees. The image 600 also shows several dynamic objects that appear to be in motion within the depicted 3D environment, such as pedestrians, motorized vehicles and rubbish bins.
[0212] The image 610 shows the result of complementing the image 600 with certain metadata that indicate the presence of several such dynamic objects within the depicted environment. These objects have been detected either in a manual fashion, or by presenting the image 600 as an input to a dedicated inference algorithm or Al model. As a result of such a detection and inference process, a boundingbox is drawn around each detected object, and a classification label is added to each such bounding box, indicating which class the detected object is asserted to form part of.
[0213] Within the image 610, said classification label is encoded in the color and the line style of the drawn bounding box. Indeed, a bounding box drawn in black solid lines indicates that the object detected within said box is a pedestrian. A bounding box drawn in black dashed lines indicates a motorized vehicle that appears to be in motion. A bounding box drawn in grey dashed lines indicates a motorized vehicle that, by its position on or near the curbside instead of on a driving lane, has been classified by the manual classifier or the dedicated inference algorithm or Al model as a parked car. Finally, a bounding box drawn in grey solid lines indicates a rubbish bin.
[0214] Similarly, the image 620 shown in Figure 10 and the image 630 shown in Figure 11 may be generated with the use of an Al model as part of the method of any one of Figures 1-8, and may comprise a 2D image of an urban 3D environment in which several dynamic objects have been detected and classified either manually or by a dedicated inference algorithm or Al model. The metadata associated with said detection and classification also take the form of bounding boxes encompassing said objects, in a particular color and line style that indicates the object’s classification label.
[0215] In the image 620 of Figure 10, bounding boxes drawn in black solid lines indicate pedestrians, bounding boxes drawn in black dashed lines indicate motorized vehicles that appear to be in motion, bounding boxes drawn in grey dashed lines indicate motorized vehicles that appear to be stationary, e.g. parked, bounding boxes drawn in grey solid lines indicate trees, and bounding boxes drawn in light grey solid lines indicate other traffic users such as cyclists.
[0216] Similarly, in the image 630 of Figure 11 , bounding boxes drawn in white dashed lines indicate motorized vehicles that appear to be in motion, bounding boxes drawn in grey dashed lines indicate motorized vehicles that appear to be stationary, and bounding boxes drawn in black solid lines indicate pedestrians. The images 620 and 630 further depict a 3D environment comprising a street with driving lanes, a curbside at both sides of the street, building facades, and urban furniture such as luminaires, rubbish bins and traffic signs.
[0217] The weather and lighting conditions depicted by the different images 610, 620, and 630 may be substantially different: the images 610 and 620 appear to show the considered environment under sunny weather conditions, whereas the image 630 depicts fog, so that few sunlight is cast upon the environment, and the horizon and any objects at larger distances from the position at which the image appears to have been made, are to some extent blocked. The image 630 may thus form part of theresult of step 130 of simulating virtual scenes of the objects in the 3D environment as in the method of Figure 2, wherein the other virtual scenes simulated within the same step may depict the same 3D urban environment in different weather and / or lighting conditions, e.g. sunny and dry weather, rainfall, snowfall, storm with strong winds, at dawn, at dusk or during the night.
[0218] Alternatively, a collection of images, comprising the image 630 as well as images of the same 3D urban environment under the enumerated weather and / or lighting conditions, may be generated as part of step 333 of regenerating the synthetic data, in response to a desired convergence not being achieved in a validation phase (step 300), thus prompting to consider a greater variability of imaging conditions within the synthetic training data, or as part of step 340 of stress-testing with data representing unlikely conditions, within the method of Figure 4.
[0219] Moreover, it should be noted that the image 630 of Figure 11 is highly photorealistic in comparison to the image 610 of Figure 9 and the image 620 of Figure 10. As such, the image 630 may for example be the result of a step of rendering 2D image data of a 3D virtual scene or virtual 3D environment, e.g. as seen from the viewpoint of a virtual sensor placed within the virtual 3D environment, as part of step 100 of generating synthetic data within embodiments of the method of Figure 2 that involve step 130 of simulating virtual scenes, or of step 101 of generating synthetic data within embodiments of the method of Figure 3 that involve step 152 of generating a 3D model of an object.
[0220] Alternatively, the high extent of photorealism in the image 630 may be the result of step 332 of modifying rendering settings in response to desired convergence not being achieved in a validation phase (step 300) in the method of Figure 4. As mentioned before, step 332 may amount to allowing the rendering process to take more time so that more detailed and more photorealistic images result from the rendering process.
[0221] In yet another alternative, the image 630 may form part of the second sensing data obtained as a result of step 420 or 421 of converting first sensing data into second sensing data having second sensing characteristics, within the method of Figure 6 or 7. For example, the image 630 may be the result of a sensing characteristics degrading process applied onto the sensing data of an ideal virtual sensor, that were created by rendering virtual 2D sensing data from a virtual 3D model. Such a process is expected to give rise to photorealistic images, yet reminiscent of realistic, degraded sensing characteristics.
[0222] As illustrated in the images 610, 620, and 630 of Figures 9, 10, and 11, respectively, the detection and inference process performed when complementing the initially generated synthetic data withmetadata in the form of colored bounding boxes, may not result in a detection and classification of each of the static or dynamic objects one may distinguish in said images.
[0223] Indeed, not all instances of the class of pedestrians or the class of motorized vehicles have been recognized in the images 610-630. This is particularly the case in Figure 11, wherein the foggy weather conditions may have compromised the performance of the inference algorithms used for the creation of the metadata. In such cases, step 330 of modifying the generating settings may be prompted as in the method of Figure 3, so that a more performant inference algorithm is used during step 333 of regenerating synthetic data.
[0224] Moreover, contrary to Figure 9, Figure 10 shows an image in which rubbish bins have not been attributed a bounding box and a classification label. This may be because the class of rubbish bins has been determined not to be of interest within the method of Figure 10, or because the used inference algorithm may have failed to identify the distinguishable rubbish bins. Subsequently, an ML model trained on the basis of the image 620 of Figure 10 may not be configured for properly classifying rubbish bins.
[0225] The image 610 of Figure 9 shows, amongst regular cars, also a truck, which has been enclosed by a black dashed bounding box and thus has been identified simply as a motorized vehicle. The used detection and inference process can thus be concluded not to make a distinction between regular cars and trucks. Depending on the envisaged use cases, it may be of interest to make a further distinction between said two classes of motorized vehicles as part of the creation of the metadata of the generated synthetic data.
[0226] Finally, it should be noted that a large amount of object classes may be envisaged when generating synthetic data, and the used inference algorithms may be configured for detecting and classifying instances of all such classes. For example, the image 620 of Figure 10 shows a bounding box drawn in light grey solid lines around an instance of the class of cyclists. However, if the used inference algorithm were not capable of distinguishing between cyclists and pedestrians, a different classification label may have been attributed to said cyclist, and this may implicate the subsequent training of an ML model with said synthetic data and thus the performance of the trained ML model, in particular when confronted with input data depicting one or more cyclists.
[0227] Whilst the principles of the invention have been set out above in connection with specific embodiments, it is understood that this description is intended merely by way of example and not as a limitation of the scope of protection which is determined by the appended claims.Additional aspects, advantages, and characteristics of computer-implemented method embodiments, system embodiments, and computer program product embodiments are defined by the following set of clauses:
[0228] 1. A computer-implemented method comprising:
[0229] generating, using an artificial intelligence, Al, model, synthetic data of at least one object to be classified in a three-dimensional, 3D, environment; and
[0230] training a machine learning, ML, model using the synthetic data;
[0231] wherein the ML model comprises a classification model configured to detect and classify the at least one object.
[0232] 2. The method of clause 1, wherein the generating comprises:
[0233] obtaining sensing data of a real 3D environment;
[0234] modeling the 3D environment as a virtual representation of the real 3D environment based on the obtained sensing data; and
[0235] generating, using the Al model, synthetic data of the at least one object in the virtual 3D environment.
[0236] 3. The method of clause 1 or 2, wherein the generating comprises:
[0237] simulating, using the Al model, one or more virtual scenes involving the at least one object in the 3D environment; and
[0238] generating, using the Al model, synthetic image data of the at least one object in the 3D environment by rendering image data of the one or more virtual scenes.
[0239] 4. The method of clause 3, wherein the simulating comprises simulating, using the Al model, a plurality of different virtual scenes involving the at least one object based on at least one of the following: different locations and / or orientations of the at least one object, different lighting and / or weather conditions of the 3D environment, different levels of occlusion of the at least one object by one or more obstacles, different sensing characteristics of one or more virtual image sensors rendering said image data, such as different resolutions, fields of view, focal lengths, exposure times, frame rates, qualities and / or cleanness of lenses.
[0240] 5. The method of any of the previous clauses, wherein the generating comprises: obtaining sensing data of at least one real object in a real 3D environment;
[0241] modeling the at least one object as a virtual representation of the at least one real object based on the obtained sensing data; and
[0242] generating, using the Al model, synthetic data of the at least one object in the 3D environment.6. The method of any one of the previous clauses, wherein the generating comprises: obtaining textual information about the 3D environment and / or the at least one object; and generating, using the Al model, synthetic data of the at least one object in the 3D environment based on the received textual information.
[0243] 7. The method of any one of the previous clauses, wherein the generating comprises: obtaining sensing data of a real 3D environment;
[0244] obtaining textual information about the at least one object; and
[0245] generating, using the Al model, synthetic data of the at least one object in the real 3D environment based on the received textual information.
[0246] 8. The method of any one of the previous clauses, wherein the generating comprises: obtaining textual information about the at least one object; and
[0247] generating, using the Al model, a 3D model of the at least one object based on the received textual information.
[0248] 9. The method of any one of the previous clauses, wherein the synthetic data comprises at least one of synthetic image data and synthetic sound data.
[0249] 10. The method of any one of clauses 2, 5 and 7, wherein the sensing data comprises at least one of sensing image data and sensing sound data.
[0250] 11. The method of any one of the previous clauses, wherein the at least one object is a moving object such as a vehicle or a pedestrian, wherein preferably the vehicle is selected from a car, a truck such as a garbage truck, a bicycle, a scooter, a motorbike, an ambulance, a taxi, a delivery vehicle; or
[0251] wherein the at least one object is a static object such as a tree, a luminaire, a traffic light, a traffic sign, a parking place, urban furniture such as a public bench, a bus stop, or a public bin.
[0252] 12. The method of any one of the previous clauses, wherein the 3D environment comprises at least one of a road such as a street road, a sidewalk, a curbside, a pedestrian area, a parking area such as a parking lot, and a building facade.
[0253] 13. The method of any one of the previous clauses, wherein the training further comprises training the ML model using data of at least one real object in a real 3D environment.14. The method of any one of the previous clauses, wherein the training further comprises: validating the synthetic data against ground-truth data to evaluate convergence of the output of the trained ML model; and
[0254] upon determining that said convergence is not reached:
[0255] modifying at least one of a set of parameters for training the ML model and a set of settings for generating the synthetic data; and
[0256] retraining the ML model using the modified set of parameters and / or set of settings until determining that said convergence is reached.
[0257] 15. The method of clause 14, wherein the modifying of the set of settings for generating the synthetic data comprises:
[0258] at least one of changing a library storing information about the at least one object and modifying rendering settings, such as a resolution, for generating the synthetic data; and
[0259] regenerating synthetic data of the at least one object based on the changed library and / or the modified rendering settings.
[0260] 16. The method of any one of the previous clauses, wherein the synthetic data comprises at least one of two-dimensional, 2D, synthetic data and 3D synthetic data.
[0261] 17. A computer-implemented method comprising:
[0262] obtaining sensing data of a real three-dimensional, 3D, environment;
[0263] generating, using an artificial intelligence, Al, model, synthetic data of at least one object in the real 3D environment based on the obtained sensing data; and
[0264] training a machine learning, ML, model using the synthetic data.
[0265] 18. The method of clause 17, wherein the generating comprises:
[0266] simulating, using the Al model, one or more virtual scenes involving the at least one object in the real 3D environment; and
[0267] generating, using the Al model, synthetic image data of the at least one object in the real 3D environment by rendering image data of the one or more virtual scenes.
[0268] 19. The method of the previous clause, wherein the simulating comprises simulating, using the Al model, a plurality of different virtual scenes involving the at least one object based on at least one of the following: different locations and / or orientations of the at least one object, different lighting and / or weather conditions of the 3D environment, different levels of occlusion of the at least oneobject by one or more obstacles, different sensing characteristics of one or more virtual image sensors rendering said image data, such as different resolutions, fields of view, focal lengths, exposure times, frame rates, qualities and / or cleanness of lenses.
[0269] 20. The method of any one of the clauses 17-19, wherein the generating comprises: obtaining sensing data of at least one real object in the real 3D environment;
[0270] modeling the at least one object as a virtual representation of the at least one real object based on the obtained object sensing data; and
[0271] generating, using the Al model, synthetic data of the at least one object in the real 3D environment.
[0272] 21. The method of any one of the clauses 17-20, wherein the generating comprises: obtaining textual information about the at least one object; and
[0273] generating, using the Al model, synthetic data of the at least one object in the real 3D environment, and / or a 3D model of the at least one object, based on the received textual information.
[0274] 22. The method of any one of the clauses 17-21, wherein the synthetic data comprises at least one of synthetic image data and synthetic sound data.
[0275] 23. The method of any one of the clauses 17-22, wherein the sensing data comprises at least one of sensing image data and sensing sound data.
[0276] 24. The method of any one of the clauses 17-23, wherein the at least one object is a moving object such as a vehicle or a pedestrian, wherein preferably the vehicle is selected from a car, a truck such as a garbage truck, a bicycle, a scooter, a motorbike, an ambulance, a taxi, a delivery vehicle; or
[0277] wherein the at least one object is a static object such as a tree, a luminaire, a traffic light, a traffic sign, a parking place, urban furniture such as a public bench, a bus stop, or a public bin.
[0278] 25. The method of any one of the clauses 17-24, wherein the 3D environment comprises at least one of a road such as a street road, a sidewalk, a curbside, a pedestrian area, a parking area such as a parking lot, and a building facade.
[0279] 26. The method of any one of the clauses 17-25, wherein the training further comprises training the ML model using data of at least one real object in a real 3D environment.
[0280] 27. The method of any one of the clauses 17-26, wherein the training further comprises:validating the synthetic data against ground-truth data to evaluate convergence of the output of the trained ML model; and
[0281] upon determining that said convergence is not reached:
[0282] modifying at least one of a set of parameters for training the ML model and a set of settings for generating the synthetic data; and
[0283] retraining the ML model using the modified set of parameters and / or set of settings until determining that said convergence is reached.
[0284] 28. The method of the previous clause, wherein the modifying of the set of settings for generating the synthetic data comprises:
[0285] at least one of changing a library storing information about the at least one object and modifying rendering settings, such as a resolution, for generating the synthetic data; and
[0286] regenerating synthetic data of the at least one object based on the changed library and / or the modified rendering settings.
[0287] 29. The method of any one of clauses 17-28, wherein the synthetic data comprises at least one of two-dimensional, 2D, synthetic data and 3D synthetic data.
[0288] 30. The method of any one of clauses 17-29, wherein the ML model comprises a classification model configured to detect and classify the at least one object.
[0289] 31. A system comprising:
[0290] a processor; and
[0291] a memory in communication with the processor, the memory containing instructions that, when executed by the processor, cause the processor to perform the method of any one of the previous clauses.
[0292] 32. A computer program product comprising a computer readable storage medium having program instructions executable by a computer to cause the computer to perform the method of any one of clauses 1-30.
Claims
45CLAIMS1. A computer-implemented method comprising:obtaining a first sensing data having first sensing characteristics;obtaining sensing characteristics of a sensor;converting, using an artificial intelligence, Al, model, the first sensing data into a second sensing data having second sensing characteristics different from the first sensing characteristics, wherein the second sensing characteristics match the sensing characteristics of the sensor; andtraining a machine learning, ML, model related to the sensor using the second sensing data.
2. The method of claim 1, wherein the converting comprises degrading the first sensing characteristics by at least one of lowering a resolution or quality of the first sensing data, adding noise to the first sensing data, adding blur to the first sensing data, and adding compression artifact or other distortion to the first sensing data, changing a level of occlusion of the first sensing data by one or more obstacles.
3. The method of any one of the previous claims, wherein the sensing characteristics of the sensor comprise at least one of a resolution or a quality.
4. The method of any one of the previous claims, wherein the sensor is an image sensor and the sensing characteristics of the sensor comprise at least one of a resolution, a field of view, a focal length, an exposure time, a frame rate, a quality and / or a cleanness of lenses, an extent of image deformation, and an extent of brightness.
5. The method of any one of the previous claims, wherein the sensor is one of a virtual sensor configured to sense virtual objects in a virtual three-dimensional, 3D, environment and a real sensor configured to sense real objects in a real 3D environment.
6. The method of any one of the previous claims, wherein the first and second sensing data each comprise at least one of image data and sound data;wherein the sensor comprises at least one of an image sensor, such as a fish-eye camera or an infrared camera, and a sound sensor.
7. The method of any one of the previous claims, wherein the first sensing data is obtained from another sensor, wherein preferably the other sensor is one of a high-quality or high-46resolution virtual sensor configured to sense virtual objects in a virtual 3D environment and a high-resolution real sensor configured to sense real objects in a real 3D environment.
8. The method of any one of the previous claims, wherein the first and second sensing data each comprise at least one of two-dimensional, 2D, sensing data and 3D sensing data.
9. The method of any one of the previous claims, wherein the ML model comprises a classification model configured to detect and classify at least one object sensed by the sensor.
10. A computer-implemented method comprising:obtaining sensing data of a real three-dimensional, 3D, environment;generating, using an artificial intelligence, Al, model, synthetic data of at least one object in the real 3D environment based on the obtained sensing data; andtraining a machine learning, ML, model using the synthetic data.
11. The method of claim 10, wherein the sensing data and the synthetic data each comprise at least one of synthetic image data and synthetic sound data.
12. The method of claim 10 or 11, wherein the sensing data comprises at least one of sensing image data and sensing sound data.
13. The method of any one of claims 10-12, wherein the synthetic data comprises at least one of two-dimensional, 2D, synthetic data and 3D synthetic data.
14. The method of any one of claims 10-13, wherein the ML model comprises a classification model configured to detect and classify the at least one object.
15. A system comprising:a processor; anda memory in communication with the processor, the memory containing instructions that, when executed by the processor, cause the processor to perform the method of any one of the previous claims.
16. A computer program product comprising a computer readable storage medium having program instructions executable by a computer to cause the computer to perform the method of any one of claims 1-14.