Method and apparatus for improving detection accuracy of point of interest under various lighting conditions
A CNN-based deep learning model enhances indoor mapping and localization by training on synthetic and real-world images under varying lighting conditions, addressing the challenge of inconsistent object appearance due to lighting changes.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- LG ELECTRONICS INC
- Filing Date
- 2025-12-23
- Publication Date
- 2026-07-02
AI Technical Summary
Challenging lighting conditions, such as low light, strong sunlight, and artificial LED lighting, pose technical difficulties for stable mapping and tracking camera movement in indoor environments, causing objects to appear significantly different to robot cameras.
A convolutional neural network (CNN)-based deep learning model is used to detect image points of interest, trained using synthetic features, real photographs, and multiple object images under varying lighting conditions, incorporating techniques like homography and co-illumination to enhance mapping stability.
Improves mapping and localization accuracy under diverse lighting conditions by stabilizing image point detection, enabling precise robot navigation and object recognition.
Smart Images

Figure KR2025022663_02072026_PF_FP_ABST
Abstract
Description
Method and device for improving point of interest detection accuracy under various lighting conditions
[0001] The present disclosure relates to a method and apparatus for improving the accuracy of point of interest detection under various lighting conditions.
[0002] A robot can refer to a machine that automatically processes or operates assigned tasks using its own capabilities. In particular, robots equipped with the ability to perceive their surroundings and perform tasks autonomously are called intelligent robots. Depending on their purpose of use or field, robots can be classified into various categories, such as industrial robots, medical robots, household robots, and military robots.
[0003] The driving device of a robot may include actuators or motors and can perform various physical movements, such as moving robot joints. Additionally, a mobile robot may include wheels, brakes, propellers, etc. in the driving device and can move on the ground or fly in the air.
[0004] Indoor robot navigation is generally performed using two-dimensional (2D) or three-dimensional (3D) maps. These maps can be used to guide robot navigation in various applications such as autonomous cleaning, food and service delivery, tourist assistance, and automated roaming operations.
[0005] A 3D map can be generated by using an RGB-D (Red-Green-Blue-Deep) camera together with a SLAM (Simultaneous Localization and Mapping) algorithm. The map data may include object identification information for various objects placed in the space where the robot moves. For example, it may include object identification information for fixed objects such as walls or doors, and movable objects such as furniture or desks. The object identification information may include the name, type, distance, and location of the objects. Additionally, object identification may include specific 2D image patterns and 3D shape information.
[0006] The robot can determine a movement path and a movement plan using at least one of map data, object information detected by one of the sensors, or object information acquired from an external source, and can control a driving device so that the robot moves along the determined movement path and movement plan.
[0007] However, performing stable mapping under challenging lighting conditions, such as low light, strong sunlight, and artificial LED lighting, poses technical difficulties for tracking camera movement and calculating camera position and orientation. For example, the reflected light from objects placed in the space where the robot moves can vary significantly depending on various lighting conditions. Consequently, from the perspective of the robot's camera sensor, objects can appear considerably different depending on these conditions.
[0008] Various aspects of the present invention relate to a method and apparatus for more stably mapping an indoor environment by detecting image points of interest using a convolutional neural network (CNN)-based deep learning model.
[0009] According to at least one aspect, camera tracking and image matching algorithms utilize deep learning to improve mapping and localization stability under various lighting conditions, including strong sunlight, artificial LED lighting, and low-light environments. In the case of artificial LED lighting, visible light and infrared light can be used as separate light sources. For example, when training an interest extractor, the infrared light source and the visible light LED light source can be turned on or off (e.g., individually).
[0010] According to at least one embodiment, a computer implementation method for neural network learning for indoor environment mapping comprises: a step of training a neural network in a first step using a first dataset based on synthetic features; and a step of training a neural network in a second step using a second dataset based on actual photos. Additionally, the method comprises: a step of collecting a plurality of object images taken under different lighting conditions for each of a plurality of objects; and a step of training a neural network in a third step using the plurality of object images.
[0011] According to at least one embodiment, an artificial intelligence (AI) device is configured to train a neural network for mapping an indoor environment. The AI device includes at least one transceiver and one or more processors. The one or more processors are configured to train a neural network in a first step using a first dataset based on synthetic shapes, train a neural network in a second step using a second dataset based on actual photographs, collect multiple object images taken under different lighting conditions for each of a plurality of objects, and train a neural network in a third step using the plurality of object images.
[0012] According to at least one embodiment, a non-transient storage medium stores instructions that cause at least one processor to perform operations when executed. The operations include: a step of training a neural network in a first step using a first dataset based on a synthetic shape; a step of training a neural network in a second step using a second dataset based on actual photographs; a step of collecting a plurality of object images taken under different lighting conditions for each of a plurality of objects; and a step of training a neural network in a third step using the plurality of object images.
[0013] To aid in a further understanding of this specification, the attached drawings illustrate embodiments of this specification and are used to explain various aspects of this specification together with the description.
[0014] FIG. 1 is a block diagram of an artificial intelligence (AI) device according to at least one embodiment of the present disclosure.
[0015] FIG. 2 shows a block diagram of an AI server according to at least one embodiment of the present disclosure.
[0016] FIG. 3 illustrates an AI system according to at least one embodiment of the present disclosure.
[0017] FIG. 4 shows a perspective view of a robot according to at least one embodiment.
[0018] FIG. 5 is a block diagram of a control module of a robot according to at least one embodiment.
[0019] FIG. 6 shows a generalized flowchart of a neural network training method according to at least one embodiment.
[0020] FIG. 7 illustrates joint illumination and homography training according to at least one embodiment.
[0021] FIG. 8 shows a flowchart of a neural network training method for indoor environment mapping according to at least one embodiment.
[0022] Specific embodiments of the present invention will be described in more detail below with reference to the drawings.
[0023] When describing that one element is "fixed" or "connected" to another element, this may mean that the two elements are directly fixed or connected, or that a third element exists between them and that third element fixes or connects them to each other. On the other hand, when describing that one element is "directly fixed" or "directly connected" to another element, it may be understood that no third element exists between the two elements.
[0024] Autonomous driving refers to technology in which a driver drives on their own, and an autonomous vehicle refers to a vehicle that drives without driver intervention or with minimal intervention.
[0025] For example, autonomous driving may include lane keeping technology while driving, technology that automatically adjusts speed such as adaptive cruise control, technology that automatically moves along a predetermined route, and technology that automatically sets a route and moves when a destination is set.
[0026] Vehicles may include vehicles with only internal combustion engines, hybrid vehicles equipped with both internal combustion engines and electric motors, and electric vehicles with only electric motors; they may also include trains, motorcycles, etc., in addition to automobiles.
[0027] Autonomous vehicles can be considered robots equipped with autonomous driving capabilities.
[0028] Artificial Intelligence (AI) refers to the field of research or the methodologies for implementing artificial intelligence, while machine learning refers to the field that defines various problems addressed within the AI sector and studies methodologies to solve them. Machine learning is defined as an algorithm that continuously performs specific tasks while improving their performance.
[0029] An Artificial Neural Network (ANN) is a model used in machine learning that can refer to an entire model of problem-solving ability composed of artificial neurons (nodes) that form a network through synaptic connections. An artificial neural network can be defined by connection patterns between neurons of different layers, a learning process that updates model parameters, and an activation function that generates output values.
[0030] An artificial neural network (ANN) may include an input layer, an output layer, and optionally one or more hidden layers. Each layer includes one or more neurons, and the ANN may include synapses connecting the neurons to each other. In an ANN, each neuron may output a value of an activation function for a signal, weight, and bias input through a synapse.
[0031] Model parameters are parameters determined through learning, including the weight values of synaptic connections and the degree of bias of neurons. Hyperparameters are parameters set before training in machine learning algorithms, including the learning rate, number of iterations, mini-batch size, and initialization function.
[0032] The objective of artificial neural network training may be to find model parameters that minimize the loss function. The loss function can be used as an indicator to determine the optimal model parameters during the artificial neural network training process.
[0033] Machine learning can be classified into supervised learning, unsupervised learning, and reinforcement learning depending on the learning method.
[0034] Supervised learning refers to the method of training an artificial neural network (ANN) with labels provided for the training data; these labels represent the correct answer (or result) that the ANN must infer when the training data is input. Unsupervised learning refers to the method of training an ANN without labels provided for the training data. Reinforcement learning refers to a method in which an agent defined in a specific environment learns to select an action or sequence of actions that maximizes the cumulative reward in each state.
[0035] Machine learning implemented using a Deep Neural Network (DNN) containing multiple hidden layers among Artificial Neural Networks (ANNs) is also called Deep Learning, and Deep Learning is a field of machine learning. Hereinafter, the term "machine learning" refers to Deep Learning.
[0036] FIG. 1 is a block diagram of an AI device (10) according to at least one embodiment of the present disclosure. As described below, the AI device (10) may be a robot (or may include a robot).
[0037] The AI device (10) may be stationary or mobile. For example, the AI device may include a TV, projector, mobile phone, smartphone, desktop computer, laptop, digital broadcast terminal, personal digital assistant (PDA), portable multimedia player (PMP), navigation device, tablet PC, wearable device, set-top box (STB), DMB receiver, radio, washing machine, refrigerator, desktop computer, digital signage, robot, vehicle, etc.
[0038] The AI device (10) may include a communication interface (11), an input interface (12), a learning processor (13), a sensor (14), an output interface (15), a memory (17), and a processor (18).
[0039] The communication interface (11) can transmit and receive data with external devices such as other AI devices (10)a, 10b, 10c, 10d, 10e and an AI server (20) using wired / wireless communication technology (e.g., see FIG. 3). For example, the communication interface (11) can transmit and receive sensor information, user input, learning models, and control signals with external devices.
[0040] Communication technologies used in the communication interface (11) include GSM (Global System for Mobile communication), CDMA (Code Division Multi Access), LTE (Long Term Evolution), 5G, WLAN (Wireless LAN), Wi-Fi, Bluetooth, RFID (Radio Frequency Identification), IrDA (Infrared Data Association), ZigBee, NFC (Near Field Communication), etc.
[0041] The input interface (12) can acquire various types of data.
[0042] For example, the input interface (12) may include a camera for inputting a video signal, a microphone for receiving an audio signal, and a user input interface for receiving information from a user. The camera or microphone may be considered a sensor, and the signal obtained from the camera or microphone may be referred to as detection data or sensor information.
[0043] The input interface (12) can obtain training data for model training and input data to be used when obtaining output using the training model. The input interface (12) can also obtain raw input data. In this case, the processor (18) or the training processor (13) can preprocess the input data to extract input features.
[0044] The learning processor (13) can learn a model composed of an artificial neural network (ANN) using the learning data. The learned artificial neural network can be called a learning model. The learning model can be used to infer a result value for new input data instead of the learning data, and the inferred value can be used as a criterion for deciding whether to perform a specific operation.
[0045] The learning processor (13) can perform AI processing together with the learning processor (24) of the AI server (20) (e.g., see FIG. 2).
[0046] The learning processor (13) may include memory integrated into or implemented in the AI device (10). Alternatively, the learning processor (13) may be implemented using memory (17), external memory directly connected to the AI device (10), or memory stored in an external device.
[0047] The sensor (14) can acquire at least one of internal information about the AI device (10), surrounding environment information about the AI device (10), or user information using various sensors.
[0048] Examples of sensors included in the sensor (14) include a proximity sensor, an ambient light sensor, an accelerometer, a magnetic sensor, a gyroscope, an inertial sensor, a red-green-blue (RGB) sensor, an infrared (IR) sensor, a fingerprint recognition sensor, an ultrasonic sensor, an optical sensor, a microphone, a lidar, and a radar.
[0049] The output interface (15) can generate output related to visual, auditory, or tactile senses.
[0050] The output interface (15) may include a display device that outputs time information, a speaker that outputs auditory information, and a tactile module that outputs tactile information.
[0051] The memory (17) can store data that supports various functions of the AI device (10). For example, the memory (17) can store input data, training data, training models, training history, etc. obtained from the input interface (12).
[0052] The processor (18) can determine one or more executable actions of the AI device (10) based on information determined or generated using a data analysis algorithm or a machine learning algorithm. The processor (18) can control components of the AI device (10) to execute the determined actions.
[0053] The processor (18) can request, retrieve, receive, or utilize data from the learning processor (13) or memory (17). The processor (18) can control the components of the AI device (10) to execute a predicted task or at least one task that is determined to be desirable.
[0054] If a connection to an external device is required to perform a specified operation, the processor (18) can generate a control signal to control the external device and transmit the generated control signal to the external device.
[0055] The processor (18) can obtain intention information regarding user input and determine the user's requirements based on the obtained intention information.
[0056] The processor (18) collects historical information including the operation of the AI device (10) or feedback on the user's operation, and can store the collected historical information in memory (17) or a learning processor (13) or transmit it to an external device such as an AI server (20). The collected historical information can be used to update the learning model.
[0057] The processor (18) can control at least some of the components of the AI device (10) to run an application stored in memory (17). Additionally, the processor (18) can operate two or more components included in the AI device (10) in combination to run the application.
[0058] FIG. 2 illustrates a block diagram of an AI server (20) according to at least one embodiment of the present disclosure. As shown in FIG. 2, the AI server (20) is connected to an AI device (10).
[0059] The AI server (20) may refer to a device that uses a machine learning algorithm to train an artificial neural network (ANN) or uses a trained artificial neural network. The AI server (20) may include multiple servers to perform distributed processing and may be defined as a 5G network. The AI server (20) may be included as a partial component of the AI device (10) and may perform at least a portion of the AI processing together.
[0060] The AI server (20) may include a communication interface (21), memory (23), a learning processor (24), a processor (26), etc.
[0061] The communication interface (21) can transmit and receive data with an external device such as an AI device (10).
[0062] The memory (23) may include a model storage device 23a. The model storage device 23a may store a model (or ANN (26b)) that has been learned or trained through a learning processor (24).
[0063] The learning processor (24) can learn the ANN (26b) using the learning data. The learning model may be used mounted on the AI server (20) or mounted on an external device such as the AI device (10).
[0064] The learning model may be implemented in hardware, software, or a combination of hardware and software. If all or part of the learning model is implemented in software, one or more instructions constituting the learning model may be stored in memory (23).
[0065] The processor (26) can use a learning model to infer a result value for new input data and can generate a response or control command based on the inferred result value.
[0066] FIG. 3 illustrates an AI system (1) according to at least one embodiment of the present disclosure.
[0067] In AI System 1, at least one of an AI server (20), a robot (10a), an autonomous vehicle (10b), an XR device (10c), a smartphone (10d), or a home appliance (10e) is connected to a cloud network (2). The robot (10a), the autonomous vehicle (10b), the XR device (10c), the smartphone (10d), or the home appliance (10e) to which AI technology is applied may be collectively referred to as AI devices (10a to 10e).
[0068] A cloud network (2) may mean a network that constitutes part of a cloud computing infrastructure or exists within a cloud computing infrastructure. A cloud network 2 may be configured using a 3G network, a 4G or LTE network, or a 5G network.
[0069] That is, the devices (10a to 10e) and the server (20) constituting the AI system (1) can be connected to each other through a cloud network (2). In particular, the devices (10a to 10e) and the server (20) may communicate with each other through a base station, but they may also communicate directly without using a base station.
[0070] The AI server (20) may include a server that performs AI processing and a server that performs operations on big data.
[0071] The AI server (20) can be connected to at least one of the AI devices constituting the AI system (1), namely a robot (10a), an autonomous vehicle (10b), an XR device (10c), a smartphone (10d), or a home appliance (10e), via a cloud network (2), and can support at least a portion of the AI processing of the connected AI devices (10a to 10e).
[0072] For example, the AI server (20) can learn the ANN according to a machine learning algorithm instead of the AI device (10a to 10e), and can directly save the learned model or transmit the learned model to the AI device (10a to 10e).
[0073] The AI server (20) receives input data from AI devices (10a to 10e), infers a result value for the received input data using a learning model, generates a response or control command based on the inferred result value, and can transmit the response or control command to the AI devices (10a to 10e).
[0074] Alternatively, the AI device (10a to 10e) may use a learning model directly to infer a result value for the input data and generate a response or control command based on the inference result.
[0075] Hereinafter, various embodiments of AI devices (10a to 10e) to which the above technology is applied will be described in more detail. The AI devices (10a to 10e) of FIG. 3 can be seen as specific embodiments of the AI device (10) of FIG. 1.
[0076] A robot (10a) with AI technology applied can be implemented as a guide robot, transport robot, cleaning robot, wearable robot, entertainment robot, pet robot, unmanned flying robot, etc.
[0077] The robot (10a) may include a robot control module for controlling operation, and the robot control module may mean a software module or a chip that implements the software module in hardware.
[0078] The robot (10a) can obtain state information about the robot (10a) using sensor information obtained from various types of sensors, detect (recognize) the surrounding environment and objects, generate map data, determine a path and movement plan, determine a response to user interaction, or determine a method of operation.
[0079] The robot (10a) can determine a movement path and a movement plan using sensor information obtained from at least one sensor among a lidar, radar and a camera.
[0080] The robot (10a) can perform the aforementioned actions using a learning model composed of at least one artificial neural network (ANN). For example, the robot (10a) can recognize the surrounding environment and objects using the learning model and determine actions using the recognized surrounding information or object information. The learning model may be learned directly by the robot (10a) itself or may be learned from an external device such as an AI server (20).
[0081] The robot (10a) can perform the task by directly using a learning model to generate results, but sensor information can also be transmitted to an external device such as an AI server (20) to receive the generated results and perform the task.
[0082] The robot (10a) can determine a movement path and movement plan using at least one of map data, object information detected from sensor information or object information obtained from an external device, and can control a driving device so that the robot (10a) moves along the determined movement path and movement plan.
[0083] Additionally, the robot (10a) can perform operation or movement by controlling the drive unit according to the user's control / interaction. The robot (10a) can acquire intention information of interaction resulting from the user's operation or voice utterance, and can perform operation by determining a response based on the acquired intention information.
[0084] A robot (10a) equipped with artificial intelligence technology and autonomous driving technology can be implemented as a guide robot, transport robot, cleaning robot, wearable robot, entertainment robot, pet robot, unmanned flying robot, etc.
[0085] A robot (10a) equipped with AI technology and autonomous driving technology may mean the robot itself equipped with autonomous driving capabilities, or it may mean a robot (10a) that interacts with an autonomous driving vehicle 10b.
[0086] A robot (10a) equipped with autonomous driving capabilities can be collectively referred to as a device that moves along a given path without user control, or moves by itself by determining a path.
[0087] The robot (10a) may be a guide robot that provides various information to users in airports, subways, bus terminals, etc., a serving robot that provides various items to guests in restaurants, hotels, etc., a delivery robot that transports items such as food, medicine, and delivery items (hereinafter referred to as “items”), or an industrial robot that transports a cart loaded with parts to a destination in a factory.
[0088] According to various embodiments, a robot includes devices used for specific purposes (cleaning, security, monitoring, guidance, etc.) or devices that move to provide functions according to the characteristics of the space in which the robot moves. Therefore, devices equipped with means of transport capable of moving using predetermined information and sensors, and providing predetermined functions, are generally referred to as robots.
[0089] The robot can move using an internally stored map. This map contains information about stationary objects that do not move within the space, such as fixed walls and stairs. Additionally, information about movable obstacles that are periodically placed—that is, dynamic objects—can also be stored in the map.
[0090] For example, information about obstacles placed within a specific range based on the robot's forward direction can also be stored in the map. In this case, unlike the map where fixed objects described earlier are stored, the information about obstacles is temporarily registered and then deleted after the robot moves.
[0091] In addition, the robot can detect external dynamic objects using various sensors. After detecting external dynamic objects, when the robot moves to a destination in a crowded environment with many pedestrians, it can identify whether the waypoints to the destination are occupied by obstacles.
[0092] In addition, the robot can determine whether it has reached a waypoint based on the angle of change of direction. The robot then moves to the next waypoint, and as a result, can successfully reach the destination.
[0093] FIG. 4 illustrates a perspective view of a robot (100) according to at least one embodiment. FIG. 4 shows an exemplary appearance, and it should be understood that the robot may be implemented with various appearances in addition to the appearance shown in FIG. 4. Specifically, each component may be positioned at different locations in the up, down, left, and right directions depending on the shape of the robot.
[0094] The main body (120) can be configured to be long in the vertical direction and can have the shape of a lollipop toy that gradually tapers from bottom to top.
[0095] The main body (120) may include a case (30) that forms the outer shape of the robot (100). The case (30) may include an upper cover (31) positioned at the top, a first intermediate cover (32) positioned at the bottom of the upper cover (31), a second intermediate cover (33) positioned at the bottom of the first intermediate cover (32), and a lower cover (34) positioned at the bottom of the second intermediate cover (33). The first intermediate cover (32) and the second intermediate cover (33) may form a single intermediate cover.
[0096] The upper cover (31) can be placed at the top of the robot (100) and may be hemispherical or dome-shaped. The upper cover (31) may be placed at a height lower than the average height of an adult so that the user can easily understand the instructions. Additionally, the upper cover (31) may be configured to rotate at a predetermined angle.
[0097] The robot (100) may further include a control module (150) (e.g., see FIG. 5). The control module (150) controls the robot (100) like a computer or processor. Thus, the control module (150) is placed within the robot (100) to perform functions similar to a main processor and can interact with a user.
[0098] The control module (150) is placed inside the robot (100) and controls the robot during movement by detecting objects around the robot. The robot's control module (150) can be implemented as a software module, a chip in which the software module is implemented as hardware, etc.
[0099] A display device (31a) that receives commands from a user or outputs information and a sensor (e.g., camera 31b and microphone 31c) may be placed on one side of the front of the upper cover (31).
[0100] In addition to the display device (31a) of the upper cover (31), a display device (22) is placed on one side of the middle cover (32).
[0101] Depending on the function of the robot, information may be output on both display devices (31a, 22) or on only one of the two display devices (31a, 22).
[0102] Additionally, various obstacle sensors (e.g., the sensor (220) of FIG. 5) are positioned on one side or the entire bottom of the robot (100) (see 35a, 35b). For example, obstacle sensors include Time-of-Flight (TOF) sensors, ultrasonic sensors, infrared sensors, depth sensors, laser sensors, LiDAR sensors, etc. These sensors detect obstacles outside the robot (100) in various ways.
[0103] Additionally, the robot (100) further includes a moving device, which is a component that moves the robot at the bottom of the robot. The moving device is a component that moves the robot, such as a wheel.
[0104] The shape of the robot shown in FIG. 4 is exemplary. Embodiments of the present invention are not limited to the illustrated examples. Additionally, various cameras and sensors may be placed on various parts of the robot (100). For example, the robot (100) may be a guide robot that provides information to a user and guides the user by moving to a specific point.
[0105] The robot (100) may include a robot that provides cleaning services, security services, or functions. The robot (100) can perform various functions.
[0106] With multiple robots (100) deployed in a service space, the robots can perform specific functions (guidance service, cleaning service, security service, etc.). In this process, the robot (100) can store its location information, check its current location in the entire space, and generate a path necessary to move to a destination.
[0107] FIG. 5 is a block diagram of a control module (150) of a robot (100) according to at least one embodiment.
[0108] The robot (100) can perform both the function of generating a map and the function of estimating the robot's position using the map.
[0109] Alternatively, the robot (100) may only provide map generation functions.
[0110] Alternatively, the robot (100) may only provide a function to estimate the robot's position using a map. According to various embodiments, in addition to the function to estimate the robot's position using a map, the robot (100) may provide a function to create or modify a map.
[0111] The LiDAR sensor (220) can detect surrounding objects in two or three dimensions. A two-dimensional LiDAR sensor can detect the position of an object within a 360-degree range relative to the robot (100). LiDAR information detected at a specific location can form a single LiDAR frame. That is, the LiDAR sensor (220) detects the distance between the robot and an object placed outside the robot (100) to generate a LiDAR frame.
[0112] For example, the camera sensor (230) is a standard camera. Two or more camera sensors (230) may be used to overcome the field of view limitation. An image captured at a specific location constitutes visual information. That is, the camera sensor (230) captures an object outside the robot (100) to generate a visual frame containing visual information.
[0113] According to various embodiments, the robot (100) performs fusion simultaneous localization and mapping (Fusion-SLAM) using a LiDAR sensor (220) and a camera sensor (230).
[0114] In fused SLAM, LiDAR information and vision information can be combined and used. LiDAR information and vision information can be organized in the form of a map.
[0115] Unlike robots using a single sensor (LiDAR-only SLAM, visual information-only SLAM), robots using fusion SLAM can improve position estimation accuracy. In other words, performing fusion SLAM by combining LiDAR information and visual information can improve map quality.
[0116] Map quality is a standard that applies to both vision maps composed of visual information fragments and LiDAR maps composed of LiDAR information fragments. In fused SLAM, the quality of both the vision map and the LiDAR map is improved because the sensors can share information that each sensor has not fully acquired.
[0117] Additionally, LiDAR information or vision information can be extracted and used from a single map. For example, depending on the memory capacity of the robot (100) or the computational capability of the computational processor, LiDAR information or vision information, or both LiDAR information and vision information, can be used to determine the robot's location.
[0118] The interface unit (290) receives information input by the user. When the user inputs various information such as touch or voice, the interface unit (290) outputs the result. Additionally, the interface unit (290) may output a map stored in the robot (100) or output a path that the robot (100) moves on the map.
[0119] Additionally, the interface unit (290) can provide the user with predetermined information.
[0120] The controller (250) generates a map and estimates the position of the robot (100) during the process of the robot moving based on the map.
[0121] The communication device (280) enables the robot (100) to communicate with other robots or an external server and to receive and transmit information.
[0122] The robot (100) can generate each map using each sensor (LiDAR sensor and camera sensor), or generate one map using each sensor and then generate another map by extracting only the details corresponding to a specific sensor from that single map.
[0123] Additionally, the map may include driving distance information based on wheel rotation. The driving distance information is information about the distance traveled by the robot (100) and is calculated using the rotation frequency of the robot's wheels or the difference in rotation frequencies between the two wheels of the robot. The robot (100) can calculate the robot's travel distance based not only on driving distance information but also on information generated using sensors.
[0124] The controller (250) may additionally include an artificial intelligence unit (255) for artificial intelligence tasks and processing.
[0125] Multiple LiDAR sensors (220) and camera sensors (230) can be placed outside the robot (100) to identify external objects.
[0126] In addition to the LiDAR sensor (220) and camera sensor (230), various types of sensors (LiDAR sensor, infrared sensor, ultrasonic sensor, depth sensor, image sensor, microphone, etc.) are placed outside the robot (100). The controller (250) collects and processes information detected by the sensors.
[0127] The artificial intelligence unit (255) receives as input information processed by the LiDAR sensor (220), camera sensor (230) and other sensors, or information accumulated and stored while the robot (100) is moving, and outputs results necessary for the controller (250) to determine external conditions, process information, and generate a movement path.
[0128] For example, the robot (100) can store location information of various objects placed in the space where the robot moves in the form of a map. These objects may include fixed objects such as walls and doors, and movable objects such as flowerpots and desks. The artificial intelligence unit (255) can output data regarding the movement path of the robot (100), the range of work performed by the robot, etc., using the map information and information provided by the LiDAR sensor (220), camera sensor (230), and other sensors.
[0129] Additionally, the artificial intelligence unit (255) can recognize objects placed around the robot (100) using information provided by the LiDAR sensor (220), camera sensor (230), and other sensors. The artificial intelligence unit (255) can receive an image and output metadata about the image. This metadata includes the name of the object in the image, the distance between the object and the robot, the type of the object, and whether the object is placed on a map.
[0130] Information provided by the LiDAR sensor (220), camera sensor (230) and other sensors is input to the input node of the deep learning network of the artificial intelligence unit (255), and the result is output from the output node of the artificial intelligence unit (255) after undergoing information processing in the hidden layer of the deep learning network of the artificial intelligence unit (255).
[0131] The controller (250) can calculate the robot's movement path using data calculated by the artificial intelligence unit (255) or data processed by various sensors.
[0132] FIG. 6 illustrates a generalized flowchart of a neural network training method according to at least one embodiment. The neural network may include a model for detecting points of interest in an image. Points of interest are two-dimensional locations in an image that are stable and repeatable under various lighting conditions and viewpoints. The neural network may be trained using convolutional deep learning. With reference to FIG. 6, this method is described in three steps.
[0133] In Step 1 602 (e.g., the interest point pre-training step), the neural network is trained to generate pseudo-correct interest point labels for unlabeled images. This training can be conducted using self-supervised learning. For this training, a dataset (612) containing or based on synthetic features is used as input. Examples of such synthetic features include triangles, rectangles, crosses, or check marks. Using a dataset (612) based on these simple shapes instead of full-size images improves the speed of training in Step 1 602. Through this training, a basic object detection model is created, and this model is further trained in Step 2 604.
[0134] In the second step 604 (e.g., the interest point self-labeling step), the accuracy of interest point detection is improved by performing training using full-size images. Training is performed using real photos or a dataset (614) based on real photos. For example, the dataset (614) may include photos of building exteriors, room furniture, windows, toys, streetscapes, plants, animals, and pedestrians. The dataset (614) is larger than the dataset (612) used in the first step 602.
[0135] Here, different images in the dataset (614) may have been taken at different points in time and / or distances. To account for this aspect, a detection model (e.g., the basic detection model output in step 1 602) is trained using a technique called homography. Homography is a technique that permutes the same image through random cropping, translation, scaling, image rotation, and / or perspective distortion methods. A superpoint detection model is generated through training based on these permuted images, and this model is further trained in step 3 606.
[0136] In step 3 606 (e.g., learning of co-illumination conditions), the accuracy of point of interest detection is further improved. Here, deep learning of co-illumination conditions is used, and the improvement in accuracy can be significant. The learning in step 3 606 is performed using images taken in specific environments such as restaurants or hotels, or using a dataset (616) based thereon.
[0137] In experiments conducted in commercial environments and buildings (e.g., restaurants, hotels, airports), it was found that various lighting conditions have a significant impact on the robustness and accuracy of detection. This is because the light reflected from the surface of a specific physical object can vary significantly depending on the lighting conditions. Therefore, from the perspective of a camera sensor (e.g., the camera sensor (230) in FIG. 5), the object may appear significantly different depending on these conditions. For example, under ambient light conditions, feature points of the object can be perfectly extracted. However, even for the same object, the accuracy of point detection may decrease under lighting conditions different from ambient light conditions (e.g., artificial LED lighting, low light conditions, etc.).
[0138] Various aspects of the present invention aim to improve the accuracy of point of interest detection under various lighting conditions. For example, in at least one aspect, learning is performed by considering multiple lighting conditions (or complex lighting conditions). In another aspect, learning is performed in conjunction with homography deep learning. With respect to artificial LED lighting conditions, visible light and infrared light can be considered as separate light sources. For example, when training a point of interest extractor, an infrared light source and a visible light LED light source can be turned on or off (e.g., individually).
[0139] In typical indoor environments, lighting conditions can be controlled by turning artificial lights on and off or adjusting them, or by adjusting window blinds to control a third amount. The training dataset (e.g., dataset 616) is generated by collecting photographic images taken in various locations (e.g., multiple rooms or zones of a commercial facility) under various lighting conditions. This dataset is more comprehensive than the datasets (612, 614) used in steps 1 and 2 (602, 604). This improves the robustness of the generated interest detection model.
[0140] Now, with reference to FIG. 7, the features of joint illumination and homography training according to at least one embodiment will be described in more detail.
[0141] For the sake of understanding, features and / or processes are described below with reference to multiple images of the same object (e.g., refrigerator). However, it should be understood that the disclosed features and / or processes can be similarly applied to various objects, such as the robot (100) in FIG. 4, which a robot may encounter while moving in a commercial environment.
[0142] Referring to FIG. 7, the training dataset includes multiple images of the same object. The images of the object were taken under different lighting conditions. For example, the dataset includes an image of the object taken under bright natural light conditions (e.g., strong sunlight) (702-1), an image of the object taken under artificial light conditions (e.g., LED lighting) (702-2), and an image of the object taken under weak natural light conditions (e.g., darkness) (702-3). In FIG. 7, the uppercase letter 'A' is used to represent the object.
[0143] As previously mentioned regarding artificial LED lighting conditions, visible light and infrared light can be used as artificial LED light sources (e.g., respectively). For instance, when training an interest extractor, infrared and visible LED light sources can be turned on or off (e.g., respectively). Compared to visible LED light sources, infrared light has a longer wavelength and is invisible to the human eye. Infrared light can be particularly useful when tracking objects in dark environments or on reflective glass or metal surfaces.
[0144] Various lighting conditions may correspond to different illuminance levels in units of lux. For example, the bright natural light lighting conditions in image (702-1) may correspond to illuminance in the range of 10,000 to 25,000 lux. The artificial light lighting conditions in image (702-2) may correspond to illuminance in the range of 50 to 1,000 lux. The weak natural light lighting conditions in image (702-3) may correspond to illuminance in the range of 3.4 to 40 lux. As previously mentioned, infrared light can be used as a light source for artificial LED lighting. Since lux is a unit specialized for visible light and reflects the brightness perceived by the human eye, infrared light is not measured in lux. Instead, infrared light is quantified based on irradiance, which is generally expressed in milliwatts per square centimeter (mW / cm²). According to various embodiments, by adjusting the distance between an object and an infrared LED light source, the illuminance level can be varied, for example, in the range of 0.1 to 10 mW / cm².
[0145] For the object in question, a set of pseudo-correct answer interest point locations (e.g., superset (752)) is generated as a result of the learning.
[0146] The superset (752) of interest point locations is based on the interest point locations generated for image (702-1), the interest point locations generated for image (702-2), and the interest point locations generated for image (702-3).
[0147] To simplify the explanation, the following description explains how to generate point of interest locations for image (702-1) and how these point of interest locations contribute to the superset (752). However, it should be understood that point of interest locations for images (702-2, 702-3) can also be generated in a similar way, and that these point of interest locations also contribute to the superset (752) in a similar way.
[0148] As will be explained in more detail below, random isomorphic transformations are individually applied to the same input image to generate transformed copies of the input image. Each transformed image is input into a trained neural network model, from which a corresponding set of points of interest is extracted. Finally, the points of interest extracted from the corresponding untransformed image are combined to construct the complete set, which is the final output.
[0149] Referring again to FIG. 7, random homographic transformations are applied to each image (702-1) to generate N modified copies of the input image. For example, the value of N can be 100. In this case, 100 random homographic transformations are applied to each image (702-1) to generate 100 modified copies of the input image.
[0150] Each distorted copy can be considered as a different permutation of the image (702-1). For example, the distorted copies may include a permutation of the image (702-1) by cropping, a permutation of the image (702-1) by translation, a permutation of the image (702-1) by scaling, a permutation of the image (702-1) by rotation, and / or a permutation of the image (702-1) that is a combination of two or more of these distortions.
[0151] FIG. 7 shows Examples 712-1, 712-2, and 712-N, which are distorted copies of Image (702-1). Example 712-1 is Image (702-1) rotated by a first angle (clockwise rotation) and simultaneously tilted. Example 712-2 is Image (702-1) rotated by a second angle (counterclockwise rotation) and simultaneously tilted. Example (712-N) is Image (702-1) reduced in size (size reduction), rotated by a third angle, and simultaneously tilted.
[0152] Referring again to FIG. 7, each distorted copy is input into a base detector (722). According to at least one embodiment, the base detector is the superpoint detection model described earlier with reference to FIG. 6. As described earlier with reference to FIG. 6, the superpoint detection model is generated through two-stage learning.
[0153] However, it should be understood that the basic detector may be a different model. For example, the basic detector may be the basic detection model described earlier with reference to Fig. 6. As described earlier with reference to Fig. 6, the basic detection model is generated through the first stage of learning.
[0154] Returning to FIG. 7, each distorted copy is input into the base detector (722). For example, Example (712-1) is input into the base detector (722). In response, the base detector (722) outputs a set of point of interest locations (732-1) based on Example (712-1). Each point of interest location in the set (732-1) is indicated by a shaded circle in FIG. 7.
[0155] Here, the image (702-1) is known to be in a manner that has been converted to generate the example image (712-1). Therefore, reverse processing to de-distort the example image (712-1) can be performed. At the same time, this reverse processing also de-distorts the set of interest point locations (732-1). The de-distorted set of interest point locations (732-1) is input into an aggregator (742) to aggregate multiple sets of interest point locations.
[0156] Here, the multiple set of point of interest locations includes an unmodified set of point of interest locations (732-2, corresponding to e.g. 712-2) and an unmodified set of point of interest locations (732-N, corresponding to e.g. 712-N).
[0157] The generation of point of interest locations for image (702-1) and the way these point of interest locations contribute to the superset (controller (250)) have been described. As previously mentioned, it should be understood that point of interest locations for images generated under different lighting conditions (e.g., 702-2 and 702-3) can be generated in a similar manner, and that these point of interest locations also contribute to the superset (752) in a similar manner.
[0158] For example, if the value of N is 100, 100 random homographic transformations are each applied to image (702-2) to generate 100 modified copies of the input image. Similarly, 100 random homographic transformations are each applied to image (702-3) to generate 100 modified copies of the input image. Similar to the method described earlier with reference to examples (712-1, 712-2, and 712-N), each modified copy is input to the base detector (722).
[0159] Therefore, 300 sets of point of interest locations are combined to generate a superset of 752 point of interest locations that include how the same object appears under different lighting conditions.
[0160] The joint training shown in Fig. 7 is mathematically explained as follows.
[0161] represents the initial point of interest function to be adjusted. I represents the input image. x represents the resulting generated point of interest. H represents random homography. The relationship between and x is expressed as follows:
[0162]
[0163] An ideal point of interest operator must be covariant with respect to the homography. function If the output changes according to the input, it covariates with H. In other words, the covariant detector satisfies the following condition for all cases.
[0164]
[0165] Variable i can be used as an indicator related to image lighting conditions (or lighting groups), and variable j can be used as an indicator related to random homomorphic distortion. As follows:
[0166]
[0167] According to at least one aspect of the present disclosure, for each illumination condition, empirical aggregation is performed on a sufficiently large scale of random homomorphic distortion samples. Thus, the resulting aggregation of samples generates a new and improved point of interest detector. :
[0168]
[0169] represents the number of lighting groups. represents the number of arbitrary isomorphic distortions. According to at least one embodiment, the maximum number of light groups is from 2 to 4. For example, referring to FIG. 7, three different light groups are described. However, it should be understood that the maximum number of light groups may be greater.
[0170] According to at least one embodiment, the maximum number of random homomorphic distortions is 100. However, it should be understood that the maximum number of homomorphic distortions may be greater than this.
[0171] FIG. 8 shows a flowchart of a neural network training method 800 for indoor environment mapping according to at least one embodiment.
[0172] In block 802, the neural network performs a first-stage training using a first dataset based on synthetic features.
[0173] For example, as previously described with reference to Fig. 6, in the first step 602, the neural network is trained to generate pseudo-correct point of interest labels for unlabeled images. A dataset 612 containing or based on synthetic features is used as input.
[0174] In the 804th block, the neural network is trained in the second stage using a second dataset based on real photos.
[0175] For example, as previously explained with reference to FIG. 6, in the second step 604, the accuracy of point of interest detection is improved by performing training using full-size images. This training is performed using actual photos or a dataset (614) based on actual photos.
[0176] In block 806, multiple images are collected for a specific object among multiple objects. At this time, the multiple images of the object are each generated under different lighting conditions. The different lighting conditions under which the multiple images are each generated may include lighting conditions other than ambient lighting conditions.
[0177] For example, lighting conditions other than ambient lighting conditions may include at least one of bright natural light conditions, weak natural light conditions, or artificial lighting conditions. As another example, lighting conditions other than ambient lighting conditions may include bright natural light conditions, weak natural light conditions, and artificial lighting conditions.
[0178] According to another embodiment, bright natural light lighting conditions correspond to an illuminance range of 10,000 to 25,000 lux, low natural light lighting conditions correspond to an illuminance range of 3.4 to 40 lux, and artificial light lighting conditions correspond to an illuminance range of 50 to 1,000 lux.
[0179] According to another embodiment, for each of the plurality of objects, the plurality of images of the object include a first image of the object generated under bright natural light lighting conditions, a second image of the object generated under weak natural light lighting conditions, and a third image of the object generated under artificial lighting conditions.
[0180] For example, as previously described with reference to FIG. 7, the collected images include an image of an object (702-1) generated under bright natural light conditions (e.g., strong sunlight), an image of an object (702-2) generated under artificial light conditions (e.g., LED light), and an image of an object (702-3) generated under weak natural light conditions (e.g., dim light).
[0181] Multiple images of an object can be permuted in block 810. For example, multiple images of an object can be permuted based on arbitrary homography transformations.
[0182] According to another embodiment, a plurality of permuted images of an object include a first image of an object permuted by cropping, a second image of an object permuted by translation, a third image of an object permuted by scaling, and a fourth image of an object permuted by image rotation.
[0183] For example, as previously described with reference to FIG. 7, examples (712-1, 712-2, and 712-N) are illustrated as distorted copies of image (702-1). Example (712-1) is image (702-1) rotated by a first angle (clockwise rotation) and simultaneously tilted. Example (712-2) is image (702-1) rotated by a second angle (counterclockwise rotation) and simultaneously tilted. Example (712-N) is image (702-1) reduced in size (size reduction), rotated by a third angle, and simultaneously tilted.
[0184] In block 812, the neural network is trained in the third step using multiple images of an object. Training the neural network in the third step may include training the neural network using the first, second, and third images of the object. For example, training the neural network in the third step may include using a permutation of multiple images of the object.
[0185] For example, as previously described with reference to FIG. 7, each distorted copy is input into the base detector (722). For example, an example (712-1) is input into the base detector (722). In response to this, the base detector (722) outputs a set of point of interest locations (732-1) based on the example (712-1).
[0186] According to another embodiment, for each of the plurality of objects, the neural network training in the third step trains the neural network to output a superset of points of interest, and each subset of the superset corresponds to each of the lighting conditions of different lighting conditions.
[0187] For example, as previously described with reference to FIG. 7, point of interest locations are generated for image (702-1), and these point of interest locations are included in the superset (752). Additionally, point of interest locations for images generated under different lighting conditions (e.g., 702-2 and 702-3) are generated in a similar manner, and these point of interest locations are also included in the superset (752) in a similar manner.
[0188] Here, understand that blocks 808, 810, and 812 can be performed for each of the multiple objects.
[0189] The aspects and features described herein with reference to various embodiments relate to the generation of indoor environment maps to support autonomous robot navigation. These aspects and features can improve the quality, reliability, and efficiency of map generation. The generated maps can be used to guide robot navigation in various applications, such as autonomous cleaning, food and service delivery, tourist guidance, and automated roaming operations.
[0190] The embodiments described above are specific combinations of the components and features of the present disclosure. Each component or feature should be considered optional unless explicitly stated otherwise. Each component or feature may be implemented without being combined with other components or features. Additionally, some components and / or features may be combined to implement the embodiments of the present disclosure. The order of operations described in the embodiments of the present disclosure may be rearranged. Some components or features of one embodiment may be included in another embodiment, or components or features may be replaced by related components or features of another embodiment. It is evident that claims not explicitly cited in the appended claims may be combined to form an embodiment or incorporated as new claims through post-filing modifications.
[0191] Those skilled in the art will readily understand that the features disclosed in this invention may be implemented in various specific forms within the scope of the invention. Accordingly, the above detailed description should not be interpreted restrictively in all respects and should be considered exemplary. The scope of this invention should be determined by a reasonable interpretation of the appended claims, and all modifications within the same scope as the invention are included within the scope of this invention.
Claims
1. As a computer implementation method for training a neural network for indoor environment mapping, A first step of training the neural network using a first dataset based on a synthetic shape; A second step of training the neural network using a second dataset based on actual photos; For each of the multiple objects: A step of collecting a plurality of images of the above object, wherein each of the plurality of images is generated under different lighting conditions; and A third step comprising training a neural network using a plurality of images of the above object. Computer implementation method.
2. In claim 1, the different lighting conditions under which each of the plurality of images is generated include lighting conditions other than ambient lighting conditions. Computer implementation method.
3. In paragraph 2, the lighting conditions other than the ambient lighting conditions include at least one of the following: Bright natural light lighting conditions; Lighting conditions with insufficient natural light; or Artificial lighting conditions Computer implementation method.
4. In paragraph 2, the lighting conditions other than the ambient lighting conditions include the following: Natural light lighting conditions; Lighting conditions with insufficient natural light; and Artificial lighting conditions Computer implementation method.
5. In paragraph 4, the artificial lighting conditions are provided using one or more infrared light sources or one or more visible light-emitting diode (LED) light sources. Computer implementation method.
6. In Paragraph 4, The above bright natural light lighting conditions correspond to an illuminance in the range of 10,000 to 25,000 lux, and The above lighting conditions with insufficient natural light correspond to an illuminance in the range of 3.4 to 40 lux, and The above artificial lighting conditions correspond to an illuminance in the range of 50 to 1,000 lux. Computer implementation method.
7. In paragraph 4, for each of the plurality of objects: The plurality of images of the above object are: A first image of an object taken under the above-mentioned bright natural light lighting conditions; A second image of the object taken under conditions of insufficient natural light; and A third image of an object taken under the above artificial lighting conditions; including, Training the neural network in the third step above includes training the neural network using the first image, the second image, and the third image of the object. Computer implementation method.
8. In Paragraph 1, For each of the above plurality of objects: It further includes the step of permuting the plurality of images of the above object, and Training the neural network in the third step above includes training the neural network using the plurality of permuted images of the object. Computer implementation method.
9. In claim 8, the plurality of images of the object are permuted based on random homography transformation Computer implementation method.
10. In claim 9, the permuted plurality of images of the object comprises the following: A first image of the above object modified through cutting; A second image of the above object permuted by transformation; A third image of the object modified by resizing; and The fourth image of the object whose order has been changed by image rotation. Computer implementation method.
11. In paragraph 1, for each of the plurality of objects: Training the neural network in the third step above involves training the neural network to output a superset of points of interest, and each subset of the superset corresponds to each lighting condition among different lighting conditions. Computer implementation method.
12. An artificial intelligence (AI) device configured to train a neural network for mapping an indoor environment, One or more transceivers; and In the first step, the neural network is trained using a first dataset based on a synthetic shape; In the second step, the neural network is trained using a second dataset based on actual photos; For each of the multiple objects: Collecting multiple images of the above object, wherein each of the multiple images is generated under different lighting conditions; and In the third step, one or more processors that train the neural network using the plurality of images of the object AI device.
13. In paragraph 12, the different lighting conditions under which each of the plurality of images is generated include lighting conditions other than ambient lighting conditions. AI device.
14. In paragraph 13, the lighting conditions other than the ambient lighting conditions include at least one of the following: Bright natural light lighting conditions; Lighting conditions with insufficient natural light; or Artificial lighting conditions AI device.
15. In Paragraph 14, the bright natural light lighting conditions correspond to an illuminance in the range of 10,000 to 25,000 lux, and The above lighting conditions with insufficient natural light correspond to an illuminance in the range of 3.4 to 40 lux, and The above artificial lighting conditions correspond to an illuminance in the range of 50 to 1,000 lux. AI device.
16. In Paragraph 14, for each of the above plurality of objects: The plurality of images of the above object are: A first image of an object taken under the above-mentioned bright natural light lighting conditions; A second image of the object taken under conditions of insufficient natural light; and A third image of an object taken under the above artificial lighting conditions; including, The above one or more processors train the neural network in the third step by training the neural network using the first image, the second image, and the third image of the object. AI device.
17. In Paragraph 12, For each of the above plurality of objects: The above one or more processors permutate the plurality of images of the object, and training the neural network in the third step includes training the neural network using the permuted plurality of images of the object. AI device.
18. In paragraph 17, the plurality of images of the object are permuted based on a random homography transformation AI device.
19. In Paragraph 12, for each of the above plurality of objects: Training the neural network in the third step above involves training the neural network to output a superset of points of interest, and each subset of the superset corresponds to each lighting condition among different lighting conditions. AI device.
20. A non-transient storage medium storing instructions that cause one or more processors to perform operations when executed, wherein the operations include: A first step of training the neural network using a first dataset based on a synthetic shape; A second step of training the neural network using a second dataset based on actual photos; For each of the multiple objects: A step of collecting a plurality of images of the above object, wherein each of the plurality of images is generated under different lighting conditions; and A third step comprising training a neural network using a plurality of images of the above object. Non-transient storage media.