High-precision localization of objects moving along a trajectory.

The system uses a neural network trained with reference images to select and aggregate candidate images for precise localization, addressing errors in traditional methods and achieving centimeter-level accuracy for moving objects.

JP2026521183APending Publication Date: 2026-06-26ORACLE INT CORP

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
ORACLE INT CORP
Filing Date
2024-06-14
Publication Date
2026-06-26

Smart Images

  • Figure 2026521183000001_ABST
    Figure 2026521183000001_ABST
Patent Text Reader

Abstract

A technique is provided to generate highly accurate localization of objects moving along a trajectory. In one technique, a specific image associated with a moving object is identified. A set of candidate images is selected from multiple images used to train a neural network. For each candidate image in the set of candidate images, (1) the output from the neural network is generated based on inputting the specific image and each of the candidate images into the neural network, (2) the predicted position of the specific image is determined based on the output and the position associated with each of the candidate images, and (3) the predicted position is added to the set of predicted positions. The set of predicted positions is aggregated to generate an aggregated position for the specific image.
Need to check novelty before this filing date? Find Prior Art

Description

[Technical Field]

[0001] This disclosure relates to the field of object localization, and more specifically, to the use of machine learning to obtain highly accurate localization in a well-known environment. [Background technology]

[0002] Various applications require solutions to determine the position of moving objects in new environments using cameras attached to those objects. Examples of such applications include autonomous navigation of drones, robots, and self-driving cars, analysis of vehicle trajectories, augmented reality, and surveying and mapping in situations where GPS is unavailable.

[0003] Traditional approaches to determining the position of moving objects use methods based on mapping and localization. However, such approaches are prone to errors in both the mapping and localization stages, leading to very poor positioning accuracy. These errors stem from two root causes: visual feature engineering and feature sparsity. Regarding the former, visual features (e.g., elements of the landscape) cannot perfectly depict all surrounding conditions. If appropriate image features are not selected, the accuracy of both mapping and localization will be poor. Regarding the latter (feature sparsity), even if appropriate visual features are selected, only a very small number of features will be present in each image. This sparsity directly affects the accuracy of localization.

[0004] The approaches described in this section are approaches that can be pursued, but they are not necessarily approaches that have been previously conceived or pursued. Therefore, unless explicitly stated otherwise, the approaches described in this section should not be considered eligible as prior art simply because they are included in this section. [Overview of the Initiative]

[0005] overview In one embodiment, the first method includes (1) identifying a specific image associated with a moving object; (2) selecting a set of candidate images from a set of images used to train a neural network; (3) for each candidate image in the set of candidate images, (i) generating an output from the neural network based on inputting the specific image and each of the candidate images to the neural network; (ii) determining the predicted position of the specific image based on the output and the position associated with each of the candidate images; (iii) adding the predicted position to a set of predicted positions; (4) aggregating the set of predicted positions to generate an aggregated position for the specific image; and (5) associating the aggregated position with a moving object.

[0006] In one related embodiment, the first method further includes (a) for each image in a plurality of images, (i) determining the position of each image, (ii) identifying a set of images whose positions are within a threshold distance from the position of each image, (iii) storing pairs of data relating the set of images to each image, and (b) training a neural network based on the pairs of data associated with each image in the plurality of images.

[0007] In one embodiment of the related third method, the third method comprises the first method, wherein selecting a set of candidate images includes (a) determining the estimated location of a particular image, and (b) selecting a set of candidate images based on a threshold distance between the estimated location of the particular image and the location of each candidate image in the set of candidate images.

[0008] In one embodiment of the related third method, each image in a plurality of images is associated with positional data indicating the location of an object, and the third method further includes storing an index that indexes the plurality of images based on the positional data associated with each image in the plurality of images, and selecting a set of candidate images includes using estimated positions to identify the set of candidate images in the index.

[0009] In one embodiment of the related third method, determining the expected position of a particular image includes determining an estimated position based on the predicted positions of one or more images preceding the particular image, on a time basis.

[0010] In one embodiment of the third related method, selecting a set of candidate images based on a threshold distance includes (a) selecting a plurality of candidate images based on a threshold distance, (b) determining the number of candidate images among the plurality of candidate images, and (c) selecting a set of candidate images whose respective positions are closest to the estimated positions, in accordance with the determination that the number of candidate images is greater than a certain threshold number.

[0011] In one embodiment of the third related method, selecting a set of candidate images based on a threshold distance includes (a) determining that there are no candidate images within a distance threshold, and (b) increasing the threshold distance to a larger threshold distance in response to determining that there are no candidate images within the distance threshold, and including one or more candidate images within the larger threshold distance in the set of candidate images.

[0012] In one embodiment of the related first method, the aggregated position includes a value s and a value d, where s is a first distance along a predefined path from a starting position on the predefined path, and d is a second distance along the predefined path from the point indicated by value s.

[0013] In one embodiment of the related first method, the output includes (a) the positional displacement of each candidate image and (b) the angular displacement of each candidate image.

[0014] In one embodiment of the related first method, aggregating the set of predicted locations includes calculating the average of the set of predicted locations.

[0015] In one embodiment of the related first method, the first method further includes determining, for each candidate image in a set of candidate images, an estimated distance between the location of each candidate image and the location of a particular image, and aggregating the set of predicted locations includes calculating a weighted average of the set of predicted locations based on the estimated distance associated with each candidate image in the set of candidate images. [Brief explanation of the drawing]

[0016] [Figure 1] This is a block diagram showing an exemplary localization system in one embodiment. [Figure 2] This figure shows how candidate images may be selected in one embodiment. [Figure 3] This flowchart illustrates an exemplary process for predicting the position of a moving object in one embodiment. [Figure 4] This is a block diagram illustrating a computer system in which the present invention can be implemented. [Figure 5] This is a block diagram of a basic software system that can be used to control the operation of a computer system. [Modes for carrying out the invention]

[0017] Detailed explanation In the following description, for the purpose of providing a deep understanding of the present invention, numerous specific details are set forth for purposes of explanation. It will be apparent, however, that the present invention may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid unnecessarily obscuring the present invention.

[0018] General Overview A system and method for predicting the position of a moving object are provided. In one technique, a set of reference images with known positions is stored and used to train a neural network. Further, when an image not related to the position is given, a set of candidate images is selected from the set of reference images. For each candidate image included in the set of candidate images, the candidate image and the given image are input into the neural network, and the neural network outputs the predicted position of the given image. The predicted positions are aggregated to generate the final predicted position of the moving object at a time corresponding to the given image.

[0019] Embodiments improve computer-related technologies, namely localization technologies. Since the calculated position of the moving object is fixed to the exact position of the candidate image, the drift peculiar to many simultaneous localization and mapping (SLAM) systems does not occur in the localization system. Another improvement is that, in one embodiment, the localization system can robustly maintain accurate localization of a very long trajectory extending very far, such as several kilometers.

[0020] System Overview Figure 1 is a block diagram showing an exemplary localization system 100 in one embodiment. The localization system 100 includes an image database 110, a machine learning (ML) model trainer 120, an ML model 130, an input image data source 140, a candidate image selector 150, an ML model initiator 160, a model output 162, and an aggregater 170. Each of the ML model trainer 120, the candidate image selector 150, the ML model initiator 160, and the aggregater 170 may be implemented by software, hardware, or a combination of software and hardware.

[0021] The image database 110 includes multiple sets of images (or "multiple image sets"), each image set corresponding to a different video footage or video stream. For example, one or more moving objects such as cars, airplanes, robots, and drones include a digital camera that generates images (or video streams) and a position detection component that determines the position and yaw angle of the moving object at each time interval, such as every 50 milliseconds. If the moving object is moving in three-dimensional space (e.g., a flying drone), the position detection component also determines the pitch and roll of the moving object (wherever position is mentioned thereafter, the yaw angle will be included even if not explicitly mentioned). The position detection component is aligned with time-generated images to determine the time (or timestamp) of each image.

[0022] Multiple image sets are needed to generalize ML Model 130, enabling the actual changes in digital images that can be used when calling ML Model 130. The same moving object can be used to generate multiple sets of images. For example, a drone may fly from point A to point B multiple times. Additionally or alternatively, multiple moving objects can be used to generate multiple sets of images. For example, multiple racing cars may run from point A to point B once or multiple times. When multiple different sets of images are used, the (multiple) moving objects can be instructed to start generating images at slightly different locations and to move laterally and / or vertically within a defined range. This allows future moving objects that may rely on ML Model 130 to also move within the same range and have a precise understanding of their absolute position in two- or three-dimensional space.

[0023] In one embodiment, one or more images in a set of images are modified to simulate alternative weather and / or lighting conditions. For example, video footage shot during the daytime may be input into a filter that darkens images within the video footage to simulate how the saved environment would look at sunset or twilight. Another example is video footage shot during a period of bright blue skies, which is then input into a filter that adds simulated cloudy, rainy, and / or foggy conditions to the video footage. In these ways, if the recorded environment includes rainy or low-light conditions during a localization session of a moving object, the simulated video footage can be used to appropriately locate the object of the moving object.

[0024] In one embodiment, the multiple sets of images include image sets corresponding to video footage of objects moving in different directions. For example, the image database 110 stores (1) a first set of images of one or more moving objects from point A to point B, and (2) a second set of images of one or more moving objects from point B to point A.

[0025] If the image database 110 includes image sets with varying angles, the number of image sets required to train the ML model 130 (and to achieve a certain degree of accuracy and usefulness) can be reduced compared to the number of images required when the image set has fewer variations.

[0026] Model training The ML model trainer 120 trains the ML model 130 based on images in the image database 110 using one or more machine learning techniques. The ML model 130 may be in any model form, one example being an augmented optical flow neural network. Examples of such neural networks include FlowNetS, FlowNet2C, and FlowNet2SD. Such neural networks are trained by showing them pairs of images. As shown here, these images are captured from a known environment along with camera position information (CPI) recorded with the images. CPI includes yaw, pitch, and roll. CPI can be recorded using external sensors such as a global positioning system (GPS), an inertial navigation system (INS), or a real-time kinematic positioning system (RKT). Alternatively, CPI can be generated offline using techniques such as Differential GPS (DGPS), post-processed kinematics (PPK), or structure from motion (SFM). Images can be captured at different times, under different lighting or weather conditions, and using different cameras.

[0027] In one embodiment, the structure of the ML model 130 is based on a predefined or "off-the-shelf" neural network. For example, the number of layers and the number of neurons per layer are predefined, in addition to some of the weights connecting the neurons. However, during the training process, the ML model trainer 120 modifies these weights through methods such as backpropagation. The final layer of the model structure is also transformed to enable it to predict multiple values, such as longitudinal distance (s), transverse distance (d), and camera angle.

[0028] Each image used to train ML Model 130 is associated with a location. The location can be an absolute position in space, such as a Global Positioning System (GPS) position with latitude and longitude coordinates. Alternatively, the location may be a relative position to a fixed geographical location, such as the starting position of a competition field. Or, the location may be a position whose measurement is only relevant to a known environment, such as the distance along an arc from the starting position and the distance from the arc, where the arc defines the accepted path of movement for the moving object whose position is being determined. In addition to the location coordinates ((x,y) or (s,d)), each image is associated with the camera angle relative to a given reference plane. This is important because embodiments predict the angle of a moving object, and this angle is used to indicate the state of the moving object in its physical environment. Two moving objects may be in the same physical position (at different times), except they point in very different directions. Therefore, the angle is important for use in localization calculations.

[0029] The training data for ML model 130 includes multiple training instances, each training instance containing a pair of images (included in image database 110) at a known location. Pairs of images can be selected based on their relative distance from one another. For example, if two images are within a certain threshold distance from each other, they are added as a pair for training purposes. Therefore, a single image may be included in many pairs used to train ML model 130.

[0030] In one relevant embodiment, the images in each pair of images used to train the ML model 130 are from different video footage and may contain at least slightly different trajectories. For example, the first video footage may follow the left boundary of the trajectory (or accepted path of movement), the second video footage may follow the right boundary of the trajectory, and the third video footage may follow the center of the trajectory. Thus, given a pair of images {v1,v2} used to train the ML model 130, v1 and v2 are from different video footage (and therefore have different trajectories).

[0031] ML model 130 is trained to "learn" the difference in physical distance and angle between two images where the position and angle of one of the two images are known, and the position and angle of the other image are unknown. During training, since the positions of both images are known, and therefore the distances of multiple positions are known, ML model trainer 120 uses the output of a pre-trained version of ML model 130 to output (i.e., "estimate") the distance and compare it with the known distance. ML model trainer 120 uses the comparison result to change the weights of ML model 130, such as weights associated with different neurons in a neural network. The larger the comparison result, the larger the change in weight. Therefore, if the comparison result is 0, ML model trainer 120 may not change any weights.

[0032] Once trained, the ML model 130 processes image pairs and predicts the spatial correlation (or displacement) between the points where the image pairs were taken. This spatial correlation can take the form of a custom correlation, such as (a) a three-dimensional or two-dimensional transformation, or (b) a difference between custom axes and rotation angles. This spatial correlation is then translated into relative position. Estimating relative position tends to be better done by estimating absolute position, which is prone to drift. An exemplary spatial correlation or displacement is a set of three values, namely s, d, ψ, where s is the longitudinal distance along a curve or predefined path, d is the lateral distance from the curve, and ψ is the field of view.

[0033] In one relevant embodiment, the ML model 130 may output displacements to be added to a known relative position to generate a predicted relative position, but the predicted relative position can be converted to an absolute position, such as Global Positioning System coordinates. This conversion is useful later when working with equations of motion. For example, the ML model 130 is used to predict a relative position along a trajectory, which is then converted to Global Positioning System coordinates, which are then input into other models.

[0034]

number

[0035] Time inference The input image data source 140 is the source of input images whose predicted positions are calculated using the ML model 130. The input image data source 140 may be a digital camera attached to a moving object on which a localization session is being performed. Alternatively, the input image data source 140 may be storage local to the candidate image selector 150, which stores images streamed from the digital camera. Thus, while the candidate image selector 150 retrieves and processes the first digital image from the input image data source 140, additional images are stored (or uploaded) to the input image data source 140, and these additional images are from the same input data stream. For example, a wireless connection may exist between the input image data source 140 and the digital camera generating the images stored in the input image data source 140.

[0036] The candidate image selector 150 retrieves an input image from the input image data source 140 and determines the estimated position of the input image. The estimated position of the input image is determined by one of several methods. If the input image is the first image in a video stream, it may be assumed that the moving object started at the starting position, and the starting position is associated with position data in the same units as the position data associated with the training images used to train the ML model 130, e.g., Global Positioning System coordinates (an example of an absolute position) or the position {s,d} (an example of a relative position). If the starting position is unknown, but the first image is known to be somewhere in a known area or space of the object's scan (e.g., a competition field), the first image may be compared with all images contained in a block of images (e.g., an image database 110), and the position of the most similar image is used as the starting position.

[0037] If the input image is not the first image in the video stream, the estimated position of the input image is based on the predicted position of each of the one or more previous input images, or of an accepted or processed image prior to the current input image. The estimated position may then be an extrapolation from the previous predicted(or multiple) positions. For example, the predicted positions of v1 and v2 are used to determine the estimated position of v3 (the current input image) by assuming that v3 is on the same line as v1 and v2. The estimated position is also based on the velocity or speed determined for a moving object. The velocity may be calculated, for example, by dividing (a) the difference in the predicted locations of the last two images by (b) the difference in time between the last two images. The velocity of the moving object may be useful in scenarios where the moving object accelerates or decelerates. The precise estimated position of the moving object is useful when selecting a suitable set of candidate images that can be input into the ML model 130.

[0038] Selecting candidate images Given the estimated position of an input image, the candidate image selector 150 selects a set of candidate images based on the estimated position. The set of candidate images comes from the same set of images used to train the ML model 130. Thus, the set of candidate images is from the image database 110 and is a proper subset of the set of images used to train the ML model 130. Therefore, the images contained in the image database 110 serve not only as training images for the ML model 130 but also as reference images. The selected set of candidate images is based on the known positions of the candidate images within that set. The closer an image's position is to the estimated position, the more likely the candidate image selector 150 will be to select that image as a candidate image.

[0039] The candidate image selector 150 may use a threshold distance value when selecting a set of candidate images. For example, if an image in the image database 110 is associated with a location within the threshold distance range of the estimated location, the candidate image selector 150 adds that image to the set of candidate images (which is initially empty).

[0040] Additionally or alternatively, the candidate image selector 150 uses a threshold number to select a number of candidate images equal to the threshold number. For example, if the threshold number is 7, the candidate image selector 150 selects 7 images (from the image database 110) with the estimated position of the input image and the closest known position.

[0041] As another example, the candidate image selector 150 identifies all images (in the image database 110) that have a known position within a threshold distance of the estimated position of the input image. If the number of identified images is greater than the threshold number, the candidate image selector 150 removes one or more identified images from the identified images until the number of remaining images equals the threshold number. The removed images may be those with a known position furthest from the estimated position compared to the positions of the remaining images. Alternatively, the removed images may be selected randomly.

[0042] If the number of identified images within the threshold distance is less than the threshold number (e.g., 1), the candidate image selector 150 increases the threshold distance (at least for the current input image, and not necessarily for the next input image) and identifies zero or more additional images. If one or more additional images are identified as a result of increasing the threshold distance, the candidate image selector 150 determines whether the total number of identified images is equal to the threshold number. If so, the process for the current input image stops. Otherwise, the threshold distance increases again, and the process of identifying additional images is repeated. One reason for not increasing the search range is that increasing the search range forces the model to extrapolate (for example, to predict outside the range of a model trained to perform well), which can lead to large errors.

[0043] In one embodiment, the images in the image database 110 are indexed based on the position data of the images. The position data may indicate an absolute position or a relative position. This position data is one value or a plurality of values. Indexing the images based on the position data increases the speed of the candidate image selection process so that the candidate image selector 150 does not need to scan each image in the image database 110 for candidate images. For example, the position data may be indicated by the coordinates of a global positioning system. As another example, the position data can include values for variables s, d, where (1) s indicates the distance from the starting position along a predefined path, curve, or trajectory, and (2) d indicates the distance from point s on the predefined path, and the straight line from s to d is perpendicular to the tangent at s of the defined path. When an image in the image database 110 is indexed by s, an estimated position (s e , d e ) is given, and the candidate image selector 150 uses s s where T is the threshold distance along the predefined path, to identify one or more images with s within the range of s e +T s to s e -T s . When an image in the image database 110 is also indexed by d, for each image identified by the candidate image selector 150 based on s e , the candidate image selector 150 can determine whether the image has a d with a value between d e where T is the lateral threshold distance, of d d +T e and d d -T e d . T d is the d e ​The number can increase or decrease if it is close to the end or boundary of a predefined path of the moving object. One possible method for fast nearest neighbor search is a KD tree, which is a binary tree where each node represents a K-dimensional point. Each non-leaf node in the binary tree acts as a hyperplane, dividing the space into two divisions. This hyperplane is perpendicular to a selected axis associated with one of the K dimensions.

[0044] Exemplary candidate image selection process Figure 2 shows how candidate images are selected in one embodiment. Lines 202 and 204 represent edges or boundaries that the moving object is not intended (or expected) to cross. Thus, the moving object remains between lines 202 and 204 throughout its movement. The region between lines 202 and 204 represents part of the overall trajectory that the moving object may pass over (or through).

[0045] Lines 210 through 230 represent the trajectories that one or more moving objects have crossed in the past. A digital camera (attached to one or more moving objects) generated a digital image during each of these movements. The ML model trainer 120 uses the generated digital images to train a neural network (e.g., ML model 130).

[0046] Objects 212 and 214 represent points on the trajectory 210 where the digital images were generated. These images are associated with their respective locations or positions on the trajectory 210. Therefore, object 214 represents a time later than the time indicated by object 212. The digital images associated with objects 212 and 214 consist of the same video footage. Similarly, objects 222 and 224 represent points on the trajectory 220 where the digital images were generated. These images are associated with their respective locations or positions on the trajectory 220. Therefore, object 224 represents a time later than the time indicated by object 222. The digital images associated with objects 222 and 224 consist of the same video footage, which is different from the video footage from which the digital images associated with objects 212 and 214 originate. Similarly, object 232 represents a point on the trajectory 230 where the digital image was generated. This image is associated with a specific location or position on the trajectory 230 and originates from different video footage than the previous two.

[0047] Object 240 represents the previous position of the moving object at time t-1, which may be the position predicted using the embodiments described herein. The value of t may be a second, a millisecond, or any other unit of time. Thus, the value of t-1 may not be literally t-1, but may be some value less than t and indicating a time earlier than t.

[0048] Location 242 represents the estimated position of the moving object at time t. The candidate image selector 150 (or another component of the localization system 100) determines the estimated position based on the estimated velocity of the moving object and / or one or more other positions prior to time t-1, and one or more other factors. The search region 244 represents the region (around location 242) for the candidate image selector 150 to search for candidate images used to train the neural network. In this example, the candidate image selector 150 selects five digital images, namely corresponding to objects 212, 214, 222, 224, and 232.

[0049] Object 250 represents a moving object at its current location at time t. Object 250 is a moving object whose position is predicted based on a digital image generated at time t. As described in more detail here, the ML model initiator 160 inputs this digital image (referred to as the "current" image) along with candidate images into a neural network (e.g., ML model 130), and the neural network outputs predicted displacements, which are used to calculate absolute position (e.g., latitude and longitude coordinates) or relative position on a predefined path (e.g., values ​​of s and v). The ML model initiator 160 repeats this process for each candidate image in the set of candidate images, resulting in a set of predicted positions. Thus, the current image is input into the neural network multiple times to obtain multiple predicted positions for object 250 at time t.

[0050] The final predicted position 252 is determined based on a set of predicted positions. Such a determination may involve the execution of one or more actions, such as aggregate actions, as will be described in more detail here. In this example, the final predicted position 252 is different from the current position of the moving object represented by object 250. This difference may be due to a prediction error.

[0051] Launching an ML model After the candidate image selector 150 has selected a set of candidate images, the ML model starter 160 starts the ML model 130 by passing the candidate images and input images as input to the ML model 130. This start is repeated for each candidate image in the set of candidate images. The candidate image selector 150 (or other components of the localization system 100) invokes or triggers the ML model starter 160. Thus, the candidate image selector 150 may pass the input image and a single candidate image to the ML model starter 160. Such passing can occur while the candidate image selector 150 is selecting candidate images, given an input image. Alternatively, such passing can occur only after the candidate image selector 150 has finished selecting a set of candidate images.

[0052] Instead of passing one set of image pairs at a time to the ML model initiator 160 (which will invoke the ML model 130 for each pair), the candidate image selector 150 may pass the input image and the entire set of candidate images that the candidate image selector 150 will select (for example, given the input image) in a single call or transmission. In this scenario, the ML model initiator 160 determines which of the images is the input image and which is part of the set of candidate images. Then, for each candidate image in the set, the ML model initiator 160 invokes the ML model 130 by passing the candidate image and the input image.

[0053] Each iteration of the ML model 130 yields an output (of the model output 162), from which the predicted position of a moving object is ultimately determined, given an input image. Therefore, if there are 10 candidate images when there is an input image, the ML model 130 will be iterated 10 times, generating 10 instances of the model output, each instance containing difference or mutation information.

[0054] Therefore, each instance of the model output 162 is a predicted displacement containing one or more values, such as a difference or variation value of position (which may contain multiple values) or a difference or displacement of angle. To generate an (intermediate) predicted position for the input image, the ML model starter 160 (or other component of the localization system) adds the predicted displacement (which may be a positive or negative value) to the known position of the input candidate image used to generate the predicted displacement. For each start, the ML model 130 may output multiple values, each corresponding to a different component or dimension of the position data. For example, the output from a single start of the ML model 130 might be the value of s, the value of d, and the value of ψ. Thus, the three predicted displacements are added to the corresponding known position values ​​of the input candidate image to compute the predicted position (which contains the three position values) associated with the input image.

[0055] Aggregation In one embodiment, the aggregator 170 performs an aggregation operation for a plurality of predicted positions determined for the input image in order to generate a final predicted position. The aggregation operation may be the calculation of the mean, the calculation of the median, or the calculation of a weighted average. If the latter is the case, the weights applied to different predicted positions may vary depending on the distance from the corresponding candidate image to the estimated position of the input image. For example, if the position of the first candidate image is closer to the estimated position than the position of the second candidate image, the position of the first candidate image is given a higher weight than the position of the second candidate image.

[0056] In one embodiment, the predicted differences or displacements from the ML model 130 are first aggregated. In this case, the aggregated differences or displacements would be the differences from the midpoint between all the input candidate images. The absolute position of the midpoint can be calculated as the average of all the absolute positions of the candidate images. The aggregated differences or displacements are then added to the absolute position of the midpoint. However, the following embodiments are based on an approach in which aggregation occurs after each predicted difference has been added to the absolute position of its corresponding candidate image.

[0057] In scenarios where the predicted location contains multiple values, the aggregation operations are performed one for each component of the location data. For example, one aggregation operation is performed on the predicted value s, another aggregation operation (of the same type, for example) is performed on the predicted value d, and yet another aggregation operation is performed on the predicted value ψ.

[0058] In one embodiment, one or more predicted locations are not considered. For example, a predicted location may be associated with a confidence score that the ML model 130 outputs along with the predicted location. If the confidence score is below a certain threshold, the predicted location is not input into the aggregation operation.

[0059] The final predicted (and aggregated) position can be used in one or more ways. For example, the localization system 100 displays the final predicted position on a computer screen indicating the position or location of a moving object in a trajectory or three-dimensional space. The computer screen may be a handheld device containing a control device for controlling the movement of the moving object. Alternatively, the computer screen may be part of a computer monitor viewed by multiple people. In another example, the localization system 100 may communicate the final predicted position to an electronic receiver of the moving object in order to cause the moving object to make one or more adjustments based on the final predicted position, such as changing speed and / or direction. In this example, the objectives of the computer software controlling the moving object may be to maintain a specific speed, stay as close as possible to a predefined path (e.g., the center of the trajectory), and avoid obstacles such as the edge of the trajectory and / or other moving objects. In competition, the final prediction can be used for detailed analysis. For example, if video footage is collected of participants in a competition, predictions could reveal their strategies, tactical choices, or specific driving styles, which could provide further assistance to certain drivers in winning future competitions. The predicted data can also be used for analyzing known vehicles / drivers to find new ways to optimize routes and / or vehicle speeds as much as possible.

[0060]

number

[0061] Exemplary process Figure 3 is a flowchart illustrating an exemplary process 300 for predicting the position of a moving object in one embodiment. Process 300 may be performed by different components of the localization system 100.

[0062] In block 310, the digital image is received. The digital image is received after the neural network has been trained using one or more machine learning techniques. The entity performing process 300 may be similar to or different from the entity training the neural network. The digital image may originate from a digital camera associated with (e.g., mounted on) a moving object. The digital image may be received within milliseconds of the digital camera generating the digital image.

[0063] In block 320, the set of candidate images is selected from the set of training images used to train the neural network. Block 320 may involve determining the expected position of a received digital image, and then selecting a set of candidate images based on a threshold distance between the expected position of the digital image and the absolute position of each candidate image in the set of candidate images. Each training image is associated with an absolute position that has been previously determined and associated with that training image.

[0064] If the number of initially selected candidate images exceeds a threshold number (e.g., 10), some of those candidate images, such as those associated with the absolute position furthest from the expected position, may be removed. On the other hand, if the number of initially selected candidate images is less than the threshold number or zero, the threshold distance may be increased (e.g., by a 20% rate) to identify one or more candidate images for selection.

[0065] The expected position of a received digital image can be determined in several ways, including based on the predicted positions of one or more digital images received prior to the currently received digital image. For example, if the distance between the last two predicted positions is N, the expected position of the received digital image may be the predicted position of the most recent digital image plus N on the same line or arc, as is the case with the last two predicted positions.

[0066] In block 330, a candidate image is selected from a set of candidate images. Block 330 may include randomly selecting a candidate image from the set of candidate images. Alternatively, the set of candidate images may be first ordered by the closest to the furthest distance from the expected position of the received digital image, and the closest candidate image (not processed in block 340) may be selected.

[0067] In block 340, the selected candidate image and the received digital image are input to a trained neural network, which generates an output used to generate the predicted position of an object moving at the time corresponding to the received digital image. Block 340 includes a trained neural network that outputs a predicted displacement to be added to the known position of the selected candidate image for calculating the predicted position.

[0068] In block 350, predicted positions are added to the set of predicted positions. Initially, the set of predicted positions is empty. However, each subsequent iteration of 350 for a particular received digital image adds other predicted positions to the non-empty set of predicted positions.

[0069] In block 360, it is determined whether there are any further candidate images in the set of candidate images to be selected. If so, process 330 returns to block 330; otherwise, process 300 executes block 370.

[0070] In block 370, the predicted locations from the set of predicted locations are aggregated to generate aggregated locations for the received digital image. Block 370 may involve averaging multiple predicted locations or calculating a weighted average of predicted locations, where predicted locations associated with absolute locations closer to the expected location are weighted more highly than predicted locations associated with absolute locations further from the expected location.

[0071] In block 380, the aggregated position is associated with the moving object. Block 380 may include storing the aggregated position along with time data indicating the current time or the time when the localization system 100 received the digital image (received in block 310) or when the digital image was generated. Block 380 may also include storing an object identifier that uniquely identifies the moving object.

[0072] After block 380, process 300 may return to block 310 in response to detecting other digital images to process.

[0073] Superiority Existing solutions to localization problems include using visual SLAM or methods based on offline mapping and online localization. Such methods rely on finding features in input images or videos, combining these features to create a map, and relating the observed features to the features observed on the map to find the camera's current position on the map.

[0074] The embodiment is superior to such current approaches because, while the preceding approach uses sparse feature data, the embodiment uses dense image data to pinpoint the location of moving objects.

[0075] The embodiment is also superior to the current approach because it provides accuracy at the centimeter level at each step. Experiments have shown that the embodiment provides an accuracy of 3–10 centimeters for fast vehicles traveling at speeds of around 300 kilometers per hour (83 meters per second), using cameras that capture those movements in a frame-per-second range between 30 and 60 frames per second (2.5–1.2 meters of movement per frame), and similarly outside this range as much as possible. This level of accuracy at high speeds has been confirmed by more than 100 experiments using, as an example or test case, 10 images captured from cameras mounted on vehicles that have traveled at the aforementioned speeds on several closed ring roads with known geometry. The current approach only provides intermittently high-accuracy localization when there is a unique correlation between the features of the current image and the map, and loses localization in a matter of seconds at such high speeds.

[0076] The embodiments may be applied to different use cases or scenarios that require solutions for determining the position of moving objects in new environments using cameras attached to moving objects. Specific examples of such use cases include autonomous navigation of drones, robots, and self-driving cars, analysis of vehicle trajectories, augmented reality, and surveying and mapping in situations where GPS is unavailable.

[0077] Hardware Overview According to one embodiment, the technology described herein is implemented by one or more dedicated computing devices. These dedicated computing devices may include hardwired digital electronic devices such as application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs) that are continuously programmed to implement the technology, or one or more general-purpose hardware processors programmed to implement the technology according to instructions in a program in firmware, memory, other storage, or a combination of these. Such dedicated computing devices may also combine custom hardwired logic, ASICs, or FPGAs with custom programming to accomplish the technology. The dedicated computing devices may be desktop computer systems, portable computer systems, mobile devices, network-connected devices, or any other devices incorporating hardwired and / or programmable logic to implement the technology.

[0078] For example, Figure 4 is a block diagram illustrating a computer system 400 in which a specific embodiment of the present invention may be realized. The computer system 400 includes a bus 402 or other communication mechanism for transmitting information, and a hardware processor 404 for information processing connected to the bus 402. The hardware processor 404 may be, for example, a general-purpose microprocessor.

[0079] The computer system 400 also includes a main memory 406 connected to the bus 402 for storing information and instructions executed by the processor 404, such as random access memory (RAM) or other dynamic storage devices. The main memory 406 may also be used to store temporary variables or other intermediate information while instructions are being executed by the processor 404. When such instructions are stored in a non-temporary storage medium accessible to the processor 404, the computer system 400 becomes a dedicated device customized to perform the operation specified in the instructions.

[0080] The computer system 400 also includes read-only memory (ROM) 408 or other static storage device connected to bus 402 for storing static information and instructions for processor 404. Storage devices 410, such as magnetic disks, optical disks, and solid-state devices, are provided for storing information and instructions and are connected to bus 402.

[0081] The computer system 400 may be connected to a display 412, such as a cathode ray tube (CRT), via a bus 402, to present information to the computer user. An input device 414, including a combination of letters and numbers or other keys, is connected to the bus 402 to transmit information and command selections to the processor 404. Other forms of user input devices include a mouse, trackball, and cursor directional keys, which transmit directional information and command selections to the processor 404, and a cursor control 416 for controlling cursor movement on the display 412. This input device typically has two degrees of freedom on two axes, a first axis (e.g., x) and a second axis (e.g., y), allowing the device to pinpoint its position in a plane.

[0082] The computer system 400 may implement the techniques described herein using customized hardwired logic, one or more ASICs or FPGAs, firmware, and / or program logic that, in combination with the computer system, causes or programs the computer system 400 to become a dedicated device. According to one embodiment, the techniques described herein are executed by the computer system 400 in response to the processor 404 executing one or more sequences of one or more instructions contained in the main memory 406. Such instructions may be read into the main memory 406 from other storage media, such as a storage device 410. The execution of the sequence of instructions contained in the main memory 406 causes the processor 404 to execute the steps of the process described herein. In alternative embodiments, hardwired circuit configurations may be used instead of or in combination with software instructions.

[0083] As used herein, the term “storage medium” refers to any non-temporary medium that stores data and / or instructions that cause a device to operate in a particular manner. Such storage mediums may include non-volatile and / or volatile media. Non-volatile media include, for example, optical disks, magnetic disks, or solid-state devices, such as storage device 410. Volatile media include dynamic memory, such as main memory 406. Common forms of storage mediums include, for example, floppy disks, flexible disks, hard disks, solid-state devices, magnetic tapes, or any other magnetic data storage mediums, CD-ROMs (compact disc ROMs), any other optical data storage media, any physical media with a pattern of holes, RAM, PROMs (programmable ROMs), EPROMs (erasable PROMs), FLASH-EPROMs, NVRAMs (nonvolatile RAMs), and any other memory chips or cartridges.

[0084] A storage medium is different from a transmission medium, but can be used in conjunction with a transmission medium. A transmission medium is involved in the transmission of information between storage mediums. For example, a transmission medium includes coaxial cables, copper wires, and optical fibers, which constitute the bus 402. A transmission medium can also take the form of sound waves or light waves, such as those generated during radio or infrared data transmission.

[0085] Various forms of media may be involved in sending one or more sequences of one or more instructions to the processor 404 for execution. For example, the instructions may initially be sent on a magnetic disk or solid device of a remote computer. The remote computer may load the instructions into its dynamic memory and transmit the instructions via a telephone line using a modem. A modem local to the computer system 400 may receive data on the telephone line and may use an infrared transmitter to convert the data into an infrared signal. An infrared detector may receive the data transmitted by the infrared signal, and a suitable circuit configuration may place the data on the bus 402. The bus 402 transmits the data to the main memory 406, from which the processor 404 retrieves and executes the instructions. Instructions received by the main memory 406 may be stored on the storage device 410 either before or after execution by the processor 404, at any time.

[0086] The computer system 400 also includes a communication interface 418 connected to bus 402. The communication interface 418 provides bidirectional data communication connected to a network link 420 connected to a local network 422. For example, the communication interface 418 may be an integrated services digital network (ISDN) card, a cable modem, a satellite modem, or a modem that provides data communication connectivity to a corresponding type of telephone line. As another example, the communication interface 418 may be a LAN card that provides data communication connectivity to a compatible LAN (local area network). A wireless link may also be implemented. In such implementations, the communication interface 418 transmits or receives electrical, electromagnetic, or optical signals that transmit digital data streams representing various forms of information.

[0087] Network link 420 typically provides data communication to other data devices over one or more networks. For example, network link 420 may provide communication to a host computer 424 or to a data device operated by an Internet Service Provider (ISP) 426 via a local network 422. The ISP 426 similarly provides data communication services over a global packet data communication network now commonly referred to as the “Internet.” Both the local network 422 and the Internet 428 use electrical, electromagnetic, or optical signals to transmit digital data streams. Signals traversing various networks and signals on network link 420 via the communication interface 418 that transmit digital data to and from computer system 400 are exemplary forms of transmission media.

[0088] The computer system 400 can send messages, including program code, and receive data through (multiple) networks, network links 420, and communication interfaces 418. In the example of the Internet, server 430 may send requested code to an application program through the Internet 428, ISP 426, local network 422, and communication interfaces 418.

[0089] The received code is executed by the processor 404 while it is being received, and / or stored in the memory device 410 or other non-volatile storage after execution.

[0090] According to some embodiments, a system comprising one or more computing devices has means for performing operations including the operations described in any of the method claims. According to a further embodiment, a computer program product, when implemented by one or more computing devices, includes instructions for causing one or more computing devices to perform operations including the operations described in any of the method claims. In some further embodiments, a computer-readable medium, when implemented by one or more computing devices, includes instructions for causing one or more computing devices to perform operations including the operations described in any of the method claims.

[0091] Software Overview Figure 5 is a block diagram of basic software 500 that may be used to control the operation of computer system 400. The software system 500 and its components, including their connections, relationships, and functions, are meant to be representative only and not to limit the implementation of exemplary (or more) embodiments. Other software systems suitable for implementing exemplary (or more) embodiments may have different components, including components with different connections, relationships, and functions.

[0092] A software system 500 is provided to instruct the operation of the computer system 400. The software system 500, which may be stored in system memory (RAM) 406 and on fixed storage (e.g., hard disk or flash memory), includes a kernel or operating system (OS) 510.

[0093] OS510 manages low-level aspects of computer operation, including process execution, memory allocation, file input and output (I / O), and device input and output. One or more application programs, represented as 502A, 502B, 502C...502N, can be “loaded” for execution by system 500 (e.g., transferred from fixed storage 410 to memory 406). Applications or other software intended for use on computer system 400 can also be stored as a set of downloadable computer-executable instructions for download and installation, for example, from an internet environment (e.g., a web server, application store, or other online service).

[0094] The software system 500 includes a graphical user interface (GUI) 515 to receive user commands and data in a graphical manner (e.g., "point-and-click" or "touch gesture"). These inputs can, in turn, be acted upon by the system 500 in accordance with instructions from the operating system 510 and / or (multiple) applications 502. The GUI 515 also provides a display of the results of the actions from the OS 510 and (multiple) applications 502, and the user can provide additional input or terminate the session (e.g., log off).

[0095] OS510 can run directly on the raw hardware 520 (e.g., (multiple) processors 404) of the computer system 400. Alternatively, a hypervisor or virtual machine monitor (VMM) 530 may be placed between the raw hardware 520 and OS510. In this configuration, the VMM 530 acts as a software "cushion" or virtualization layer between OS510 and the raw hardware 520 of the computer system 400.

[0096] VMM530 instantiates and runs one or more virtual machine instances (guest machines). Each guest machine contains a "guest" operating system, such as OS510, and one or more applications, such as applications502, designed to run on the guest operating system. VMM530 provides the guest operating system with a virtual operating platform and manages the execution of the guest operating system.

[0097] In some instances, VMM530 can enable the guest operating system to run as if it were directly running on the raw hardware 520 of computer system 400. In these instances, the same version of the guest operating system configured to run directly on the raw hardware 520 can also run on VMM530 without modification or reconfiguration. In other words, the virtual machine monitor 530 provides sufficient hardware and CPU virtualization for the guest operating system in some instances.

[0098] In other instances, the guest operating system may be specifically designed or configured to run on VMM530 for efficiency. In those instances, the guest operating system is "aware" that it is running on the virtual machine monitor. In other words, VMM530 may provide paravirtualization to the guest operating system in some instances.

[0099] A computer system process includes the allocation of hardware processor time and the allocation of (physical and / or virtual) memory, the memory allocation for storing instructions executed by the hardware processor, the data generated by the hardware processor executing the instructions, and / or for storing the state of the hardware processor during the hardware processor time allocation when the computer system process is not running (e.g., the contents of registers). The computer system processor operates under the control of the operating system and may also operate under the control of other programs running on the computer system.

[0100] The basic computer hardware and software described above are provided for the purpose of illustrating the underlying fundamental computer components that may be used to implement exemplary embodiments. However, exemplary embodiments are not necessarily limited to any particular computing environment or computing device configuration. Instead, exemplary embodiments can be implemented in various forms of system configurations or processing environments, which a person skilled in the art will understand in light of this disclosure as being capable of supporting the features and functions of the exemplary embodiments presented herein.

[0101] Cloud computing The term "cloud computing" is used here to generally describe a computing model that enables on-demand access to a shared pool of computing resources such as computer networks, servers, software applications, and services, and allows for rapid provisioning and release of resources with minimal operational effort or service provider interaction.

[0102] Cloud computing environments (sometimes referred to as cloud environments or simply the cloud) can be implemented in various different ways that best suit different requirements. For example, in a public cloud environment, the underlying computing infrastructure is owned by the organization that makes its cloud services available to other organizations or the public. In contrast, a private cloud environment is generally intended for single use by or within a single organization. A community cloud is intended to be shared by several organizations within a community, while a hybrid cloud includes one or more forms of cloud (e.g., private, community, or public) combined by data and application portability.

[0103] Generally, cloud computing models, being used by consumers (whether internal or external to an organization, depending on the public or private nature of the cloud), allow some of those responsibilities that may have previously been provided by an organization's own information technology department to be provided instead as a service layer within the cloud environment. Depending on the specific implementation, the clear definition of the components or features provided by or within each cloud service layer may change, but as a general example, this includes SaaS (Software as a Service), where consumers utilize software applications running on the cloud infrastructure while the SaaS provider manages and controls the underlying cloud infrastructure and applications. In PaaS (Platform as a Service), while the PaaS provider manages or controls other aspects of the cloud environment (i.e., everything under the runtime environment), consumers can use software programming languages ​​and development tools supported by the PaaS provider for developing, deploying, and otherwise controlling their own applications. In IaaS (Infrastructure as a Service), while the IaaS provider manages and controls the underlying physical cloud infrastructure (i.e., everything below the operating system layer), consumers can deploy and run any software applications and / or provisioning processes, storage, networking, and other basic computing resources. In DBaaS (Database as a Service), while the DBaaS provider manages and controls the underlying cloud infrastructure, applications, and servers including one or more database servers, consumers use database servers or database management systems running on the cloud infrastructure.

[0104] In the aforementioned specification, embodiments of the invention are described in terms of numerous specific details that may vary by implementation. The specification and drawings are therefore considered illustrative rather than restrictive. The exclusive and proprietary indicator of the scope of the invention is further the scope of the language of the set of claims to be issued from this application, in any specific form in which the claims are registered, including any subsequent amendments.

Claims

1. It is a method, Identifying specific images associated with moving objects, Selecting a set of candidate images from multiple images used to train a neural network, For each candidate image in the aforementioned set of candidate images, Based on inputting the aforementioned specific image and each of the aforementioned candidate images into the neural network, an output is generated from the neural network. Based on the output and the position associated with each candidate image, the predicted position of the specific image is determined. Adding the aforementioned predicted position to the set of predicted positions, In order to generate aggregated positions for the aforementioned specific image, the set of predicted positions is aggregated, This includes associating the aggregated positions with the moving objects, The method described above is performed by one or more computing devices.

2. The aforementioned method, For each of the images among the aforementioned multiple images, To determine the position of each of the aforementioned images, Identifying a set of images whose position is within a threshold distance from the position of each of the aforementioned images, The set of images and the data of pairs that associate each of the images are stored, The method according to claim 1, further comprising training the neural network based on the pair of data associated with each image in the plurality of images.

3. Selecting the aforementioned set of candidate images means Determining the estimated position of the aforementioned specific image, The method according to claim 1 or 2, comprising selecting a set of candidate images based on a threshold distance between the estimated position of the particular image and the position of each candidate image in the set of candidate images.

4. Each of the aforementioned images is associated with positional data indicating the position of an object. The method further includes storing an index that indexes the plurality of images based on the position data associated with each image among the plurality of images, The method according to claim 3, wherein selecting a set of candidate images includes using the estimated position to identify the set of candidate images in the index.

5. The method according to claim 3, wherein determining the expected position of the particular image includes determining the estimated position based on the predicted positions of one or more images that precede the particular image in time.

6. Selecting the set of candidate images based on the threshold distance is: Based on the aforementioned threshold distance, multiple candidate images are selected, Determining the number of candidate images among the aforementioned multiple candidate images, The method according to claim 3, comprising: determining that the number of candidate images is greater than a certain threshold number, and selecting a set of candidate images for each position closest to the estimated position.

7. Selecting the set of candidate images based on the threshold distance is: Determining that no candidate images are within the aforementioned threshold distance, The method according to claim 3, comprising: increasing the threshold distance to a larger threshold distance in response to determining that there are no candidate images within the threshold distance; and including one or more candidate images within the larger threshold distance in the set of candidate images.

8. The aggregated position includes values ​​s and d, The aforementioned value s is a first distance along the predetermined path from a starting position on the predetermined path, The method according to any one of claims 1 to 7, wherein the value d is a second distance from the point indicated by the value s on the predefined path.

9. The method according to any one of claims 1 to 8, wherein the output includes (1) the displacement of the position of each candidate image and (2) the displacement of the angle of each candidate image.

10. The method according to any one of claims 1 to 9, wherein aggregating the set of predicted locations includes calculating the average of the set of predicted locations.

11. The above method further, This includes determining the estimated distance between the position of each candidate image and the position of a specific image for each candidate image in the set of candidate images. The method according to any one of claims 1 to 9, wherein aggregating the set of predicted locations includes calculating a weighted average of the set of predicted locations based on the estimated distance associated with each candidate image in the set of candidate images.

12. One or more storage media for storing instructions, wherein, when executed by one or more computing devices, the instructions cause the method described in any one of claims 1 to 11 to be executed.

13. It is a system, One or more computing devices, A system comprising one or more storage media for storing instructions, wherein, when an instruction is executed by the one or more computing devices, it causes the method described in any one of claims 1 to 11 to be executed.