A robot and a matching method
By using object descriptors for object matching in robots, the problems of low accuracy and high computational cost in existing technologies are solved, enabling efficient and accurate object recognition and localization in embedded devices.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING GEEKPLUS TECH CO LTD
- Filing Date
- 2021-05-24
- Publication Date
- 2026-06-26
AI Technical Summary
In existing technologies, object matching accuracy is low and computational load is high, making it difficult to deploy in embedded devices such as robots.
Object matching is performed using object descriptors. Images are acquired through visual sensors, and object descriptors are generated using feature extraction and graph neural networks. Object matching is then performed by combining sparsification and feature aggregation processing to reduce computational load and improve accuracy.
It improves the accuracy of object matching, reduces computational requirements, and makes deployment in embedded devices easier.
Smart Images

Figure CN115393614B_ABST
Abstract
Description
TECHNICAL FIELD
[0001] The present disclosure relates to the technical field of machine vision, and in particular, to a robot and a matching method. BACKGROUND
[0002] For semantic simultaneous localization and mapping (SLAM) and visual place recognition (VPR), object description and object matching play a crucial role. When matching objects, there is a problem of low matching accuracy. SUMMARY
[0003] Embodiments of the present disclosure at least provide a robot and a matching method.
[0004] In a first aspect, embodiments of the present disclosure provide a robot, comprising: a vision sensor, and a control component; wherein the vision sensor is configured to acquire a first image during driving of the robot; the control component is configured to: perform feature extraction on the first image to obtain a feature map of the first image; obtain a first object descriptor of at least one first object in the first image based on the feature map; and perform matching on the first object and at least one second object in a second image based on the first object descriptor and a second object descriptor of the second object, to obtain a matching result of the first object and the second object.
[0005] In a possible implementation, the feature map comprises a first feature map and a second feature map; when obtaining the first object descriptor of the at least one first object in the first image based on the feature map, the control component is configured to: perform feature point detection processing on the first feature map to obtain feature point position information and a feature point descriptor corresponding to a feature point in the first feature map; and perform object detection processing on the second feature map to obtain object position information of the first object in the second feature map; and obtain the first object descriptor based on the feature point position information, the object position information, and the feature point descriptor.
[0006] In a possible implementation, the feature point comprises an endpoint and / or a fixed point of a contour corresponding to the first object.
[0007] In a possible implementation, the control component, when obtaining the first object descriptor of the first object based on the feature point position, the object position, and the feature point descriptor, is configured to: determine a target feature point of the first object based on the object position information and the feature point position information; and obtain the first object descriptor of the first object based on the feature point position information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point.
[0008] In a possible implementation, the control component, when obtaining the first object descriptor of the first object based on the feature point position, the object position, and the feature point descriptor, is configured to: determine a target feature point of the first object based on the object position information and the feature point position information; and obtain the first object descriptor of the first object based on the feature point position information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point.
[0009] In a possible implementation, before the control component obtains the first object descriptor of the first object by performing feature aggregation processing on the feature data, the control component is further configured to: perform sparse processing on the feature data; and the obtaining, by the control component, the first object descriptor of the first object by performing feature aggregation processing on the feature data includes: obtaining the first object descriptor of the first object by performing feature aggregation processing on the feature data after the sparse processing.
[0010] In a possible implementation, when the control component matches the first object and the second object based on the first object descriptor and a second object descriptor of at least one second object in the second image, the control component is configured to: determine similarity information of the first object and the second object based on the first object descriptor and the second object descriptor; compare the similarity information with a preset similarity threshold; and determine that the first object and the second object are the same object in a case where the similarity information is greater than the similarity threshold.
[0011] In a possible implementation, the second image includes a historical image with a time stamp earlier than that of the first image.
[0012] In a possible implementation, the control component is further configured to: determine a position of the robot in the target scene when the first image is acquired based on the matching result.
[0013] Secondly, embodiments of this disclosure also provide a matching method, comprising: extracting features from a first image to obtain a feature map of the first image; obtaining a first object descriptor for at least one first object in the first image based on the feature map; and matching the first object and the second object based on the first object descriptor and a second object descriptor for at least one second object in a second image to obtain a matching result of the first object and the second object.
[0014] In one possible implementation, the feature map includes a first feature map and a second feature map; obtaining a first object descriptor for at least one first object in the first image based on the feature map includes: performing feature point detection processing on the first feature map to obtain feature point location information and feature point descriptors corresponding to the feature points in the first feature map; performing object detection processing on the second feature map to obtain object location information of the first object in the second feature map; and obtaining a first object descriptor for the first object based on the feature point location information, the object location information, and the feature point descriptor.
[0015] In one possible implementation, the feature points include: endpoints and / or fixed points of the contour corresponding to the first object.
[0016] In one possible implementation, obtaining a first object descriptor for the first object based on the feature point location, the object location, and the feature point descriptor includes: determining a target feature point of the first object based on the object location information and the feature point location information; and obtaining a first object descriptor for the first object based on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point.
[0017] In one possible implementation, obtaining a first object descriptor for the first object based on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point includes: performing attention processing on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point using a pre-trained graph neural network to obtain feature data characterizing the appearance features and / or structural features of the first object; and performing feature aggregation processing on the feature data to obtain the first object descriptor for the first object.
[0018] In one possible implementation, before performing feature aggregation processing on the feature data to obtain the first object descriptor of the first object, the method further includes: performing sparsification processing on the feature data; the step of performing feature aggregation processing on the feature data to obtain the first object descriptor of the first object includes: performing feature aggregation processing on the sparsified feature data to obtain the first object descriptor of the first object.
[0019] In one possible implementation, matching the first object and the second object based on the first object descriptor and the second object descriptor of at least one second object in the second image includes: determining similarity information between the first object and the second object based on the first object descriptor and the second object descriptor; comparing the similarity information with a preset similarity threshold; and determining that the first object and the second object are the same object if the similarity information is greater than the similarity threshold.
[0020] In one possible implementation, the second image includes a historical image with a timestamp earlier than the first image.
[0021] In one possible implementation, the method further includes: determining the position of the robot in the target scene when acquiring the first image based on the matching result.
[0022] This embodiment of the disclosure utilizes a feature map obtained by feature extraction from a first image to determine a first object descriptor for at least one first object in the first image, and determines a second object descriptor for at least one second object in the second image, so as to match the first object and the second object to obtain a matching result for the first object and the second object. Since the object descriptor describes the object as a whole, compared with feature points which can only describe the local features of the object, using object descriptors for matching has higher matching accuracy.
[0023] In addition, since the amount of data is smaller when using the first object descriptor to represent the first object compared to using feature points, the amount of computation required for object matching of the entire map is also less, and the requirement for computing power is lower, making it easier to deploy in embedded devices.
[0024] To make the above-mentioned objects, features and advantages of this disclosure more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description
[0025] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings used in the embodiments will be briefly described below. These drawings are incorporated in and constitute a part of this specification. They illustrate embodiments conforming to this disclosure and, together with the specification, serve to explain the technical solutions of this disclosure. It should be understood that the following drawings only show some embodiments of this disclosure and should not be considered as limiting the scope. Those skilled in the art can obtain other related drawings based on these drawings without creative effort.
[0026] Figure 1 A schematic diagram of the structure of a robot provided in an embodiment of this disclosure is shown;
[0027] Figure 2 A flowchart of a matching method provided by an embodiment of this disclosure is shown. Detailed Implementation
[0028] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this disclosure, and not all of them. The components of the embodiments of this disclosure described and shown herein can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of this disclosure is not intended to limit the scope of the claimed disclosure, but merely represents selected embodiments of this disclosure. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without inventive effort are within the scope of protection of this disclosure.
[0029] Research has found that when using feature point matching to determine whether two objects are the same, the method typically relies on the number or proportion of matched feature points on the two objects. However, since each object has a different size in an image, the number of its corresponding feature points varies considerably. Furthermore, when the viewing angle changes, different parts of the same object are projected onto the image, meaning that the feature points determined at different viewing angles will also differ for the same object. Therefore, using the number or proportion of matched feature points to determine whether two objects are the same has limitations; it is difficult to apply to matching objects of different sizes or with changing viewing angles, leading to a decrease in matching accuracy.
[0030] In addition, when performing object matching on the entire map, the number of object feature points will increase exponentially due to the large number of objects in the map. This will result in a huge amount of computation during matching, and the computing power of embedded devices is usually insufficient to meet the computational requirements, making it difficult to deploy the current object matching algorithm to embedded devices such as robots.
[0031] Based on the above research, this disclosure provides a matching method that determines the first object descriptor of each first object in the first image and uses the first object descriptor and the second object descriptor for matching to achieve the matching between the first object and the second object. The object descriptor describes the entire object, and compared with feature points which can only describe the local features of the object, using the object descriptor for matching has higher matching accuracy.
[0032] In addition, since the amount of data is smaller when using the first object descriptor to represent the first object compared to using feature points, the amount of computation required for object matching of the entire map is also less, and the requirement for computing power is lower, making it easier to deploy in embedded devices.
[0033] The shortcomings of the above solutions are the result of the inventor's practical experience and careful research. Therefore, the discovery process of the above problems and the solutions proposed in this disclosure below should be considered as the inventor's contribution to this disclosure.
[0034] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures.
[0035] To facilitate understanding of this embodiment, a robot disclosed in this disclosure will first be described in detail, followed by a detailed description of a matching method disclosed in this disclosure. The execution entity of the matching method provided in this disclosure may include a robot or a server controlling the robot. In some possible implementations, the matching method can be implemented by a processor calling computer-readable instructions stored in memory.
[0036] This matching method enables the robot to match objects in different images acquired as it moves within a target scene. Furthermore, the robot can be positioned based on the matching results, or controlled to precisely move to a desired location within the target scene.
[0037] The robot and matching method provided in the embodiments of this disclosure are described below.
[0038] See Figure 1 The diagram shown is a structural schematic of a robot provided in an embodiment of this disclosure. The robot includes: a vision sensor 10 and a control component 20.
[0039] The vision sensor 10 is configured to acquire a first image during the robot's movement.
[0040] The control component 20 is configured to: extract features from a first image to obtain a feature map of the first image; obtain a first object descriptor for at least one first object in the first image based on the feature map; and match the first object and the second object based on the first object descriptor and a second object descriptor for at least one second object in a second image to obtain a matching result for the first object and the second object.
[0041] In one possible implementation, the feature map includes a first feature map and a second feature map; the control component 20, when obtaining a first object descriptor of at least one first object in the first image based on the feature map, is configured to: perform feature point detection processing on the first feature map to obtain feature point location information and feature point descriptors corresponding to the feature points in the first feature map; and perform object detection processing on the second feature map to obtain object location information of the first object in the second feature map; and obtain a first object descriptor of the first object based on the feature point location information, the object location information, and the feature point descriptor.
[0042] In one possible implementation, the feature points include: endpoints and / or fixed points of the contour corresponding to the first object.
[0043] In one possible implementation, when the control component 20 obtains the first object descriptor of the first object based on the feature point location, the object location, and the feature point descriptor, it is configured to: determine the target feature point of the first object based on the object location information and the feature point location information; and obtain the first object descriptor of the first object based on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point.
[0044] In one possible implementation, when the control component 20 obtains the first object descriptor of the first object based on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point, it is configured to: perform attention processing on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point using a pre-trained graph neural network to obtain feature data characterizing the appearance features and / or structural features of the first object; and perform feature aggregation processing on the feature data to obtain the first object descriptor of the first object.
[0045] In one possible implementation, the control component 20 is further configured to: perform sparsification processing on the feature data before performing feature aggregation processing on the feature data to obtain the first object descriptor of the first object; the feature aggregation processing on the feature data to obtain the first object descriptor of the first object includes: performing feature aggregation processing on the sparsified feature data to obtain the first object descriptor of the first object.
[0046] In one possible implementation, when the control component 20 matches the first object and the second object based on the first object descriptor and the second object descriptor of at least one second object in the second image, it is configured to: determine similarity information between the first object and the second object based on the first object descriptor and the second object descriptor; compare the similarity information with a preset similarity threshold; and determine that the first object and the second object are the same object if the similarity information is greater than the similarity threshold.
[0047] In one possible implementation, the second image includes a historical image with a timestamp earlier than the first image.
[0048] In one possible implementation, the control component 20 is further configured to: determine the position of the robot in the target scene when acquiring the first image, based on the matching result.
[0049] Based on the same inventive concept, this disclosure also provides a matching method corresponding to the robot method.
[0050] See Figure 2 The diagram shown is a flowchart of a matching method provided in an embodiment of this disclosure, including:
[0051] S201: Perform feature extraction on the first image to obtain the feature map of the first image;
[0052] S202: Based on the feature map, obtain a first object descriptor for at least one first object in the first image;
[0053] S203: Based on the first object descriptor and the second object descriptor of at least one second object in the second image, match the first object and the second object to obtain the matching result of the first object and the second object.
[0054] The above S201 to S203 are explained in detail below.
[0055] This embodiment of the disclosure utilizes a feature map obtained by feature extraction from a first image to determine a first object descriptor for at least one first object in the first image, and determines a second object descriptor for at least one second object in the second image, so as to match the first object and the second object to obtain a matching result for the first object and the second object. Since the object descriptor describes the object as a whole, compared with feature points which can only describe the local features of the object, using object descriptors for matching has higher matching accuracy.
[0056] Regarding S201 above, the method of acquiring the first image also differs in different scenarios.
[0057] For example, in a smart warehousing scenario, a vision sensor can be mounted on a freight robot. The robot moves within its drivable space, and the onboard vision sensor captures images to obtain a first image.
[0058] In addition, this matching method can also be applied to autonomous driving scenarios. For example, a visual sensor can be installed on an autonomous vehicle to collect images of the area that the autonomous vehicle can drive in during the driving process, so as to obtain a first image.
[0059] The following explanation uses the example of a robot acquiring the first image while moving through a warehouse space.
[0060] When extracting features from the first image, a shared feature extraction module (SharedEncoder) can be used to obtain feature maps of the first image. Specifically, for example, different sized convolutional kernels can be used to obtain first and second feature maps of different sizes. Alternatively, a convolutional neural network (Visual Geometry Group, VGG) can be directly used to determine first and second feature maps of different sizes.
[0061] The dimensions of the first image may include, for example, height H, width W, and the number of channels n. For instance, the dimensions of the first image may be represented as (H, W, n). After feature extraction from the first image, feature maps of different sizes can be obtained. The sizes of these different feature maps may include, for example, (H / 2, W / 2, 64), (H / 4, W / 4, 128), (H / 8, W / 8, 256), (H / 16, W / 16, 512), and (H / 32, W / 32, 512).
[0062] Since feature maps of different sizes can express different meanings, for example, feature maps with a larger number of channels, such as feature maps with a size of (H / 32, W / 32, 512), have higher-level semantic information and are therefore more suitable for feature point detection processing; while feature maps with a smaller number of channels, such as feature maps with a size of (H / 8, W / 8, 256), can more accurately reflect the shape (e.g., appearance and structural features of the first object) and position of the first object in the first image, and are therefore more suitable for object detection processing.
[0063] In addition, if the size information of the first image reflects that the height and width of the first image are small, after feature extraction of the first image, there may be a situation where the feature map can be well applied to feature point detection processing and object detection processing. In this case, the feature map under this size can be used as the first feature map and as the second feature map.
[0064] Regarding S202 above, given a determined feature map, a first object descriptor for at least one object in the first image can be obtained based on the determined feature map.
[0065] The objects included in the first image may include, but are not limited to, at least one of the following: shelves, boxes, other robots, ground markings, and location point markings.
[0066] For example, the objects included in the first image include a shelf and two boxes.
[0067] Specifically, when determining the first object descriptor of at least one object in the first image, the following methods can be used, for example: performing feature point detection processing on the first feature map to obtain feature point location information and feature point descriptors corresponding to the feature points in the first feature map; performing object detection processing on the second feature map to obtain object location information of the first object in the second feature map; and obtaining the first object descriptor of the first object based on the feature point location information, the object location information, and the feature point descriptor.
[0068] The following sections will explain the processes of feature point detection on the first feature map and object detection on the second feature map.
[0069] When performing feature point detection processing on the first feature map, a sparse feature point and descriptor extraction module (Point Detector) can be used, for example. Specifically, the sparse feature point and descriptor extraction module can use the deep learning network Superpoint to perform feature point detection processing on the first feature map to determine the feature points in the first feature map.
[0070] Here, the Superpoint network can be used to concentrate computing power to determine the feature point location information and the corresponding feature point descriptor in the first feature map. Furthermore, while determining the feature points with high accuracy, it can also obtain a large number of feature points, and the dispersion of the determined feature points is also good, making it easier to accurately describe the first object.
[0071] The feature points in the first feature map may include, for example, the endpoints and / or fixed points of the contour corresponding to the first object. For instance, if the first object includes a shelf, the feature points in the determined first feature map may include, for example, the endpoints on the contour of the shelf mapped in the first feature map that correspond to the vertices of the shelf.
[0072] Furthermore, during feature point detection processing, the location information of the feature points corresponding to the feature points in the first feature map, as well as feature point descriptors, can be obtained. The feature point descriptors can, for example, have a one-to-one correspondence with the determined feature points, ensuring they do not change with variations in lighting, viewing angle, etc., and possess high uniqueness. Thus, utilizing feature point descriptors can improve matching accuracy.
[0073] Here, since the first feature map is obtained by feature extraction from the first image, by determining the feature point location information and feature point descriptor corresponding to the feature points in the first feature map, the location information and feature point descriptor of the actual corresponding feature points of the first object in the first image can be determined according to the association between the first feature map and the first image.
[0074] When performing object detection processing on the second feature map, an instance segmentation module can be used, for example. Specifically, this instance segmentation module can use a Mask Region-CNN (Mask RCNN) from Convolutional Neural Networks (CNNs) to perform object detection processing on the first feature map, thereby obtaining the object location information of the first object in the second feature map.
[0075] Specifically, when the instance segmentation module performs object detection processing on the first feature map, it can determine the object location information of the first object in the second feature map by determining its instance segmentation mask, for example.
[0076] Here, since the second feature map is also obtained by extracting features from the first image, by determining the object position information of the first object in the second feature map, the actual object position information of the first object in the first image can be determined according to the correlation between the second feature map and the first image.
[0077] Based on the feature point location information, object location information, and feature point descriptors, the first object descriptor of the first object can be obtained.
[0078] In a specific implementation, when determining the first object descriptor of the first object, the following method can be used, for example: based on the object location information and the feature point location information, determine the target feature point of the first object; based on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point, obtain the first object descriptor of the first object.
[0079] In one possible implementation, an instance segmentation module can determine an instance segmentation mask representing the object location information of the first object in the second feature map. Using this instance segmentation mask, the specific location of the first object in the first image can be determined. Here, the specific location of the first object includes the region it occupies in the first image. Then, using the feature point location information, feature points falling within the region occupied by the first object in the first image can be determined based on the specific location of the first object, and these points are used as target feature points corresponding to the first object.
[0080] After determining the target feature points, attention processing is performed on the feature point location information corresponding to the target feature points and the feature point descriptors corresponding to the target feature points using a pre-trained graph neural network to obtain feature data characterizing the appearance features and / or structural features of the first object; then, feature aggregation processing is performed on the feature data to obtain the first object descriptor of the first object.
[0081] Here, after determining the feature point location information and the feature point descriptor corresponding to the target feature point, the information that can be expressed by these two features remains at the level of describing the feature point itself; that is, the information they can express is still insufficient to fully and accurately represent the first object. Therefore, a graph neural network is used to sequentially perform attention processing on the feature point location and the feature point descriptor corresponding to the target feature point to learn the appearance features and / or structural features of the first object, thereby obtaining feature data of the appearance features and / or structural features of the first object. In this way, the determined feature data of the appearance features and / or structural features of the first object can better represent the appearance features and / or structural features of the first object.
[0082] Before performing feature aggregation on the obtained feature data, sparsification can also be applied to the feature data. Here, since there are multiple target feature points corresponding to the first object, after using a graph neural network to perform attention processing on the corresponding feature point positions and feature point descriptors for the multiple target feature points corresponding to the first object, the obtained feature data also has multiple corresponding target feature points.
[0083] In this case, since the number of feature data may be large, the computing power required for matching using such feature data is also large. In addition, for the first object, the multiple target feature points identified have different weights in representing the first object. Therefore, the feature data can be sparsified to reduce the impact of a single target feature point on the feature data of the first object as a whole.
[0084] For example, when the first object includes a shelf, the corresponding determined target feature points include a feature point corresponding to a apex corner of the shelf, and also a feature point on the edge line of one of the shelf's sides. Since the apex corner is quite distinctive in the actual appearance and / or structure of a shelf, the target feature point corresponding to the apex corner of the shelf can effectively characterize the shelf's appearance and / or structural features.
[0085] When using the corresponding target feature points to represent the structural features of a shelf edge, the edge is less effective than the apex in representing the appearance and / or structural features of the shelf. Furthermore, a single edge of a shelf may correspond to multiple target feature points, and these multiple target feature points may express similar semantics. Therefore, using a large number of target feature points corresponding to the edge for matching may lead to a waste of computational resources and reduce matching efficiency.
[0086] Therefore, sparsification of the feature data is adopted to further reduce redundant data in the feature data and improve the accuracy of the sparsified feature data in representing the first object.
[0087] After sparsifying the feature data, feature aggregation is performed on the sparsified feature data to obtain the first object descriptor of the first object.
[0088] Here, after further feature aggregation processing on the feature data obtained after sparsification, the feature point descriptors from multiple sparsified feature data can be further aggregated into a single descriptor, which is then used as the first object descriptor of the first object. The first object descriptor of the first object can, for example, include a multi-dimensional vector, such as a 2048-dimensional descriptor vector.
[0089] Thus, for the first object in the first image, it can be represented by a first object descriptor; since the feature point descriptor used when determining the first object descriptor can better reflect the semantic information of the corresponding feature points, the determined first object descriptor can more accurately represent the semantic information of the first object.
[0090] Regarding S203 above, the second image may be, for example, a second image different from the first image acquired by the robot while it is moving in the warehouse space. For example, the second image may include a historical image with a timestamp earlier than the first image.
[0091] The method for obtaining the second object descriptor of at least one second object in the second image is the same as described above. Figure 2 The methods for determining the first object descriptor of at least one first object in the first image corresponding to S201 and S202 are similar and will not be described again here.
[0092] After determining the first object descriptor and the second object descriptor, the first object and the second object can be matched using the first object descriptor and the second object to obtain the matching result.
[0093] In the first image and the second image, the first object and the second object can each include one or more. Therefore, when determining the first object descriptor and the second object descriptor, multiple first object descriptors corresponding to multiple first objects and multiple second object descriptors corresponding to multiple second objects can be obtained.
[0094] In specific implementation, for example, the following method can be adopted: based on the first object descriptor and the second object descriptor, determine the similarity information of the first object and the second object; compare the similarity information with a preset similarity threshold; if the similarity information is greater than the similarity threshold, determine that the first object and the second object are the same object.
[0095] Specifically, when both the first object descriptor and the second object descriptor include multiple objects, for example, the similarity information of each of the first object descriptors and each of the second object descriptors can be determined one by one. Then, it can be determined whether there exists a first object that is the same as the second object. The specific process can be determined according to the actual situation and will not be elaborated here.
[0096] When determining the similarity information between the first object and the second object, for example, the inner product of the first object descriptor and the second object descriptor can be calculated, and the result obtained after calculating the inner product can be used as the similarity information between the first object and the second object.
[0097] For example, if both the first and second object descriptors correspond to a 2048-dimensional descriptor vector, the product of the corresponding dimensions of the two descriptor vectors is calculated, and then the 2048 products are summed. The result is used as the similarity information between the first and second objects. Here, the similarity information between the first and second objects can be represented as Sim, for example.
[0098] Here, since the obtained first object descriptor and second object descriptor can express the first object and the second object relatively comprehensively and accurately, if the first object and the second object are the same object, the determined first object descriptor and second object descriptor should be similar; that is, in this case, the inner product of the calculated first object descriptor and second object descriptor should be large.
[0099] Therefore, a preset similarity threshold SIM can also be set to compare the similarity threshold SIM with the similarity information Sim to determine whether the first object and the second object are the same object.
[0100] In one possible implementation, if the similarity information is greater than the similarity threshold, the first object and the second object are determined to be the same object; in another possible implementation, if the similarity information is less than or equal to the similarity threshold, the first object and the second object are determined to be different objects.
[0101] In this way, by using the determined similarity information, it is only necessary to compare it with the size of the similarity threshold to quickly match the first object and the second object. Compared with the method of matching by using the feature points corresponding to the first object and the second object respectively, the amount of data required for calculation is greatly reduced, the calculation is simple, and the corresponding computing power required is also less, which is suitable for deployment on actual systems. Moreover, it can improve the matching efficiency while providing matching accuracy.
[0102] Furthermore, while feature point matching methods are generally suitable for matching objects in adjacent frames, they cannot eliminate the influence of factors such as changes in viewing angle on the identification of object feature points. However, the matching method provided in this embodiment uses object descriptors for object matching. Therefore, the timestamps corresponding to the first and second images, representing the time interval between the two images, can be relatively long without affecting the matching of the first object in the first image and the second object in the second image.
[0103] In another embodiment of this disclosure, the position of the robot in the target scene when acquiring the first image can also be determined based on the matching results.
[0104] Here, the matching method provided in the embodiments of this disclosure can also be used for loop closure detection and relocation.
[0105] For example, the target scene may include location A. After the robot acquires the second image at location A, it continues to move within the target scene. After a period of time, using the localization result obtained from the determined second image, the robot is controlled to return to location A and acquire the first image. At this point, if the matching result shows that the first object in the first image and the second object in the second image are the same object, it can be considered that the robot was controlled to reach the same location using the localization result, meaning the localization result for location A is relatively accurate, and the robot's position can be marked. However, if the matching result shows that the first object and the second object do not match, it indicates that the robot was not returned to the same location when the localization result was used to return to location A, meaning the current localization is inaccurate, and a new localization test is performed.
[0106] Those skilled in the art will understand that, in the above-described method of the specific implementation, the order in which each step is written does not imply a strict execution order and does not constitute any limitation on the implementation process. The specific execution order of each step should be determined by its function and possible internal logic.
[0107] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems and devices described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here. In the several embodiments provided in this disclosure, it should be understood that the disclosed systems, devices, and methods can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of units is only a logical functional division; in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Another point is that the displayed or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces; the indirect coupling or communication connection of devices or units may be electrical, mechanical, or other forms.
[0108] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0109] In addition, the functional units in the various embodiments of this disclosure can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.
[0110] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a processor-executable, non-volatile, computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0111] Finally, it should be noted that the above-described embodiments are merely specific implementations of this disclosure, used to illustrate the technical solutions of this disclosure, and not to limit it. The protection scope of this disclosure is not limited thereto. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this disclosure. Such modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be covered within the protection scope of this disclosure. Therefore, the protection scope of this disclosure should be determined by the protection scope of the claims.
Claims
1. A robot, characterized in that, include: Vision sensors and control components; The vision sensor is configured to acquire a first image during the robot's movement. The control component is configured to: extract features from the first image to obtain a feature map of the first image; based on the feature map, obtain a first object descriptor for at least one first object in the first image, wherein the first object descriptor is obtained by feature aggregation and organization of the feature data of the first object; the feature data of the first object is obtained by using a pre-trained graph neural network to perform attention processing on the feature point location information corresponding to the target feature points of the first object and the feature point descriptor corresponding to the target feature points; the target feature points of the first object are used to represent the feature points in the region occupied by the first object in the first image; based on the first object descriptor and the second object descriptor of at least one second object in the second image, match the first object and the second object to obtain a matching result of the first object and the second object, wherein the matching result includes at least whether the first object and the second object are the same object; the second image is a second image different from the first image obtained during the robot's movement.
2. The robot according to claim 1, characterized in that, The feature map includes a first feature map and a second feature map; The control component, when obtaining a first object descriptor for at least one first object in the first image based on the feature map, is configured to: The first feature map is processed by feature point detection to obtain the feature point location information and feature point descriptor corresponding to the feature points in the first feature map. as well as Perform object detection processing on the second feature map to obtain the object location information of the first object in the second feature map; Based on the feature point location information, the object location information, and the feature point descriptor, a first object descriptor for the first object is obtained.
3. The robot according to claim 2, characterized in that, The feature points include: the endpoints and / or fixed points of the contour corresponding to the first object.
4. The robot according to claim 2 or 3, characterized in that, The control component, when obtaining the first object descriptor of the first object based on the feature point location, the object location, and the feature point descriptor, is configured as follows: Based on the object location information and the feature point location information, the target feature points of the first object are determined; Based on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point, the first object descriptor of the first object is obtained.
5. The robot according to claim 1, characterized in that, The feature data of the first object includes feature data of the appearance features and / or structural features of the first object.
6. The robot according to claim 5, characterized in that, Before performing feature aggregation processing on the feature data to obtain the first object descriptor of the first object, the control component is further configured to: The feature data is then subjected to sparsification processing; The step of performing feature aggregation processing on the feature data to obtain the first object descriptor of the first object includes: The feature data after sparsification is subjected to feature aggregation processing to obtain the first object descriptor of the first object.
7. The robot according to claim 1, characterized in that, When the control component matches the first object and the second object based on the first object descriptor and the second object descriptor of at least one second object in the second image, it is configured to: Based on the first object descriptor and the second object descriptor, the similarity information between the first object and the second object is determined; The similarity information is compared with a preset similarity threshold; If the similarity information is greater than the similarity threshold, the first object and the second object are determined to be the same object.
8. The robot according to claim 1, characterized in that, The second image includes historical images whose timestamps are earlier than the first image.
9. The robot according to claim 1, characterized in that, The control component is also configured to: determine the position of the robot in the target scene when acquiring the first image, based on the matching result.
10. A matching method, characterized in that, include: Feature extraction is performed on the first image to obtain the feature map of the first image; Based on the feature map, a first object descriptor for at least one first object in the first image is obtained; the first object descriptor is obtained by feature aggregation and organization of the feature data of the first object; the feature data of the first object is obtained by using a pre-trained graph neural network to perform attention processing on the feature point location information corresponding to the target feature point of the first object and the feature point descriptor corresponding to the target feature point. The target feature points of the first object are used to represent the feature points of the region occupied by the first object in the first image; Based on the first object descriptor and the second object descriptor of at least one second object in the second image, the first object and the second object are matched to obtain a matching result of the first object and the second object; the matching result includes at least whether the first object and the second object are the same object; the second image is a second image different from the first image obtained during the robot's movement.
11. The matching method according to claim 10, characterized in that, The feature map includes a first feature map and a second feature map; The step of obtaining a first object descriptor for at least one first object in the first image based on the feature map includes: The first feature map is processed by feature point detection to obtain the feature point location information and feature point descriptor corresponding to the feature points in the first feature map. as well as Perform object detection processing on the second feature map to obtain the object location information of the first object in the second feature map; Based on the feature point location information, the object location information, and the feature point descriptor, a first object descriptor for the first object is obtained.
12. The matching method according to claim 11, characterized in that, The feature points include: the endpoints and / or fixed points of the contour corresponding to the first object.
13. The matching method according to claim 11 or 12, characterized in that, Based on the feature point location, the object location, and the feature point descriptor, a first object descriptor for the first object is obtained, including: Based on the object location information and the feature point location information, the target feature points of the first object are determined; Based on the feature point location information corresponding to the target feature point and the feature point descriptor corresponding to the target feature point, the first object descriptor of the first object is obtained.
14. The matching method according to claim 13, characterized in that, The feature data of the first object includes feature data of the appearance features and / or structural features of the first object.
15. The matching method according to claim 14, characterized in that, Before performing feature aggregation processing on the feature data to obtain the first object descriptor of the first object, the method further includes: The feature data is then subjected to sparsification processing; The step of performing feature aggregation processing on the feature data to obtain the first object descriptor of the first object includes: The feature data after sparsification is subjected to feature aggregation processing to obtain the first object descriptor of the first object.
16. The matching method according to claim 10, characterized in that, Matching the first object and the second object based on the first object descriptor and the second object descriptor of at least one second object in the second image includes: Based on the first object descriptor and the second object descriptor, the similarity information between the first object and the second object is determined; The similarity information is compared with a preset similarity threshold; If the similarity information is greater than the similarity threshold, the first object and the second object are determined to be the same object.
17. The matching method according to claim 10, characterized in that, The second image includes historical images whose timestamps are earlier than the first image.
18. The matching method according to claim 10, characterized in that, Also includes: Based on the matching results, the robot's position in the target scene when acquiring the first image is determined.