Information processing device, information processing method, and program

The information processing device adjusts tracking duration based on target detection in time-series images, ensuring continuous tracking and reducing user intervention by setting durations based on candidate object proximity and detection results.

JP7876357B2Active Publication Date: 2026-06-19CANON KK

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Patents
Current Assignee / Owner
CANON KK
Filing Date
2022-07-05
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Conventional tracking techniques fail to appropriately set the tracking duration, leading to unnecessary user intervention when the target is not detected, or continued tracking when it should be terminated, due to varying shooting conditions and target characteristics.

Method used

An information processing device that acquires time-series images, identifies the tracking target, detects candidate objects, calculates the distance between the target and candidates, and sets the tracking duration based on this information, terminating the process if the target is not detected for longer than the set duration.

Benefits of technology

Ensures continuous tracking of the target even when it is not detected in the image, minimizing user intervention by appropriately adjusting the tracking duration based on detection results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 0007876357000001
    Figure 0007876357000001
  • Figure 0007876357000002
    Figure 0007876357000002
  • Figure 0007876357000003
    Figure 0007876357000003
Patent Text Reader

Abstract

To appropriately set tracking continuation time for continuing processing of tracking a tracking target even when the tracking target is not detected from images.SOLUTION: An information processing apparatus performs processing of tracking a tracking target in time-series images. The information processing apparatus acquires the time-series images. The information processing apparatus acquires information for specifying the tracking target. The information processing apparatus performs detection processing of detecting, from the time-series images, an object according to the information for specifying the tracking target. The information processing apparatus sets tracking continuation time based on information indicating a result of the detection processing. When the length of time during which the tracking target is not detected from the images exceeds the tracking continuation time, the information processing apparatus ends the processing of tracking the tracking target.SELECTED DRAWING: Figure 2
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to an information processing apparatus, an information processing method, and a program, and particularly to a technique for tracking a tracking target in an image.

Background Art

[0002] As techniques for tracking a specific subject in an image, techniques using luminance or color information and techniques using template matching are known. In recent years, a technique using a Deep Neural Network (hereinafter abbreviated as DNN) has attracted attention as a highly accurate tracking technique. Non-Patent Document 1 discloses a Siamese type DNN for tracking a specific subject in an image. In this type of DNN, an image in which a tracking target appears and an image to be searched are respectively input to a Convolutional Neural Network (hereinafter abbreviated as CNN) having the same weights. Then, by calculating the cross-correlation of the outputs obtained from the CNN for each image, the position where the tracking target exists in the image to be searched is specified.

[0003] On the other hand, in these tracking techniques, when the appearance of the tracking target changes greatly, or when the tracking target is hidden by an obstacle, etc., the detection of the tracking target may fail temporarily. Therefore, Patent Document 1 proposes that when the time during which the tracking target cannot be tracked is shorter than a predetermined holding time, tracking is restarted when the tracking target is redetected, and when the holding time is exceeded, tracking is terminated. Further, Patent Document 1 discloses that this holding time is changed according to control information of an imaging device such as the focal length of the imaging device or the control information of an anti-shake device. Furthermore, Patent Document 2 proposes setting this holding time according to the authentication result of the tracking target.

Prior Art Documents

Patent Documents

[0004]

Patent Document १

[0005] [Non-Patent Document 1] L. Bertinetto et al. "Fully-Convolutional Siamese Networks for Object Tracking", arXiv:1606.09549, 2016. [Non-Patent Document 2] W. Liu et al. "High-level Semantic Feature Detection: A New Perspective for Pedestrian Detection", Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2019. [Non-Patent Document 3] J. Zhou et al. "Discriminative and Robust Online Learning for Siamese Visual Tracking", arXiv:1909.02959, 2019. [Overview of the project] [Problems that the invention aims to solve]

[0006] In the conventional technology described above, if the holding time is too long, the user may not realize that tracking has failed for a long time. On the other hand, if tracking is terminated, the user will be able to re-select the target to be tracked. Therefore, if the holding time is too short, the user will have to go through the trouble of re-selecting the target to be tracked. Thus, the appropriate tracking duration varies depending on various factors such as shooting conditions and the characteristics of the target to be tracked.

[0007] The present invention aims to appropriately set the tracking duration so that the tracking process continues even when the target is not detected in the image. [Means for solving the problem]

[0008] The information processing device according to one embodiment of the present invention has the following configuration. That is, An information processing device that performs tracking of a target object in a time-series image, A first acquisition means for acquiring the aforementioned time-series images, A second acquisition means for acquiring information that identifies the target being tracked, According to the information that identifies the target being tracked. , one or more candidates for the tracking target A detection process is performed to detect an object from the aforementioned time-series images. i. A test to determine the target to be tracked from among the candidate objects. means of expenditure, multiple The aforementioned The distance between the position of the target to be tracked, which was determined from among the candidate objects, and the positions of the candidate objects other than the target to be tracked. A setting means for setting the tracking duration based on this, A control means that terminates the process of tracking the target if the length of time during which the target is not detected in the image exceeds the tracking duration, It is equipped with. [Effects of the Invention]

[0009] Even if the target is not detected in the image, the tracking duration can be appropriately set to continue tracking the target. [Brief explanation of the drawing]

[0010] [Figure 1] A diagram showing an example of the hardware configuration of an information processing device according to one embodiment. [Figure 2] A block diagram showing an example of the functional configuration of an information processing device according to one embodiment. [Figure 3] A flowchart showing the processing procedure in an information processing method according to one embodiment. [Figure 4] A diagram illustrating the flow of the method for calculating the tracking duration in one embodiment. [Figure 5] A diagram illustrating an example of detecting the position of a tracked target. [Figure 6] A diagram for explaining an example of a method for calculating the tracking duration in one embodiment. [Figure 7] A diagram for explaining an example of a method for calculating the tracking duration in one embodiment. [Figure 8] A block diagram showing an example of the functional configuration of an information processing apparatus according to one embodiment. [Figure 9] A diagram for explaining the flow of a method for calculating the tracking duration in one embodiment. [Figure 10] A diagram showing an example of notification of the tracking duration.

Embodiments for Carrying Out the Invention

[0011] Hereinafter, embodiments will be described in detail with reference to the accompanying drawings. Note that the following embodiments do not limit the invention according to the claims. Although a plurality of features are described in the embodiments, not all of these plurality of features are essential for the invention, and the plurality of features may be arbitrarily combined. Further, in the accompanying drawings, the same or similar configurations are denoted by the same reference numerals, and redundant descriptions are omitted.

[0012] (Hardware Configuration) An image processing apparatus according to an embodiment of the present invention can be realized by a computer including a processor and a memory. FIG. 1 is a diagram showing an example of the hardware configuration of an information processing apparatus according to an embodiment of the present invention. The information processing apparatus shown in FIG. 1 has a CPU 101, a ROM 102, a RAM 103, a storage unit 104, a display unit 105, an operation unit 106, and a network interface (I / F) 107.

[0013] The CPU 101 is a central processing unit that can perform calculations and logical decisions for various processes and control each component connected to the system bus 108. The Read-Only Memory (ROM) 102 is program memory that stores programs for control by the CPU 101 that define various processing procedures, which will be described later. The Random Access Memory (RAM) 103 is used as the main memory or temporary storage area such as the work area of ​​the CPU 101. Note that the program memory may be implemented by loading a program into the RAM 103 from a storage medium such as a storage unit 104 connected to the information processing device.

[0014] The storage unit 104 can store electronic data and programs used in each embodiment. Such a storage unit 104 can be implemented, for example, using a medium (recording medium) and an external storage drive that enables access to this medium. Examples of such media include hard disk drives (HDDs), flexible disks (FDs), CD-ROMs, DVDs, USB memory sticks, MOs, and flash memory. The storage unit 104 may also be a server device connected via a network. The storage unit 104 can store data to be processed in this embodiment, and can store data to be tracked, for example.

[0015] The display unit 105 is a device that outputs an image onto a display screen. The display unit 105 may be, for example, a CRT display, a liquid crystal display, or an organic EL display. The operation unit 106 can accept various operations from the user. The operation unit 106 may include, for example, a switch, a keyboard, or a mouse, and is used when setting the tracking target. The operation unit 106 may also be an external device connected to the information processing device by wire or wireless. The configuration of the display unit 105 and the operation unit 106 is not limited to this. For example, the operation unit 106 may be a pen tablet, or a tablet having the functions of both the display unit 105 and the operation unit 106 may be used.

[0016] I / F107 functions as an interface for connecting the information processing unit to external devices. The CPU101, ROM102, RAM103, storage unit104, display unit105, operation unit106, and I / F107 are all connected to the system bus108. (Example of functional configuration)

[0017] Figure 2 is a diagram showing an example of the functional configuration of an information processing device according to one embodiment of the present invention. The information processing device 200 according to this embodiment performs a process of tracking a target in a time-series image. The information processing device 200 has a first image acquisition unit 201, a feature quantity calculation unit 202, a second image acquisition unit 203, a template generation unit 204, a detection unit 205, a time calculation unit 206, and a tracking control unit 207. The information processing device 200 may also have a learning unit 208, while the learning process performed by the learning unit 208, described later, may be performed by another information processing device. The functions of these processing units can be realized by a processor such as a CPU 101 executing a program stored in a memory such as a ROM 102, RAM 103, or storage unit 104. However, some or all of the functions of the information processing device 200 may be realized by dedicated hardware. Furthermore, the image processing device according to one embodiment of the present invention may be composed of multiple information processing devices (e.g., an information processing system) connected via a network, for example.

[0018] The first image acquisition unit 201 acquires a time-series of images. In the following example, the first image acquisition unit 201 acquires images taken at times t=1, ..., and n. These images are images of the region where the tracking target is searched during the tracking process, and can be called search region images. Such time-series images can be obtained, for example, by continuously taking images using an imaging device set up to capture images of the region where the tracking target is searched. The time-series images may also be moving images.

[0019] The second image acquisition unit 203 acquires information to identify the tracking target. In this embodiment, the second image acquisition unit 203 acquires an image of the tracking target. The second image acquisition unit 203 may acquire the captured image of the tracking target from the storage unit 104 or from an external source, or it may acquire a partial image of a position specified by the user from the image acquired by the first image acquisition unit 201. In one embodiment, the information processing device 200 may acquire information representing the characteristics of the tracking target, such as feature quantities of the image of the tracking target, as information to identify the tracking target.

[0020] The feature calculation unit 202, the template generation unit 204, and the detection unit 205 perform detection processing to detect objects from a time-series image that conform to information identifying the tracking target. In this embodiment, the method for detecting the tracking target is not particularly limited, and for example, techniques using brightness or color information, techniques using template matching, and techniques using DNNs can be applied. Below, a configuration using a Siamese-type neural network as described in Non-Patent Document 1 will be described with reference to Figure 4. According to the embodiment shown below, the position or range of objects conforming to information identifying the tracking target is identified using a Siamese-type neural network.

[0021] The template generation unit 204 generates a tracking target template 403 representing the characteristics of the tracking target from the tracking target image 401 acquired by the second image acquisition unit 203. The template generation unit 204 can perform this process using a CNN 402. That is, when image 401 is input to the CNN 402, it generates a tracking target template 403. The CNN 402 is pre-trained to obtain a tracking target template that makes it easy to distinguish between tracking targets and non-tracking targets. Such training of the CNN 402 can be carried out, for example, according to Non-Patent Document 1.

[0022] Furthermore, the feature calculation unit 202 generates feature quantities 406 of image 404 from image 404, which is the search region image acquired by the first image acquisition unit 201. The feature calculation unit 202 can perform this process using CNN 405. That is, when image 404 is input to CNN 405, it generates feature quantities 406. In this embodiment, since the cross-correlation between the tracking target template 403 and feature quantities 406 is calculated, CNN 402 and CNN 405 use common weight parameters.

[0023] Furthermore, the detection unit 205 performs a detection process to detect objects that conform to the tracking target template 403 from the image 404, using the tracking target template 403 and the feature quantities 406 of the image 404. Here, the detection unit 205 can generate information indicating the result of the detection process based on the cross-correlation between the feature quantities at multiple locations in the image and the feature quantities of the tracking target image. For example, the detection unit 205 first outputs a cross-correlation tensor 408 by performing a cross-correlation process 407 that calculates the cross-correlation between the tracking target template 403 and the feature quantities 406. As this cross-correlation process 407, a Siamese-type CNN similar to the one shown in Non-Patent Literature 1 can be used. The cross-correlation process 407 is trained so that the value of the cross-correlation tensor 408 is large at locations in the image 404 where the probability of the tracking target indicated by the tracking target template 403 being present is high.

[0024] Furthermore, the detection unit 205 estimates the position of an object in the image 404 that follows the tracking target template 403 based on the cross-correlation tensor 408. The detection unit 205 can perform this process using a CNN 409. That is, when the cross-correlation tensor 408 is input to the CNN 409, it outputs an inference map 410 that shows the position of the tracking target.

[0025] An example of the inference map 410 is shown in Figure 5. Map 501 shows the estimated position of the tracked object in image 404, obtained based on the inference map 410. Image 404 contains two people 502 and 503, and the inference values ​​(i.e., likelihood) indicating the possibility of a tracked object existing at each position, as shown by the inference map 410, are shown in different colors, such as pixels 504 to 506. The detection unit 205 detects objects in pixels where these inference values ​​are above a threshold as objects that conform to the tracked object template 403. Through the above process, the detection unit 205 can detect objects that conform to the tracked object template 403 from each image in the time series, and can record this detection result as the tracked object detection result.

[0026] In Figure 5, since the inference values ​​for pixels 504 and 506 are above the threshold, an object located at the position corresponding to pixels 504 and 506 is detected. Thus, the detection process that detects objects according to the tracking target template 403 may detect not only the tracking target but also objects similar to the tracking target. Furthermore, the tracking target may not be detected from image 404 for reasons such as a change in the tracking target's orientation or the tracking target being hidden behind another person or obstacle.

[0027] Incidentally, when there is only one or fewer objects to track in image 404, such as when the tracking target is a specific person, if multiple objects are detected, the detection unit 205 can detect one of the detected multiple objects as the tracking target. The detected multiple objects can be called candidate objects for tracking targets. That is, the detection unit 205 can detect one or more candidate objects for a tracking target according to information that identifies the tracking target, and further, can determine the tracking target from among the candidate objects. In this case, the detection unit 205 may select the candidate object located at the pixel with the highest inferred value as the tracking target. On the other hand, the detection unit 205 can also select the candidate object as the tracking target that is closest to the position of the tracking target detected from the image at the previous time, or to the predicted position of the tracking target based on the movement of the tracking target at a previous time. As a specific method for selecting a tracking target from candidate objects, for example, the method described in Non-Patent Literature 2 can be used.

[0028] Furthermore, the detection unit 205 may not detect a tracking target if there is no tracking target that meets specific conditions. For example, the detection unit 205 can determine a tracking target such that the distance between the current position of the tracking target and the position of the tracking target at a previous time is within a threshold. In this case, the detection unit 205 may also consider other parameters such as the speed of the tracking target. If such a method is adopted, a tracking target may not be selected from the candidate objects, i.e., the tracking target may not be detected from the image 404, for reasons such as the position of the candidate object being significantly different from the position of the tracking target at the previous time.

[0029] In any case, the detection unit 205 detects an object from the time-series images according to the tracking target template 403, and the result is obtained as the detection result of the tracking target from the time-series images. In this way, the tracking process of the tracking target is performed. As a result of the tracking process, the detection unit 205 can record information indicating the position of the detected tracking target in each of the time-series images in correspondence with each other.

[0030] Here, an example is shown in which the detection unit 205 generates an inference map 410 showing the estimated position of the tracked object. However, other forms of detection of the tracked object by the detection unit 205 are also possible. In one embodiment, the detection unit 205 can identify the position or range of the tracked object on the image. For example, the detection unit 205 may use a CNN 409 to further calculate two maps in addition to the inference map 410, showing the estimated vertical and horizontal sizes of the tracked object. Alternatively, the detection unit 205 may calculate the estimated position and size of the tracked object, i.e., the estimated range in which the tracked object is captured. In another form, the detection unit 205 may calculate a more accurate estimated position of the center of the tracked object on the cell 503 where the tracked object is located. For example, the detection unit 205 can further calculate two maps showing the estimated amount of small displacement (Δx, Δy) between the center of the tracked object (person 502) and the center of the cell 503. Furthermore, the detection unit 205 can calculate various estimated values ​​related to the tracked object, such as the orientation or angle of the tracked object, the foreground area of ​​the tracked object, or the distance to the upper, lower, left, or right edge of the tracked object. These configurations can be realized, for example, according to Non-Patent Document 2.

[0031] The time calculation unit 206 sets the tracking duration based on information indicating the results of the detection process. For example, the time calculation unit 206 can set the tracking duration based on information generated by the detection unit 205. The type of information indicating the results of the detection process used by the time calculation unit 206 is not particularly limited. For example, the information indicating the results of the detection process may be information indicating the position of a candidate object. As described above, the position of a candidate object can be determined based on the inference map 410 and the cross-correlation tensor 408, so the information indicating the position of a candidate object includes the inference map 410 and the cross-correlation tensor 408. The information indicating the results of the detection process may also be the result of the cross-correlation between the features of the image and the features of the image being tracked, and this includes the cross-correlation tensor 408. Furthermore, the information indicating the results of the detection process may also be information indicating the likelihood that a tracked object exists at each position in the image. As described above, the likelihood that a tracked object exists at each position in the image can be determined based on the inference map 410 and the cross-correlation tensor 408, so such likelihood information includes the inference map 410 and the cross-correlation tensor 408. Thus, the information indicating the results of the detection process also includes the intermediate calculation results in the detection process performed by the detection unit 205. Furthermore, as information indicating the results of the detection process, information indicating the position of objects in pixels where the inference value in the inference map 410 is greater than or equal to a threshold may be used, and this threshold may be different from the threshold used when detecting candidate objects based on the inference map 410.

[0032] The following describes the case where the tracking duration is set based on the inference map 410, which shows the result of the tracking target detection process in image 404. In this example, the time calculation unit 206 obtains the tracking duration 412 by inputting the inference map 410 to the tracking duration calculation process 411.

[0033] In the tracking duration calculation process 411, the time calculation unit 206 sets the tracking duration based on information indicating the position of the candidate object. The candidate object is an object that conforms to the information identifying the target to be tracked, and can also be called a similar object that is similar to the target to be tracked. The time calculation unit 206 may also set the tracking duration based on information indicating the position of the similar object.

[0034] For example, the time calculation unit 206 can determine whether or not a candidate object exists within a predetermined range from the detected tracking target, and calculate the tracking duration based on this determination. The time calculation unit 206 can also calculate the tracking duration based on the distance between the position of the tracking target, which is determined from among multiple candidate objects, and the positions of candidate objects other than the tracking target. Specifically, the time calculation unit 206 calculates the tracking duration such that the tracking duration when a candidate object exists within a predetermined range from the tracking target is shorter than the tracking duration when no candidate object exists within a predetermined range from the tracking target. As described above, since the inference map 410 shows the positions of candidate objects, it is possible to determine whether or not a candidate object exists within a predetermined range from the tracking target by referring to the inference map 410.

[0035] Figures 6(A) and 6(B) show the results of the detection process when a target to be tracked and a candidate object are detected from the image. Inference maps 601 and 611 correspond to inference map 410 and show the targets to be tracked 602 and 612, and candidate objects 603 and 613. Boundary lines 604 and 614 indicate the area (neighborhood area) within a predetermined range from the targets to be tracked 602 and 612.

[0036] The time calculation unit 206 determines whether candidate objects 603 and 613 are within the boundary line. In Figure 6(A), candidate object 603 is located outside the boundary line 604, and in Figure 6(B), candidate object 613 is located inside the boundary line 614. Based on this determination, the time calculation unit 206 sets the tracking duration to T1 in the case of Figure 6(A) and to T2 in the case of Figure 6(B). Here, T1 and T2 may be predetermined design values, and the time calculation unit 206 can read these values ​​from, for example, the storage unit 104. Also, if T1 > T2, that is, if the candidate object is close to the tracking object, a shorter tracking duration is set. For example, T1 = 4 seconds and T2 = 2 seconds.

[0037] The method for calculating the tracking duration based on the candidate object's position is not limited to the method described above. For example, multiple boundary lines may be defined. For instance, boundary lines can be defined such that a third region surrounds the target object, a second region is located outside the third region, and a first region is located outside the second region. A tracking duration can then be assigned to each region. For example, if the candidate object is located in the first region, the tracking duration can be set to T3; if the candidate object is not located in the first region but is in the second region, the tracking duration can be set to T2. Furthermore, if the candidate object is not located in the first or second region but is in the third region, the tracking duration can be set to T1. Here, T1 > T2 > T3.

[0038] Alternatively, the time calculation unit 206 may calculate the tracking duration based on the distance between the target being tracked and the candidate objects. In this case, a continuous function that gives the tracking duration corresponding to the distance can be defined and used. For example, when the distance between the target being tracked and the candidate object closest to the target being tracked is a first distance, the tracking duration can be calculated to be longer compared to when this distance is a second distance shorter than the first distance. Furthermore, the time calculation unit 206 may calculate the tracking duration based on candidate objects located within a predetermined range from the target being tracked. For example, when the number of candidate objects located within a predetermined range from the target being tracked is a first number, the tracking duration can be calculated to be longer compared to when the number of such candidate objects is a second number greater than the first number.

[0039] Alternatively, the time calculation unit 206 may set the tracking duration based on the cross-correlation tensor 408 instead of the inference map 410. The cross-correlation tensor 408 shows the cross-correlation between the features 406 of the image 404 and the tracking target template 403. As described above, the values ​​in the cross-correlation tensor 408 corresponding to positions where the probability of a tracking target existing in the image 404 is high are larger. Thus, the cross-correlation tensor 408 also shows the result of the tracking target detection process in the image 404. In this case, the time calculation unit 206 can obtain the tracking duration 412 by inputting the cross-correlation tensor 408 to the tracking duration calculation process 411. If the cross-correlation tensor is three-dimensional or more, the time calculation unit 206 can extract a specific two-dimensional map and determine the tracking duration based on the extracted two-dimensional map. Alternatively, the time calculation unit 206 may generate a two-dimensional map by calculating pixel-by-pixel statistics of the cross-correlation tensor and determine the tracking duration based on the generated two-dimensional map.

[0040] The tracking control unit 207 terminates the tracking process when the length of time during which the tracking target is not detected in the image exceeds the tracking duration. At this time, the tracking control unit 207 may notify the user that the tracking process has ended. The tracking control unit 207 can also prompt the user to re-specify new information to identify the tracking target (for example, an image of the tracking target). Subsequently, when the second image acquisition unit 203 acquires new information to identify the tracking target, the detection unit 205 can start the tracking process based on the new information.

[0041] (Process flow) Figures 3(A) and 3(B) are flowcharts showing the processing flow in the information processing method according to this embodiment. In the following description, the notation for each step will be omitted by prefixing the reference numeral S to each step. However, the information processing device 200 does not necessarily have to perform all the steps shown in this flowchart. In Figure 3, the processing performed by the CPU 101 as described above is shown as a block.

[0042] Figure 3(A) is a flowchart of the process for detecting the target to be tracked from each of the time-series images. Figure 3(B) is a detailed flowchart of the process in S305, and is a flowchart of the process for determining whether the target to be tracked was detected or not, and the process for determining whether the tracking has ended.

[0043] First, let's explain the flowchart in Figure 3(A). In the following example, information identifying the target to be tracked is first acquired, and based on this information, a process is performed to detect the target in the image captured at time t=1. Furthermore, based on the same information identifying the target to be tracked, a process is performed to detect the target in each image captured from time t=2 onwards. Therefore, the process according to the flowchart in Figure 3(A) can be performed for each image in the time series.

[0044] In S301, the information processing device 200 acquires information to identify the tracking target. In this example, as described above, the second image acquisition unit 203 acquires an image of the tracking target, and the template generation unit 204 generates a tracking target template from the image of the tracking target. The second image acquisition unit 203 may acquire an image captured by an imaging device connected to the information processing device 200, or it may acquire an image stored in the storage unit 104. In this example, in the process of detecting the tracking target in an image captured at time t=1, the second image acquisition unit 203 acquires information to identify the tracking target and stores this information in the storage unit 104. Furthermore, in the process of detecting the tracking target in images captured at time t=2 and later, the second image acquisition unit 203 acquires the already acquired information to identify the tracking target from the storage unit 104. In this way, by performing tracking processing on images captured at each time using the same information to identify the tracking target, the same tracking target can be detected from images captured at each time.

[0045] In S302, the first image acquisition unit 201 acquires the search area image. As described above, in the first process following the flowchart in Figure 3(A), the first image acquisition unit 201 can acquire the image captured at time t=1. Furthermore, in the second and subsequent processes following the flowchart in Figure 3(A), the first image acquisition unit 201 can acquire the image captured at time t=2 or later. In this example, the feature calculation unit 202 further generates feature quantities of the search area image as described above in order to detect the target to be tracked.

[0046] In S303, the detection unit 205 detects the tracking target from each of the time-series images based on information that identifies the tracking target. In this example, as described above, the detection unit 205 calculates a cross-correlation tensor between the feature quantities of the search area and the tracking target template, and detects the tracking target based on the cross-correlation tensor. As described above, the detection unit 205 can track the tracking target by associating the tracking target detected from the images captured at each time point.

[0047] In S304, the time calculation unit 206 updates the tracking duration by calculating the tracking duration based on the results of the detection process as described above. The time calculation unit 206 can store the calculated tracking duration in the storage unit 104. In this embodiment, the time calculation unit 206 updates the tracking duration even if the detection of the tracking target fails in S303. In this case, the time calculation unit 206 may calculate the tracking duration based on the results of the detection process of the tracking target in an image taken at a time earlier than the image acquired in S302. For example, the time calculation unit 206 may calculate the tracking duration based on the position of the tracking target in an image taken at a time earlier than the image acquired in S302 and the position of the candidate object in the image acquired in S302. On the other hand, if the detection of the tracking target fails in S303, the time calculation unit 206 may skip updating the tracking duration.

[0048] In S305, the tracking control unit 207 determines whether tracking is successful and whether tracking has ended. The detailed processing flow in S305 is described below with reference to Figure 3(B).

[0049] In S311, the tracking control unit 207 determines whether or not it was able to detect the target being tracked in S303. If it is determined that the target being tracked was not detected, the process proceeds to S312.

[0050] In S312, the tracking control unit 207 updates the disappearance time, which indicates the time during which the tracking target has not been detected. For example, the tracking control unit 207 can add to the disappearance time the difference between the imaging time of the previously processed search area image and the imaging time of the currently processed search area image.

[0051] In S313, the tracking control unit 207 determines whether the disappearance time is longer than the tracking duration. If the disappearance time is longer than the tracking duration, the process proceeds to S314.

[0052] In S314, the tracking control unit 207 decides to terminate the tracking process. Subsequently, the information processing device 200 can start a process to track the target based on new information that identifies the target, as described above.

[0053] If it is determined in S311 that the target to be tracked has been detected, or if it is determined in S313 that the disappearance time is less than or equal to the tracking duration, the tracking process will not end. In this case, tracking processing will be performed on the image captured at the next time, based on the information that identifies the target to be tracked. As described above, the tracking process for the target to be tracked ends when the time during which the target to be tracked has not been detected exceeds the tracking duration. For this reason, if the target to be tracked has been detected in S303, the tracking control unit 207 can reset the disappearance time to 0. The tracking control unit 207 can also store the current disappearance time in the storage unit 104.

[0054] (Learning Methods) In this embodiment, the parameters of CNN402, CNN405, cross-correlation processing 407, and CNN409 can be determined by the learning process performed by the learning unit 208. For example, similar to the process shown in Figure 3(A), the first image acquisition unit 201 can acquire a search area image for learning, and 203 can acquire a tracking target image for learning. The feature calculation unit 202, template generation unit 204, and detection unit 205 can then calculate an inference map 410 based on these images. The learning unit 208 calculates a loss by comparing the obtained inference map 410 with ground truth data prepared for the pair of the search area image and tracking target image for learning. The learning unit 208 can then update the parameters of CNN402, CNN405, cross-correlation processing 407, and CNN409 based on the loss thus calculated.

[0055] Here, the ground truth data can be prepared in accordance with the configuration of the inference map 410. For example, the ground truth data can be provided for the estimated position of the tracked object and the estimated size of the tracked object. The ground truth data for position may be, for example, a map having Gaussian function values ​​centered on the position where the tracked object exists. The ground truth data for size may be a regression map where the pixels where the tracked object exists have values ​​indicating the size of the tracked object. Backpropagation can be used to update the parameters based on the loss. As for specific learning methods, for example, the method described in Non-Patent Document 1 can be used.

[0056] According to the embodiment described above, the tracking duration can be set based on information indicating the results of the detection process. In particular, according to the example above, the tracking duration is set based on the position of the target being tracked and the positions of candidate objects similar to the target being tracked. If the position of the target being tracked and the position of a candidate object are close together, it may become temporarily impossible to detect an object that matches the information identifying the target being tracked. If an object is then detected again, it is highly likely that a candidate object, not the target being tracked, has been detected. In this way, the difficulty of correctly re-detecting the target being tracked can be estimated based on the position of the target being tracked and the positions of candidate objects similar to the target being tracked. According to this embodiment, if the difficulty of re-detecting the target being tracked is high, and the probability of correctly re-detecting the target being tracked is low even if tracking is continued after the initial detection fails, the tracking duration is appropriately set to be shorter.

[0057] (Modified method 1 for determining tracking duration) Up to this point, we have described a method for calculating the tracking duration by determining whether or not a candidate object exists within a predetermined range from the target being tracked. Another form of calculating the tracking duration is to use a neural network. For example, the time calculation unit 206 can use a neural network to calculate the tracking duration from information indicating the results of the detection process. Below, we will describe a method for calculating the tracking duration using a CNN with reference to Figures 7(A) and 7(B).

[0058] In this modified example, the tracking duration is calculated by performing a tracking duration calculation process using the cross-correlation tensor 408, which represents the result of the detection process. The details of the tracking duration calculation process are shown in Figure 7(A). First, the time calculation unit 206 calculates the tracking difficulty 722 by inputting the cross-correlation tensor 408 into the CNN 721. The learning method for CNN 721 will be described later. The tracking difficulty 722 is a binary value, where 1 indicates that tracking is easy and 0 indicates that tracking is difficult. Furthermore, the time calculation unit 206 calculates the tracking duration 724 by inputting this tracking difficulty 722 into the time calculation process 723. For example, as shown in Figure 7(B), the memory unit 104 can maintain a correspondence table between tracking difficulty and tracking duration. In this case, the time calculation unit 206 can calculate the tracking duration 724 based on the correspondence table 731.

[0059] The learning unit 208 can perform the training of such a CNN721 as follows. First, the learning unit 208 creates ground truth data with a tracking difficulty of 722. Ground truth data can be created by performing tracking processing using training images and comparing the obtained tracking results with the ground truth tracking results. That is, if a correct tracking result is obtained, ground truth data indicating that tracking processing is easy can be associated with the training images. Also, if an incorrect tracking result is obtained, ground truth data indicating that tracking processing is difficult can be associated with the training images.

[0060] As a concrete example, the creation of ground truth data can be performed as follows. First, the detection unit 205 obtains an inference map 410 based on the training search region image and the training target image, using the parameters of CNN 402, CNN 405, cross-correlation processing 407, and CNN 409 that have been trained using the method described above. In this example, an inference map 410 is generated for each of the multiple training time series images, and the tracking process for the target is performed based on these inference maps 410. Then, the learning unit 208 determines whether the tracking of the target in the last image of the multiple time series images was successful. Specifically, the learning unit 208 can determine whether the estimated position of the target matches the ground truth data of the target's position associated with the multiple training time series images and the training target image. In this modified example, the ground truth data for tracking difficulty 722 is 1 when tracking is successful, and the ground truth data for tracking difficulty 722 is 0 when tracking fails. The learning unit 208 stores the correct data with a tracking difficulty of 722 obtained in this way in the storage unit 104, associating it with the learning search area image and the learning tracking target image.

[0061] In this example, the learning unit 208 calculated one ground truth data for multiple time-series images used for training. However, the learning unit 208 may also calculate ground truth data with a tracking difficulty of 722 for each image. In this case, the learning unit 208 can calculate ground truth data with a tracking difficulty of 722 depending on whether or not it has correctly tracked the target in each image. Alternatively, the learning unit 208 may also calculate ground truth data with a tracking difficulty of 722 depending on whether or not it has correctly detected the target in each image. In this case, the learning unit 208 can determine whether or not it has correctly detected the target by comparing the inference map 410 obtained based on each image with the ground truth map showing the position of the target. Furthermore, the learning unit 208 may also acquire ground truth data (for example, ground truth data with a tracking difficulty of 722) input from the user via the operation unit 106.

[0062] The learning unit 208 then uses the ground truth data with a tracking difficulty of 722 obtained as described above to train the parameters of the CNN721. In this case as well, the feature calculation unit 202, template generation unit 204, and detection unit 205 can calculate the cross-correlation tensor 408 based on the training search region image and the training tracking target image. The time calculation unit 206 calculates the tracking difficulty of 722 by inputting the cross-correlation tensor 408 into the CNN721. The learning unit 208 then calculates the loss. The learning unit 208 can calculate the loss according to the error between the calculated tracking difficulty of 722 and the ground truth data with a tracking difficulty of 722. A cross-entropy loss or the like can be used as the loss function to calculate the loss. The loss can also be calculated according to the error evaluated for multiple training images. After that, the learning unit 208 updates the parameters of the CNN721 based on the calculated loss. Parameter updates can be performed based on backpropagation using Momentum SGD or the like. The learning unit 208 can update the coupling weight coefficients between layers of the model so that the loss calculated based on multiple images is smaller than a predetermined threshold.

[0063] The parameters of CNN721 are learned by repeating the learning process for one batch (group of image data) as described above until predetermined conditions are met. Termination conditions include reaching a predetermined number of iterations or the loss falling below a predetermined value. In this modified example, the parameters of CNN721 are updated, but the parameters of CNN402, CNN405, and the cross-correlation process 407 may also be updated simultaneously. Furthermore, the learning unit 208 may calculate a loss according to the error related to the results of the detection process. For example, when calculating the loss related to the inference map 410 and performing backpropagation of the error in the inference map 410, the parameters can be updated based on a weighted average of the error propagated from the inference map 410 and the error propagated from the tracking difficulty 722.

[0064] According to the above modification, the tracking difficulty of the target can be estimated based on the information indicating the results of the detection process. Such a high tracking difficulty indicates that when the object that conforms to the information identifying the target can no longer be detected, the detection failure is not so much temporary as it is difficult to correctly re-detect the target. According to this modification, if the difficulty of re-detecting the target is high and the probability of correctly re-detecting the target is low even if tracking continues after a failure to detect the target, the tracking duration is appropriately set to be shorter. Note that the method for calculating the tracking duration using the results of the detection process and the CNN is not limited to the method described above. For example, the inference map 410 may be used as input to CNN721. Also, the tracking difficulty class calculated by CNN721 may be three or more. Furthermore, instead of CNN721 outputting the tracking difficulty 722, it may directly output the tracking duration 724. Even with such methods, the tracking duration can be appropriately set based on the difficulty of re-detecting the target estimated based on the features of the results of the detection process.

[0065] (Modified method 2 for determining tracking duration) In the above embodiment, the tracking duration was calculated based on the result of detection processing from one image in the time series. Alternatively, the tracking duration may be set based on information indicating the results of detection processing for two or more images in the time series.

[0066] In this modified example, the time calculation unit 206 calculates the tracking duration using the results of the detection process for each of a plurality of consecutive search region images in a time series. For example, the time calculation unit 206 can calculate the tracking duration based on a cross-correlation tensor 408 obtained from each of the plurality of images in the time series. In this case, the detection unit 205 can store the calculated cross-correlation tensor 408 in the storage unit 104. The time calculation unit 206 then inputs the cross-correlation tensor 408 based on the currently processed search region image and the cross-correlation tensors previously calculated and held by the storage unit 104 to the tracking duration calculation process 411 to calculate the tracking duration 412.

[0067] The number of cross-correlation tensors 408 that the time calculation unit 206 reads from the memory unit 104 can be predetermined, for example, set by the user in advance. The time calculation unit 206 can also integrate multiple cross-correlation tensors and input the data obtained by the integration into the tracking duration calculation process 411. For example, when calculating the tracking duration by inputting cross-correlation tensors into the CNN 721 as in the modified example 1 above, the time calculation unit 206 can concatenate multiple cross-correlation tensors and input the concatenated cross-correlation tensor into the CNN 721. Note that the method of integrating multiple cross-correlation tensors is not limited to concatenation. For example, the time calculation unit 206 may calculate statistics such as the mean value. Specifically, it can calculate the mean value of corresponding elements in multiple cross-correlation tensors and input a tensor having this mean value as an element into the CNN 721. The time calculation unit 206 may also use an inference map 410 instead of cross-correlation tensors 408 as a result of the detection process from the search region image.

[0068] In this modified example, in S303, the detection unit 205 can store the results of the detection process in the storage unit 104. For example, if processing is being performed on a search region image captured at time t=n, the results of the detection process at time t=n (e.g., the cross-correlation tensor 408 or the inference map 410) can be stored in the storage unit 104. Also, in S304, the time calculation unit 206 can obtain the results of past detection processes from the storage unit 104. In this example, the time calculation unit 206 can obtain the results of the detection process at times t=1, ..., and n-1 (e.g., the cross-correlation tensor 408 or the inference map 410) from the storage unit 104. Furthermore, as described above, the number of detection process results to be obtained can be set in advance. The time calculation unit 206 then calculates the tracking duration based on the results of multiple detection processes, for example, the results of the detection process at times t=1, ..., n-1, and n.

[0069] In this way, by storing information indicating the results of the detection process and using information indicating the results of multiple detection processes to calculate the tracking duration, the tracking duration can be calculated based on the time-dependent characteristics of the detection process results.

[0070] (Modification 3 of the method for determining the tracking duration) There are various methods for the time calculation unit 206 to calculate the tracking duration based on the results of the detection process. For example, the time calculation unit 206 can set the tracking duration based on information indicating the results of the detection process at the location of the target being tracked. Specifically, the time calculation unit 206 can calculate the tracking duration based on the inference value shown in the inference map 410, which indicates the probability of the target being tracked. If the inference value at the location of the target being tracked is large, it is estimated that the difficulty of detecting the target being tracked is low, and therefore the tracking duration can be increased. For example, if the inference value at the location of the target being tracked is a first value, the tracking duration can be set to T1, and if the inference value at the location of the target being tracked is a second value smaller than the first value, the tracking duration can be set to T2, which is shorter than T1.

[0071] Furthermore, the time calculation unit 206 may calculate statistical data indicating the results of the detection process for two or more images in the time series, and calculate the tracking duration based on these statistics. For example, the average value of the inference values ​​at the position of the tracking target on each image, as shown by the inference map 410 obtained from each of the multiple images in the time series, can be calculated, and the tracking duration can be calculated based on this average value. A large average value suggests that the difficulty of detecting the tracking target remains low. In this case as well, a larger average value allows for a longer tracking duration. Specifically, the time calculation unit 206 can set the tracking duration to T1 if the average value is greater than the threshold X, and to T2 (T1>T2) if it is less than the threshold X. Here, the threshold X and the tracking durations T1 and T2 may be predetermined design values, and the time calculation unit 206 can read these values ​​from, for example, the storage unit 104. Note that instead of selecting the tracking duration from two classes such as T1 and T2, the tracking duration may be selected from three or more classes. Furthermore, the time calculation unit 206 may calculate the tracking duration from statistics using a defined continuous function. With this configuration, even if a tracking target that was stably detected becomes temporarily undetectable due to being hidden by an obstacle or for other reasons, it is possible to prevent the tracking time from becoming extremely short.

[0072] Furthermore, the time calculation unit 206 may calculate the tracking duration according to the detection status of the target being tracked. For example, the time calculation unit 206 can set the tracking duration based on the detection status of the target being tracked from two or more images in a time series. Specifically, the time calculation unit 206 can set the tracking duration based on the number of times the target being tracked has not been detected from two or more images in a time series. That is, the time calculation unit 206 can calculate the tracking duration based on the number of times the target being tracked could not be detected (number of disappearances) in a predetermined number of past images. In this case, the fewer the number of disappearances, the longer the tracking duration can be. For example, if the number of disappearances exceeds a predetermined number, the time calculation unit 206 can shorten the tracking duration compared to when the number of disappearances does not exceed a predetermined number. With such a configuration, the tracking duration can be shortened when the target being tracked is difficult to detect, for example, due to the low quality of the target being tracked template. In this case, the user can be prompted to reset the target being tracked image earlier.

[0073] As yet another example, the time calculation unit 206 can set the tracking duration based on the movement of the tracked object, which is determined based on the position of the tracked object detected from two or more images in the time series. For example, the time calculation unit 206 may calculate the duration based on the motion vector of the tracked object shown by the result of the detection process. As the motion vector of the tracked object, the difference in the position of the tracked object detected in two consecutive search area images in the time series can be used. In this case, the larger the motion vector, the shorter the tracking duration can be. With such a configuration, the tracking duration can be set appropriately according to the motion vector. For example, the larger the movement of the tracked object, the more difficult it is expected to be to re-detect the tracked object after it is no longer detected. With such a configuration, the tracking duration is shorter when the movement of the tracked object is large, so tracking ends sooner when the tracked object is no longer detected.

[0074] (Modification 4 of the method for determining the tracking duration) The method by which the time calculation unit 206 calculates the tracking duration based on the results of the detection process is not limited to using the results of the cross-correlation process 407. For example, the tracking target can be detected from the search area image using an object recognition model that distinguishes the tracking target from other objects. In this case as well, the tracking duration can be calculated based on the results of the detection process from the search area image. As for specific methods for calculating the tracking duration, the methods already described can be used.

[0075] The following describes a method for calculating the tracking duration using the results of the detection process by the detection unit 205 and the object identification results using the object identification model. An example of the functional configuration in this modified example is shown in Figure 8. In addition to the configuration shown in Figure 2 already described, the information processing device 1200 in this modified example has an object identification unit 1201. The function of the object identification unit 1201 can be realized, for example, by the CPU 101 reading and executing a program stored in the ROM 102 or the storage unit 104.

[0076] In this embodiment, the object recognition unit 1201 uses an object recognition model to identify a specific object in the image. In this example, the object recognition unit 1201 uses an object recognition model that distinguishes between the tracked object and other objects, and performs object recognition on the image 404 based on the feature quantities 406 of the image 404 generated by the feature quantity calculation unit 202. The object recognition method by the object recognition unit 1201 will be explained with reference to Figure 9. The object recognition unit 1201 inputs the feature quantities 406 of the image 404 into the CNN 1401. The CNN 1401 is a model that distinguishes between the tracked object and other objects. The weight parameters of the CNN 1401 are obtained in advance through training and are stored in the memory unit 104. The CNN 1401, having received the feature quantities 406 as input, outputs a recognition map 1402. The recognition map 1402 can indicate the position of the tracked object in the image 404.

[0077] The time calculation unit 206 sets the tracking duration based on the information indicating the results of the detection process and the identification result by the object identification unit 1201. In this example, the time calculation unit 206 performs a tracking duration calculation process 1403 using the cross-correlation tensor 408 and the identification map 1402 as inputs, and calculates the tracking duration 1404. A CNN can be used for the tracking duration calculation process 1403. In this case, similar to the modified example 2, the time calculation unit 206 can integrate the cross-correlation tensor 408 and the identification map 1402, and input the data obtained by the integration into the tracking duration calculation process 1403. For example, the time calculation unit 206 may perform this integration by concatenating the cross-correlation tensor 408 and the identification map 1402. Alternatively, the time calculation unit 20 6 may perform this integration by multiplying the cross-correlation tensor 408 and the discrimination map 1402 element by element. Alternatively, the time calculation unit 206 may calculate the tracking duration 1404 using the inference map 410 instead of the cross-correlation tensor 408. Furthermore, the time calculation unit 206 may input the tracking results of multiple tracking targets to the tracking duration calculation process 1403, similar to the modification 3.

[0078] The learning method for the parameters of CNN402, CNN405, cross-correlation processing 407, CNN409, and tracking duration calculation processing 1403 is the same as in Modification 1 already described. Here, the case in which the weight parameters of CNN1401 are obtained by learning and stored in the memory unit 104 is described. On the other hand, online learning of the weight parameters of CNN1401 may be performed based on an inference map 410 showing the results of the detection process, which is obtained based on the tracking target template 403 and the image 404. Such online learning can be performed, for example, according to the method described in Non-Patent Document 3.

[0079] The processing of this modified example can be achieved in S304 by having the object identification unit 1201 output the identification map 1402 as described above, and the time calculation unit 206 calculate the tracking duration using the cross-correlation tensor 408 and the identification map 1402 as described above.

[0080] On the other hand, the object identification unit 1201 may identify a different object instead of the target being tracked. For example, the object identification unit 1201 may identify an object that acts as an obstacle in tracking, such as a car or a building. The obstacle in tracking may be an object of the same category as the target being tracked, such as a person. By detecting such an object from image 404, it becomes possible to estimate the difficulty of detecting the target being tracked. That is, if such an obstacle is present, it is expected that the difficulty of detection will be high. Therefore, if the object identification unit 1201 detects such an obstacle, the tracking duration can be shortened.

[0081] As one variation, the object recognition unit 1201 can set up a model to identify obstacle objects as a CNN 1401 shown in Figure 9. The weight parameters of such a CNN 1401 can be parameters obtained through learning and stored in the memory unit 104. For learning the CNN parameters, for example, the method described in Non-Patent Literature 2 can be used. In this case as well, the time calculation unit 206 can calculate the tracking duration using the cross-correlation tensor 408 (or inference map 410) and the identification map 1402 as described above.

[0082] In this way, by using an object recognition model that identifies the target to be tracked, the tracking duration can be set more appropriately. For example, even if a candidate object is close to the target to be tracked during tracking, if the object recognition model can identify the target to be tracked from the candidate object, the tracking duration will be set to be longer, reducing the burden on the user to reset the target to be tracked. Furthermore, by using a model that identifies obstacles as the object recognition model, it is possible to recognize situations where tracking is likely to fail, and the tracking duration can be set more appropriately.

[0083] (Notification method for tracking duration) The time calculation unit 206 can notify the user of the tracking duration. The notification method is not particularly limited. For example, the time calculation unit 206 may display a numerical value indicating the tracking duration on the display unit 105, or it may display visual information other than a numerical value indicating the tracking duration on the display unit 105. In addition, the tracking control unit 207 may notify the user of the completion of the tracking process of the target, for example, via the display unit 105.

[0084] An example of display on the display unit 105 will be explained with reference to Figure 10. As shown in Figure 10, the display unit 105 can display the search area image 1501. The search area image 1501 shows the tracking target 1502, and further displays a detection frame 1503 indicating the tracking target detected by the tracking process. In the example in Figure 10, the detection frame 1503 is displayed in a format corresponding to the tracking duration. For example, one of several formats is selected depending on the length of the tracking duration. Each of the multiple formats differs in, for example, the color of the border, the type of border (e.g., solid or dashed line), or the thickness of the border. The notification method is not limited to this method; for example, the color of the area surrounded by the detection frame 1503 may be changed according to the tracking duration.

[0085] If the tracking target template set by the user is inappropriate, or if the appearance of the tracked target changes significantly, the tracking duration is likely to be shortened. By notifying the user of the tracking duration in this way, it is possible to prompt the user to reset the tracking target template early when the tracking duration is shortened. (Other examples)

[0086] According to one embodiment of the present invention, when the tracking target is not detected in the image, the tracking control unit 207 determines whether or not to terminate the tracking process based on the difficulty of detecting the tracking target estimated based on information indicating the results of the detection process. As already described, the difficulty of detecting the tracking target can be estimated by the time calculation unit 206 based on information indicating the results of the detection process. Furthermore, as described above, the tracking control unit 207 can terminate the tracking process depending on whether the length of time during which the tracking target is not detected in the image exceeds the tracking duration determined according to the difficulty of detection, and the higher the difficulty of detection, the shorter the tracking duration can be. However, it is not essential for the tracking control unit 207 to set the tracking duration. For example, when the tracking target is not detected in the image, the tracking control unit 207 may terminate the tracking process if the difficulty of detection exceeds a threshold, and may continue the tracking process if the difficulty of detection is below the threshold. With such a configuration, it is possible to appropriately control whether or not to continue the tracking process according to the difficulty of detecting the tracking target.

[0087] Up to this point, we have mainly described the case where tracking processing is performed using a CNN. However, the convolutional processing performed by the feature calculation unit 202, template generation unit 204, and detection unit 205 using a CNN can also be performed using other neural network modules. For example, the convolutional layers of the CNN can be changed to a fully connected neural network, or a transformation module (Transformer network) that includes a self-attention mechanism. By using a self-attention mechanism, high-precision inference becomes possible even when using a DNN with a small number of weight parameters.

[0088] For example, the feature calculation unit 202 and the template generation unit 204 can generate features from an image using a neural network including fully connected layers, a neural network including convolutional layers, or a Transformer-type neural network. The time calculation unit 206 can calculate the tracking duration from information indicating the result of the tracking target detection process using a neural network including fully connected layers or a neural network including convolutional layers. Furthermore, the object recognition model used by the object recognition unit 1201 may have a neural network including fully connected layers or a neural network including convolutional layers.

[0089] The present invention can also be realized by supplying a program that implements one or more of the functions of the above-described embodiments to a system or device via a network or storage medium, and by having one or more processors in the computer of that system or device read and execute the program. It can also be realized by a circuit (e.g., an ASIC) that implements one or more functions.

[0090] The disclosures herein include the following information processing devices, information processing methods, and programs.

[0091] (Item 1) An information processing device that performs tracking of a target object in a time-series image, A first acquisition means for acquiring the aforementioned time-series images, A second acquisition means for acquiring information that identifies the target being tracked, A detection means that performs a detection process to detect an object from the time-series images that conforms to the information identifying the tracking target, A setting means for setting the tracking duration based on information indicating the results of the detection process, A control means that terminates the process of tracking the target if the length of time during which the target is not detected in the image exceeds the tracking duration, An information processing device characterized by comprising:

[0092] (Item 2) The detection means detects one or more candidate objects for the tracking target according to the information that identifies the tracking target, The detection means further determines the target to be tracked from among the candidate objects, The information processing device according to item 1, wherein the setting means sets the tracking duration based on information indicating the position of the candidate object.

[0093] (Item 3) The information processing device according to item 2, wherein the setting means sets the tracking duration based on the distance between the position of the tracking target determined from among the plurality of candidate objects and the positions of candidate objects other than the tracking target.

[0094] (Item 4) The information processing device according to item 1, wherein the setting means sets the tracking duration based on information indicating the results of the detection process for two or more images among the time-series images.

[0095] (Item 5) The detection means generates information indicating the result of the detection process based on the cross-correlation between the feature quantities of the image and the feature quantities of the image being tracked. The information processing device according to item 1 or 4, characterized in that the setting means calculates the tracking duration from information indicating the result of the detection process using a neural network.

[0096] (Item 6) The information processing apparatus according to item 5, characterized in that the neural network is a neural network including a fully connected layer or a neural network including a convolutional layer.

[0097] (Item 7) The detection means generates information indicating the result of the detection process based on the cross-correlation between feature quantities at multiple locations in the image and feature quantities of the image being tracked. The information processing device according to item 1 or 4, wherein the setting means sets the tracking duration based on information indicating the result of the detection process at the position of the target to be tracked.

[0098] (Item 8) The information processing apparatus according to any one of items 1 to 7, characterized in that the information showing the result of the detection process is the result of the cross-correlation between the feature quantities of the image and the feature quantities of the image of the target to be tracked, or information showing the likelihood that the target to be tracked exists at each position of the image.

[0099] (Item 9) The information processing device according to item 4, wherein the setting means sets the tracking duration based on the detection status of the tracking target from two or more images among the time-series images.

[0100] (Item 10) The information processing device according to item 9, wherein the setting means sets the tracking duration based on the number of times the tracking target has not been detected from two or more images in the time series.

[0101] (Item 11) The information processing device according to item 10, wherein the setting means sets the tracking duration based on the movement of the tracking target, which is determined based on the position of the tracking target detected from two or more images in the time series.

[0102] (Item 12) The system further comprises an identification means that identifies a specific object in the image using an object identification model, The information processing device according to any one of items 1 to 11, characterized in that the setting means sets the tracking duration based on information indicating the result of the detection process and the identification result by the identification means.

[0103] (Item 13) The information processing apparatus according to item 12, characterized in that the aforementioned specific object is the target to be tracked or an obstacle.

[0104] (Item 14) The information processing apparatus according to item 12 or 13, characterized in that the object recognition model has a neural network including a fully connected layer or a neural network including a convolutional layer.

[0105] (Item 15) The information processing device according to any one of items 1 to 14, characterized in that the detection means uses a Siamese-type neural network to identify the position or range of an object according to the information that identifies the target to be tracked.

[0106] (Item 16) The second acquisition means acquires the image of the target being tracked, The detection means generates feature quantities of the time-series images from the images and generates feature quantities of the tracking target from the images of the tracking target using a neural network including a fully connected layer, a neural network including a convolutional layer, or a Transformer-type neural network. An information processing device characterized by any one of items 1 to 15.

[0107] (Item 17) The information processing device according to any one of items 1 to 16, characterized in that the detection means identifies the position or range of the tracking target on the image based on the detection process.

[0108] (Item 18) The information processing device according to item 17, wherein the detection means further tracks the tracking target between the time-series images based on information indicating the position or range of the tracking target in the time-series images.

[0109] (Item 19) The setting means is characterized by notifying the user of the tracking duration, as described in any one of items 1 to 18.

[0110] (Item 20) An information processing device that performs tracking of a target object in a time-series image, A first acquisition means for acquiring each of the aforementioned time-series images, A second acquisition means for acquiring information that identifies the target being tracked, A detection means that performs a detection process to detect an object from the time-series images that conforms to the information identifying the tracking target, A control means that determines whether or not to terminate the process of tracking the target if the target is not detected in the image, based on the difficulty of detecting the target estimated based on the information indicating the result of the detection process, An information processing device characterized by comprising:

[0111] (Item 21) An information processing method performed by an information processing device that performs tracking of a target in a time-series image, A first acquisition step involves acquiring the aforementioned time-series images, A second acquisition step involves acquiring information to identify the target being tracked, A detection step involves performing a detection process to detect an object from the time-series images that conforms to the information identifying the target being tracked, A setting step of setting the tracking duration based on information indicating the results of the detection process, A control step to terminate the tracking process if the length of time during which the tracking target is not detected in the image exceeds the tracking duration, An information processing method characterized by including

[0112] (Item 22) A program that causes a computer to function as an information processing device as described in one of items 1 through 19.

[0113] The invention is not limited to the embodiments described above, and various modifications and variations are possible without departing from the spirit and scope of the invention. Accordingly, claims are attached to disclose the scope of the invention. [Explanation of Symbols]

[0114] 200: Information processing device, 201: First image acquisition unit, 202: Feature calculation unit, 203: Second image acquisition unit, 204: Template generation unit, 205: Detection unit, 206: Time calculation unit, 207: Tracking control unit, 208: Learning unit, 1200: Information processing unit, 1201: Object identification unit

Claims

1. An information processing device that performs tracking of a target object in a time-series image, A first acquisition means for acquiring the aforementioned time-series images, A second acquisition means for acquiring information that identifies the target being tracked, A detection means that performs a detection process to detect one or more candidate objects for the tracking target from the time-series images according to the information that identifies the tracking target, and determines the tracking target from among the candidate objects, A setting means for setting the tracking duration based on the distance between the position of the target to be tracked, which is determined from among a plurality of candidate objects, and the positions of candidate objects other than the target to be tracked. A control means that terminates the process of tracking the target if the length of time during which the target is not detected in the image exceeds the tracking duration, An information processing device characterized by comprising:

2. The information processing apparatus according to claim 1, wherein the setting means sets the tracking duration based on information indicating the results of the detection process for two or more images among the time-series images.

3. The detection means generates information indicating the result of the detection process based on the cross-correlation between the feature quantities of the image and the feature quantities of the image being tracked. The information processing apparatus according to claim 1, characterized in that the setting means calculates the tracking duration from information indicating the result of the detection process using a neural network.

4. The information processing apparatus according to claim 3, characterized in that the neural network is a neural network including a fully connected layer or a neural network including a convolutional layer.

5. The detection means generates information indicating the result of the detection process based on the cross-correlation between feature quantities at multiple locations in the image and feature quantities of the image being tracked. The information processing apparatus according to claim 1, characterized in that the setting means sets the tracking duration based on information indicating the result of the detection process at the position of the target to be tracked.

6. The information processing apparatus according to claim 1, characterized in that the information indicating the result of the detection process is the result of the cross-correlation between the feature quantities of the image and the feature quantities of the image of the target to be tracked, or information indicating the likelihood that the target to be tracked exists at each position of the image.

7. The information processing apparatus according to claim 2, wherein the setting means sets the tracking duration based on the detection status of the tracking target from two or more images among the time-series images.

8. An information processing device that performs tracking of a target object in a time-series image, A first acquisition means for acquiring the aforementioned time-series images, A second acquisition means for acquiring information that identifies the target being tracked, A detection means that performs a detection process to detect an object from the time-series images that conforms to the information identifying the tracking target, A setting means for setting the tracking duration based on the number of times the tracking target was not detected from two or more images in the aforementioned time series, A control means that terminates the process of tracking the target if the length of time during which the target is not detected in the image exceeds the tracking duration, An information processing device characterized by comprising:

9. The information processing apparatus according to claim 8, wherein the setting means sets the tracking duration based on the movement of the tracking target, which is determined based on the position of the tracking target detected from two or more images in the time series.

10. The system further comprises an identification means that identifies a specific object in the image using an object identification model, The information processing apparatus according to claim 1, wherein the setting means sets the tracking duration based on information indicating the result of the detection process and the identification result by the identification means.

11. The information processing apparatus according to claim 10, characterized in that the specific object is the target to be tracked or an obstacle.

12. The information processing apparatus according to claim 10, characterized in that the object recognition model has a neural network including a fully connected layer or a neural network including a convolutional layer.

13. The information processing apparatus according to claim 1, wherein the detection means uses a Siamese-type neural network to identify the position or range of an object according to the information that identifies the target to be tracked.

14. The second acquisition means acquires the image of the target being tracked, The detection means generates feature quantities of the images from the time-series images and generates feature quantities of the tracking target images from the tracking target images, using a neural network including fully connected layers, a neural network including convolutional layers, or a Transformer-type neural network. The information processing apparatus according to claim 1, characterized in that

15. The information processing apparatus according to claim 1, wherein the detection means identifies the position or range of the tracking target on the image based on the detection process.

16. The information processing apparatus according to claim 15, wherein the detection means further tracks the tracking target between the time-series images based on information indicating the position or range of the tracking target in the time-series images.

17. The information processing device according to claim 1, wherein the setting means notifies the user of the tracking duration.

18. An information processing method performed by an information processing device that performs tracking of a target in a time-series image, A first acquisition step of acquiring the aforementioned time-series images, A second acquisition step involves acquiring information to identify the target being tracked, A detection process is performed to detect one or more candidate objects for the tracking target from the time-series images according to the information that identifies the tracking target, and a detection step is performed to determine the tracking target from among the candidate objects. A setting step of setting the tracking duration based on the distance between the position of the target to be tracked, which is determined from among a plurality of candidate objects, and the positions of candidate objects other than the target to be tracked. A control step to terminate the tracking process if the length of time during which the tracking target is not detected in the image exceeds the tracking duration, An information processing method characterized by including

19. An information processing method performed by an information processing device that performs tracking of a target object in a time-series image, A first acquisition step of acquiring the aforementioned time-series images, A second acquisition step involves acquiring information to identify the target being tracked, A detection step involves performing a detection process to detect an object from the time-series images that conforms to the information identifying the target being tracked, A setting step to set the tracking duration based on the number of times the tracking target was not detected from two or more images in the aforementioned time series, A control step to terminate the tracking process if the length of time during which the tracking target is not detected in the image exceeds the tracking duration, An information processing method characterized by including

20. A program for causing a computer to function as an information processing device according to any one of claims 1 to 17.