Information processing device, information processing method, and recording medium
The information processing device addresses inefficiencies in image processing by standardizing target region sizes and shapes, reducing computational load and enhancing tracking accuracy through integrated feature vectors.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- NEC CORP
- Filing Date
- 2024-12-16
- Publication Date
- 2026-06-25
AI Technical Summary
Existing image processing technologies face inefficiencies in processing images of varying sizes and shapes of tracking targets, leading to increased computational load and complexity in obtaining visual feature vectors.
An information processing device that cuts out target regions of a common size and shape for multiple tracking targets, allowing direct processing without resizing, and integrates visual and positional feature vectors to reduce computational load.
Reduces processing load and enhances accuracy in tracking targets across multiple images by standardizing target region sizes and shapes, facilitating efficient generation of visual feature vectors.
Smart Images

Figure JP2024044382_25062026_PF_FP_ABST
Abstract
Description
Information processing device, information processing method, and recording medium
[0001] This disclosure relates to an information processing device, an information processing method, and a recording medium.
[0002] In recent years, various image processing technologies have emerged. Patent Document 1 discloses a technology for detecting a person through a bounding box that encloses the area occupied by the person, and for processing the image within the bounding box. The shape and size of the bounding box differ for each person.
[0003] Japanese Patent Publication No. 2024-76312
[0004] One example of the purpose of this disclosure is to advance image processing technology.
[0005] According to one aspect of this disclosure, an information processing device is provided comprising: detection means for detecting a target to be tracked from an image and acquiring location information of the target to be tracked; extraction means for cutting out a target region within the image that relates to the target to be tracked and has a common size for multiple targets to be tracked, based on the location information; and acquisition means for acquiring a visual feature vector from the target region.
[0006] Furthermore, according to one aspect of this disclosure, an information processing method is provided in which one or more computers detect a target to be tracked from an image, obtain location information of the target to be tracked, cut out a target area in the image that is related to the target to be tracked and has a common size for multiple target to be tracked, based on the location information, and obtain a visual feature vector from the target area.
[0007] Furthermore, according to one aspect of this disclosure, a recording medium is provided for recording a program that causes a computer to function as: a detection means for detecting a target to be tracked from an image and acquiring location information of the target to be tracked; an extraction means for cutting out a target area within the image that is related to the target to be tracked and has a common size for multiple target to be tracked, based on the location information; and an acquisition means for acquiring a visual feature vector from the target area.
[0008] Figure 1 is a diagram showing an example of a functional block diagram of an information processing device. Figure 2 is a flowchart showing an example of the processing flow of an information processing device. Figure 3 is a diagram illustrating an example of the processing of an information processing device. Figure 4 is a diagram illustrating an example of the processing of a comparative example. Figure 5 is a diagram showing an example of the hardware configuration of an information processing device. Figure 6 is a diagram illustrating another example of the processing of an information processing device. Figure 7 is a diagram showing another example of a functional block diagram of an information processing device. Figure 8 is a diagram illustrating another example of the processing of an information processing device. Figure 9 is a diagram illustrating another example of the processing of an information processing device. Figure 10 is a diagram illustrating another example of the processing of an information processing device. Figure 11 is a diagram showing an example of a functional block diagram of a cross-attention mechanism of an information processing device. Figure 12 is a diagram illustrating another example of the processing flow of an information processing device. Figure 13 is a flowchart illustrating another example of the processing flow of an information processing device. Figure 14 is a diagram illustrating another example of the processing of an information processing device.
[0009] The embodiments of this disclosure will be described below with reference to the drawings. In this disclosure, the drawings are associated with one or more embodiments. In all drawings, similar components are denoted by the same reference numerals, and their descriptions are omitted where appropriate.
[0010] <<First Embodiment>> Figure 1 is a functional block diagram showing an overview of the information processing device 10. Figure 2 is a flowchart showing an example of the processing flow executed by the information processing device 10.
[0011] As shown in Figure 1, the information processing device 10 includes a detection unit 11, a cutting unit 12, and an acquisition unit 13. These functional units execute the processes shown in the flowchart of Figure 2.
[0012] In S10, the detection unit 11 detects the target to be tracked from the image and acquires the location information of the target to be tracked. In S11, the cropping unit 12 crops out a target region from the image that relates to the target to be tracked and has a common size for multiple targets to be tracked, based on the location information. In S12, the acquisition unit 13 acquires visual feature vectors from the target region.
[0013] In this way, the information processing device 10 extracts a "target region of a common size for multiple tracking targets" for each tracking target, processes the extracted target region, and obtains a visual feature vector for each tracking target. That is, the size of the target region extracted from the image for each tracking target in order to obtain a visual feature vector for each tracking target is common (identical) for multiple tracking targets (all tracking targets).
[0014] The features are illustrated using Figure 3. As shown in Figure 3, the information processing device 10 extracts target regions w1 to w3 corresponding to each of the multiple tracking targets from the image P. The information processing device 10 then processes each of the target regions w1 to w3 extracted for each tracking target in the conversion unit to obtain a visual feature vector for each tracking target. The size and shape of the target regions w1 to w3 extracted from each of the multiple tracking targets are common (identical). The size and shape of the target regions w1 to w3 are, for example, the size and shape that match the input to the conversion unit. In this case, the extracted target regions w1 to w3 can be input to the conversion unit and processed without performing any processing such as image resizing.
[0015] Next, Figure 4 illustrates the processing of the comparative example. As shown in Figure 4, the comparative example extracts rectangular regions w'1 to w'3 corresponding to each of the multiple tracking targets from image P. Each rectangular region w'1 to w'3 is a rectangular region that encompasses the entirety of each of the multiple tracking targets. The size and shape of each rectangular region w'1 to w'3 are determined according to the size and shape of each of the multiple tracking targets within the image. Therefore, as shown in the figure, the size and shape of each rectangular region w'1 to w'3 may differ from one another. The comparative example then resizes each of the rectangular regions w'1 to w'3 extracted for each tracking target to match the input of the transformation unit, and then processes them in the transformation unit to obtain a visual feature vector for each tracking target.
[0016] In this way, the comparative example cuts out a rectangular region that depends on the size and shape within the detected image of the tracking target from the image. Next, the comparative example resizes the images of the rectangular regions whose sizes and shapes vary for each tracking target to unify the size and shape. Then, the comparative example processes the resized image with a conversion unit to obtain a visual feature vector. In contrast, regardless of the size and shape within the detected image of the tracking target, the information processing apparatus 10 cuts out a target region with a size and shape common to all tracking targets from the image. For this reason, at the time when an image is cut out for each tracking target, the size and shape are unified. The information processing apparatus 10 processes the image cut out for each tracking target with a conversion unit without resizing or the like to obtain a visual feature vector.
[0017] There is such a difference in processing between the information processing apparatus 10 and the comparative example. For this reason, the information processing apparatus 10 can obtain a visual feature vector with fewer layers than the comparative example. That is, the information processing apparatus 10 can reduce the processing load of the computer for obtaining the visual feature vector compared to the comparative example. According to such an information processing apparatus 10, the processing load of the computer for generating a visual feature vector for each tracking target detected from the image can be reduced.
[0018] <<Second Embodiment>> <Overview> The information processing apparatus 10 according to the second embodiment embodies the configuration of the information processing apparatus 10 according to the first embodiment. The information processing apparatus 10 tracks each tracking target based on an integrated vector for each tracking target obtained by integrating the visual feature vector for each tracking target generated in the processing according to the first embodiment and the position feature vector for each tracking target generated based on the position within the image. This will be described in detail below.
[0019] <Hardware Configuration> First, an example of the hardware configuration of the information processing apparatus 10 will be described. Each functional unit of the information processing apparatus 10 is realized by an arbitrary combination of hardware and software. It is understood by those skilled in the art that there are various modifications to the realization method and apparatus. Software includes programs stored in advance at the stage of shipping the apparatus, programs downloaded from recording media such as CDs (Compact Discs), or servers on the Internet, and the like.
[0020] Figure 5 is a block diagram illustrating the hardware configuration of the information processing apparatus 10. As shown in FIG. 5, the information processing apparatus 10 includes a processor 1A, a memory 2A, an input / output interface 3A, a peripheral circuit 4A, and a bus 5A. The peripheral circuit 4A includes various modules. The information processing apparatus 10 may not have the peripheral circuit 4A. Note that the information processing apparatus 10 may be composed of a plurality of physically and / or logically divided apparatuses. In this case, each of the plurality of apparatuses can have the above hardware configuration.
[0021] The bus 5A is a data transmission path for the processor 1A, the memory 2A, the peripheral circuit 4A, and the input / output interface 3A to transmit and receive data to and from each other. The processor 1A is an arithmetic processing device such as a CPU (Central Processing Unit) or a GPU (Graphics Processing Unit). The memory 2A is a memory such as a RAM (Random Access Memory) or a ROM (Read Only Memory). The input / output interface 3A includes an interface for acquiring information from an input device, an external device, an external server, an external sensor, a camera, etc., and an interface for outputting information to an output device, an external device, an external server, etc. Further, the input / output interface 3A includes an interface for connecting to a communication network such as the Internet. The input device is, for example, a keyboard, a mouse, a microphone, a physical button, a touch panel, etc. The output device is, for example, a display, a projection device, a speaker, a printer, a mailer, etc. The processor 1A can issue commands to each module and perform arithmetic operations based on their arithmetic results.
[0022] <Tracking Method> Next, the tracking method executed by the information processing device 10 will be explained with reference to Figure 6. Figure 6 is a conceptual diagram showing an example of a tracking method using query propagation.
[0023] In Figure 6, the information processing device 10 is configured to perform a tracking process for tracking targets included in a video. In the tracking process performed by the information processing device 10, a combined vector for each target is obtained as a "detection query," which is a combination of the visual feature vector for each target generated in the processing of the first embodiment and the position feature vector for each target generated based on its position in the image. Then, using this detection query, the "tracking query," which is a query for tracking the target, is updated. By updating this tracking query while propagating it over time, the information processing device 10 can acquire features that are effective for tracking at a specific time, enabling highly accurate tracking. These queries are set for each target. Therefore, if a video contains multiple targets, the query corresponding to each of the multiple targets will be updated.
[0024] For example, in the example shown in Figure 6, tracking targets A and B are detected from a frame captured at time T1. Therefore, the detection query at time T1 includes detection queries for tracking targets A and B. The tracking query is then updated using the detection query at time T1. As a result, the tracking query at time T2 includes tracking queries for tracking targets A and B.
[0025] Next, a new target C is detected in the frame captured at time T2. Therefore, the detection query at time T2 includes a detection query for target C. The tracking query is then updated using the detection query at time T2. As a result, the tracking query at time T3 includes the tracking queries for targets A and B that were included in the tracking query at time T2 (in other words, propagated from the previous time), and the tracking query for the newly detected target C.
[0026] Furthermore, only tracking targets A and C were detected in the frame captured at time T4, while tracking target B was not detected. Therefore, the tracking query for tracking target B is missing from the tracking query at time T4, and the tracking queries for tracking targets A and C are included.
[0027] <Functional Configuration> Next, the functional configuration of the information processing device 10 will be described in detail. Figure 7 is an example of a functional block diagram of the information processing device 10. As shown in the figure, the information processing device 10 has a detection unit 11, a cutting unit 12, an acquisition unit 13, an integration unit 14, an update unit 15, and a restoration unit 16.
[0028] The detection unit 11 detects the target to be tracked from the image and acquires the location information of the target to be tracked.
[0029] An "image" is one of several images in a time-series collection. For example, an image may be one of a group of frame images that make up a video. Alternatively, an image may be one of a group of still images taken in sequence using a continuous shooting function, etc.
[0030] The detection unit 11 can acquire images one by one as processing targets from a collection of multiple images in a time series, in chronological order. The detection unit 11 may also acquire images one by one as processing targets from a collection of multiple images in a time series, in the order they were taken. Alternatively, the detection unit 11 may acquire images one by one as processing targets from a collection of multiple images in a time series, in reverse order of taking them. The latter example is suitable, for example, when tracking a target while playing back multiple images in a time series in reverse (playing from the end to the beginning).
[0031] The detection unit 11 may acquire all of the multiple time-series images as processing targets in chronological order, or it may acquire a portion of the multiple time-series images as processing targets intermittently at predetermined intervals. Alternatively, the detection unit 11 may acquire images generated by the camera in real time and use the acquired images as processing targets. In addition, the detection unit 11 may retrospectively acquire a collection of multiple time-series images generated by the camera and then acquire them one by one as processing targets from that collection of multiple images. The camera and the information processing device 10 may be connected in a way that allows communication. Furthermore, images generated by the camera may be input to the information processing device 10 by any means.
[0032] A "camera" has the function of detecting visible light and creating an image. As an alternative, the camera may also have the function of detecting and creating an image of other electromagnetic waves such as infrared rays. The camera may also be a surveillance camera installed in a predetermined location. In addition, the camera may be a mobile camera installed on a moving object and used to photograph the inside or outside of the moving object. Examples of mobile objects include, but are not limited to, bicycles, motorcycles, passenger cars, buses, trucks, trains, airplanes, ships, and drones. In addition, the camera may be a camera carried by a person and used to take pictures in various locations.
[0033] A "tracking target" is an object detected within an image and tracked across multiple images in a time series. The tracking target may be a person, an animal such as a dog, cat, or horse, or other living organisms such as insects. The tracking target may also be a mobile object with mobility capabilities (bicycle, motorcycle, car, bus, truck, train, airplane, ship, drone, etc.). In addition, the tracking target may be an object carried by a person or other entity (bag, pouch, case, umbrella, etc.). Furthermore, the tracking target may be an object monitored to prevent theft or other illegal activities (materials, tools, machinery at a work site, etc.). Note that the examples of tracking targets given here are merely examples and are not limited to those given.
[0034] The detection unit 11 can detect the target to be tracked from an image using widely known techniques. For example, the detection unit 11 may use a trained detection model composed of a neural network to detect the target to be tracked from an image. The detection model may further determine the type of object (person, car, etc.) of the detected target to be tracked. In this case, the detection unit 11 can, for example, detect a rectangular region that encompasses the entire target to be tracked. The detection unit 11 can also recognize the type of object of the detected target to be tracked.
[0035] As another example, the detection unit 11 may detect the target from the image by using face detection technology to detect a rectangular region containing a face, which is part of the target to be tracked. Alternatively, the detection unit 11 may detect the target from the image by using pose estimation technology such as OpenPose or MMPose to detect a set of feature points (such as feature points of a person's body) of the target to be tracked from the image.
[0036] "Location information of the tracked object" indicates the position of the detected tracked object within the image. The location information of the tracked object may also be information about a rectangular region encompassing the tracked object. The rectangular region encompassing the tracked object is a concept that includes a rectangular region encompassing the entire tracked object and a rectangular region encompassing the face, which is a part of the tracked object. In this case, the location information of the tracked object may indicate the coordinates of the vertices of the rectangular region, the width of the rectangular region, the height of the rectangular region, etc. In addition, the location information of the tracked object may indicate the coordinates of a predetermined feature point (e.g., a feature point corresponding to the head) within the set of feature points of the tracked object detected by the pose estimation technique. Note that the coordinates, width, and height here refer to the coordinates, width, and height within the image. After detecting the tracked object from the image using the method described above, the detection unit 11 can acquire the location information of the tracked object as described here based on the results of the detection.
[0037] If there is one person (or one object) to be tracked in the image, the detection unit 11 detects that one person (or one object) to be tracked and acquires the location information of that one person (or one object). On the other hand, if there are multiple objects to be tracked in the image, the detection unit 11 detects multiple objects to be tracked and acquires the location information of each of the multiple objects to be tracked.
[0038] The cutting-out unit 12 cuts out a target region from an image based on the position information of each tracking target for each tracking target.
[0039] The "target region" is a region within the image, and more specifically, it is a partial region within the image. The target region is a region related to the tracking target, and more specifically, it is a region containing information about the tracking target. The target region can enclose a part or the whole of the tracking target. When the tracking target is a person, the target region can enclose a part of the body of the tracking target who is a person, or the whole body.
[0040] The size and shape of the target region are fixed. The size and shape of the target region are common sizes and a common shape for a plurality of tracking targets. The size and shape of the target region are common (identical) regardless of the size and shape of the tracking target within the image.
[0041] Hereinafter, an example in which the target regions have common sizes and a common shape for a plurality of tracking targets will be described using drawings.
[0042] In FIG. 8, a plurality of tracking targets having different sizes and shapes within the image are detected from one image P. In FIG. 8, the tracking target enclosed by the frame DW 1 and the tracking target enclosed by the frame DW 2 are detected. In this case, the cutting-out unit 12 cuts out target regions OW 1 and OW 2 corresponding to each of the plurality of tracking targets. Note that although the sizes and shapes within the image of the plurality of detected tracking targets (the sizes and shapes of the frames DW 1 and the frames DW 2 ) are different from each other, the sizes and shapes of the target regions OW 1 and OW 2 cut out corresponding to each of the plurality of detected tracking targets are common (identical).
[0043] In FIG. 9, tracking targets having different sizes and shapes within the image are detected from each of the two images P 1 and P 2 . In FIG. 9, the tracking target enclosed by the frame DW 1 is detected from the image P 1 , and from the image P2 From frame DW 2 A tracking target contained within is detected. In this case, the cutting unit 12 identifies the target area OW corresponding to each of the multiple tracking targets. 1 and OW 2 Extract the image P of each of the multiple tracked targets that were detected. 1 and P 2 Size and shape within (frame DW 1 and frame DW 2 Although their size and shape differ from one another, the target area OW is cut out corresponding to each of the multiple tracked targets detected. 1 and OW 2 The size and shape are the same (identical).
[0044] Incidentally, in one example, information about a rectangular region encompassing the tracked object is acquired as location information for the tracked object. The size and shape of this rectangular region may change depending on the size and orientation of the tracked object in the image. On the other hand, as mentioned above, the size and shape of the target region are fixed regardless of the size and orientation of the tracked object in the image. Therefore, the size of the target region (common size) may be greater than or equal to the size of the rectangular region, or it may be less than the size of the rectangular region. Figure 10 illustrates this situation.
[0045] In Figure 10, multiple tracking targets with different sizes and shapes within the image are detected from a single image P. In Figure 10, frame DW 1 The tracking target contained within, and the frame DW 2 A tracking target contained within is detected. In this case, the cutting unit 12 identifies the target area OW corresponding to each of the multiple tracking targets. 1 and OW 2 Extract the target area OW. 1 and OW 2 The size and shape are the same (identical).
[0046] Note: Frame DW 1 and target area OW 1 As in the example, the size of the rectangular area (frame DW 1 The size of the target area OW 1It may be larger than the common size. In this case, the cut-out portion 12 is, for example, a rectangular area (frame DW 1 A portion of the area enclosed by the box is the target area OW 1 This will be how it will be presented.
[0047] Also, frame DW 2 and target area OW 2 As in the example, the size of the rectangular area (frame DW 2 The size of the target area OW 2 It may also be smaller than the common size. In this case, the cut-out portion 12 is, for example, a rectangular area (frame DW 2 The target area OW which includes the area enclosed by 2 This will be the part that is extracted.
[0048] Next, we will explain the process of determining the target area for each tracking target.
[0049] Once the size and shape of the target area, as well as its position within the image, are determined, the target area within the image can be identified.
[0050] As described above, the size and shape of the target area are common to multiple tracking targets. The size and shape of the target area are fixed values, and the operator can determine the size and shape of the target area in advance and register it in the information processing device 10. The cutting unit 12 can identify the size and shape of the target area based on the registered information. The shape of the target area is, for example, a rectangle, but is not limited to this.
[0051] The size of the target region can be determined by various criteria. For example, the size and shape of the image that the model that converts the image of the target region (described later) into a visual feature vector accepts as input may be determined as a common size and shape for multiple tracking targets.
[0052] Furthermore, the operator may determine the size of the target area based on the tendency of how the tracked object appears in the image. For example, by analyzing multiple images, the operator can identify the range (size trend) of the size of the tracked object within the image. The operator may then determine the size of the target area to be such that it just encloses all or most of the relatively small tracked object within that range. For example, the operator may make the width and height of the target area closer to the width and height of a relatively small tracked object (e.g., the width and height of a rectangular area containing the tracked object) than to the width and height of a relatively large tracked object (e.g., the width and height of a rectangular area containing the tracked object).
[0053] A relatively small tracking target may be one whose size is smaller than the median of the size range (size trend) of tracking targets within an image. For example, a relatively small tracking target may be the smallest tracking target within the size range (size trend) of tracking targets within an image. Conversely, a relatively large tracking target may be one whose size is larger than the median of the size range (size trend) of tracking targets within an image. For example, a relatively large tracking target may be the largest tracking target within the size range (size trend) of tracking targets within an image.
[0054] When the size of the target area is determined in this way, the target area corresponding to a relatively large tracking target is the target area OW in Figure 8. 1 As in the example, only a portion of the target being tracked will be enclosed. Therefore, the target area corresponding to a relatively large target will contain less visual information about the target. On the other hand, the target area corresponding to a relatively small target will be the target area OW in Figure 8. 2As in the example, the entire or majority of the target being tracked will be enclosed in an area of just enough size to encompass the entire or majority of the target being tracked. Therefore, a target area corresponding to a relatively small target will contain a lot of visual information about the target being tracked. Also, because the area is sized to just encompass the entire or majority of the target being tracked, the disadvantage of the target area corresponding to a relatively small target containing a lot of information other than the target being tracked can be suppressed. The effects of determining the size of the target area in this way will be discussed later.
[0055] Next, we will explain the position of the target region within the image. The position of the target region within the image differs for each tracking target. The cropping unit 12 determines the position of the target region within the image for each tracking target based on the tracking target's position information.
[0056] The cropping unit 12 can determine the position in the image that satisfies a predetermined positional relationship with the position of the tracked target indicated by the tracked target's location information, as the position of the target region corresponding to that tracked target. The predetermined positional relationship is defined such that the target region of each tracked target encompasses part or all of each tracked target. If the tracked target is a person, the predetermined positional relationship is defined such that the target region of each tracked target encompasses part or all of the body of each tracked target that is a person. The following embodiments will describe specific examples of the process of determining the position of the target region in the image for each tracked target based on the tracked target's location information, as well as other examples.
[0057] Returning to Figure 7, the acquisition unit 13 acquires visual feature vectors from the target region extracted by the cropping unit 12. The acquisition unit 13 also acquires positional feature vectors from the positional information of the tracked target acquired by the detection unit 11.
[0058] A "visual feature vector" is a vector (feature quantity) that represents the visual features of a target region, that is, the visual features of the tracked object present in the target region. If the target region includes all or most of the tracked object, the visual feature vector obtained from that region represents the visual features of all or most of the tracked object. On the other hand, if the target region includes only a part of the tracked object, the visual feature vector obtained from that region represents the visual features of only a part of the tracked object.
[0059] There are various methods for converting an image of a target region into a visual feature vector. For example, the acquisition unit 13 may acquire a visual feature vector by inputting the image of the target region into a convolutional layer. Alternatively, the acquisition unit 13 may acquire a visual feature vector by inputting the image of the target region into a downsampling layer. For example, the acquisition unit 13 can acquire a visual feature vector of a target region by inputting the image of the target region into a CNN (Convolutional Neural Network) having convolutional layers and pooling layers.
[0060] A "position feature vector" is a vector (feature quantity) that indicates the position of the object being tracked within the image. The position of the object being tracked within the image is indicated by the position information of the object being tracked acquired by the detection unit 11.
[0061] The acquisition unit 13 may use an encoder equipped with multiple feature extraction blocks to convert the location information of the tracked object into a location feature vector. The encoder may be constructed, for example, with about three fully connected layers in a neural network.
[0062] The integration unit 14 integrates the visual feature vector and position feature vector acquired by the acquisition unit 13 to obtain an integrated vector.
[0063] In one example, the integration unit 14 can generate an integrated vector by concatenating (arranging) visual feature vectors and positional feature vectors in a predetermined order. In this case, for example, integrating a 32-dimensional visual feature vector and a 32-dimensional positional feature vector generates a 64-dimensional integrated vector. The integration unit 14 can perform this integration using, for example, the CONCAT function. Note that 32 and 64 dimensions here are just examples, and the dimensions of each vector are not limited to these.
[0064] As another example, the integration unit 14 can generate an integrated vector by adding together the same elements of the visual feature vector and the positional feature vector. In this case, for example, a 32-dimensional integrated vector is generated by integrating a 32-dimensional visual feature vector and a 32-dimensional positional feature vector. Note that 32 dimensions here is just an example, and the dimensions of each vector are not limited to this.
[0065] The update unit 15 updates the integrated vector using a cross-attention mechanism that can match the target being tracked across multiple images.
[0066] The "Cross-attention mechanism" has the function of matching the integrated vector of the tracked object between two images. Specifically, the cross-attention mechanism has the function of matching which tracked objects are the same object for each of the consecutive images acquired in a time series. The specific configuration of the cross-attention mechanism will be explained in detail later. The integrated vector updated by the update unit 15 (hereinafter referred to as the "updated integrated vector") may be used, for example, as a tracking query (see Figure 6) used in the tracking process. In this case, the updated integrated vector may be temporarily stored in a storage unit that stores the tracking query. The storage unit may be provided within the information processing device 10, or it may be provided in an external device that is communicably connected to the information processing device 10.
[0067] The restoration unit 16 is configured to restore the updated integrated vector, updated by the update unit 15, to the positional information and visual information that were in their state before the vector transformation. The restoration unit 16 may be configured as a decoder equipped with multiple feature extraction blocks. The restoration unit 16 may be constructed, for example, with about three fully connected layers in a neural network.
[0068] The memory unit is configured to store location information, the various vectors mentioned above, and other various information for each tracked target. For example, the memory unit may be configured to store detection queries and tracking queries for each tracked target. For example, the memory unit may have multiple memory areas, and one memory area may store tracking queries performed for a particular person (tracked target) in chronological order. Alternatively, the memory unit may be configured to store location information, the various vectors mentioned above, and other various information linked to a corresponding ID (Identifier) for each tracked target. The memory unit may be located within the information processing device 10, or it may be located within an external device that is communicatively connected to the information processing device 10.
[0069] "Cross-Warning Mechanism" Next, the configuration and operation of the cross-warning mechanism will be explained with reference to Figure 11. Figure 11 is a block diagram showing the configuration of the cross-warning mechanism used by the update unit 15.
[0070] As shown in Figure 11, the cross-attention mechanism comprises three feature embedding processing units 210, 220, and 230 corresponding to the query, key, and value, respectively, a matrix multiplication unit 240, a normalization unit 250, a matrix multiplication unit 260, a residual processing unit 270, and a memory update unit 280.
[0071] The feature embedding processing unit 210 is configured to extract queries from the integration vector at time t (i.e., feature quantities corresponding to frames captured at time t) input from the integration unit 14. The feature embedding processing unit 220 is configured to extract keys from the integration vector at time t-τ calculated in past tracking processes (i.e., feature quantities corresponding to frames captured at time t-τ prior to time t). The feature embedding processing unit 230 is configured to extract values from the integration vector at time t-τ calculated in past tracking processes. The queries and keys are output to the matrix multiplication unit 240. On the other hand, the values are output to the matrix multiplication unit 260.
[0072] The matrix multiplication unit 240 is configured to calculate an Attention Weight indicating the correlation between a query and a key by performing a matrix product of the query and the key. That is, the matrix multiplication unit 240 is configured to calculate a weight indicating the correlation between the integrated vector corresponding to the frame captured at time t and the integrated vector corresponding to the frame captured at time t-τ. For example, the matrix multiplication unit 240 may calculate (use) a similarity matrix (Affinity Matrix) with the integrated vector corresponding to the frame captured at time t on the vertical axis and the integrated vector corresponding to the frame captured at time t-τ on the horizontal axis as the Attention Weight of the cross-attention mechanism.
[0073] The normalization unit 250 is configured to perform normalization processing on the weights calculated by the matrix multiplication unit 240. For example, the normalization unit 250 may perform normalization processing on the similarity matrix calculated by the matrix multiplication unit 240 using the Cross-softmax function. The weights normalized by the normalization unit 250 are output to the matrix multiplication unit 260.
[0074] The matrix multiplication unit 260 is configured to perform a process that reflects the weights in the value by performing a matrix multiplication calculation between the output from the normalization unit 250 and the value. Typically, this matrix multiplication may be a tensor product (in other words, a Cartesian product). For example, the matrix multiplication may be a Kronecker product. The calculation result of the matrix multiplication unit 260 is output to the residual processing unit 270.
[0075] The residual processing unit 270 is configured to perform residual processing on the calculation result of the matrix multiplication unit 260. This residual processing may be a process of adding the calculation result of the matrix multiplication unit 260 with the integrated vector input to the cross-attention mechanism (specifically, the feature quantity at time t). This is to prevent the calculation value of the cross-attention mechanism from disappearing when the calculated correlation value is low. For example, if 0 is calculated as the correlation (weight), the value value will be multiplied by that 0, causing the feature value in the calculation result of the matrix multiplication unit 260 to become 0 (disappear). To prevent this, the residual processing unit 270 performs the residual processing described above. The calculation result of the residual processing unit 270 is output from the cross-attention mechanism as an updated integrated vector at time t.
[0076] The memory update unit 280 updates the integrated vector (feature) corresponding to the tracked target stored in the memory. The memory update unit 280 may update only the integrated vector stored in the memory means corresponding to the updated integrated vector output by the matrix multiplication unit 260, or it may overwrite and update the integrated vector in the memory means with the integrated vector output by the calculation result of the residual processing unit 270. For example, the tracked target may be identified by the weight calculated from the query and key in the matrix multiplication unit 240, and it may be determined which tracked target to update from among the multiple tracked targets stored in the memory unit 155. Furthermore, the updated integrated vector calculated from the normalized weights and values in the matrix multiplication unit 260 may be determined as the update amount of the tracked target's integrated vector stored in the memory unit 155.
[0077] Furthermore, the information processing device 10, noting that the operations performed in the tracking process and the cross-attention mechanism are essentially similar, can be said to be updating the integrated vector using the information generated when matching objects. For example, in the tracking process, the processes of detecting the object to be tracked, matching the object to be tracked, and updating the detection result of the object to be tracked are performed. On the other hand, in the cross-attention mechanism, the processes of obtaining an integrated vector related to the object to be tracked, calculating weights, and updating the integrated vector related to the object to be tracked are performed. The information processing device 10 essentially reuses the process of calculating weights in the cross-attention mechanism as well as the process of matching the object to be tracked in the tracking process. In other words, the information processing device 10 essentially reuses the process of matching the object to be tracked in the tracking process as well as the process of calculating weights in the cross-attention mechanism. Therefore, it can be said that the information processing device 10 realizes the operations of detecting an object, matching an object, and updating the detection result using the cross-attention mechanism.
[0078] More specifically, the cross-attention mechanism uses the integrated vector corresponding to the frame captured at time t as a query, as described above. The cross-attention mechanism also retrieves the integrated vectors contained in frames captured up to time t-τ, prior to time t, as key and value data from the storage unit 160. In this way, the cross-attention mechanism performs the process of tracking the target. The storage update unit 280 may also update the storage means by overwriting the integrated vector output by the calculation result of the residual processing unit 270. Alternatively, the tracking target may be identified from the weights calculated by the matrix multiplication unit 240, and the integrated vector corresponding to the ID of the identified tracking target stored in the storage means may be updated using the updated integrated vector output by the matrix multiplication unit 260. In this way, it is possible to perform the tracking process for targets included in the video with relatively simple algorithms and with high accuracy.
[0079] "Similarity Matrix" Next, with reference to Figure 12, we will specifically explain the similarity matrix calculated by the cross-attention mechanism described above. Figure 12 is a plan view showing an example of a similarity matrix calculated by the cross-attention mechanism.
[0080] As shown in Figure 12, the similarity matrix AM used as a weight by the cross-attention mechanism provides information indicating the correspondence between the tracked object Ot-τ at time t-τ and the tracked object Ot at time t. For example, the similarity matrix AM shows the following: (1) The first tracked object Ot-τ among multiple tracked objects Ot-τ corresponds to the first tracked object Ot among multiple tracked objects Ot (i.e., they are the same person). (2) The second tracked object Ot-τ among multiple tracked objects Ot-τ corresponds to the second tracked object Ot among multiple tracked objects Ot. (N) The Nth tracked object Ot-τ among multiple tracked objects Ot-τ corresponds to the Nth tracked object Ot among multiple tracked objects Ot.
[0081] Furthermore, since the similarity matrix AM shows the correspondence between the tracked object Ot-τ and the tracked object Ot, it may also be called correspondence information.
[0082] Specifically, the similarity matrix AM can be considered as a matrix whose vertical axis corresponds to the vector components of the integrated vector CVt-τ and whose horizontal axis corresponds to the vector components of the integrated vector CVt. Therefore, the size of the vertical axis of the similarity matrix AM is the size of the integrated vector CVt-τ, which corresponds to the size (i.e., number of pixels) of the image taken at time t-τ. Similarly, the size of the horizontal axis of the similarity matrix AM is the size of the integrated vector CVt, which corresponds to the size (i.e., number of pixels) of the image taken at time t. In other words, the similarity matrix AM can be considered as a matrix whose vertical axis corresponds to the detection result of the tracked object Ot-τ captured in the image at time t-τ, and whose horizontal axis corresponds to the detection result of the tracked object Ot captured in the image at time t. The detection result shows the positional and visual features of the tracked object.
[0083] In this case, the elements of the similarity matrix AM react (typically have a non-zero value) at the point where the vector component corresponding to a tracking target Ot-τ on the vertical axis intersects with the vector component corresponding to the same tracking target Ot on the horizontal axis. In other words, the elements of the similarity matrix AM react at the point where the detection result of tracking target Ot-τ on the vertical axis intersects with the detection result of tracking target Ot on the horizontal axis.
[0084] In other words, the similarity matrix AM is typically a matrix where the value of the element at the intersection of the vector component corresponding to the tracked object Ot-τ contained in the integrated vector CVt-τ and the vector component corresponding to the same tracked object Ot contained in the integrated vector CVt satisfies the condition. The condition is, for example, that the value of the element at the above intersection is the value obtained by multiplying the two vector components (i.e., a non-zero value), while the values of the other elements are zero. Hereafter, the "intersection of the vector component corresponding to the tracked object Ot-τ contained in the integrated vector CVt-τ and the vector component corresponding to the same tracked object Ot contained in the integrated vector CVt" may be referred to as the "intersection position of the tracked object O".
[0085] For example, in the example shown in Figure 12, the elements of the similarity matrix AM react at the intersection of the vector component corresponding to the tracked target O#k in the integrated vector CVt-τ and the vector component corresponding to the same tracked target O#k in the integrated vector CVt. Here, k is the number of detected tracked targets O, and in the example shown in Figure 12, k = 1, 2, 3, or 4. In other words, the elements of the similarity matrix AM react at the intersection of the detection result of the tracked target O#k captured in the image taken at time t-τ and the detection result of the tracked target O#k captured in the image taken at time t.
[0086] Conversely, if the elements of the similarity matrix AM do not react (typically become 0) at the intersection of the tracked object O, it can be inferred that the tracked object Ot-τ, which was captured in the image taken at time t-τ, was not captured in the image taken at time t. For example, it can be assumed that the tracked object Ot-τ, which was captured in the image taken at time t-τ, went out of frame and outside the camera's shooting range at time t.
[0087] Thus, the similarity matrix AM can be used as information showing the correspondence between the target object Ot-τ and the target object Ot. In other words, the similarity matrix AM can be used as information showing the result of matching the target object Ot-τ captured in the image taken at time t-τ with the object Ot captured in the image taken at time t. Therefore, the similarity matrix AM can be used as information for tracking the position of the target object Ot-τ, which was captured in the image taken at time t-τ, within the image taken at time t. By using the similarity matrix AM in this way, it is possible to perform the tracking process of targets included in a video with high accuracy.
[0088] <Processing Flow> Next, an example of the processing flow of the information processing device 10 will be explained using the flowchart in Figure 13. The purpose here is to explain the processing flow. Details of each process have been described above, so explanations will be omitted here as appropriate.
[0089] When the tracking process is started, the information processing device 10 first acquires an image that is one of a collection of multiple images in a time series (S20). Then, the information processing device 10 detects the target to be tracked from the image acquired in S20 and acquires the location information of the detected target to be tracked (S21).
[0090] Next, the information processing device 10 obtains a location feature vector from the location information of the tracked target acquired in S21 (S22).
[0091] Furthermore, the information processing device 10 extracts a target region from the image acquired in S20 based on the location information of the tracked object acquired in S21 (S23). The target region is a region within the image acquired in S20 that relates to the tracked object and has a common size for multiple tracked objects. Next, the information processing device 10 acquires a visual feature vector from the image of the target region (S24).
[0092] Next, the information processing device 10 integrates the position feature vector acquired in S22 and the visual feature vector acquired in S24 to obtain an integrated vector (S25). Then, the information processing device 10 updates the integrated vector acquired in S25 using the cross-attention mechanism (S26). After that, the information processing device 10 restores the integrated vector updated in S26 to the position information and visual information in the state before the vector conversion (S27).
[0093] Subsequently, the information processing device 10 determines whether or not to terminate the tracking process (S28). If it does not terminate the tracking process (No in S28), it returns to S20 and repeats the process. On the other hand, if it does terminate the tracking process (Yes in S28), the information processing device 10 terminates the series of processes.
[0094] <Effects and Effects> The information processing device 10 of the second embodiment can achieve the same effects and effects as the information processing device 10 of the first embodiment.
[0095] Furthermore, the information processing device 10 extracts a "target region with a common size for multiple tracking targets" for each tracking target, processes the extracted target region, and obtains a visual feature vector for each tracking target. The information processing device 10 can make the size of the target region a common size regardless of the size of the tracking target within the image. Therefore, even if multiple tracking targets with different sizes within the image are detected from a single image, the information processing device 10 can make the size of the extracted target region corresponding to each of the multiple tracking targets the same. Also, even if multiple tracking targets with different sizes within the image are detected from each of the multiple images, the information processing device 10 can make the size of the extracted target region corresponding to each of the multiple tracking targets the same. Due to this processing, if the size of the rectangular region containing the tracking target is smaller than the common size, the information processing device 10 can extract the target region containing the rectangular region. Also, if the size of the rectangular region containing the tracking target is larger than the common size, the information processing device 10 can extract a part of the rectangular region as the target region. With such an information processing device 10, the effects and advantages described in the first embodiment using Figures 3 and 4 can be realized. In other words, it reduces the processing load on the computer required to generate the visual feature vectors for each tracked object detected from the image.
[0096] Furthermore, the information processing device 10 can, in one example, obtain a visual feature vector by inputting an image of the target region into a convolutional layer. Alternatively, the information processing device 10 can, in one example, obtain a visual feature vector by inputting an image of the target region into a downsampling layer. With such an information processing device 10, it is possible to obtain a visual feature vector that represents the visual features of the target being tracked, which are included in the target region.
[0097] Furthermore, the information processing device 10 performs tracking of the target using an integrated vector which combines the visual feature vector and position feature vector of the target to be tracked.
[0098] The technology of tracking an object based on its position within an image is widely known. This technology allows for tracking an object within an image with a reasonably sufficient level of accuracy. However, the position (coordinates) of the object within an image can change not only due to the movement of the object but also due to changes in the camera's shooting direction caused by camera shake or the surrounding environment (wind, etc.). When such changes in position within an image due to changes in the camera's shooting direction occur, it can become difficult to track the object within the image with sufficient accuracy. The information processing device 10 suppresses this problem by tracking the object using an integrated vector that combines the object's visual feature vector and position feature vector. By utilizing not only the object's position information but also its visual information, tracking within the image can be continued with a reasonably sufficient level of accuracy even when the object's position within the image changes due to changes in the camera's shooting direction, etc. As a result, the tracking accuracy is improved.
[0099] Furthermore, the information processing device 10 can employ both the aforementioned "technique for extracting a target area of a common size for each tracking target" and the aforementioned "technique for utilizing integrated vectors" by employing a distinctive method.
[0100] To track an object based on visual information, it is preferable to obtain visual feature vectors from a target region that includes the majority of the object, for example, the entire object, and does not contain any extraneous information in the background. However, since the size of objects to be tracked varies within an image, determining the target region for each object in this manner may result in the target regions not being identical and exhibiting variability.
[0101] If the size of the target region is set to a common size regardless of its size within the image of the tracked object, the size of the target region of one tracked object may satisfy the above-mentioned preferred conditions, but the size of the target region of other tracked objects may not satisfy the above preferred conditions and may be too large or too small. This problem can arise when employing both the "technique of extracting a target region of a common size for each tracked object" and the "technique of using integrated vectors".
[0102] To address this problem, the information processing device 10 can set the size of the target area to a size that just encloses the whole or most of the tracking object, which is relatively small in size within the image. In other words, the size of the target area can be set to a size suitable for a tracking object that is relatively small in size within the image. Here, "suitable size" is a size that satisfies the above-mentioned preferred conditions (a target area that includes most of the tracking object, e.g., the whole, and does not contain any extraneous information in the background).
[0103] When the size of the target area is determined in this way, the target area corresponding to a relatively small tracking target is the target area OW in Figure 8. 2 As shown in the example, the entire or majority of the target being tracked is enclosed within a region of just enough size to encompass the entire or majority of the target. Therefore, the target region corresponding to a relatively small target will contain a lot of visual information about the target. Also, because the region is just large enough to enclose the entire or majority of the target, the disadvantage of the target region corresponding to a relatively small target containing a lot of information other than the target being tracked can be suppressed. As a result, sufficient visual information (visual feature vectors) can be obtained to track a relatively small target within the image.
[0104] On the other hand, the target area corresponding to a relatively large tracking target is the target area OW in Figure 8. 1 As in the example, this only encloses a portion of the target being tracked. Therefore, a target region corresponding to a relatively large target will contain limited visual information (visual feature vectors) about that target. Tracking with sufficient accuracy based on such limited visual information is difficult.
[0105] However, for the following reasons, this configuration allows for tracking both relatively small and relatively large targets with sufficient accuracy.
[0106] Tracking targets that are relatively close to the camera and appear relatively large in the image are less affected by slight changes in the camera's shooting direction. Therefore, even if a slight change in the camera's shooting direction causes a change in the target's position in the image, tracking can continue with a reasonably sufficient level of accuracy. In other words, for tracking targets that appear relatively large in the image, tracking can be performed with a reasonably sufficient level of accuracy based on positional information, thus reducing the need for visual tracking support.
[0107] On the other hand, tracking targets that are relatively far from the camera and appear relatively small in the image are more susceptible to changes in their position within the image caused by even slight changes in the camera's shooting direction. Therefore, if a change in position within the image occurs due to a slight change in the camera's shooting direction, it can become difficult to continue tracking the target within the image with sufficient accuracy. In other words, for tracking targets that appear relatively small in the image, the need for tracking support based on visual information is relatively greater.
[0108] Therefore, the information processing device 10 sets the size of the target area to a size suitable for tracking targets that are relatively small in size within the image, thereby providing relatively stronger support for tracking based on visual information for tracking targets that are relatively small in size within the image. As a result, tracking based on positional information and visual information enables tracking of relatively small targets with sufficient accuracy. However, when the size of the target area is set to a size suitable for tracking targets that are relatively small in size within the image, the support for tracking based on visual information becomes relatively weaker for tracking targets that are relatively large in size within the image. However, as mentioned above, tracking targets that are relatively large in size within the image can be tracked with a certain degree of sufficient accuracy based on positional information, so the accuracy of tracking using visual feature vectors is relatively low. For this reason, even when the size of the target area is set to a size suitable for tracking targets that are relatively small in size within the image, relatively large targets can be tracked with sufficient accuracy based on positional information.
[0109] Furthermore, the information processing device 10 can track targets using a cross-attention mechanism. With such an information processing device 10, it is possible to perform tracking processing more appropriately compared to cases where a cross-attention mechanism is not used. For example, when attempting to use a self-attention mechanism for tracking processing, learning is required so that the weights in the self-attention mechanism respond strongly to the same targets. However, realizing such a configuration requires a large number of self-attention mechanisms, resulting in the technical problem of complicating the algorithm used for tracking processing. With the information processing device 10 using a cross-attention mechanism, the algorithm for tracking processing can be constructed with a simple structure, making it possible to achieve highly accurate tracking processing while suppressing computational costs.
[0110] <<Third Embodiment>> The information processing device 10 of the third embodiment uses a characteristic process to determine the position of the target area to be cut out corresponding to the tracked target, based on the location information of the tracked target. This will be explained in detail below.
[0111] The cutting unit 12 can determine a position that satisfies a predetermined positional relationship with the position of the tracking target indicated by the tracking target's location information, and set that position as the position of the target area corresponding to the tracking target.
[0112] For example, the cutting unit 12 can identify a reference position based on the location information of the target to be tracked, and cut out a target area whose relative positional relationship with the reference position satisfies the positional conditions.
[0113] When the location information of the tracked object is information about a rectangular area containing the tracked object, the reference position may be a predefined position within that rectangular area. Examples of predefined positions include, but are not limited to, the center of the rectangular area, the upper right vertex of the rectangular area, the upper left vertex of the rectangular area, the lower right vertex of the rectangular area, and the lower left vertex of the rectangular area.
[0114] Furthermore, if the location information of the tracked object is information about a set of feature points of the tracked object detected by the posture estimation technology, the reference position may be the position of a predefined feature point within that set of feature points. The predefined feature point may be a feature point corresponding to the head, a feature point corresponding to the neck, a feature point corresponding to the waist, or any other feature point.
[0115] Positional conditions are defined by the relative positional relationship with the reference position as described above. Examples of positional conditions include, but are not limited to, the following: • The reference position coincides with the center of the target area. • The reference position coincides with the upper right vertex of the target area. • The reference position coincides with the upper left vertex of the target area. • The reference position coincides with the lower right vertex of the target area. • The reference position coincides with the lower left vertex of the target area. • The position reached after moving from the reference position according to a predetermined rule coincides with the center of the target area. • The position reached after moving from the reference position according to a predetermined rule coincides with the upper right vertex of the target area. • The position reached after moving from the reference position according to a predetermined rule coincides with the upper left vertex of the target area. • The position reached after moving from the reference position according to a predetermined rule coincides with the lower right vertex of the target area. • The position reached after moving from the reference position according to a predetermined rule coincides with the lower left vertex of the target area.
[0116] The predetermined rules include, for example, "move x1 units in the x-axis direction and y1 units in the y-axis direction," but are not limited to these.
[0117] The cutting section 12 can determine the position of the target area using this method. The size of the target area is the same for multiple tracking targets (common size), as in the first and second embodiments.
[0118] With such a cutting section 12, for example, it is possible to cut out a target area whose center position coincides with the center position of the tracking target (e.g., the center of a rectangular area, the position of a characteristic point at the center of the body such as the waist, etc.) and whose size is a common size.
[0119] Other configurations of the information processing device 10 can be the same as those in the first and second embodiments.
[0120] The information processing device 10 of the third embodiment can achieve the same effects as the information processing device 10 of the first and second embodiments. Furthermore, the information processing device 10 can identify a reference position based on the location information of the tracked object, and can extract a target area whose relative positional relationship with the reference position satisfies positional conditions and whose size is a common size. With such an information processing device 10, it is possible to extract a desired target area by appropriately setting the definition of the reference position and positional conditions. For example, it is possible to extract a target area that includes the whole or a part of the tracked object. In addition, the target area to be extracted can be easily changed by changing the definition of the reference position and positional conditions.
[0121] Furthermore, with such an information processing device 10, for example, it is possible to extract a target area whose center position coincides with the center position of the tracking target (e.g., the center of a rectangular area, the position of a feature point at the center of the body such as the waist), and whose size is a common size. In this way, by extracting a predetermined range from the center position of the tracking target as the target area, it becomes possible to extract the whole or a large portion of the tracking target, which is relatively small in size within the image, with a size that just encloses the whole or a large portion of the tracking target.
[0122] <<Fourth Embodiment>> The information processing device 10 of the fourth embodiment determines the position of the target area to be cut out in accordance with the tracking target so as to satisfy predetermined cutting conditions. This will be described in detail below.
[0123] The cutting unit 12 determines the position of the target area to be cut out in accordance with the tracking target so as to satisfy predetermined cutting conditions. The size of the target area is a common size (common size) for multiple tracking targets, as in the first to third embodiments.
[0124] The extraction conditions may include, but are not limited to, one of the following: • The target area encompasses a predetermined portion of the target being tracked. • The upper part of the target being tracked is preferentially included.
[0125] The designated part of the target being tracked is a part that is useful for distinguishing the target from other targets (a part that has distinctiveness). If the target being tracked is a person, the face (head) may be defined as the designated part. If the target being tracked is a car, the location where the emblem is installed (the center of the front grille) may be defined as the designated part. Note that the examples given here are merely examples and are not limited to these.
[0126] In this example, the cropping unit 12 can, for example, identify the location of the tracked object within the image based on the location information of the tracked object, and then analyze the image at the location of the tracked object to detect a predetermined portion of the tracked object. The cropping unit 12 can then determine the position of the target region so as to enclose the detected predetermined portion of the tracked object.
[0127] As another example, the cutting unit 12 can determine the position of the target area so as to preferentially include the area above the object to be tracked. For example, the cutting unit 12 can determine the position of the target area so that the top edge of the target area overlaps with the top edge of the rectangular area that encompasses the entire object to be tracked.
[0128] Other configurations of the information processing device 10 can be the same as those of the first to third embodiments.
[0129] The information processing device 10 of the fourth embodiment can achieve the same effects and advantages as the information processing device 10 of the first to third embodiments. Furthermore, the information processing device 10 can determine the position of the target area so as to enclose a predetermined portion of the object to be tracked.
[0130] As described in detail in the second embodiment, in one example, the size of the target region is set to a size suitable for tracking targets that are relatively small in size within the image. In this case, a target region corresponding to a relatively large tracking target will contain only a portion of the tracking target. If the target region contains only a portion of the tracking target, it may not be possible to obtain sufficient visual feature vectors from the target region to identify that tracking target.
[0131] To address this problem, the information processing device 10 determines the position of the target area to satisfy the above-mentioned extraction conditions, thereby enabling it to extract a target area containing a useful portion for distinguishing it from other tracking targets from a relatively large tracking target. For example, the information processing device 10 can extract a target area that contains a useful portion (a portion with discriminative power) for distinguishing the tracking target from other tracking targets. Specifically, it can extract a target area that contains the head (face) of the person being tracked. Furthermore, the information processing device 10 can extract a target area that preferentially contains the upper part of the tracking target. When a target area that preferentially contains the upper part is extracted, it is possible to extract a target area that contains the head (face) of the person being tracked with a high probability.
[0132] With this information processing device 10, it becomes possible to distinguish not only relatively small tracking targets but also relatively large tracking targets with sufficient accuracy using visual feature vectors. Furthermore, by using visual feature vectors in addition to positional feature vectors, the tracking accuracy can be significantly improved not only for relatively small tracking targets but also for relatively large targets.
[0133] <<Fifth Embodiment>> The information processing device 10 of the fifth embodiment changes the method of cropping the target area according to the size of the tracking target within the image. That is, the information processing device 10 crops the target area corresponding to each tracking target using an appropriate method according to the size of the tracking target within the image. This will be explained in detail below.
[0134] The cropping unit 12 changes the method of cropping the target area, specifically the cropping position of the target area, according to its size within the image being tracked. The size of the target area is the same (common) regardless of its size within the image being tracked, as in the first to fourth embodiments.
[0135] The "size of the tracked object within the image" may be expressed as the size of the rectangular region containing the tracked object. In this case, the cropping unit 12 may calculate, for example, the height or area (number of pixels contained within, etc.) of the rectangular region containing the tracked object as the size of the tracked object within the image.
[0136] In addition, the size of the object being tracked within the image may be expressed as the distance between a first and second feature point within a set of feature points of the object being tracked (such as feature points of a human body). There are no particular restrictions on which feature points are designated as the first and second feature points. For example, the feature point corresponding to the head may be designated as the first feature point, and the feature point corresponding to the toes may be designated as the second feature point.
[0137] The method for cropping the target area is defined in advance for each size of the target within the image to be tracked. The cropping unit 12 calculates the size of each tracked target within the image for each tracked target detected from the image. Then, the cropping unit 12 determines the position of the target area using the cropping method corresponding to the calculated size within the image.
[0138] For example, if the size of the object to be tracked within the image is greater than or equal to a predetermined value, a cropping condition may be defined to crop a target area that preferentially includes the area above the object to be tracked. The cropping unit 12 may then crop the target area to satisfy this cropping condition if the size of the object to be tracked within the image is greater than or equal to a predetermined value.
[0139] In another example, if the size of the image being tracked is greater than or equal to a predetermined value, a cropping condition may be defined to crop out a target region that includes a predetermined part of the image being tracked (such as the head (face)). The cropping unit 12 may then crop out the target region to satisfy this cropping condition if the size of the image being tracked is greater than or equal to a predetermined value.
[0140] Furthermore, if the size of the object to be tracked within the image is less than a predetermined value, the target area may be extracted based on a first rule that extracts the target area whose center position coincides with the center position of the object to be tracked. If the size of the object to be tracked within the image is greater than or equal to a predetermined value, the target area may be extracted based on a rule different from the first rule. The rule different from the first rule may be, for example, a rule that extracts the target area in such a way as to satisfy the extraction conditions described above.
[0141] In this case, if the size of the object being tracked within the image is less than a predetermined value, the cropping unit 12 will crop a central region where the center position coincides with the center position of the object being tracked as the target region. If the size of the object being tracked within the image is greater than or equal to the predetermined value, the cropping unit 12 will crop a region different from the central region as the target region. In other words, the position of the target region differs for each object being tracked.
[0142] An example of this situation will be explained using Figure 14. In Figure 14, multiple tracking targets with different sizes and shapes within the image are detected from a single image P. In Figure 14, frame DW 1 The tracking target contained within, and the frame DW 2 A tracking target contained within is detected. In this case, the cutting unit 12 identifies the target area OW corresponding to each of the multiple tracking targets. 1 and OW 2 Cut it out.
[0143] Frame DW 2 The size of the image of the object to be tracked contained within is less than a predetermined value. Therefore, the cropping unit 12 defines the central region where the center position coincides with the center position of the object to be tracked as the target region OW 2 Cut it out as such. Meanwhile, frame DW 1 The size of the area to be tracked within the image contained therein is greater than a predetermined value. Therefore, the cropping unit 12 selects an area different from the central area as the target area OW 1 It is cut out as follows. For example, the cut-out section 12 is the target region OW that satisfies the "cut-out condition for cutting out the target region that includes the head (face) of the target being tracked". 1 Extract the relevant region OW. 1 and OW 2 The size and shape are the same (identical).
[0144] Other configurations of the information processing device 10 can be the same as those of the first to fourth embodiments.
[0145] The information processing device 10 of the fifth embodiment can achieve the same effects as the information processing device 10 of the first to fourth embodiments. Furthermore, the information processing device 10 can change the position of the target region cut out according to the size of the image of the object to be tracked. For example, if the size of the image of the object to be tracked is less than a predetermined value, the information processing device 10 can cut out a target region based on a first rule in which the center position coincides with the center position of the object to be tracked. If the size of the image of the object to be tracked is greater than or equal to the predetermined value, the information processing device 10 can cut out a target region based on a rule different from the first rule. In addition, if the size of the image of the object to be tracked is less than a predetermined value, the information processing device 10 can cut out the central region in which the center position coincides with the center position of the object to be tracked as the target region. If the size of the image of the object to be tracked is greater than or equal to the predetermined value, the information processing device 10 can cut out a region different from the central region as the target region. With such an information processing device 10, a target region can be cut out from each object to be tracked in a manner suitable for each size of object to be tracked.
[0146] Furthermore, as described in detail in the second embodiment, in one example, the size of the target area is set to a size suitable for a tracking target that is relatively small in size within the image. In this case, a target area corresponding to a relatively large tracking target will contain only a portion of the tracking target. If the target area contains only a portion of the tracking target, it may not be possible to obtain sufficient visual feature vectors from the target area to identify that tracking target. Therefore, if the size of the tracking target within the image is greater than or equal to a predetermined value, the information processing device 10 can extract a target area that preferentially contains the upper part of the tracking target. In addition, if the size of the tracking target within the image is greater than or equal to a predetermined value, the information processing device 10 can extract a target area that contains a predetermined part of the tracking target (such as the head). With such an information processing device 10, even for relatively large tracking targets, it is possible to extract a target area that contains a portion of the tracking target with high discriminatory power. As a result, even for relatively large tracking targets, it becomes possible to distinguish them from other tracking targets with sufficient accuracy using the visual feature vectors obtained from such a target area.
[0147] <<Modifications>> The methods of utilizing the information processing device 10 are not limited to the examples described in the first to fifth embodiments. The information processing device 10 can be used in tracking technologies in general. Furthermore, the information processing device 10 can also be used in online deep neural network models that use time-series data. The same effects and advantages as those of the first to fifth embodiments can be achieved in these modified forms as well.
[0148] Although this disclosure has been described above with reference to embodiments, this disclosure is not limited to the embodiments described above. Various modifications to the structure and details of this disclosure are possible, which can be understood by those skilled in the art within the scope of this disclosure. Furthermore, each embodiment can be combined with other embodiments as appropriate.
[0149] Furthermore, the flowchart used in the above explanation shows multiple steps (processes) in sequence. However, the execution order of the steps performed in each embodiment is not limited to the order in which they are described. In each embodiment, the order of the illustrated steps can be changed to the extent that it does not impede the content.
[0150] Some or all of the above embodiments may also be described as follows, but are not limited to the following: 1. An information processing apparatus comprising: detection means for detecting a tracking target from an image and acquiring location information of the tracking target; extraction means for extracting a target region from the image that is a region relating to the tracking target and has a common size for a plurality of tracking targets, based on the location information; and acquisition means for acquiring a visual feature vector from the target region. 2. The information processing apparatus according to 1, characterized in that the extraction means sets the size of the target region to the common size regardless of the size of the tracking target in the image. 3. The information processing apparatus according to 1 or 2, characterized in that the extraction means identifies a reference position based on the location information and extracts the target region whose relative positional relationship with the reference position satisfies a condition and whose size is the common size. 4. The information processing apparatus according to any one of 1 to 3, characterized in that the extraction means extracts the target region whose center position coincides with the center position of the tracking target and whose size is the common size. 5. The information processing apparatus according to any one of 1 to 4, wherein the cropping means changes the cropping position of the target area according to the size of the tracking target in the image. 6. The information processing apparatus according to 5, wherein the cropping means crops the target area that preferentially includes the upper part of the tracking target when the size of the tracking target in the image is greater than or equal to a predetermined value. 7. The information processing apparatus according to 5 or 6, wherein the cropping means crops the target area based on a first rule which crops the target area whose center position coincides with the center position of the tracking target when the size of the tracking target in the image is less than a predetermined value, and crops the target area based on a rule different from the first rule when the size of the tracking target in the image is greater than or equal to the predetermined value.8. The information processing apparatus according to any one of 5 to 7, characterized in that the cropping means crops a central region whose center position coincides with the center position of the tracking target when the size of the tracking target in the image is less than a predetermined value, and crops a region different from the central region as the target region when the size of the tracking target in the image is greater than or equal to the predetermined value. 9. The information processing apparatus according to any one of 5 to 8, characterized in that the tracking target is a person, and the cropping means crops a target region that includes the head of the tracking target when the size of the tracking target is greater than or equal to a predetermined value. 10. The information processing apparatus according to any one of 1 to 9, characterized in that the tracking target is a person, and the cropping means crops a target region that includes the head of the tracking target and whose size is the common size. 11. The information processing apparatus according to any one of 1 to 10, characterized in that the acquisition means acquires the visual feature vector by inputting the image of the target region into a convolutional layer. 12. The information processing apparatus according to any one of 1 to 10, characterized in that the acquisition means acquires the visual feature vector by inputting the image of the target region to a downsampling layer. 13. The information processing apparatus according to any one of 1 to 12, characterized in that the acquisition means further comprises an integration means that acquires a position feature vector from the position information and integrates the visual feature vector and the position feature vector to acquire an integrated vector. 14. The information processing apparatus according to 13, characterized in that it further comprises an update means that updates the integrated vector using a cross-attention mechanism capable of matching the tracking target among a plurality of images. 15. The information processing apparatus according to any one of 1 to 14, characterized in that the position information indicates a rectangular region encompassing the tracking target, and the cropping means crops a part of the rectangular region as the target region when the size of the rectangular region is larger than the common size.16. The information processing apparatus according to any one of 1 to 15, characterized in that the position information indicates a rectangular region encompassing the tracking target, and the cropping means crops out the target region encompassing the rectangular region when the size of the rectangular region is smaller than the common size. 17. The information processing apparatus according to any one of 1 to 16, characterized in that when a plurality of tracking targets of different sizes within the image are detected from a single image, the cropping means crops out the target region corresponding to each of the plurality of tracking targets, and the size of the target region cropped corresponding to each of the plurality of tracking targets is the same. 18. The information processing apparatus according to any one of 1 to 17, characterized in that when a plurality of tracking targets of different sizes within the image are detected from each of the plurality of images, the cropping means crops out the target region corresponding to each of the plurality of tracking targets, and the size of the target region cropped corresponding to each of the plurality of tracking targets is the same. 19. An information processing method comprising: one or more computers detecting a target to be tracked from an image, acquiring location information of the target to be tracked, cutting out a target region within the image that relates to the target to be tracked and has a common size for multiple target to be tracked, based on the location information, and acquiring a visual feature vector from the target region. 20. A recording medium for recording a program that causes a computer to function as: detection means for detecting a target to be tracked from an image and acquiring location information of the target to be tracked; cutting out a target region within the image that relates to the target to be tracked and has a common size for multiple target to be tracked, based on the location information; and acquisition means for acquiring a visual feature vector from the target region.
[0151] Some or all of the appendices 2 to 18, which are dependent on the information processing device described in appendice 1 above, may also be dependent on the information processing method in appendice 19 and the recording medium in appendice 20 in the same dependent relationship as between appendice 1 and appendices 2 to 8. Furthermore, within the scope that does not depart from each of the embodiments described above, some or all of the configurations described as appendices can be realized in various hardware, software, various recording means for recording software, or systems.
[0152] 10 Information Processing Device 11 Detection Unit 12 Extraction Unit 13 Acquisition Unit 14 Integration Unit 15 Update Unit 16 Restoration Unit 1A Processor 2A Memory 3A Input / Output I / F 4A Peripheral Circuit 5A Bus
Claims
1. An information processing device comprising: detection means for detecting a target to be tracked from an image and acquiring location information of the target to be tracked; extraction means for cutting out a target region within the image that relates to the target to be tracked and has a common size for multiple targets to be tracked, based on the location information; and acquisition means for acquiring a visual feature vector from the target region.
2. The information processing apparatus according to claim 1, characterized in that the cropping means sets the size of the target area to the common size regardless of the size of the target within the image.
3. The information processing apparatus according to claim 1 or 2, characterized in that the cutting means identifies a reference position based on the position information, cuts out the target area whose relative positional relationship with the reference position satisfies the conditions and whose size is the common size.
4. The information processing apparatus according to any one of claims 1 to 3, characterized in that the cutting means cuts out the target region whose center position coincides with the center position of the tracking target and whose size is the common size.
5. The information processing apparatus according to any one of claims 1 to 4, characterized in that the cropping means changes the cropping position of the target region according to the size of the tracking target within the image.
6. The information processing apparatus according to claim 5, characterized in that, when the size of the tracking target in the image is greater than or equal to a predetermined value, the cropping means crops out the target region that preferentially includes the area above the tracking target.
7. The information processing apparatus according to claim 5 or 6, wherein the cropping means crops the target region based on a first rule which crops the target region whose center position coincides with the center position of the target when the size of the target in the image is less than a predetermined value, and crops the target region based on a rule different from the first rule when the size of the target in the image is greater than or equal to the predetermined value.
8. The information processing apparatus according to any one of claims 5 to 7, characterized in that the cropping means crops a central region whose center position coincides with the center position of the tracking target as the target region when the size of the tracking target in the image is less than a predetermined value, and crops a region different from the central region as the target region when the size of the tracking target in the image is greater than or equal to the predetermined value.
9. The information processing apparatus according to any one of claims 5 to 8, wherein the tracking target is a person, and the cutting means cuts out the target region that includes the head of the tracking target when the size of the tracking target is greater than or equal to a predetermined value.
10. The information processing apparatus according to any one of claims 1 to 9, characterized in that the tracking target is a person, and the cutting means cuts out the target region which includes the head of the tracking target and whose size is the common size.
11. The information processing apparatus according to any one of claims 1 to 10, characterized in that the acquisition means acquires the visual feature vector by inputting the image of the target region into a convolutional layer.
12. The information processing apparatus according to any one of claims 1 to 10, characterized in that the acquisition means acquires the visual feature vector by inputting the image of the target region to a downsampling layer.
13. The information processing apparatus according to any one of claims 1 to 12, wherein the acquisition means further comprises an integration means that acquires a position feature vector from the position information and integrates the visual feature vector and the position feature vector to acquire an integrated vector.
14. The information processing apparatus according to claim 13, further comprising an update means for updating the integrated vector using a cross-attention mechanism capable of matching the tracking target among a plurality of images.
15. The information processing apparatus according to any one of claims 1 to 14, wherein the position information indicates a rectangular region encompassing the tracking target, and the cutting means cuts out a portion of the rectangular region as the target region when the size of the rectangular region is larger than the common size.
16. The information processing apparatus according to any one of claims 1 to 15, characterized in that the position information indicates a rectangular region encompassing the tracking target, and the cutting means cuts out the target region encompassing the rectangular region when the size of the rectangular region is smaller than the common size.
17. When a plurality of tracking targets of different sizes within the image are detected from a single image, the cropping means crops out a target region corresponding to each of the plurality of tracking targets, and the size of the target region cropped out corresponding to each of the plurality of tracking targets is the same, characterized in that the information processing apparatus according to any one of claims 1 to 16.
18. When tracking targets of different sizes within each of the multiple images are detected from each of the multiple images, the cropping means crops out the target area corresponding to each of the multiple tracking targets, and the size of the cropped target area corresponding to each of the multiple tracking targets is the same, characterized in that the information processing apparatus according to any one of claims 1 to 17.
19. An information processing method comprising: one or more computers detecting a target to be tracked from an image; obtaining location information of the target to be tracked; cutting out a target region within the image that relates to the target to be tracked and has a common size for multiple target to be tracked, based on the location information; and obtaining a visual feature vector from the target region.
20. A recording medium for recording a program that causes a computer to function as: a detection means for detecting a target to be tracked from an image and acquiring location information of the target to be tracked; an extraction means for cutting out a target region within the image that is related to the target to be tracked and has a common size for multiple targets to be tracked, based on the location information; and an acquisition means for acquiring a visual feature vector from the target region.