Railway station passenger identification method and system
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA ACADEMY OF RAILWAY SCI CORP LTD
- Filing Date
- 2026-03-26
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244533A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of intelligent personnel sensing technology, and in particular to a method and system for identifying key passengers at railway stations. Background Technology
[0002] As crucial transportation hubs, railway passenger stations face significant challenges in providing services to key passengers due to their high-density and high-volume operations, particularly those carrying large or multiple pieces of luggage or relying on assistive devices. However, existing service models relying on manual identification suffer from limited coverage and low response efficiency, failing to meet actual needs.
[0003] Therefore, developing intelligent automatic identification technology for key passengers is of great significance for improving the quality of passenger station services and ensuring passenger safety. Summary of the Invention
[0004] In view of this, embodiments of the present invention provide a method and system for identifying key passengers at railway stations, so as to eliminate the defects of existing manual identification technologies for key passengers.
[0005] One aspect of the present invention provides a method for identifying key passengers at railway stations, the method comprising the following steps: Obtain the images to be identified captured by cameras at railway passenger stations; The image to be identified is input into a pre-trained target detection model, which outputs the bounding box information of the pedestrian target, as well as the category information, bounding box information, and detection confidence of the object carried by the pedestrian target; wherein, the category of the object carried includes luggage and assistive devices. When the category of the carried item is luggage, the size of the luggage is determined based on the bounding box information of the carried item and the spatial relationship between the camera and the carried item. The number of luggage items is determined based on the intersection-union ratio between the bounding box of the carried item and the bounding box of the corresponding pedestrian target. Thus, if the determined luggage size and / or luggage quantity meet the set luggage conditions, the pedestrian target corresponding to the carried item is identified as a key passenger carrying large and / or multiple pieces of luggage. If the category of the carried object is an assistive device, and the detection confidence of the carried object reaches a set confidence threshold, then the pedestrian target corresponding to the carried object is identified as a key passenger who needs an assistive device. The target detection model is an improved YOLO model, comprising a backbone network, a neck network, and a detection head. Each convolutional layer in the target detection model is a region-dynamically perceptive depth-separable convolutional layer. The neck network includes a top-down feature fusion module, a bottom-up feature fusion module, and a multi-scale feature map output module. The top-down feature fusion module is used to upsample the multi-scale feature map output by the backbone network. The bottom-up feature fusion module includes a small target detection layer, which is used to downsample the multi-scale feature map output by the top-down feature fusion module.
[0006] In some embodiments of the present invention, the region-dynamically aware depth-separable convolutional layer includes a depth-separable convolutional layer, a learnable guide, and a filter generator; wherein, The depthwise separable convolutional module is used to generate a guided feature map based on the input image; The guide is used to perform dimensionality reduction on the guide feature map and use the softmax function to determine the weight of each pixel in the dimensionality-reduced guide feature map in multiple preset regions. The filter generator is used to obtain a basic filter tensor based on the input image, and to perform a weighted summation using the weights output by the guide to obtain the convolution kernel corresponding to each pixel in the input image, thereby using the obtained convolution kernel to perform a convolution operation on the input image.
[0007] In some embodiments of the present invention, the bounding box information includes the size and position of the bounding box; The dimensions of the luggage were determined in the following way: If the category of the object being carried is determined to be luggage, the actual physical distance between the object and the camera is determined based on a pre-trained distance prediction model; Calculate the product of the bounding box width of the target and the actual physical distance to obtain the first theoretical distance, and calculate the product of the bounding box height of the target and the actual physical distance to obtain the second theoretical distance; The luggage dimensions are obtained by calculating the ratios of the first and second theoretical distances to the camera focal length parameter.
[0008] In some embodiments of the present invention, determining the actual physical distance between the object target and the camera based on a pre-trained distance prediction model includes: The bounding box information of the target object and the position information of a specific camera are input into a pre-trained distance prediction model, and the actual physical distance between the target object and the camera is output; wherein, the specific camera is a camera that captures an image containing the target object to be identified.
[0009] In some embodiments of the present invention, the features corresponding to pedestrian targets whose passenger types are not determined are input into a pre-trained attribute space memory model, and the attribute similarity features corresponding to the pedestrian target are output. The attribute similarity features are then input into a preset age classification algorithm to obtain the age prediction result of the pedestrian target. Based on the age prediction result, it is determined whether the pedestrian target is an elderly key passenger. The features corresponding to the pedestrian target are obtained based on the pixel information within the bounding box of the pedestrian target. The attribute space memory model is used to extract the features corresponding to the preset attributes from the features corresponding to the input pedestrian target as a query, and to calculate the similarity between the query and the standard attributes pre-stored in the attribute space memory to obtain the attribute similarity features corresponding to the pedestrian target.
[0010] In some embodiments of the present invention, the features corresponding to pedestrian targets whose passenger type is not determined are obtained through the following methods: Select pedestrian targets with undetermined passenger types from the image to be identified, and divide the image within the bounding box of the selected pedestrian targets into multiple image blocks; The corresponding features are extracted from multiple image patches using an attention mechanism, and the extracted features are projected into a metric space that is the same as the standard attribute through feature projection, thereby obtaining the features corresponding to the pedestrian target.
[0011] In some embodiments of the present invention, after identifying key passengers, the method further includes: If the pedestrian target is determined to be a key passenger, the key passenger image is obtained by segmenting from the image to be identified based on the bounding box information of the pedestrian target; The key passenger images are input into a pre-trained cascaded variational encoder, and the latent variables corresponding to the key passenger images are output. If the similarity between two latent variables obtained from different images to be identified meets the preset similarity condition, then the images of key passengers corresponding to these two latent variables belong to the same pedestrian target, thereby realizing the tracking of key passengers.
[0012] In some embodiments of the present invention, after obtaining the latent variables corresponding to the images of key passengers, the method further includes: The latent variables corresponding to the key passenger images are input into a pre-trained anti-domain discriminator to obtain the optimized latent variables corresponding to the key passenger images.
[0013] Another aspect of the present invention provides a system for identifying key passengers at railway stations, including a processor, a memory, and a computer program / instructions stored in the memory. The processor is used to execute the computer program / instructions, and when the computer program / instructions are executed, the system implements the steps of the method described in any of the above embodiments.
[0014] Another aspect of the present invention provides a computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implement the steps of the method described in any of the above embodiments.
[0015] The present invention proposes a method and system for identifying key passengers at railway stations. Utilizing an improved target detection model based on the existing YOLO model, it can automatically identify key passengers carrying large / multiple pieces of luggage, as well as those requiring assistance with mobility devices. The method proposed in this application can significantly improve the intelligence, accuracy, and timeliness of services for key passengers at railway stations, providing effective technical support for ensuring the safety of key passengers' travel, and even enabling the construction of an "identification-tracking-service" support technology.
[0016] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the description, or may be learned by practice of the invention. The objects and other advantages of the invention can be realized and obtained by means of the structures specifically pointed out in the description and drawings.
[0017] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description
[0018] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, are not intended to limit the scope of the invention. In the drawings: Figure 1 This is a flowchart illustrating a method for identifying key passengers at railway stations according to an embodiment of the present invention.
[0019] Figure 2 This is a flowchart illustrating a method for identifying key passengers at railway stations according to another embodiment of the present invention.
[0020] Figure 3 This is a schematic diagram of the architecture of a region-dynamically perceptive depth-separable convolutional layer in one embodiment of the present invention.
[0021] Figure 4 This is an architecture diagram of a target detection model in one embodiment of the present invention. Detailed Implementation
[0022] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.
[0023] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.
[0024] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.
[0025] In the following description, embodiments of the invention will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.
[0026] The highly mobile and complex operating environment of railway passenger stations, coupled with the limitations of traditional manual service models, easily leads to delays in response to key passenger services and regulatory blind spots. This application proposes a method for identifying key passengers at railway passenger stations, aiming to improve the technological level in the field of public safety. Specifically, it can improve the accuracy of automatic identification of key passenger services. Its application will help achieve legally mandated public safety goals more efficiently and in a more standardized manner, and has positive social benefits.
[0027] Based on existing definitions of key passengers and the passenger types with concentrated safety risks in actual railway station operations, this application focuses on three categories of passengers: elderly key passengers, key passengers requiring assistive devices, and key passengers carrying large and / or multiple pieces of luggage. For example, elderly passengers can be defined as passengers aged ≥65 years, and passengers carrying large and / or multiple pieces of luggage are defined as passengers carrying luggage exceeding standard dimensions and / or ≥3 pieces. This application designs for elderly passengers to be detected using a combination of target detection and attribute recognition models. Key passengers carrying large and / or multiple pieces of luggage and key passengers requiring assistive devices can be automatically identified using the target detection model. This is because elderly passengers are accurately identified primarily based on the overall characteristics of the pedestrian target (such as facial expressions and posture), a function not possessed by existing target detection models. The other two categories of key passengers can be determined based on the objects carried by the pedestrian target.
[0028] like Figure 1As shown, the method proposed in this application is implemented by a key passenger identification cascaded network. This network, through the coordinated operation of three core modules—target detection, attribute recognition, and automatic tracking—transforms the approach from "passive service" to "proactive support" for key passengers, thereby constructing an "identification-tracking-service" support technology. Specifically, the target detection module, primarily implemented by a target detection model, receives images of passengers and their belongings captured by railway station cameras. It utilizes a combination of a small target detection enhancement layer and a region dynamic perception depth-separable convolutional layer (DR-DP). The YOLO model (Conv) automatically locates and identifies passengers and their belongings in received images. A dedicated small target detection layer accurately locates assistive devices (such as wheelchairs) and key small targets like luggage. Furthermore, a depth-separable convolutional layer with region dynamic perception solves occlusion problems in dense passenger flow, enabling reliable detection in complex station environments. The attribute recognition module receives target instances within the bounding boxes of pedestrian targets with undetermined passenger types (in this application, target instances within the bounding boxes refer to the specific objects enclosed by the bounding boxes and identified and classified by the model). It extracts discriminative features through an attribute spatial memory model to accurately distinguish elderly key passengers (and can also utilize staff identified in the images). The automatic tracking module, after automatically identifying key passengers through the target detection and attribute recognition modules, extracts domain-independent features based on cascaded variational autoencoders to ensure stable identity representation, enabling real-time tracking of the entire trajectory of key passengers across cameras. Simultaneously, it monitors the status of staff accompanying passengers during the tracking process, providing service and escort support.
[0029] like Figure 2 As shown, the key passenger identification method proposed in this application includes steps S110 to S140, as follows: Step S110: Obtain the image to be identified captured by a specific camera in the railway passenger station. The specific cameras in the railway passenger station are located in key areas of the railway passenger station, including but not limited to the entrance area, waiting hall area, platform area, and exit area.
[0030] Since the purpose of this application is to identify key passengers based on pedestrian target features and the features of the items carried by the pedestrian targets, this application requires that the image to be identified must contain one or more pedestrian targets. The image to be identified may or may not contain items carried by the corresponding pedestrian targets, so that subsequent step S120 uses a pre-trained target detection model to identify whether there are pedestrian targets in the image to be identified, whether the pedestrian targets are carrying items, and the type of the items carried. It should be noted that if the acquired image to be identified does not contain pedestrian targets or items carried, the target detection model will not output bounding box information or detection confidence results. If the acquired image to be identified only includes items carried but not the corresponding pedestrian targets, the target detection model can still output relevant results, but these output results will not participate in the key passenger determination in steps S130 and S140. If the acquired image to be identified only includes pedestrian targets but not the corresponding items carried, the attribute recognition module can be used to determine whether these pedestrian targets are elderly key passengers.
[0031] The image to be identified consists of one or more image frames. Furthermore, the pedestrian targets and object targets mentioned in this application are all detected targets or target instances in the image to be identified. Therefore, the target detection model may not be able to accurately determine the correspondence between pedestrian targets and object targets. Thus, this application considers objects located within a defined detection area of a pedestrian target as the corresponding object. For example, the defined detection area of a pedestrian target can be a region with radius r centered on the pedestrian target. Similarly, the corresponding pedestrian target can be determined based on the object. This application does not specifically limit the method for determining the correspondence between pedestrian targets and object targets. Additionally, the size of the defined detection area can vary for different types of object targets.
[0032] Step S120: Input the image to be identified into the pre-trained target detection model, and output the bounding box information of the pedestrian target, the category information of the corresponding object carried by the pedestrian target, the bounding box information of the corresponding object carried by the pedestrian target, and the detection confidence of the corresponding object carried by the pedestrian target. Since this application subsequently determines the key passenger and its type based on the category of the object carried, the category of the object carried can be set to baggage, assistive devices, and others. Furthermore, the bounding box information includes the size and position of the bounding box, such as the length, width, and center point coordinates of the bounding box.
[0033] In some embodiments of the present invention, considering that the existing YOLO model can perform well in general scenarios by adopting a multi-scale feature fusion strategy, this application designs an object detection model based on the existing YOLO model. Therefore, the network architecture of the object detection model in this application mainly includes a backbone network, a neck network, and a detection head.
[0034] However, existing YOLO models struggle to adapt to the unique environment of railway stations for small target detection. This is because: ① Station cameras are installed at a high position, and passengers occupy a small proportion of the frame. In addition, there is severe occlusion between targets when passenger flow is dense, leading to a further reduction in effective pixels; ② Assistive devices such as wheelchairs and canes are small in size, resulting in insufficient representation of target features under long-distance shooting conditions. This leads to a significant lack of localization accuracy in standard YOLO models, and even the problem of missing detection for targets with extremely small pixel sizes; ③ Existing YOLO models have large downsampling rates, such as 8x, 16x, and 32x downsampling, which may cause the feature information of these tiny targets to be over-compressed or even lost in deep feature maps.
[0035] Based on this, to improve the detection accuracy of YOLO-based target detection models for small objects such as assistive devices, this application introduces a small target detection enhancement mechanism into the YOLO model to improve the feature extraction capability and recognition accuracy of key small-sized targets. Specifically, the neck network in the target detection model may include a top-down feature fusion module, a bottom-up feature fusion module, and a multi-scale feature map output module. The top-down feature fusion module upsamples the multi-scale feature map output by the backbone network, the bottom-up feature fusion module includes a small target detection layer for downsampling the multi-scale feature map output by the top-down feature fusion module, and the multi-scale feature map output module outputs the multi-scale feature map obtained by the bottom-up feature fusion module to the detection head. For example, after introducing the small target detection layer, the bottom-up feature fusion module can extract a 4-fold downsampled feature map from the output of the top-down feature fusion module and stitch them together to form a shallow feature map. Compared to the existing YOLO model, the shallow feature map obtained by introducing the small target detection layer retains more detailed information, which is beneficial for the recognition of small targets.
[0036] This application does not impose specific limitations on the architecture of the small object detection layer. For example, a path aggregation network (PAN) can be used to construct the small object detection layer to achieve feature enhancement, or it can be constructed by fusing a high-resolution feature enhancement network with a multi-scale receptive field.
[0037] As an example, adding a small target detection layer to the neck network of an existing YOLO model can enhance the network's ability to detect small targets and improve detection efficiency. The number of parameters in this improved YOLO model can be calculated using the following formula: ; in, This represents the kernel size of the YOLO model. This represents the number of channels in the image input to the YOLO model (specifically, the number of channels in the image input to the backbone network). This represents the number of channels in the image output by the YOLO model (specifically, the number of channels in the image output by the detection head). As shown in the formula above, introducing a small target feature enhancement mechanism increases the number of model parameters and reduces the overall inference and detection speed. However, the improved YOLO model has achieved a frame rate of 25 frames per second or more (a conventional value), which meets the requirements for real-time detection.
[0038] In some embodiments of the present invention, to further improve the model's recognition accuracy, this application, in addition to introducing a small target enhancement layer, also introduces a region-dynamically perceptive depth-separable convolutional layer to replace the convolutional layers in the YOLO model, making each convolutional layer in the target detection model a region-dynamically perceptive depth-separable convolutional layer. This model improvement method can significantly enhance the model's visual feature extraction capability and spatial context understanding capability without significantly increasing computational costs.
[0039] The original convolutional layers in the YOLO model typically extract features using filters (also known as convolutional kernels), and the feature representation capability is improved by increasing the number of filters. However, increasing the number of filters often leads to a significant increase in computational complexity. This application introduces a region-dynamically aware depthwise separable convolutional layer, which introduces a learnable guide in the spatial dimension to improve the feature extraction capability of the convolutional layer, rather than simply increasing the number of filters. Specifically, the region-dynamically aware depthwise separable convolutional layer may include a depthwise separable convolutional layer, a learnable guide, and a filter generator, as shown in the following structure: Figure 3As shown, the depthwise separable convolutional module processes the input image (feature map output from the previous network layer or image to be recognized captured by a camera) to generate guiding features for subsequent operations. The learnable guide can generate different feature representations for each region, even when the spatial dimension is divided into multiple regions, based on the guiding features. Within each region, a filter generator dynamically generates two-dimensional convolutional filters suitable for that region, thus achieving convolution using these filters. This design not only reduces the number of parameters in the convolutional module but also improves the model's adaptability to features from different regions, thereby enhancing representational capabilities while maintaining computational cost.
[0040] A depthwise separable convolutional layer is used to generate a guiding feature map based on the input image. The guider is used to reduce the dimensionality of the guiding feature map and uses the softmax function to determine the weights of each pixel in the dimensionality-reduced guiding feature map in multiple preset regions. The filter generator is used to obtain the basic filter tensors based on the input image (each preset region has a corresponding basic filter tensor, and the number of basic filter tensors corresponding to each preset region is the same), and uses the weights output by the guider to perform a weighted summation to obtain the convolution kernel corresponding to each pixel in the input image, thereby using the obtained convolution kernel to perform a convolution operation on the input image.
[0041] As an example, a learnable guide can generate a region attribution weight for each pixel on the feature map, as follows: ① Input and Dimensionality Reduction: The guide receives feature maps from the depthwise separable convolutional unit. (Feature map) To guide the feature map, assume the feature map... The size is , and (These are the length and width, respectively). The feature map can be transformed using a lightweight convolutional layer (such as a 1×1 convolution) or a fully connected layer. Number of channels It is compressed to a very small dimension. (like =4 or 8, This can be understood as a preset maximum number of region categories), resulting in a low-dimensional feature map. .
[0042] ② Generate spatial weights: for feature maps Applying the softmax function along the channel dimension results in a feature map that, after softmax processing, becomes... Each spatial location (i.e., each pixel) on the graph receives a... A 3D vector, with each pixel corresponding to... Each dimensional vector represents the element belonging to a given pixel. Different regions (including region 1, region 2, ..., region ...) ,…,area The probability or weight of ). Wherein, the obtained individual Each element in a dimensional vector is between 0 and 1, and the sum of all elements in all vectors is 1, for example... =4, and a certain pixel point obtained If the dimension vector is [0.05, 0.8, 0.1, 0.05], then this vector indicates that the pixel mainly belongs to the second region.
[0043] ③ Output: The output of the guide is not a clear boundary map, but rather a combination of the corresponding values of each pixel. A three-dimensional weight tensor obtained from the dimensional vector .
[0044] The filter generator can transform the region assignment weights obtained from the guide into convolution kernels that act on each region, and then use the generated convolution kernels to perform convolution operations. The convolution kernel generation process is as follows: ① Generate basic filter bank: The core of the filter generator is a small neural network (such as one consisting of fully connected layers or 1×1 convolutions) that is based on an image of an input depth-separable convolutional layer. Aggregation features (such as input image) The global context vector obtained by global average pooling can be used to obtain the fundamental filter tensor. ,in, This indicates the number of channels in the DR-DP Conv output feature map. This indicates the kernel size (e.g., 3×3). It can also be understood as the filter generator generating a set of filters at once. Different groups, complete Convolution kernels, each group has One convolutional kernel.
[0045] ② Weighting: For the input image Each spatial location on ' Each of these can generate a unique convolutional kernel suitable for that location. Pixel The corresponding convolution kernel is obtained by weighted summation of the fundamental filters, and the weighted summation formula is as follows: ; in, Represents pixels Belonging to the region The weight, For the first The group's fundamental filter tensor.
[0046] The object detection module is designed and integrated with a small object detection layer to improve the detection accuracy and recall of key points of small objects. In addition, a region-dynamic perception depth-separable convolutional layer is used to replace the original convolutional layer in the YOLO model (if the small object detection layer includes a convolutional layer, it can also be replaced), to achieve effective fusion of multi-scale features and enhancement of global information, thereby further improving the training efficiency and detection performance of the model.
[0047] Step S130: When the category of the carried item is luggage, the luggage size is determined based on the bounding box information of the carried item and the spatial relationship (the spatial relationship between the camera and the carried item), and the luggage quantity is determined based on the intersection-union ratio between the bounding box of the carried item and the bounding box of the corresponding pedestrian target. Thus, if the determined luggage size and / or luggage quantity meet the set luggage conditions, the pedestrian target corresponding to the carried item is identified as a key passenger carrying large and / or multiple pieces of luggage (key passenger carrying large luggage, key passenger carrying multiple pieces of luggage, or key passenger carrying both large and multiple pieces of luggage).
[0048] In some embodiments of the present invention, the spatial relationship between the camera and the object being carried is the actual physical distance between the object being carried and the camera. The size of the luggage can be determined in the following way: when the category of the object being carried is determined to be luggage, the actual physical distance between the object being carried and the camera is determined based on a pre-trained distance prediction model; the product of the bounding box width of the object being carried and the actual physical distance is calculated to obtain a first theoretical distance, and the product of the bounding box height of the object being carried and the actual physical distance is calculated to obtain a second theoretical distance; the ratios of the first theoretical distance and the second theoretical distance to the camera focal length parameter are calculated to obtain the size of the luggage.
[0049] As an example, the bounding box of a pedestrian target in the image to be identified is obtained based on the object detection model. And the bounding box of the carry-on object categorized as baggage (referred to as the baggage bounding box). Baggage-based bounding boxes The width mapping is used to calculate the actual width dimensions of the luggage. The formula is: ; in, The pixel width of the luggage in the image to be recognized (i.e. (width) The actual physical distance between the luggage and the camera. This refers to the camera's focal length parameter.
[0050] Similarly, the actual height of the luggage can be calculated. The formula is: ; in, The pixel width of the luggage in the image to be recognized (i.e. (height).
[0051] exist Greater than 2 meters or If the distance is greater than 1 meter, the pedestrian is considered a key passenger carrying large luggage.
[0052] In some embodiments of the present invention, the actual physical distance between the target object and the camera can be determined as follows: the bounding box information of the target object and the location information of a specific camera (here, the specific camera is the camera that captures the image to be identified containing the target object) are input into a pre-trained distance prediction model (which can be a regression model trained based on a pre-constructed scene dataset), and the actual physical distance between the target object and the camera is output. Additionally, the location information of all cameras within the railway station can be pre-stored in the distance prediction model, such as by storing it as a mapping relationship between camera identifiers and camera locations. When determining the actual physical distance, inputting the camera identifier will automatically determine the camera location.
[0053] As an example, the intersection-over-union (IoU) ratio is calculated between the bounding box of the pedestrian target and the bounding box of each corresponding baggage target. If the calculated IoU ratio is greater than a set IoU threshold (e.g., 0.3), the baggage quantity is increased by one piece; if the calculated IoU ratio is not greater than the set IoU threshold, the baggage quantity is decreased by one piece. If the number of baggage corresponding to the pedestrian target is not less than a set baggage quantity threshold (e.g., 3 pieces), the pedestrian target is determined to be a key passenger carrying multiple pieces of baggage. Assume the bounding box of the pedestrian target is... The target detection model simultaneously detected The categories are the items carried in the luggage, and the bounding boxes are respectively... Then the number of luggage The calculation formula can be expressed as: .
[0054] The above-mentioned Greater than 2 meters or The dimensions of more than 1 meter and the setting of a luggage quantity threshold of 3 pieces are merely examples, and the present invention is not limited thereto.
[0055] Step S140: When the category of the object being carried is an assistive device, if the detection confidence of the object reaches a set confidence threshold (e.g., 0.75 or higher), it is considered a valid detection, and the pedestrian target corresponding to the object is identified as a key passenger who needs an assistive device.
[0056] The execution order of steps S130 and S140 is irrelevant and does not affect each other. Based on YOLOv8, after introducing a small object detection layer and a depthwise separable convolutional layer for dynamic region perception, the model architecture can be as follows: Figure 4 As shown. Using Figure 4 The target detection model shown can perform pedestrian and object detection as follows: A 640×640×3 image to be identified is input into the first layer of the backbone network, DR-DP Conv. After passing through a series of DR-DP Conv layers, CSP layers, normalization layers, and activation layers, four feature maps of different scales are obtained. These four feature maps are then input into a top-down feature fusion module, where they are fused with the laterally input feature maps through upsampling. The fused features are then optimized through DR-DP Conv and CSP layers. Simultaneously, the fused features are enhanced through a bottom-up path, ultimately generating four fused feature maps of different scales. These feature maps are then fed into the detection head, where classification and regression branches output the target category, location information, and confidence score, respectively. Figure 4 In S and P represent the size of the convolutional kernel, the stride, and the boundary expansion size of the feature map input to this network layer, respectively. Figure 4 (K=3, S=2, P=1 are just examples, and the invention is not limited thereto); the CSP layer can perform convolution operations on the input features, but does not use convolutional layers; and in the detection head, Bbox Loss represents the regression box loss, and Cls Loss represents the classification loss.
[0057] The target detection model can only detect the specific category of the object being carried. For pedestrian targets, it can only identify them as pedestrians, but cannot obtain more detailed information. Considering that elderly passengers are also considered key passengers, but may not rely on assistive devices for movement, this application can also introduce an attribute recognition module to identify elderly key passengers not identified in steps S130 and S140, in order to achieve a deeper level of identification for key passengers. To address the problem that the target detection module lacks the ability to distinguish fine-grained features of special groups, this application introduces an attribute recognition module, which is mainly implemented by an attribute recognition model. The attribute recognition model includes an attribute spatial memory (ASM) model and a Transformer structure. The Transformer structure can be used as the backbone network to adaptively focus on key information such as pedestrian features, and the ASM model can selectively strengthen discriminative features, thereby improving the model's robustness to recognition under conditions of appearance diversity.
[0058] like Figure 1As shown, the key passenger identification method proposed in this application further includes: step S210, inputting the features corresponding to the pedestrian target whose passenger type is not determined into a pre-trained attribute space memory model, and outputting the attribute similarity features corresponding to the pedestrian target; step S220, inputting the attribute similarity features into a preset age classification algorithm, obtaining the age prediction result of the pedestrian target, and thus determining whether the pedestrian target is an elderly key passenger based on the age prediction result.
[0059] In step S210 of this application, the feature extraction method of the attribute space memory model is the same as that of existing schemes, and the process is as follows: Features corresponding to preset attributes (such as obtaining attribute attention feature maps corresponding to each preset attribute) are extracted from the features corresponding to the input pedestrian target as a query. The query is then compared with the standard attributes pre-stored in the attribute space memory bank to calculate the similarity, thus obtaining the attribute similarity features corresponding to the pedestrian target. The preset attributes can be age-related attributes, such as posture, facial expression, and clothing, etc., and this application does not impose specific limitations. Since the preset attributes are age-related, the features output by the attribute space memory model are also age-related features. Furthermore, the method by which the attribute space memory model extracts features corresponding to the preset attributes can be to weight the features received by the ASM model according to the preset weights of each attribute. The standard attributes are the standard features of the preset attributes, which can be set according to existing datasets.
[0060] In some embodiments of the present invention, the features corresponding to pedestrian targets with undetermined passenger types are obtained based on pixel information within the bounding box of the pedestrian target. Specifically, the process involves selecting pedestrian targets with undetermined passenger types from the image to be identified, dividing the image within the bounding box of the selected pedestrian target into multiple image blocks, extracting corresponding preliminary features from each of the multiple image blocks using an attention mechanism (such as the Swing Transformer), and projecting the extracted preliminary features onto a metric space identical to the standard attributes through feature projection, thereby obtaining the features corresponding to the pedestrian target. The image blocks can be divided according to the number of image blocks or a set width and height; this application does not specifically limit the image block division method. The purpose of block processing is to obtain more refined local information, thereby better identifying whether a person is a key individual. Feature projection can be implemented using lightweight neural networks or linear embedding layers. Furthermore, the purpose of adding an attention mechanism to the ASM model to extract preliminary features is to select features from more robust attribute-related regions, thereby improving the performance of pedestrian identification.
[0061] As an example, the image within the bounding box of a pedestrian target whose passenger type is not determined is divided into... Each image patch is input into a Swin Transformer to enhance the attention weights for key regions such as the face. For the [number]th image patch... ( ( ) graphs, features output by the Swing Transformer The calculation formula is as follows: ; in, This is the query vector for the key region. This is the transpose of the eigenvectors of the input Swing Transformer. Dimension scaling factor ( (Indicates the dimension of the query vector).
[0062] The attribute-space memory model combines attention and selection mechanisms to achieve accurate attribute localization based on memory information. For the first... Feature maps and attributes of each input ASM model ( , Attribute attention feature map (for a preset total number of attributes) The calculation formula can be: ; in, Indicates the first A weight vector related to each attribute, Represents the input to the ASM model. Feature map Middle position eigenvectors, , , and They represent the first The height, width, and number of channels of each feature map.
[0063] Computational attribute attention feature map The similarity between the attribute and the pre-stored standard attributes in the attribute space memory is calculated, and the attribute similarity feature is output. The overall formula of the ASM model can be expressed as: ; in, This refers to the age-related attribute similarity features output by the ASM model. Additionally, attribute attention maps with high confidence obtained during the inference phase can be stored in the attribute space memory.
[0064] In step S220, an age classification algorithm based on a classification function or model (such as a support vector machine) can be used to predict the probability that a passenger belongs to the older, high-priority passenger category. The age classification algorithm defined using a support vector machine can be expressed as: ; in, This represents a linear change in age characteristics. The translation amount in the feature space. The weight vectors for the support vector machine are determined during the training of the age classification algorithm.
[0065] This application does not specifically limit the method for determining whether a pedestrian target is an older, high-priority passenger based on age prediction results, such as in When setting an age value (e.g., 0.75), passengers are identified as senior priority passengers.
[0066] To enable staff to provide services to key passengers, this application may also store features related to staff clothing characteristics in the attribute space memory, such as discriminative visual patterns of various uniforms (e.g., specific color distributions, texture features, or distinctive markings). By inputting appearance features related to pedestrian targets with undetermined passenger types during the inference stage, the similarity between the pedestrian target and the standard uniform features in the attribute space memory is determined, thereby determining whether the pedestrian target is a staff member.
[0067] Regarding target tracking, this application proposes a feature decoupling method based on Cascaded Variational Autoencoder (CVAE) technology. Compared with traditional methods, this method can effectively eliminate domain difference interference in cross-camera scenes and enhance adaptability to changes in target appearance by leveraging the cascaded coding structure. Specifically, after identifying key passengers, the method proposed in this application may further include: when the pedestrian target is determined to be a key passenger (a key passenger carrying large and / or multiple pieces of luggage, a key passenger requiring assistive devices, or an elderly key passenger), the key passenger image is segmented from the image to be identified based on the bounding box information of the pedestrian target, i.e., the target instance within the bounding box of the pedestrian target is obtained; the key passenger image is input into a pre-trained cascaded variational encoder, and the latent variables corresponding to the key passenger image are output; if the similarity between two latent variables obtained from different images to be identified meets a preset similarity condition, then the key passenger images corresponding to these two latent variables belong to the same pedestrian target, thereby realizing the tracking of key passengers.
[0068] This application utilizes cascaded variational autoencoders (CAEs) to decouple and learn the features of pedestrian targets, enabling real-time tracking of detected key passengers and monitoring the presence of staff within a preset radius. The automatic tracking module primarily extracts more discriminative identity features, such as posture, shape, and color, through a two-level coding structure (appearance encoder and identity encoder). The pre-trained CAEs include an appearance encoder and an identity encoder, with the output of the appearance encoder serving as the input to the identity encoder. Specifically, the appearance encoder extracts variable, identity-independent transient attributes from the input image, such as facial expressions, posture, lighting, hairstyle, and makeup. The identity encoder extracts invariant core identity information from the image, such as facial skeletal structure and the relative positions of facial features. Moreover, the network architectures of appearance encoders and identity encoders are the same as those of encoders in variational autoencoders. In fact, the network architectures of appearance encoders and identity encoders can be the same. For example, both appearance encoders and identity encoders include convolutional layers (used to extract local features of the input image), pooling layers (used to reduce the dimensionality of the feature map and reduce the amount of computation), fully connected layers (used to map the extracted features to the latent space), and variational layers (used to learn the distribution of latent variables, usually including the mean and variance).
[0069] As an example, an appearance encoder can input an image Features of (i.e., key passenger images) (Features can be the input image) probability distribution diagram The deep features extracted by the convolutional neural network are passed through a parameterized projection layer (outputting the mean and variance of a Gaussian distribution) to obtain a probabilistic feature representation, i.e., the feature. Input image The posterior distribution map in the latent space is mapped to the apparent latent variables. appearance hidden variables The approximate posterior distribution can be expressed by the formula: ,in, This represents the parameters of the appearance encoder determined through training. Indicates a normal distribution. This indicates that the pre-trained appearance encoder is based on the input. The predicted mean This indicates that the pre-trained appearance encoder is based on the input. The predicted variance This indicates that the pre-trained appearance encoder is based on the input. And what was obtained Distribution of [something].
[0070] Identity encoder with appearance hidden variables As input, interference information related to the camera domain is stripped away, and latent variables of identity features that are invariant to the field of view are extracted. Latent variables of identity features Approximate posterior distribution ,in, This represents the parameters of the identity encoder determined through training. Indicates a normal distribution. This indicates that the pre-trained identity encoder is based on the input. The predicted mean This indicates that the pre-trained identity encoder is based on the input. The predicted variance This indicates that the pre-trained identity encoder is based on the input. And what was obtained Distribution of [something].
[0071] For image frames of the same target instance captured by different cameras, the distribution of latent variables output by the identity encoder is approximately the same. Therefore, by comparing different input images... The corresponding hidden vector Similarity can be used to determine the same target in different images.
[0072] As an example, the camera captures images 1 and 2 to be identified, respectively. A target detection model is used to detect pedestrian targets in images 1 and 2. A cascaded variational encoder is then used to extract the latent variables of a pedestrian target in image 1 who is a key passenger. And the latent variable of a pedestrian target belonging to a key passenger in image 2 to be identified. The cosine similarity between the two latent variables is calculated using the following formula: .
[0073] If the implicit vector and If the similarity satisfies the preset similarity conditions, then it is determined that... The pedestrian target in the corresponding image 1 to be identified and The pedestrian targets in the corresponding image 2 belong to the same target, enabling continuous tracking of key passenger targets. The preset similarity condition can be that the cosine similarity exceeds a preset similarity threshold; however, this invention is not limited to this.
[0074] In some embodiments of the present invention, this application also introduces an anti-domain discriminant based on the cascaded variational encoder to ensure the latent variables of identity features. The field-of-view invariant property. This is achieved by using the output of the identity encoder. By inputting the anti-region discriminator, optimized latent variables of identity features can be extracted. To improve robustness, the optimized latent variables are utilized. Calculating similarity can also achieve the purpose of identifying the same pedestrian target.
[0075] As an example, the anti-aliasing discriminator can be trained by maximizing the loss function (with the aim of minimizing its own discrimination error), such that the loss function can be defined as: ; in, The image is input to the cascaded variational encoder. For conditional probability distribution, It expresses expectation.
[0076] In some embodiments of the present invention, while identifying the movement trajectory of key passengers, it is also possible to detect in real time whether there are targets matching the clothing characteristics of staff within the service monitoring area surrounding the key passengers. The process of detecting staff is as follows: after identifying key passengers, all pedestrian targets (target instances within the bounding box) not identified as key passengers are extracted from the service monitoring area, and these are input into the attribute recognition module to determine the staff and their location.
[0077] For example, the service monitoring area could be the location of key passenger targets. A circular region centered at a point with a radius of 10m can be represented by the formula: The location of key passenger targets can be determined based on a pre-trained distance prediction model.
[0078] If staff are detected within the service monitoring area, and the minimum physical distance between the key passenger target and the staff is no greater than 1m ( And the interaction time is not less than 30 seconds. If the target passenger is in a state of being assisted (i.e., ...), then it is determined that the key passenger is in a state of being assisted. Otherwise, the target is judged to be in a state of no assistance (i.e., The passenger station service system will automatically push service instructions to the handheld terminal of the staff member closest to the passenger based on the location detected by the cascaded variable encoder. The service instructions include the real-time coordinates of the key passenger target, characteristic description (such as the type of key passenger), and service suggestions. This mechanism can not only realize proactive service triggering, but also build a complete process of "detection-identification-service verification", providing quantitative data support for service quality assessment and process optimization.
[0079] The following verification experiments demonstrate that the key passenger identification cascade network proposed in this application can significantly improve the intelligence, accuracy, and timeliness of key passenger services at railway stations, providing effective technical support for ensuring the travel safety of key passengers.
[0080] This application systematically evaluates the performance of three core modules in sequence: the target detection model is tested for target recognition accuracy, the feature extraction capability of the attribute recognition model is verified, and the trajectory continuity performance of the automatic tracking model based on cascaded variational autoencoder is evaluated in multi-camera scenarios; then, through on-site testing in actual station scenarios, the overall verification of automatic identification and service for key passengers is completed.
[0081] ① The experimental environment was configured as follows: the hardware platform consisted of an Intel Core i7-8750H processor, an NVIDIA GeForce RTX 3080 graphics card, and 16GB of RAM; the software environment was Windows 10 operating system. During the training of the object detection model, the initial learning rate was set to 0.01, a periodic learning rate decay strategy (decay coefficient 0.2) was adopted, the weight decay was 0.0005, and the batch size was 8. The optimizer momentum parameter was set to 0.8 for the pre-learning phase and 0.937 for the regular training phase. All experiments used an input image size of 640×640 pixels.
[0082] (1) Performance verification and analysis of target detection model In the experiments, YOLOv8 was used as the baseline model, trained using the Stochastic Gradient Descent (SGD) optimizer, and tested and validated on the PASCAL VOC public dataset. The validation results are shown in Table 1. To systematically evaluate the impact of network structure improvements on performance, mean average precision (mAP), parameter count, and inference latency were selected as core evaluation metrics. As shown in Table 1, by replacing the original standard convolutional layers in YOLOv8 with region-dynamically perceptive depth-separable convolutional layers, the improved model achieved improvements of 0.6% and 0.3% in mAP50 and mAP50-95 metrics, respectively, demonstrating that its enhanced feature extraction capabilities are more suitable for capturing subtle features such as assistive devices carried by passengers. Although the introduction of the small target detection layer resulted in a corresponding decrease in inference speed of 5.2ms, it significantly improved the detection rate of key small-sized targets such as wheelchairs and canes, which is crucial for ensuring the integrity of services for special groups. Here, mAP50 refers to the mAP calculated when the IoU threshold is 0.5, and mAP50-95 refers to the mAP calculated in the range of IoU threshold from 0.5 to 0.95 (usually in steps of 0.05).
[0083] Table 1 Experimental results of the improved target detection model To verify the performance of the YOLO series models, this application also compares YOLOv8 with YOLOv4, YOLOv5, and YOLOv7, and the specific results are shown in Table 2. Compared with the existing YOLOv8 benchmark model, although the improved model proposed in this application leads to an increase in model size and number of parameters, it performs better in detecting key small targets such as wheelchairs and canes due to the introduction of a small target detection layer and the use of DR-DP Conv to optimize the convolutional structure. Therefore, the model improvement scheme proposed in this application is more suitable for real-time detection scenarios in railway passenger stations. Although the improved model will bring certain computational overhead, such as a slight decrease in detection speed compared with versions with simpler network structures like YOLOv5 and YOLOv4, its improved YOLOv8 proposed in this application has significant value in improving detection accuracy in complex passenger station environments. It can significantly improve the model's detection accuracy: on the one hand, it can alleviate the scale change problem caused by long-distance shooting through multi-scale feature fusion; on the other hand, it can ensure the reliable identification of key passenger markers by enhancing the perception capability of small targets, providing effective technical support for intelligent services in railway passenger stations.
[0084] Table 2 Experimental results of different YOLO models on the PASCAL VOC dataset. (2) Performance verification and analysis of attribute recognition model The evaluation metrics are set as mean accuracy (mA), accuracy (Accu), precision (Prec), recall, and F1 score. By comparing the algorithm of the attribute recognition model proposed in this application with existing attribute recognition algorithms, the performance comparison results shown in Table 3 can be obtained. The calculation method of mean accuracy (mA) is as follows: ; Where r is the number of attribute categories obtained from the test, and R is the total number of categories. This represents the number of true cases (the number of samples in which the attribute recognition model correctly predicts category r). True positive examples (the total number of samples that truly belong to class r). True negative examples (the number of samples that the attribute recognition model correctly predicts as non-class r); True negatives (the total number of true negative samples in the dataset that do not belong to class r).
[0085] As shown in Table 3, the ASM-Transformer model proposed in this application demonstrates superior performance on both the PETA and PA100K public datasets. Compared to existing methods such as MsVAA (Multi-scale Visual Attention Aggregation), VAC (Visual Attention Consistency), and ALM (Attribute Localization Module), this application significantly improves key metrics with only a small increase in learnable parameters. Specifically, compared to the MsVAA model built on ResNet101, this method improves the average accuracy and F1 score on the PETA dataset by 4.95% and 1.59%, respectively; compared to the ALM model, which increases the number of parameters by 17% by introducing complex module combinations, the mA metrics are improved by 3.93% and 3.24% on both datasets, respectively. The main advantage of the attribute recognition model proposed in this application lies in the extraction of preliminary features using the Transformer structure and the automatic generation of stable and reliable attribute attention regions through the Attribute Spatial Memory (ASM) module. Compared with the VAC method, which only focuses on global attention consistency, the attribute recognition model of this application can generate accurate local attention regions for each fine-grained attribute, effectively solving the problem of attention region bias. This feature makes the ASM-Transformer model show unique advantages in recognizing elderly passengers: it can accurately locate and recognize the facial features of elderly passengers.
[0086] Table 3 Performance Comparison of Different Attribute Recognition Algorithms (3) Verification of multi-camera trajectory continuity based on cascaded variational autoencoder technology To verify the model's generalization performance, 20%, 40%, 60%, 80%, and 100% of the publicly available datasets Market1501 and MSMT17 were used as training samples for the cascaded variational encoders. This means that five different cascaded variational encoders were trained on each dataset. The tracking performance of these five encoders was then verified on the corresponding Market1501 and MSMT17 datasets, and the results were averaged to obtain the results in Table 4. Table 4 fully verifies the superior tracking performance based on the cascaded variational autoencoder technique. The data comparison shows that the automatic tracking module proposed in this application achieved the best performance on both the challenging Market1501 and MSMT17 datasets. The main reasons are: First, the two-level coding structure can effectively decouple the correlation between identity features and camera domain, making the feature expression of the same target more consistent under different cameras, which is directly reflected in the 72.3% mAP performance on the MSMT17 multi-camera dataset. Second, the adversarial training mechanism ensures the visual invariance of features, and can maintain stable recognition ability even in severely occluded scenes. Finally, the similarity measurement method in the latent space is more discriminative than the traditional appearance feature matching, which is particularly important for distinguishing passengers with similar clothing in complex passenger station environments.
[0087] Table 4 Comparative experimental results under different datasets and trajectory verification methods. ②On-site testing Through a 30-day field test at Zhengzhou East Railway Station, the proposed method demonstrated good performance in identifying key passengers and providing service delivery. During the test, 2,847 key passenger identification events were recorded, successfully reducing service downtime from 15 minutes in the traditional model to 3.5 minutes, significantly improving service efficiency. Table 5 compares the results for different types of key passengers, verifying that the proposed method achieved an accuracy rate exceeding 92% for all three types of key passengers. Specifically, the accuracy rate for wheelchair users reached 98.2%, for elderly passengers 95.6%, and for passengers carrying large luggage 92.3%, a significant improvement compared to the 76.8% accuracy rate of manual identification. Regarding response efficiency, the average response time was reduced to approximately 30 seconds, with wheelchair users responding fastest at 25 seconds, followed by elderly passengers at 28 seconds, and passengers carrying large luggage at 32 seconds. Compared to the traditional manual identification method which takes 5-8 minutes, the response efficiency is improved by over 85%.
[0088] Table 5 Comparison of Accuracy Rates for Identifying Key Passengers Comparing data before and after the deployment of the priority passenger identification cascade network shows that priority passenger satisfaction increased from 83% to 96%, and the service complaint rate decreased by 62%. Particularly in services for special groups, the service response time for wheelchair passengers was reduced to less than one minute, and the guidance service completion rate for elderly passengers reached 98.7%. The on-site data fully validates the effectiveness of the priority passenger identification cascade network proposed in this application in a real passenger station, providing reliable technical support for the intelligent upgrading of priority passenger services.
[0089] The proposed cascaded network for key passenger identification, comprised of three main modules—a target detection model, an ASM-Transformer attribute recognition model, and an automatic tracking module based on cascaded variational autoencoder technology—can construct a complete "identification-tracking-service verification" system for key passengers. Field tests also demonstrate that the network achieves an accuracy rate exceeding 92% for identifying three types of key passengers, with a response time reduced to approximately 30 seconds. This indicates that the proposed solution effectively addresses the low efficiency and slow response times of traditional manual methods, providing a reliable solution for intelligent services at railway stations. It should be noted that due to the high false alarm rate associated with camera-based detection of key passengers such as children and pregnant women, the method proposed in this application does not consider the identification of these passengers.
[0090] In the research, development, training, and testing process of this application's technical solution, the following strict safeguards were implemented for processing personal images and identification information from public places: ① The primary and sole purpose of all image and identification information collection in public places is necessary for maintaining public safety (e.g., responding to public safety emergencies); ② Clear and prominent signage has been installed in the information collection area to clearly inform the public of the collection area, purpose, information type, and responsible agency, ensuring the public's right to know; ③ All collected information is used solely for the aforementioned public safety purposes and will not be used for any commercial marketing, personalized recommendations, or other purposes unrelated to maintaining public safety; ④ All information is stored in a dedicated system or controlled experimental environment with high-level security protection, implementing strict access control and operational auditing.
[0091] Corresponding to the above method, the present invention also provides a system for identifying key passengers at railway stations. The system includes a computer device, which includes a processor and a memory. The memory stores computer programs / instructions, and the processor is used to execute the computer programs / instructions stored in the memory. When the computer programs / instructions are executed by the processor, the system implements the steps of the method described above.
[0092] This invention also provides a computer-readable storage medium storing a computer program / instructions thereon, which, when executed by a processor, implements the steps of the aforementioned edge computing server deployment method. The computer-readable storage medium can be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, floppy disks, hard disks, removable storage disks, CD-ROMs, or any other form of storage medium known in the art.
[0093] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.
[0094] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.
[0095] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0096] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for identifying key passengers at railway passenger stations, characterized in that, The method includes the following steps: Obtain the images to be identified captured by cameras at railway passenger stations; The image to be identified is input into a pre-trained target detection model, which outputs the bounding box information of the pedestrian target, as well as the category information, bounding box information, and detection confidence of the object carried by the pedestrian target; wherein, the category of the object carried includes luggage and assistive devices. When the category of the carried item is luggage, the size of the luggage is determined based on the bounding box information of the carried item and the spatial relationship between the camera and the carried item. The number of luggage items is determined based on the intersection-union ratio between the bounding box of the carried item and the bounding box of the corresponding pedestrian target. Thus, if the determined luggage size and / or luggage quantity meet the set luggage conditions, the pedestrian target corresponding to the carried item is identified as a key passenger carrying large and / or multiple pieces of luggage. If the category of the carried object is an assistive device, and the detection confidence of the carried object reaches a set confidence threshold, then the pedestrian target corresponding to the carried object is identified as a key passenger who needs an assistive device. The target detection model is an improved YOLO model, comprising a backbone network, a neck network, and a detection head. Each convolutional layer in the target detection model is a region-dynamically perceptive depth-separable convolutional layer. The neck network includes a top-down feature fusion module, a bottom-up feature fusion module, and a multi-scale feature map output module. The top-down feature fusion module is used to upsample the multi-scale feature map output by the backbone network. The bottom-up feature fusion module includes a small target detection layer, which is used to downsample the multi-scale feature map output by the top-down feature fusion module.
2. The method according to claim 1, characterized in that, The region-aware depth-separable convolutional layer includes a depth-separable convolutional layer, a learnable guide, and a filter generator; wherein, The depthwise separable convolutional module is used to generate a guided feature map based on the input image; The guide is used to perform dimensionality reduction on the guide feature map and use the softmax function to determine the weight of each pixel in the dimensionality-reduced guide feature map in multiple preset regions. The filter generator is used to obtain a basic filter tensor based on the input image, and to perform a weighted summation using the weights output by the guide to obtain the convolution kernel corresponding to each pixel in the input image, thereby using the obtained convolution kernel to perform a convolution operation on the input image.
3. The method according to claim 1, characterized in that, in, The bounding box information includes the size and position of the bounding box; The dimensions of the luggage were determined in the following way: If the category of the object being carried is determined to be luggage, the actual physical distance between the object and the camera is determined based on a pre-trained distance prediction model; Calculate the product of the bounding box width of the target and the actual physical distance to obtain the first theoretical distance, and calculate the product of the bounding box height of the target and the actual physical distance to obtain the second theoretical distance; The luggage dimensions are obtained by calculating the ratios of the first and second theoretical distances to the camera focal length parameter.
4. The method according to claim 1, characterized in that, The pre-trained distance prediction model determines the actual physical distance between the target object and the camera, including: The bounding box information of the target object and the position information of a specific camera are input into a pre-trained distance prediction model, and the actual physical distance between the target object and the camera is output; wherein, the specific camera is a camera that captures an image containing the target object to be identified.
5. The method according to claim 1, characterized in that, The method further includes: The features corresponding to pedestrian targets with undetermined passenger types are input into a pre-trained attribute space memory model, and the output is the attribute similarity feature corresponding to the pedestrian target. The attribute similarity feature is then input into a preset age classification algorithm to obtain the age prediction result of the pedestrian target. Based on the age prediction result, it is determined whether the pedestrian target is an elderly key passenger. The features corresponding to the pedestrian target are obtained based on the pixel information within the bounding box of the pedestrian target. The attribute space memory model is used to extract the features corresponding to the preset attributes from the features corresponding to the input pedestrian target as a query, and to calculate the similarity between the query and the standard attributes pre-stored in the attribute space memory to obtain the attribute similarity features corresponding to the pedestrian target.
6. The method according to claim 5, characterized in that, The features corresponding to the pedestrian targets with undetermined passenger types are obtained through the following methods: Select pedestrian targets with undetermined passenger types from the image to be identified, and divide the image within the bounding box of the selected pedestrian targets into multiple image blocks; The corresponding features are extracted from multiple image patches using an attention mechanism, and the extracted features are projected into a metric space that is the same as the standard attribute through feature projection, thereby obtaining the features corresponding to the pedestrian target.
7. The method according to claim 1, characterized in that, After identifying key passengers, the method further includes: If the pedestrian target is determined to be a key passenger, the key passenger image is obtained by segmenting from the image to be identified based on the bounding box information of the pedestrian target; The key passenger images are input into a pre-trained cascaded variational encoder, and the latent variables corresponding to the key passenger images are output. If the similarity between two latent variables obtained from different images to be identified meets the preset similarity condition, then the images of key passengers corresponding to these two latent variables belong to the same pedestrian target, thereby realizing the tracking of key passengers.
8. The method according to claim 1, characterized in that, After obtaining the latent variables corresponding to the images of key passengers, the method further includes: The latent variables corresponding to the key passenger images are input into a pre-trained anti-domain discriminator to obtain the optimized latent variables corresponding to the key passenger images.
9. A system for identifying key passengers at railway stations, comprising a processor, a memory, and a computer program / instructions stored in the memory, characterized in that, The processor is configured to execute the computer program / instructions, and when the computer program / instructions are executed, the system implements the steps of the method as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method as described in any one of claims 1 to 8.