Video person retrieval method, apparatus, device, and storage medium

By establishing a video database, extracting character segments and audio from target videos, performing semantic analysis and text combination, and using a trained convolutional neural network model, the problem of inaccurate retrieval results in existing video search technologies is solved, achieving efficient and accurate video character retrieval.

CN115238124BActive Publication Date: 2026-06-23PING AN TECH (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PING AN TECH (SHENZHEN) CO LTD
Filing Date
2022-08-09
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In existing video search technologies, the accuracy of retrieval results obtained by extracting human segments through facial recognition is not high, resulting in low efficiency.

Method used

A video-based person retrieval method is adopted. By establishing an original video database, person segments and audio from target videos are extracted, semantic parsing and text combination are performed, and a pre-set person identification model is used to extract the person's identity from the combined text. The model is obtained by training a convolutional neural network.

Benefits of technology

It improves the accuracy and efficiency of video person retrieval, and can automatically extract the identity of people in target videos. It combines audio and video text to achieve efficient and accurate retrieval.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115238124B_ABST
    Figure CN115238124B_ABST
Patent Text Reader

Abstract

The video character retrieval method, device, equipment and storage medium of the present application, wherein the method comprises: establishing an original video database, obtaining a target video from the original video database; extracting a character segment in the target video; extracting a character audio in the character segment; performing semantic analysis on the character audio to obtain an audio text; combining the audio text and a video text in the target video to obtain a combined text; and using a preset character identity recognition model to extract a character identity in the combined text, wherein the preset character identity recognition model is obtained by training a convolutional neural network. The combined text combining the audio text and the video text includes the character name and the character relationship of the target video, and the character identity in the combined text can be automatically extracted by the trained preset character identity recognition model, thereby achieving high retrieval accuracy and efficiency.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of video retrieval technology, such as methods, apparatus, devices, and storage media for retrieving people in videos. Background Technology

[0002] With the explosive growth of video data, finding the videos users need from this massive amount of content has become a crucial issue. The most important element in videos is the people, who often embody the most abstract and important content. Understanding and analyzing the people in videos is a vital task for current video retrieval systems.

[0003] Traditional video databases often rely on visual features of people for video retrieval. For example, when a user wants to search for a specific person, such as a favorite celebrity, facial recognition is typically used to extract clips of that person. However, this retrieval method is relatively inefficient and yields low accuracy results. Summary of the Invention

[0004] This application provides a video person retrieval method, apparatus, device, and storage medium, aiming to solve the problem that existing video search technologies only use facial recognition to extract person segments, resulting in low accuracy of retrieval results.

[0005] To solve the above problems, this application adopts the following technical solution:

[0006] This article provides a method for retrieving people from videos, the method including:

[0007] Establish an original video database and obtain the target video from the original video database;

[0008] Extract the human segments from the target video;

[0009] Extract the audio of the characters from the aforementioned character segments;

[0010] Semantic analysis is performed on the audio of the person to obtain the audio text;

[0011] The audio text and the video text in the target video are combined to obtain combined text;

[0012] The identities of individuals in the combined text are extracted using a preset individual identity recognition model, wherein the preset individual identity recognition model is obtained by training a convolutional neural network.

[0013] The extraction of character segments from the target video includes:

[0014] Extract key frames from the target video, and input the key frames into the feature extraction layer of the preset person identification model for sequential convolution, activation and pooling to obtain a feature map;

[0015] The feature map is input into the region candidate layer of the preset person identity recognition model for region segmentation, and the region of interest is extracted.

[0016] The region of interest is input into the interest region pooling layer of the preset person identity recognition model for pooling to obtain pooled features;

[0017] The pooled features are input into the classification and regression layers of the preset person identification model for binary classification to extract the person fragments.

[0018] The semantic parsing of the character's audio to obtain audio text includes:

[0019] Convert the character's audio into a character audio vector;

[0020] The audio vector of the person is input into a preset speech recognition model, and the keywords and triplet relationships of the audio vector are extracted by the preset speech recognition model to obtain the audio text.

[0021] The process of combining the audio text and the video text in the target video to obtain combined text includes:

[0022] The combined text is obtained by combining the audio text with the video title and video summary of the target video.

[0023] The preset person identification model is obtained by training a convolutional neural network, including:

[0024] Obtain the training text set;

[0025] The training text set is input into the convolutional neural network, and the loss function value is calculated.

[0026] The loss function value is backpropagated to update the network parameters of the convolutional neural network.

[0027] Determine whether the network parameters are less than or equal to the parameter threshold. If so, stop training the convolutional neural network and obtain the preset person identification model.

[0028] The establishment of the original video database includes:

[0029] Use web crawlers to obtain video resources from the internet;

[0030] The video resources are added to the original video database.

[0031] The method of using web crawlers to obtain video resources from the internet includes:

[0032] Use the request library of the web crawler to obtain web page data;

[0033] The webpage data is regularized to obtain regularized webpage data;

[0034] The XPath path language is used to traverse the nodes of the regularized web page data to obtain data nodes;

[0035] The video resources in the regularized web page data are extracted using the Beautiful Soup parsing library based on the data nodes.

[0036] This application also provides a video character retrieval device, including:

[0037] The target video acquisition module is used to establish an original video database and acquire target videos from the original video database;

[0038] The character segment extraction module is used to extract character segments from the target video;

[0039] The character audio extraction module is used to extract the character audio from the character segment;

[0040] The semantic parsing module is used to perform semantic parsing on the audio of the person to obtain audio text;

[0041] The text combining module is used to combine the audio text and the video text in the target video to obtain combined text;

[0042] The person identification module is used to extract the person's identity from the combined text using a preset person identification model, wherein the preset person identification model is obtained by training a convolutional neural network.

[0043] This application also provides a computer device, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the video person retrieval method described in any of the above claims.

[0044] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the video person retrieval method described in any of the above claims.

[0045] This application's video person retrieval method includes: establishing an original video database; obtaining target videos from the original video database; extracting person segments from the target videos; extracting audio from the person segments; performing semantic analysis on the audio to obtain audio text; combining the audio text and the video text from the target videos to obtain combined text; and using a pre-set person identification model to extract the person's identity from the combined text. The pre-set person identification model is obtained by training a convolutional neural network. The combined text, which combines audio and video text, includes the person's name and relationships from the target videos. The pre-trained pre-set person identification model can automatically extract the person's identity from the combined text, resulting in high retrieval accuracy and efficiency. Attached Figure Description

[0046] Figure 1 This is a flowchart illustrating a video person retrieval method according to one embodiment;

[0047] Figure 2 This is a schematic diagram illustrating the process of obtaining a target video from an original video database according to one embodiment;

[0048] Figure 3 This is a schematic diagram illustrating the process of extracting human segments from a target video according to one embodiment;

[0049] Figure 4 This is a schematic diagram illustrating the process of semantic parsing of a person's audio in one embodiment;

[0050] Figure 5 This is a schematic diagram illustrating the process of training a convolutional neural network to obtain a preset person identification model, as shown in one embodiment.

[0051] Figure 6 This is a schematic block diagram of the structure of a video person retrieval device according to an embodiment;

[0052] Figure 7 This is a schematic block diagram of the structure of a computer device according to one embodiment.

[0053] The realization of the purpose, functional features and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation

[0054] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0055] Those skilled in the art will understand that, unless explicitly stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in the specification of this application means the presence of features, integers, steps, operations, elements, units, cells, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, units, cells, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless couplings. The term “and / or” as used herein includes all or any of the units and all combinations thereof of one or more associated listed items.

[0056] Those skilled in the art will understand that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.

[0057] In one embodiment, refer to Figure 1 This is a flowchart illustrating the video person retrieval method of this application, including the following steps S1 to S6:

[0058] S1: Establish an original video database and obtain the target video from the original video database.

[0059] Establishing the original video database includes:

[0060] Use web crawlers to obtain video resources from the internet;

[0061] The video resources are added to the original video database.

[0062] A web crawler, also known as a web spider or web robot, simulates a client to send web requests and retrieve response data. In other words, a web crawler is a program or script that automatically retrieves information from the World Wide Web according to certain rules.

[0063] Video resources can be downloaded to the original video database, or the links to the video resources can be saved to the original video database, depending on the actual situation, and no limitation is made here. This application embodiment takes downloading video resources to the original video database as an example.

[0064] The method of using web crawlers to obtain video resources from the internet includes:

[0065] Use the request library of the web crawler to obtain web page data;

[0066] The webpage data is regularized to obtain regularized webpage data;

[0067] The XPath path language is used to traverse the nodes of the regularized web page data to obtain data nodes;

[0068] The video resources in the regularized web page data are extracted using the Beautiful Soup parsing library based on the data nodes.

[0069] The request is sent to the webpage using commands from the request library. If the webpage contains content corresponding to the commands from the request library, the requested content is returned, which is a video.

[0070] Web page data includes videos, images, and strings. By regularizing the web page data, we can obtain regularized web page data, which is valid data that conforms to the regularization expression.

[0071] The XPath language is used to traverse the regularized web page data, extract the tags and attributes of the regularized web page data, and filter out the data nodes whose tags and attributes are video resources.

[0072] The Beautiful Soup parsing library is used to download the video resources corresponding to data nodes with tags and attributes of video resources to the original video database.

[0073] The original video database includes both target and non-target videos. The amount of video in the original video database is large enough that the identities of the people in the videos can be retrieved from the original video database.

[0074] The search can begin with an initial search of the original video database based on the video type of the person being searched for, yielding the first set of search results. Then, a second search can be performed within these first results based on the video's release date to retrieve the target video.

[0075] There can be one or more target videos.

[0076] For example, if the person to be searched is Liu Moumou, and Liu Moumou has acted in many period dramas, the first search would be to search for period dramas in the original video database, obtaining all period dramas. The second search would be to search for period dramas from the last two years among all period dramas, obtaining multiple target videos.

[0077] S2: Extract the human segments from the target video.

[0078] Extract key frames from the target video, and input the key frames into the feature extraction layer of the preset person identification model for sequential convolution, activation and pooling to obtain a feature map;

[0079] The feature map is input into the region candidate layer of the preset person identity recognition model for region segmentation, and the region of interest is extracted.

[0080] The region of interest is input into the interest region pooling layer of the preset person identity recognition model for pooling to obtain pooled features;

[0081] The pooled features are input into the classification and regression layers of the preset person identification model for binary classification to extract the person fragments.

[0082] The preset person identification model can quickly divide the target video into person segments and non-person segments, and the person segments can be used to extract the person's audio.

[0083] S3: Extract the audio of the characters from the character segments.

[0084] Character clips include character videos and character audio, with the audio including dialogue, monologues, and background sounds.

[0085] S4: Perform semantic analysis on the audio of the person to obtain the audio text.

[0086] The audio of the person is input into a preset speech recognition model, and the keywords and triple relationships of the audio are extracted through the preset speech recognition model to obtain the audio text.

[0087] The preset speech recognition model is trained on a convolutional neural network using real audio of people. By extracting keywords from the audio, the name of the person can be obtained. The triple relationship includes the subject, object, and predicate of the sentence. The relationship between people can be determined through the triple relationship.

[0088] The audio text includes the names of the characters and the relationships between them.

[0089] S5: Combine the audio text and the video text in the target video to obtain combined text.

[0090] The combined text is obtained by combining the audio text with the video title and video summary of the target video.

[0091] Combining the video title and video summary can help to better identify the person in the target video.

[0092] Video titles and summaries are shorter than audio text, but they contain important guidance. Based on video titles and summaries, the range of identities of individuals can be quickly narrowed down, reducing the workload of searching for identities from audio text.

[0093] S6: Use a preset person identification model to extract the person's identity from the combined text, wherein the preset person identification model is obtained by training a convolutional neural network.

[0094] The preset person identification model is obtained by training a convolutional neural network, including:

[0095] Obtain the training text set;

[0096] The training text set is input into the convolutional neural network, and the loss function value is calculated.

[0097] The loss function value is backpropagated to update the network parameters of the convolutional neural network.

[0098] Determine whether the network parameters are less than or equal to the parameter threshold. If so, stop training the convolutional neural network and obtain the preset person identification model.

[0099] The preset person identification model is trained by combining real text and corresponding person identities. Before training, the corresponding person identities are manually labeled based on the real text.

[0100] By inputting the combined text into the preset person identification model, the person's identity can be obtained.

[0101] The video person retrieval method of this application includes establishing an original video database and obtaining target videos from the original video database; extracting person segments from the target videos; extracting person audio from the person segments; performing semantic analysis on the person audio to obtain audio text; combining the audio text and the video text from the target video to obtain combined text; and using a preset person identification model to extract the person's identity from the combined text, wherein the preset person identification model is obtained by training a convolutional neural network. The combined text, which combines audio text and video text, includes the person's name and relationship in the target video. The trained preset person identification model can automatically extract the person's identity from the combined text, resulting in high retrieval accuracy and efficiency.

[0102] Reference Figure 2 In one embodiment, step S1, which involves establishing an original video database and obtaining the target video from the original video database, includes the following steps S11-S13:

[0103] S11: Use web crawlers to obtain video resources from the web.

[0104] Use a Python-based web crawler to send requests to various video websites to obtain video resources. Once the requests are answered, retrieve the video resources from the various video websites.

[0105] A web crawler, also known as a web spider or web robot, simulates a client to send web requests and retrieve response data. In other words, a web crawler is a program or script that automatically retrieves information from the World Wide Web according to certain rules.

[0106] S12: Add the video resources to the original video database.

[0107] Video resources can be downloaded to the original video database, or a link to the video resources can be added to the original video database. Other methods can also be used to add video resources to the original video database, depending on the actual situation. No restrictions are imposed here.

[0108] S13: Obtain the target video from the original video database.

[0109] The original video database includes both target and non-target videos. The amount of video in the original video database is large enough that the identities of the people in the videos can be retrieved from the original video database.

[0110] The search can begin with an initial search of the original video database based on the video type of the person being searched for, yielding the first set of search results. Then, a second search can be performed within these first results based on the video's release date to retrieve the target video.

[0111] There can be one or more target videos.

[0112] For example, if the person to be searched is Liu Moumou, and Liu Moumou has acted in many period dramas, the first search would be to search for period dramas in the original video database, obtaining all period dramas. The second search would be to search for period dramas from the last two years among all period dramas, obtaining multiple target videos.

[0113] Web crawlers can automatically retrieve video resources from the internet and add them to the original video database, saving manpower and improving the efficiency of acquiring video resources.

[0114] As described above, establishing the original video database involves using web crawlers to retrieve video resources from the internet and adding these resources to the original video database. Target videos are then retrieved from the original video database. Web crawlers can automatically retrieve video resources from the internet and add them to the original video database, saving manpower and improving the efficiency of video resource retrieval. The original video database includes both target and non-target videos, and its large volume allows for the retrieval of the identities of the people in the videos.

[0115] Reference Figure 3 In one embodiment, step S2, which extracts human segments from the target video, includes the following steps S21-S24:

[0116] S21: Extract the key frames of the target video, and input the key frames into the feature extraction layer of the preset person identity recognition model to perform convolution, activation and pooling in sequence to obtain a feature map.

[0117] The preset person identification model includes a feature extraction layer, a region candidate layer, an interest domain pooling layer, and a classification and regression layer.

[0118] Keyframes reflect the main content of a video, and their extraction is fundamental to video analysis and retrieval. Keyframe extraction methods include extracting keyframes based on shot boundaries, extracting keyframes based on motion analysis, extracting keyframes based on image information, and extracting keyframes based on Euclidean distance.

[0119] By setting a convolution kernel to perform convolution on keyframes, the features of the keyframes can be extracted. Different convolution kernels will extract different features.

[0120] Activating the features extracted by convolution using an activation function can enhance the non-linearity of the features. The activation function can be the ReLU activation function, or other activation functions.

[0121] Pooling is performed on the features activated by the activation function to obtain a feature map. Pooling can reduce the number of redundant features and improve computational efficiency.

[0122] S22: Input the feature map into the region candidate layer of the preset person identity recognition model to divide the region and extract the region of interest.

[0123] The feature map is input into the region candidate layer for binary classification to obtain positive and negative sample regions. Bounding box regression is then used to refine the positive and negative sample regions to obtain the region of interest.

[0124] S23: Input the region of interest into the interest region pooling layer of the preset person identity recognition model for pooling to obtain pooled features.

[0125] The pooling method can be max pooling or average pooling, or other pooling methods, depending on the actual situation. No restrictions are imposed here.

[0126] S24: Input the pooled features into the classification and regression layers of the preset person identification model for binary classification and extract the person fragments.

[0127] The classification function used for binary classification can be the softmax function, the sigmoid function, or other classification functions, depending on the actual situation. No restrictions are imposed here.

[0128] Convolution can extract features from keyframes, region segmentation can find regions of interest from multiple regions, pooling can reduce the number of redundant features, binary classification can obtain labels for human segments and non-human segments, and the corresponding human segments can be extracted based on the human segment labels.

[0129] This embodiment of the application extracts human segments from a target video, including binary classification of the target video using the preset human identification model to obtain human segments. The preset human identification model includes a feature extraction layer, a region candidate layer, a region of interest pooling layer, and a classification and regression layer. Convolution can extract features from keyframes, region segmentation can find regions of interest from multiple regions, pooling can reduce the number of redundant features, binary classification obtains human segment labels and non-human segment labels, and the corresponding human segments can be extracted based on the human segment labels.

[0130] Reference Figure 4 In one embodiment, step S4, which involves semantic parsing of the audio of the person to obtain audio text, includes the following steps S41-S42:

[0131] S41: Convert the character's audio into a character audio vector.

[0132] Each character's audio is split into multiple character audio vectors, with each audio vector representing a single sentence.

[0133] S42: Input the audio vector of the person into a preset speech recognition model, and extract the keywords and triplet relationships of the audio vector of the person through the preset speech recognition model to obtain the audio text.

[0134] The preset speech recognition model is trained on a convolutional neural network using real audio of people. By extracting keywords from the audio, the name of the person can be obtained. The triple relationship includes the subject, object, and predicate of the sentence. The relationship between people can be determined through the triple relationship.

[0135] The audio text includes the names of the characters and the relationships between them.

[0136] This application embodiment performs semantic parsing on audio of individuals to obtain audio text, including converting the audio into audio vectors. The audio vectors are then input into a preset speech recognition model, which extracts keywords and triplet relationships from the audio vectors to obtain the audio text. The preset speech recognition model is trained on a convolutional neural network using real audio of individuals. By extracting keywords from the audio, the individual's name can be obtained. The triplet relationships include the subject, object, and predicate of the statement, and the relationships between individuals can be determined through these triplet relationships.

[0137] In one embodiment, the combination of the audio text and the video text in the target video to obtain combined text includes:

[0138] The combined text is obtained by combining the audio text with the video title and video summary of the target video.

[0139] Combining the video title and video summary can help to better identify the person in the target video.

[0140] Video titles and summaries are shorter than audio text, but they contain important guidance. Based on video titles and summaries, the range of identities of individuals can be quickly narrowed down, reducing the workload of searching for identities from audio text.

[0141] This application embodiment combines the audio text and the video text in the target video to obtain combined text. This includes combining the audio text with the video title and video summary of the target video. The video title and video summary are shorter than the audio text, but contain important guiding content. Based on the video title and video summary, the range of possible identities can be quickly narrowed down, reducing the workload of retrieving identities from the audio text.

[0142] Reference Figure 5 In one embodiment, the preset person identification model is obtained by training a convolutional neural network, including the following steps S51'-S54', which are performed before step S6.

[0143] S51': Obtain the training text set.

[0144] You can obtain real training text sets from real training text databases, simulated training text sets from simulated training text databases, or other methods of obtaining training text sets, depending on the actual situation. No restrictions are imposed here.

[0145] S52': Input the training text set into the convolutional neural network and calculate the loss function value.

[0146] The loss function can be the cross-entropy loss function, the mean squared error loss function, or other loss functions, depending on the specific circumstances. No restrictions are imposed here.

[0147] The loss function value represents the error between the output result and the expected result. The smaller the loss function value, the smaller the error between the output result and the expected result, and the higher the accuracy of the convolutional neural network in identifying people.

[0148] S53': Backpropagate the loss function value to update the network parameters of the convolutional neural network.

[0149] Each time the loss function value is calculated, it is backpropagated to calculate the update amount of each network parameter, and then each network parameter is updated according to the update amount.

[0150] S54': Determine whether the network parameters are less than or equal to the parameter threshold. If so, stop training the convolutional neural network and obtain the preset person identification model.

[0151] Set the parameter thresholds for each network parameter, and determine whether each network parameter is less than or equal to the corresponding parameter threshold. If so, it means that the convolutional neural network has met the standard, stop training the convolutional neural network, and obtain the preset person identification model.

[0152] If any network parameter is greater than the corresponding parameter threshold, it means that the convolutional neural network does not yet meet the standard and needs to continue training until all network parameters are less than or equal to the corresponding parameter threshold. Then, stop training the convolutional neural network and obtain the preset person identification model.

[0153] Training a convolutional neural network (CNN) can continuously optimize its network parameters, making the output result closer and closer to the expected result, until all network parameters are less than or equal to the corresponding parameter thresholds. This indicates that the CNN meets the standard, and the trained CNN can be used as the preset person identification model.

[0154] The preset person identification model in this embodiment is obtained by training a convolutional neural network (CNN). The process includes acquiring a training text set, inputting the training text set into the CNN, and calculating a loss function value. The loss function value is then backpropagated to update the CNN's network parameters. It is determined whether the network parameters are less than or equal to a parameter threshold; if so, training the CNN stops, and the preset person identification model is obtained. Training the CNN continuously optimizes its network parameters, making the output increasingly closer to the expected result. This continues until all network parameters are less than or equal to their corresponding parameter thresholds, indicating that the CNN meets the standard. The trained CNN is then used as the preset person identification model.

[0155] Reference Figure 6 This is a schematic block diagram of a video person retrieval device proposed in this application. The device includes:

[0156] Target video acquisition module 10 is used to establish an original video database and acquire target videos from the original video database;

[0157] Establishing the original video database includes:

[0158] Use web crawlers to obtain video resources from the internet;

[0159] The video resources are added to the original video database.

[0160] A web crawler, also known as a web spider or web robot, simulates a client to send web requests and retrieve response data. In other words, a web crawler is a program or script that automatically retrieves information from the World Wide Web according to certain rules.

[0161] Video resources can be downloaded to the original video database, or the links to the video resources can be saved to the original video database, depending on the actual situation, and no limitation is made here. This application embodiment takes downloading video resources to the original video database as an example.

[0162] The method of using web crawlers to obtain video resources from the internet includes:

[0163] Use the request library of the web crawler to obtain web page data;

[0164] The webpage data is regularized to obtain regularized webpage data;

[0165] The XPath path language is used to traverse the nodes of the regularized web page data to obtain data nodes;

[0166] The video resources in the regularized web page data are extracted using the Beautiful Soup parsing library based on the data nodes.

[0167] The request is sent to the webpage using commands from the request library. If the webpage contains content corresponding to the commands from the request library, the requested content is returned, which is a video.

[0168] Web page data includes videos, images, and strings. By regularizing the web page data, we can obtain regularized web page data, which is valid data that conforms to the regularization expression.

[0169] The XPath language is used to traverse the regularized web page data, extract the tags and attributes of the regularized web page data, and filter out the data nodes whose tags and attributes are video resources.

[0170] The Beautiful Soup parsing library is used to download the video resources corresponding to data nodes with tags and attributes of video resources to the original video database.

[0171] The original video database includes both target and non-target videos. The amount of video in the original video database is large enough that the identities of the people in the videos can be retrieved from the original video database.

[0172] The search can begin with an initial search of the original video database based on the video type of the person being searched for, yielding the first set of search results. Then, a second search can be performed within these first results based on the video's release date to retrieve the target video.

[0173] There can be one or more target videos.

[0174] For example, if the person to be searched is Liu Moumou, and Liu Moumou has acted in many period dramas, the first search would be to search for period dramas in the original video database, obtaining all period dramas. The second search would be to search for period dramas from the last two years among all period dramas, obtaining multiple target videos.

[0175] The character segment extraction module 20 is used to extract character segments from the target video;

[0176] Extract key frames from the target video, and input the key frames into the feature extraction layer of the preset person identification model for sequential convolution, activation and pooling to obtain a feature map;

[0177] The feature map is input into the region candidate layer of the preset person identity recognition model for region segmentation, and the region of interest is extracted.

[0178] The region of interest is input into the interest region pooling layer of the preset person identity recognition model for pooling to obtain pooled features;

[0179] The pooled features are input into the classification and regression layers of the preset person identification model for binary classification to extract the person fragments.

[0180] The preset person identification model can quickly divide the target video into person segments and non-person segments, and the person segments can be used to extract the person's audio.

[0181] The character audio extraction module 30 is used to extract the character audio from the character segment;

[0182] Character clips include character videos and character audio, with the audio including dialogue, monologues, and background sounds.

[0183] Semantic parsing module 40 is used to perform semantic parsing on the character's audio to obtain audio text;

[0184] The audio of the person is input into a preset speech recognition model, and the keywords and triple relationships of the audio are extracted through the preset speech recognition model to obtain the audio text.

[0185] The preset speech recognition model is trained on a convolutional neural network using real audio of people. By extracting keywords from the audio, the name of the person can be obtained. The triple relationship includes the subject, object, and predicate of the sentence. The relationship between people can be determined through the triple relationship.

[0186] The audio text includes the names of the characters and the relationships between them.

[0187] The text combination module 50 is used to combine the audio text and the video text in the target video to obtain combined text;

[0188] The combined text is obtained by combining the audio text with the video title and video summary of the target video.

[0189] Combining the video title and video summary can help to better identify the person in the target video.

[0190] Video titles and summaries are shorter than audio text, but they contain important guidance. Based on video titles and summaries, the range of identities of individuals can be quickly narrowed down, reducing the workload of searching for identities from audio text.

[0191] The person identification module 60 is used to extract the person's identity from the combined text using a preset person identification model, wherein the preset person identification model is obtained by training a convolutional neural network.

[0192] The preset person identification model is obtained by training a convolutional neural network, including:

[0193] Obtain the training text set;

[0194] The training text set is input into the convolutional neural network, and the loss function value is calculated.

[0195] The loss function value is backpropagated to update the network parameters of the convolutional neural network.

[0196] Determine whether the network parameters are less than or equal to the parameter threshold. If so, stop training the convolutional neural network and obtain the preset person identification model.

[0197] The preset person identification model is trained by combining real text and corresponding person identities. Before training, the corresponding person identities are manually labeled based on the real text.

[0198] By inputting the combined text into the preset person identification model, the person's identity can be obtained.

[0199] The video person retrieval device of this application embodiment can implement the video person retrieval method. The video person retrieval method of this application embodiment includes establishing an original video database; obtaining target videos from the original video database; extracting person segments from the target videos; extracting person audio from the person segments; performing semantic analysis on the person audio to obtain audio text; combining the audio text and the video text from the target video to obtain combined text; and using a preset person identification model to extract the person's identity from the combined text, wherein the preset person identification model is obtained by training a convolutional neural network. The combined text, combining audio text and video text, includes the person's name and relationship in the target video. The trained preset person identification model can automatically extract the person's identity from the combined text, resulting in high retrieval accuracy and efficiency.

[0200] Reference Figure 7 This application also provides a computer device, which may be a server, and its internal structure may be as follows: Figure 7 As shown. This computer device includes a processor, memory, network interface, and database connected via a system bus. The processor in this computer design provides computing and control capabilities. The memory of this computer device includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of this computer device is used to store composite text, etc. The network interface of this computer device is used for communication with external terminals via network connection. When the computer program is executed by the processor, it implements a video person retrieval method.

[0201] Those skilled in the art will understand that Figure 7The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer equipment on which the present application is applied.

[0202] One embodiment of this application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements a video person retrieval method. It is understood that the computer-readable storage medium in this embodiment can be a volatile readable storage medium or a non-volatile readable storage medium.

[0203] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Any references to memory, storage, databases, or other media used in this application and in the embodiments can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual-rate SDRAM (SSRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0204] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, apparatus, article, or method. Unless otherwise specified, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, apparatus, article, or method that includes that element.

[0205] The above description is only a preferred embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural changes made based on the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.

Claims

1. A method for retrieving people from videos, characterized in that, The method includes: Establish an original video database and obtain the target video from the original video database; Extract the human segments from the target video; Extract the audio of the characters from the aforementioned character segments; Semantic parsing is performed on the audio of the person, and the audio of the person is converted into an audio vector of the person. Keywords and triple relations of the audio vector of the person are extracted by a preset speech recognition model to obtain audio text. The keywords are the names of the people, and the triple relations include the subject, object and predicate of the sentence, which are used to determine the relationship between people in the video. The audio text and the video text in the target video are combined to obtain combined text, wherein the video text is the video title and video summary of the target video; The preset character identification model is used to extract the character identities from the combined text. The preset character identification model is obtained by training a convolutional neural network. The preset character identification model is trained from real combined text and corresponding character identities. Before training, the corresponding character identities are manually labeled according to the real combined text. The combined text, which combines audio text and video text, includes the names of the characters in the target video and their relationships. The trained preset character identification model can automatically extract the character identities from the combined text. The extraction of character segments from the target video includes: Extract key frames from the target video, and input the key frames into the feature extraction layer of the preset person identification model for sequential convolution, activation and pooling to obtain a feature map; The feature map is input into the region candidate layer of the preset person identity recognition model for region segmentation, and the region of interest is extracted. The region of interest is input into the interest region pooling layer of the preset person identity recognition model for pooling to obtain pooled features; The pooled features are input into the classification and regression layers of the preset person identification model for binary classification to extract the person fragments.

2. The video person retrieval method according to claim 1, characterized in that, The preset person identification model is obtained by training a convolutional neural network, including: Obtain the training text set; The training text set is input into the convolutional neural network, and the loss function value is calculated. The loss function value is backpropagated to update the network parameters of the convolutional neural network. Determine whether the network parameters are less than or equal to the parameter threshold. If so, stop training the convolutional neural network and obtain the preset person identification model.

3. The video person retrieval method according to claim 1, characterized in that, The establishment of the original video database includes: Use web crawlers to obtain video resources from the internet; The video resources are added to the original video database.

4. The video person retrieval method according to claim 3, characterized in that, The method of using web crawlers to obtain video resources from the internet includes: Use the request library of the web crawler to obtain web page data; The webpage data is regularized to obtain regularized webpage data; The XPath path language is used to traverse the nodes of the regularized web page data to obtain data nodes; The video resources in the regularized web page data are extracted using the Beautiful Soup parsing library based on the data nodes.

5. A video person retrieval device, characterized in that, include: The target video acquisition module is used to establish an original video database and acquire target videos from the original video database; The character segment extraction module is used to extract character segments from the target video; The character audio extraction module is used to extract the character audio from the character segment; The semantic parsing module is used to perform semantic parsing on the audio of the person, convert the audio of the person into an audio vector of the person, and extract the keywords and triple relationships of the audio vector of the person through a preset speech recognition model to obtain the audio text. The keywords are the names of the people, and the triple relationships include the subject, object and predicate of the statement, which are used to determine the relationship between the people in the video. The text combination module is used to combine the audio text and the video text in the target video to obtain combined text, wherein the video text is the video title and video summary of the target video; The person identification module is used to extract the person identification from the combined text using a preset person identification model. The preset person identification model is obtained by training a convolutional neural network. The preset person identification model is trained from real combined text and corresponding person identification. Before training, the corresponding person identification is manually labeled according to the real combined text. The combined text, which combines audio text and video text, includes the names of the people in the target video and their relationships. The trained preset person identification model can automatically extract the person identification from the combined text. The character segment extraction module is used to extract character segments from the target video; Extract key frames from the target video, and input the key frames into the feature extraction layer of the preset person identification model for sequential convolution, activation and pooling to obtain a feature map; The feature map is input into the region candidate layer of the preset person identity recognition model for region segmentation, and the region of interest is extracted. The region of interest is input into the interest region pooling layer of the preset person identity recognition model for pooling to obtain pooled features; The pooled features are input into the classification and regression layers of the preset person identification model for binary classification to extract the person fragments.

6. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the video person retrieval method according to any one of claims 1 to 4.

7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the steps of the video person retrieval method according to any one of claims 1 to 4.