A method for matching images and text, an electronic device, and a storage medium.
By adding phrases to the word segmenter and using different embedding modules of the CLIP model to process single characters and phrases, text semantic vectors are generated, solving the problem of low accuracy in image-text matching and improving the accuracy of image-text matching and user experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HONOR DEVICE CO LTD
- Filing Date
- 2024-03-07
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, image-text matching tasks often result in images that are irrelevant to the user's search text, leading to low matching accuracy and impacting user experience.
By expanding the vocabulary of the lexicon to include single characters and phrases, different embedding modules of the pre-trained CLIP model are used to process single characters and phrases separately, encode feature vectors, and merge them into text semantic vectors, thereby improving the accuracy of image-text matching.
It effectively improves the accuracy of image and text matching, enhances the user experience, and ensures that the matching results better match the user's search intent.
Smart Images

Figure CN120656181B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a text-image matching method, electronic device, and storage medium. Background Technology
[0002] With the rapid development of information technology and the network society, users have developed a demand for image-text matching. For example, the gallery applications on electronic devices such as mobile phones and tablets may contain multiple images. When a user wants to find an image, they can enter search text in the gallery application's search box, and the electronic device will display images that match the search text. Alternatively, when a user needs to insert an image into an article, they can use the text content of the article to search for images that match the text content in a browser.
[0003] However, in image-text matching tasks, it is easy for the images found by the electronic device to be irrelevant to the search text entered by the user, resulting in low accuracy of image matching and affecting the user experience. Summary of the Invention
[0004] To address the aforementioned issues, this application provides a text-image matching method, an electronic device, and a storage medium, with the aim of improving the accuracy of text-image matching and effectively enhancing the user experience.
[0005] Firstly, this application provides a method for matching text and images. Exemplarily, this method can be applied to electronic devices, which can be terminals such as mobile phones, tablets, and laptops, or servers such as cloud servers and independent physical servers.
[0006] In this method, the electronic device first acquires text. For example, the text can be text input by the user, text sent by other electronic devices communicating with the electronic device, or pre-stored text. The electronic device then performs word segmentation on the text to obtain word segmentation results including single characters and phrases. A phrase includes at least two single characters. For example, taking the text "water cup on the desktop" as an example, the word segmentation result includes the phrase "water cup," and the remaining text can be segmented into single characters. The pre-trained text encoder includes a first embedding module and a second embedding module, which are trained separately at different training stages. The electronic device can determine the feature vector corresponding to the single character in the word segmentation result through the first embedding module and the feature vector corresponding to the phrase in the word segmentation result through the second embedding module. Subsequently, the electronic device encodes the feature vectors corresponding to the single characters and the feature vectors corresponding to the phrases to determine the text semantic vector corresponding to the acquired text. Finally, the electronic device performs image-text matching based on the text semantic vector. For example, it can perform a task of obtaining an image by matching text or a task of matching text by matching an image.
[0007] This solves the problem of semantic errors in the text encoder caused by splitting phrases into individual characters. The generated text semantic vectors have strong representation capabilities and can preserve the continuous semantics of phrases. Furthermore, the pre-trained text encoder includes a first embedding module and a second embedding module, allowing for parameter decomposition of these modules. During the text encoder training process, both modules can be trained separately at different stages. This means the text encoder can be trained independently for semantic understanding of individual characters or phrases, avoiding mutual interference and further improving the overall semantic understanding of the text. Therefore, the image-text matching method provided in this application can improve the accuracy of image-text matching and effectively enhance the user experience.
[0008] In one possible implementation, the pre-trained text encoder described above belongs to a pre-trained CLIP model, that is, the pre-trained CLIP model includes the pre-trained text encoder. The training steps of the pre-trained CLIP model may include: the electronic device first acquires a first image training sample and a first text training sample corresponding to the first image training sample. For example, it may acquire an image as the first image training sample and then acquire the content description text of the image as the corresponding first text training sample; the CLIP model to be fine-tuned includes a first embedding module and a second embedding module. The electronic device may adjust the first model parameters of the second embedding module included in the CLIP model to be fine-tuned based on the first text training sample, the first image training sample, and a first preset loss function, while keeping the other model parameters unchanged, that is, freezing the model parameters in the CLIP model to be fine-tuned except for the first model parameters. For example, during the training process, a contrastive loss training method may be used for training. After training, the pre-trained CLIP model is obtained.
[0009] Thus, during the training of the CLIP model for fine-tuning, only the first model parameters of the second embedding module are adjusted. The second embedding module is used for semantic understanding of phrases in the text. Therefore, it is possible to train the semantic understanding ability of the second embedding module for phrases in the text separately, avoiding interference with the semantic understanding ability of the first embedding module for individual characters in the text.
[0010] In one possible implementation, the steps for obtaining the first and second embedding modules of the CLIP model to be fine-tuned may include: the electronic device first acquires second image training samples and second text training samples corresponding to the second image samples. For example, the number of second image training samples may be greater than the number of first image training samples, and the same applies to the number of second text training samples; the electronic device then adjusts the model parameters of the CLIP model to be pre-trained based on the second text training samples, the second image training samples, and the second preset loss function. The CLIP model to be pre-trained includes a third embedding module, meaning the model parameters of the third embedding module are trained and adjusted; subsequently, when the CLIP model with adjusted parameters meets the pre-training cutoff condition, i.e., training can be stopped, the electronic device decomposes the third embedding module included in the CLIP model with adjusted parameters to obtain the first and second embedding modules of the CLIP model to be fine-tuned. In this way, the model parameters of the first and second embedding modules can be decomposed, facilitating the subsequent separate training of the model parameters of the second embedding module.
[0011] In one possible implementation, the image-text matching method may further include: the electronic device adding multiple phrases to the word segmenter's vocabulary to obtain a post-added vocabulary, where the original vocabulary included multiple single characters, and the post-added vocabulary includes both single characters and multiple phrases; correspondingly, the aforementioned step of performing word segmentation processing on the text to obtain the segmentation result may include: the electronic device performing word segmentation processing on the text using a word segmenter based on the post-added vocabulary to obtain the segmentation result. This ensures that the word segmentation result includes both single characters and phrases, avoiding the splitting of phrases with continuous semantics into single characters, which helps improve the accuracy of image-text matching.
[0012] In one possible implementation, the added vocabulary also includes a first mapping relationship where multiple characters and multiple identifiers correspond one-to-one, and a second mapping relationship where multiple phrases and multiple identifiers correspond one-to-one. For example, character 1 has a first mapping relationship with 1, character 2 has a first mapping relationship with 2, and phrase 1 has a second mapping relationship with 10000, etc. Correspondingly, the image-text matching method may further include: the electronic device performing identifier mapping on the word segmentation results based on the first and second mapping relationships to obtain an identifier sequence corresponding to the word segmentation results. For example, the electronic device performs identifier mapping on the characters in the word segmentation results based on the first mapping relationship, and performs identifier mapping on the phrases in the word segmentation results based on the second mapping relationship, obtaining an identifier sequence including the identifiers of characters and phrases; the electronic device then determines the feature vector corresponding to the character based on the identifier sequence through the first embedding module of a pre-trained text encoder, and determines the feature vector corresponding to the phrase through the second embedding module of a pre-trained text encoder. In this way, complex text can be mapped into a simpler identifier sequence, reducing the difficulty of feature extraction and improving computational efficiency.
[0013] In one possible implementation, the steps described above—determining the feature vector corresponding to a single character using a first embedding module of a pre-trained text encoder based on the identifier sequence, and determining the feature vector corresponding to a phrase using a second embedding module of a pre-trained text encoder—may include: the electronic device first splits the identifier sequence to obtain a first identifier sequence representing a single character and a second identifier sequence representing a phrase; then, the electronic device extracts features from the first identifier sequence using the first embedding module of the pre-trained text encoder to obtain the feature vector corresponding to a single character, and extracts features from the second identifier sequence using the second embedding module of the pre-trained text encoder to obtain the feature vector corresponding to a phrase; correspondingly, the electronic device then merges the feature vectors corresponding to the single character and the phrase based on the position of the single character and the position of the phrase in the text to obtain the feature vector corresponding to the text; finally, the feature vector corresponding to the text is encoded to obtain the text semantic vector corresponding to the text. Thus, splitting the identifier sequence into a first identifier sequence and a second identifier sequence facilitates the distinction between single characters and phrases, which are then input into the first embedding module and the second embedding module respectively.
[0014] In one possible implementation, the above-described image-text matching step based on text semantic vectors may include: an electronic device determining an image that matches the text from a pre-stored set of images based on the text semantic vectors. Thus, in the image-text matching task, the representational power of the text semantic vectors obtained in this application is enhanced, thereby improving the accuracy of image-text matching for this task.
[0015] In one possible implementation, the aforementioned pre-stored multiple images include visual semantic vectors corresponding to each image. Correspondingly, the step of determining images matching the text from the pre-stored multiple images based on text semantic vectors can include: the electronic device first calculates the similarity between the multiple visual semantic vectors corresponding to each image and the text semantic vectors corresponding to the text, obtaining similarity calculation results for each image, indicating the distance between each image and the text; the electronic device then sorts the multiple similarity calculation results in descending order, placing images closer to the text first; subsequently, the electronic device determines the K images with the highest similarity calculation results from the multiple images as images matching the text, i.e., determining the K images with higher similarity calculation results, where K is a positive integer and K is less than or equal to the number of images. Thus, K images can be matched based on the text semantic vectors of the text, and these text semantic vectors can preserve the continuous semantics of phrases, exhibiting stronger representational capabilities, thereby obtaining K more matching images and improving the accuracy of image-text matching.
[0016] In one possible implementation, the visual semantic vectors corresponding to the aforementioned multiple images are obtained through an image encoder based on a pre-trained CLIP model. Thus, based on the pre-trained CLIP model, visual semantic vectors capable of representing image semantics can be obtained, which helps improve the accuracy of image-text matching.
[0017] In one possible implementation, the text may include multiple texts. In this image-text matching method, before the electronic device performs image-text matching based on the text semantic vector, an image can also be acquired; the visual semantic vector of the image is determined. Accordingly, the image-text matching method may include: the electronic device first acquires multiple texts, for example, the multiple texts may be multiple pre-stored texts; the electronic device then performs word segmentation processing on the multiple texts respectively to obtain the word segmentation results corresponding to the multiple texts; then, for the word segmentation results corresponding to each text, the electronic device determines the feature vector corresponding to a single character through the first embedding module of a pre-trained text encoder, and determines the feature vector corresponding to a phrase through the second embedding module of a pre-trained text encoder; subsequently, the electronic device encodes the feature vector corresponding to the single character and the feature vector corresponding to the phrase for the word segmentation results corresponding to each text to determine the text semantic vector corresponding to that text; finally, the electronic device determines the text that matches the image from the multiple texts based on the visual semantic vector and the multiple text semantic vectors. In this way, the task of obtaining text by image matching in the image-text matching task is realized, and the representation ability of the multiple text semantic vectors obtained in this application is enhanced, thus improving the image-text matching accuracy of this task.
[0018] In one possible implementation, the above-mentioned text acquisition steps may include: acquiring the text sent by the terminal. For example, the image-text matching method can be applied to a server such as a cloud server, whereby the cloud server acquires the text sent by the terminal; the cloud server then performs word segmentation processing on the text to obtain the segmentation results; then, the cloud server determines the feature vector corresponding to a single character through the first embedding module of a pre-trained text encoder, and determines the feature vector corresponding to a phrase through the second embedding module of a pre-trained text encoder; subsequently, the cloud server encodes the feature vectors corresponding to the single characters and the feature vectors corresponding to the phrases to determine the text semantic vector corresponding to the text; the cloud server performs image-text matching based on the text semantic vector, and finally sends the image-text matching result obtained from the image-text matching to the terminal so that the terminal can present it to the user. Thus, it is shown that the terminal can implement the image-text matching task based on the image-text matching method provided in this application embodiment, and the communicating terminal and server can also implement the image-text matching task based on the image-text matching method provided in this application embodiment.
[0019] Secondly, this application provides an electronic device including a memory and a processor; the memory stores computer program code, which includes computer instructions; one or more processors invoke the computer instructions to cause the electronic device to execute the image-text matching method described in the first aspect.
[0020] Thirdly, this application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the image-text matching method described in the first aspect. Attached Figure Description
[0021] Figure 1 A schematic diagram illustrating a search result provided in an embodiment of this application;
[0022] Figure 2 A schematic diagram of a model structure of a text encoder provided in an embodiment of this application;
[0023] Figure 3 A flowchart illustrating a text encoder generating text semantic vectors is provided as an embodiment of this application.
[0024] Figure 4 A schematic diagram illustrating word vector mapping in an embedding layer, as provided in an embodiment of this application;
[0025] Figure 5 A schematic diagram illustrating word vector mapping in another embedding layer provided in an embodiment of this application;
[0026] Figure 6 A schematic diagram of the pre-training stage of the CLIP model training process provided in this application embodiment;
[0027] Figure 7 A schematic diagram of the fine-tuning stage of the CLIP model training process provided in the embodiments of this application;
[0028] Figure 8 This is a signaling interaction diagram of a text-image matching method provided in an embodiment of this application. Detailed Implementation
[0029] To ensure clarity and conciseness in the description of the following embodiments, the terminology used in the embodiments of this application will first be explained. It should be understood that this explanation is for the purpose of better understanding the embodiments of this application and does not necessarily constitute a limitation on the embodiments of this application.
[0030] CLIP Model: The CLIP (Contrastive Language-Image Pre-Training) model is a neural network model for matching images and text. In some embodiments, training the text encoder and image encoder of the CLIP model yields a text encoder for outputting text semantic vectors and an image encoder for outputting visual semantic vectors of images.
[0031] Text semantic vector: This is a vector that represents the semantic features of the entire text, obtained by inputting text into a text encoder. For example, the text encoder can employ models such as the transformer commonly used in Natural Language Processing (NLP), and this application does not limit this to any particular model. In the embodiments of this application, the text encoder includes an original embedding module (also referred to as a first embedding module), a new embedding module (also referred to as a second embedding module), a transformer layer, and a mapping layer. Text can be input into the text encoder to obtain the corresponding text semantic vector.
[0032] Visual semantic vector: This can be obtained by inputting an image into an image encoder. For example, the image encoder can employ a CNN model or a VIT model; this application is not limited to either. In the embodiments of this application, images stored in a gallery application can be input into the image encoder to obtain the visual semantic vector corresponding to the image.
[0033] Vocabulary: The dictionary or vocabulary used by the word segmenter. In the embodiments of this application, the word segmenter's vocabulary includes single characters, phrases (a phrase includes at least two single characters), punctuation marks, and special symbols used for word segmentation.
[0034] Embedding layer: Also known as the embedding layer, it is used to extract feature vectors from the input data, converting discrete words or symbols in the input data into continuous vector representations. In this embodiment, the embedding layer is used to map single characters or phrases in the text to a continuous vector space to obtain the initial text vector corresponding to the text.
[0035] Transformer layer: This layer, also known as the transformer layer, is used to capture dependencies and contextual information in the input sequence through a self-attention mechanism and an encoder-decoder structure. In this embodiment, the encoder of the transformer layer is used to capture the dependencies and contextual information of the initial text vector output by the embedding layer, and outputs a text encoded vector.
[0036] Images: In this embodiment of the application, images include pictures and video frames from videos.
[0037] The following section compares and explains the technical advantages of the image-text matching method, electronic device, and storage medium provided in this application, in conjunction with relevant technologies. For ease of understanding, an example scenario is used for illustration.
[0038] In related technologies, after an electronic device obtains the search text input by a user, it typically performs image-text matching based on the CLIP model. The electronic device can input the search text into the text encoder of the CLIP model to obtain the vector corresponding to the search text. Then, it can determine the vector with high similarity from the vectors corresponding to multiple images and present the corresponding image as the search result to the user. The vectors corresponding to multiple images can be generated in advance based on the image encoder of the CLIP model.
[0039] Assuming the electronic device is a mobile phone, and the phone's gallery application stores multiple images, the image encoder of the CLIP model can obtain the vectors corresponding to each image. When a user enters the search text "water cup" in the search box of the gallery application, the phone can obtain the vector corresponding to "water cup" through the text encoder of the CLIP model. Based on this, the phone can search and match the vectors corresponding to the multiple images and display the images that match "water cup" as the search results.
[0040] However, as Figure 1 As shown, the phone can find images 1 and 2 that match "water cup", but it will also find images 3 and 4 that only match "water". In other words, images 3 and 4 are not related to the search text "water cup", which reduces the accuracy of image-text matching and affects the user experience.
[0041] The inventors discovered that the reason for the low accuracy of image-text matching in the above situation is that the word segmenter will divide the text into individual characters (also known as single characters). In other words, even when the text contains phrases (a phrase includes at least two single characters), it will also split phrases with continuous semantics into single characters, causing errors in the vector representation of the text, which in turn reduces the accuracy of image-text matching.
[0042] Therefore, to solve the above problems, this application provides a text-image matching method. In this method, the vocabulary of the word segmenter is expanded by adding phrases to the vocabulary containing single characters. After the electronic device acquires text, it segments the text using the word segmenter to obtain segmentation results containing single characters and phrases. The electronic device splits the single characters and phrases contained in the segmentation results, inputs the single characters into the original embedding module of the CLIP model's text encoder to obtain initial single-character text vectors, and inputs the phrases into the newly added embedding module of the text encoder to obtain initial phrase text vectors. Then, based on the original positions of the single characters and phrases in the text, the initial single-character text vectors and the initial phrase text vectors are merged to obtain the initial text vector corresponding to the text. Subsequently, based on the initial text vectors, the text semantic vector corresponding to the text is obtained through the transformer layer and mapping layer of the text encoder. Finally, the electronic device performs text-image matching based on the text semantic vector.
[0043] This solves the problem of CLIP model misinterpreting phrases due to splitting phrases into individual characters in text. Furthermore, by inputting individual characters and phrases into different embedding modules, the weights of the original and newly added embedding modules are decomposed. During training, the newly added embedding module can be fine-tuned independently, avoiding impacting CLIP model's semantic understanding of individual characters while improving its semantic understanding of phrases. This enhances CLIP model performance, thereby improving its semantic understanding of text and generating highly representative text semantic vectors. This contributes to improving the accuracy of image-text matching in image-text matching tasks and effectively improving the user experience.
[0044] Next, let's combine Figure 2 This application provides a detailed description of the structure of the text encoder of the CLIP model provided in the embodiments.
[0045] like Figure 2 As shown, the text encoder includes a splitting module, an embedding layer, a merging module, a transformer layer, and a mapping layer.
[0046] The splitting module is used to split the ID sequence obtained by the word segmenter from the word segmentation, serialization and mapping of the search text, to obtain single character ID subsequences and phrase ID subsequences. In other words, the splitting module can split the single character ID subsequence representing a single character and the phrase ID subsequence representing a phrase.
[0047] The embedding layer is used to capture the semantic information of the text. In some embodiments, the embedding layer may include an original embedding module and a new embedding module. The original embedding module is responsible for receiving the single-character ID subsequence output by the splitting module, converting the single-character ID subsequence into a fixed-dimensional real number vector (also known as feature extraction) to obtain the initial single-character text vector (also known as the feature vector corresponding to the single character). The new embedding module is responsible for receiving the phrase ID subsequence output by the splitting module, converting the phrase ID subsequence into a fixed-dimensional real number vector to obtain the initial phrase text vector (also known as the feature vector corresponding to the phrase).
[0048] The merging module is used to merge the initial single-character text vector and the initial phrase text vector to obtain an initial text vector (which may be called the feature vector corresponding to the text). In some embodiments, the merging module can merge the initial single-character text vector and the initial phrase text vector according to the original position of the single character and the original position of the phrase in the search text. The resulting initial text vector has the same position for the initial single-character text vector as the single character in the search text, and the same position for the initial phrase text vector as the phrase in the search text.
[0049] The transformer layer receives the initial text vector output by the merging module and generates a richer and more abstract text representation (which can be called the text encoding vector corresponding to the search text). For example... Figure 2 As shown, the transformer layer may include a transformer encoder. In some embodiments, the transformer encoder contains multiple identical layers, each including a self-attention mechanism and a feedforward neural network, which can capture contextual information and relationships between words (e.g., single words and phrases, single words and single words, or phrases and phrases) in the initial input text vector.
[0050] The mapping layer is used to receive the text encoding vector output by the transformer layer. It can transform the text encoding vector output by the transformer layer and the visual semantic vector output by the image encoder to the same dimension and distribution, and output the text semantic vector. This makes it easier to calculate the vector similarity between the two through the CLIP model, thereby achieving cross-modal matching.
[0051] Furthermore, in some embodiments, the image encoder of the CLIP model can employ a deep convolutional neural network as its underlying architecture, such as ResNet-50, and this application does not limit this to any particular model. The image encoder can perform a series of convolution, pooling, and activation operations on the image to extract image features and convert them into visual semantic vectors with the same dimension and distribution as the text semantic vectors, so as to provide a foundation for subsequent cross-modal image-text matching.
[0052] Next, combined Figure 3 This section details the process of generating text semantic vectors based on searched text using a text encoder with the aforementioned structure. For example... Figure 3 As shown, text semantic vectors can be obtained through the following steps:
[0053] S301: The word segmenter adds phrases to the vocabulary.
[0054] In the related art, the vocabulary of a tokenizer usually contains Chinese single characters. When the tokenizer tokenizes a search text, it will split a phrase with continuous meaning into single characters, which easily confuses the meanings of the phrase and the single characters. For example, the tokenizer will split "water cup" into "water" and "cup", and the meaning of "water" is different from that of "water cup"; another example is that the tokenizer will split "backpack" into "back" and "pack", and the meaning of "back" is different from that of "backpack".
[0055] Therefore, in some embodiments, phrases with continuous meaning can be added to the vocabulary of the tokenizer (the vocabulary after adding phrases can also be referred to as the post-added vocabulary), so as to obtain a tokenization result including phrases and retain the semantics of the phrases in the search text.
[0056] In some embodiments, step S301 can be understood as a preparatory step before generating the text semantic vector based on the search text and can be completed in advance.
[0057] S302: The tokenizer obtains the search text input by the user.
[0058] In some embodiments, the search text can be the text input by the user into the search box of the gallery application or the text input by the user into the search box of the browser application. This application does not make any limitations in this regard.
[0059] It should be understood that the above-mentioned use of the search text to match pictures (which can be referred to as the text matching picture task) in this application is one of the scenarios of the text and picture matching task. The text and picture matching task can also include using pictures to match text tasks.
[0060] In some embodiments, before performing the picture matching text task, the tokenizer can pre-obtain multiple texts, and then the text encoder can generate corresponding text semantic vectors for these texts respectively. When performing the picture matching text task, these text semantic vectors can be used to match the pictures to be matched to implement the picture matching text task.
[0061] S303: The tokenizer tokenizes and serializes the search text to obtain a text unit sequence.
[0062] During the process of the tokenizer tokenizing the search text, the search text will be split into multiple text units tokens. A text unit token can be a single character or a phrase.
[0063] In some embodiments, the tokenizer can query from the vocabulary using the maximum forward matching method. Exemplarily, it can start from the first character of the search text and attempt to match the longest phrase in the vocabulary. If the match is successful, the phrase is split from the search text as a text unit, and the remaining text is processed; if the match fails, it attempts to match the second-longest phrase in the vocabulary, and so on.
[0064] Exemplarily, assume the search text is "a water cup on the table", and the longest length of the phrases in the vocabulary is 5. Then, starting from the first character of the search text, select 5 consecutive characters "a water cup" for matching. If the match fails, select 4 consecutive characters "a water" starting from the first character for matching until the match is successful. Assume "table" matches successfully, then "table" is cut from "a water cup on the table" as a text unit, and the remaining text is continued to be matched. Finally, the 6 text units "table", "on", "the", "a", "water", "cup" are obtained as the tokenization result.
[0065] Subsequently, the tokenizer serializes the tokenization result, that is, converts the tokenization result into a preset format that can be transmitted for subsequent processing. For example, the preset format can be binary format, JSON, XML or other data formats, and this application does not limit this.
[0066] Exemplarily, when the tokenizer serializes "table", "on", "the", "a", "water", "cup", it can obtain '[CLS]a water cup on the table[PAD]...[PAD]' as the text unit sequence. [CLS] represents a special token at the beginning of the sequence. The text unit sequence is usually of a preset length, and [PAD] is used for padding operations at the edges of the sequence. Assume the preset length of the text unit sequence is 50, and [CLS] and "table", "on", "the", "a", "water", "cup" occupy 7 positions, then 43 [PAD]s need to be filled.
[0067] However, the tokenizer in the related art will tokenize and serialize "a water cup on the table" to obtain '[CLS]a water cup on the table[PAD][PAD]...[PAD]'. Assume the preset length of the text unit sequence is 50, and [CLS] and "table", "on", "the", "a", "water", "cup" occupy 8 positions, then 42 [PAD]s need to be filled.
[0068] S304: The tokenizer maps the text unit sequence to an ID sequence based on the vocabulary.
[0069] In some embodiments, the word segmenter's vocabulary assigns a unique ID to each stored character or phrase. In this application embodiment, the mapping relationship between a character and its ID can be referred to as the first mapping relationship, and the mapping relationship between a phrase and its ID can be referred to as the second mapping relationship.
[0070] The word segmenter can then search the vocabulary to match the corresponding IDs for multiple text units in the text unit sequence, thus obtaining an ID sequence.
[0071] For example, by using the mapping relationship between text units and IDs stored in the vocabulary, the IDs corresponding to “table”, “on”, “of”, “one”, “each” and “water cup” can be matched as “101”, “890”, “954”, “100”, “380” and “22230” respectively, and the ID sequence is [101 890 954 100 380 22230].
[0072] S305: The splitting module splits the ID sequence to obtain single-word ID subsequences and phrase ID subsequences.
[0073] It should be understood that the word segmenter's vocabulary includes newly added phrases, thus determining the ID range corresponding to individual characters and the ID range corresponding to phrases. For example, the vocabulary initially includes 20,000 characters, mapped one-to-one with IDs 1-20,000, and subsequently adds 5,000 phrases, mapped one-to-one with IDs 20,001-25,000. Based on the ID size within the ID sequence, the segmentation module can determine the IDs corresponding to individual characters and phrases. Individual character IDs within the range [1, 20,000] are grouped into individual character ID subsequences, and phrase IDs within the range [20,001, 25,000] are grouped into phrase ID subsequences. This yields [101 890 954 100380] as the individual character ID subsequence (also called the first identifier sequence) and
[22230] as the phrase ID subsequence (also called the second identifier sequence).
[0074] S306: The original embedding module of the embedding layer performs word vector mapping on the single character ID subsequence to obtain the initial single character text vector, and the newly added embedding module of the embedding layer performs word vector mapping on the phrase ID subsequence to obtain the initial phrase text vector.
[0075] like Figure 4As shown, in related technologies, the embedding layer includes an embedding module, which directly performs word vector mapping on the ID sequence [101 890 954 22230 100] to obtain the initial text vector [word vector a, word vector b, word vector c, word vector d, word vector e]. However, this requires training the entire embedding layer, which requires a lot of training data and GPU training resources. Furthermore, when using newly added phrases as training samples, it may interfere with the CLIP model's embedding layer's ability to map word vectors to single characters.
[0076] Therefore, in this embodiment, the embedding module is split into an original embedding module and a new embedding module, so that only the word vector mapping capability of the new embedding module for the newly added phrases is trained.
[0077] like Figure 5 As shown, assuming the ID sequence is [101 890 954 22230 100], the splitting module can first split it into [101 890 954 100] and
[22230] . The original embedding module then performs word vector mapping on [101 890 954 100] to obtain [word vector 1, word vector 2, word vector 3, word vector 4]. The newly added embedding module performs word vector mapping on
[22230] to obtain [word vector 5].
[0078] S307: The merging module merges the initial single-word text vector and the initial phrase text vector based on the original position of each text unit in the search text to obtain the initial text vector.
[0079] It should be understood that the positions of phrases and words in the search text are not fixed. In order to avoid affecting the semantics of the search text, it is necessary to merge the initial word text vector and the initial phrase text vector according to their original positions.
[0080] like Figure 5 As shown, the merging module places word vector 5 between word vector 3 and word vector 4, resulting in [word vector 1, word vector 2, word vector 3, word vector 5, word vector 4] as the initial text vector.
[0081] S308: The transformer encoder encodes the initial text vector to obtain the text encoded vector.
[0082] S309: The mapping layer performs vector mapping on the text encoding vector to obtain the text semantic vector corresponding to the search text.
[0083] It should be understood that in order to improve the performance, generalization ability, and adaptability of the CLIP model, it needs to be trained before application. Next, we will combine... Figure 6 and Figure 7 This section details the training process of the CLIP model.
[0084] In some embodiments, the image encoder and text encoder of the CLIP model can be trained together based on a contrastive learning training method.
[0085] In some embodiments, the training process of the CLIP model may include a pre-training phase and a fine-tuning phase.
[0086] The pre-training phase of the CLIP model can be divided into the following steps 1-7.
[0087] In some embodiments, such as Figure 6 As shown, during the pre-training phase, the text encoder's embedding layer includes an embedding module (also known as the third embedding module), the word segmenter's vocabulary contains single characters, and the CLIP model (also known as the CLIP model to be pre-trained) includes adjustable parameters for both the text encoder and the image encoder.
[0088] Step 1: Obtain the image training samples and the corresponding text training samples.
[0089] Both the image encoder and text encoder of the CLIP model need to be trained in advance with a large number of training samples. Therefore, before training the models, it is necessary to obtain the training samples of the image encoder of the CLIP model, i.e., the image training samples (also known as the first image training samples), and the training samples of the text encoder of the CLIP model, i.e., the text training samples corresponding to the image training samples (also known as the first text training samples).
[0090] In some embodiments, the image training samples may include image training samples and video frame training samples. A video frame refers to any frame of a video; a frame is a still image in the video, and consecutive frames can form a video. An image training sample can be any image, and a video frame training sample can be any video frame from a video. The text training sample corresponding to the image training sample refers to the text corresponding to the content displayed by the image training sample; that is, the text training sample can express the content displayed by the image training sample. For example, the image training sample is... Figure 1 The text training sample corresponding to the image 1 shown can be: "water glass on the table".
[0091] It should be noted that this application does not limit the method of obtaining the text training samples corresponding to the image training samples.
[0092] For example, the text training samples corresponding to the image training samples can be manually annotated, based on the user's own understanding of the semantic meaning of the images. As another example, the text training samples corresponding to the image training samples can be automatically generated by recognizing objects, scenes, and actions within the image training samples. Even more exemplarily, the text training samples corresponding to the image training samples can be automatically generated by a text generation model used to generate descriptive text for images.
[0093] Furthermore, it should be noted that this application does not limit the number of image training samples. It is understood that the text training samples correspond to the image training samples, and therefore their numbers are the same.
[0094] like Figure 6 As shown, N image training samples can be obtained, and N text training samples corresponding one-to-one with the N image training samples can be obtained. For example, image training sample 1 corresponds to text training sample 1.
[0095] Step 2: Input the image training samples into the image encoder, and the image encoder outputs the visual semantic vectors corresponding to the image training samples.
[0096] like Figure 6 As shown, N image training samples are input into the image encoder to obtain visual semantic vectors I1, I2, I3...I... corresponding to the N image training samples. N .
[0097] Step 3: Input the text training samples into the word segmenter to obtain the ID sequence corresponding to the text training samples.
[0098] like Figure 6 As shown, by inputting N text training samples into the word segmenter, we can obtain the ID sequences corresponding to the N text training samples respectively.
[0099] In some embodiments, the ID sequence can be obtained by a word segmenter performing word segmentation, serialization, and ID mapping on the text training samples. The implementation method can be found in the above embodiments, and will not be repeated here.
[0100] Step 4: Input the ID sequence of the text training samples into the text encoder, and the text encoder outputs the text semantic vector corresponding to the text training samples.
[0101] like Figure 6 As shown, the ID sequences corresponding to N text training samples are input into the text encoder to obtain the text semantic vectors T1, T2, T3...T corresponding to the N text training samples.N .
[0102] In some embodiments, after receiving the ID sequence of the text training samples, the text encoder outputs the text semantic vector through the embedding module of the embedding layer, the transformer encoder of the transformer layer, and the mapping layer. The implementation method can be found in the above embodiments, and will not be repeated here.
[0103] Step 5: Combine each visual semantic vector and multiple text semantic vectors to obtain multiple vector pairs. From these multiple vector pairs, determine the vector pairs with corresponding relationships as positive sample vector pairs, and determine the remaining vector pairs as negative sample vector pairs.
[0104] Contrastive learning is an unsupervised training method, therefore it is necessary to define positive and negative samples from the training samples. In this embodiment, positive and negative sample vector pairs are determined from multiple vector pairs.
[0105] In some embodiments, assuming there are N visual semantic vectors and N text semantic vectors, combining each visual semantic vector and the N text semantic vectors yields N×N vector pairs. It is understood that the image training samples and the text training samples have a correspondence, therefore their corresponding vectors also have a correspondence. Among the N×N vector pairs, the vector pairs consisting of the corresponding visual semantic vectors and text semantic vectors are determined as positive sample vector pairs, i.e., there can be N positive sample vector pairs; the remaining vector pairs are determined as negative sample vector pairs, i.e., there can be N×(N-1) negative sample vector pairs.
[0106] like Figure 6 As shown, taking I1 as an example, it is related to T1, T2, T3...T N By combining them separately, we get I1˙T1, I1˙T2, I1˙T3...I1˙T N These N vector pairs, I2, I3...I N Similarly, we can obtain N×N vector pairs. Let I1˙T1, I2˙T2, I3˙T3……I N ˙T N Vector pairs with a corresponding relationship are identified as positive sample vector pairs, and the rest are identified as negative sample vector pairs.
[0107] Step 6: Calculate the vector similarity between the visual semantic vector and the textual semantic vector in each vector pair.
[0108] For example, the vector cosine similarity between the visual semantic vector and the textual semantic vector in each vector pair can be calculated.
[0109] Step 7: Adjust the parameters of the image encoder and text encoder based on the loss function, the vector similarity of positive sample vector pairs, and the vector similarity of negative sample vector pairs.
[0110] It is understood that, in the embodiments of this application, the model training objective is to maximize the vector similarity between positive sample vectors and their corresponding vectors, and to minimize the vector similarity between negative sample vectors and their corresponding vectors.
[0111] In addition, in some embodiments, the similarity between positive sample vectors and their corresponding real vectors can be represented as 1, and the similarity between negative sample vectors and their corresponding real vectors can be represented as 0. The parameters of the image encoder and the text encoder are adjusted until the similarity between positive sample vectors and their corresponding predicted vectors can approach 1 to the greatest extent, and the similarity between negative sample vectors and their corresponding predicted vectors can approach 0 to the greatest extent, that is, the value of the loss function (also known as the first preset loss function) is minimized.
[0112] Step 8: When the training cutoff condition is met, end the training to obtain the pre-trained CLIP model.
[0113] For example, the training cutoff condition (also known as the pre-training cutoff condition) may be that the model reaches a pre-set number of training iterations during training, or that the loss value of the loss function is less than a loss value threshold during training.
[0114] Next, we will introduce the fine-tuning stage of the CLIP model.
[0115] The fine-tuning phase of the CLIP model can also be subdivided into steps 1-8.
[0116] In some embodiments, such as Figure 7 As shown, during the fine-tuning phase, the text encoder's embedding layer is split into an original embedding module and a new embedding module. The parameters of the new embedding module are adjustable. Apart from this, the parameters (also called model parameters) of the CLIP model (also known as the CLIP model to be fine-tuned) are the same as those of the pre-trained CLIP model. The parameters of the original embedding module are... Figure 6 The parameters of the embedding module are the same and fixed. During fine-tuning, only the parameters of the newly added embedding module (also known as the first model parameters) are adjusted. The word segmenter's vocabulary includes single characters and newly added phrases.
[0117] Step 1: Obtain the image training samples and the corresponding text training samples.
[0118] In some embodiments, the fine-tuning stage requires fewer parameters to be adjusted compared to the pre-training stage, thus allowing the CLIP model to be trained with fewer training samples than the aforementioned N text training samples and N image training samples.
[0119] For example, such as Figure 7 As shown, M image training samples (also known as second image training samples) can be obtained, and M text training samples (also known as second text training samples) corresponding one-to-one with the M image training samples can be obtained, where M is an integer less than N.
[0120] Step 2: Input the image training samples into the image encoder, and the image encoder outputs the visual semantic vectors corresponding to the image training samples.
[0121] like Figure 7 As shown, M image training samples are input into the image encoder to obtain visual semantic vectors I1, I2, I3...I... corresponding to the M image training samples. M .
[0122] Step 3: Input the text training samples into the word segmenter to obtain the ID sequence corresponding to the text training samples.
[0123] like Figure 7 As shown, by inputting M text training samples into the word segmenter, we can obtain the ID sequences corresponding to the M text training samples respectively.
[0124] Step 4: Input the ID sequence of the text training samples into the text encoder, and the text encoder outputs the text semantic vector corresponding to the text training samples.
[0125] like Figure 7 As shown, inputting M ID sequences into a text encoder yields M text training samples, corresponding to text semantic vectors T1, T2, T3...T. M .
[0126] In some embodiments, the model structure of the text encoder is as follows: Figure 2 As shown, and in combination Figure 3 The steps shown generate corresponding text semantic vectors based on text training samples. The implementation method can be found in the above embodiment, and will not be repeated here.
[0127] Step 5: Combine each visual semantic vector and multiple text semantic vectors to obtain multiple vector pairs. From these multiple vector pairs, determine the vector pairs with corresponding relationships as positive sample vector pairs, and determine the remaining vector pairs as negative sample vector pairs.
[0128] Based on the above examples, in some embodiments, assuming there are M visual semantic vectors and M text semantic vectors, there can be M positive sample vector pairs and M×(M-1) negative sample vector pairs.
[0129] like Figure 7 As shown, I1˙T1, I2˙T2, I3˙T3……I M ˙T M The first pair is a positive sample vector pair, and the rest are determined to be negative sample vector pairs.
[0130] Step 6: Calculate the vector similarity between the visual semantic vector and the textual semantic vector in each vector pair.
[0131] Step 7: Based on the loss function, the vector similarity between positive sample vector pairs and the vector similarity between negative sample vector pairs, adjust the parameters of the newly added embedding module in the image encoder.
[0132] The loss function may also be referred to as the second preset loss function. In some embodiments, the second preset loss function may be the same as or different from the first preset loss function in the above embodiments.
[0133] In some embodiments, during the fine-tuning phase of the CLIP model, the parameters of the CLIP model are the same as those of the trained CLIP model, except for the parameters of the newly added embedding module. These parameters are fixed (also known as freezing the parameters of the pre-trained CLIP model), and only the parameters of the newly added embedding module are adjusted during the fine-tuning process.
[0134] Step 8: When the training cutoff condition is met, end the training to obtain the trained CLIP model.
[0135] In some embodiments, a trained CLIP model may also be referred to as a pre-trained CLIP model.
[0136] It should be noted that other implementation methods for steps 1-8 can be found in steps 1-8 of the pre-training stage of the CLIP model, and will not be repeated here.
[0137] Next, the image-text matching method provided in the embodiments of this application will be introduced in conjunction with the above-trained CLIP model.
[0138] In some embodiments, the electronic device runs an operating system such as Android. A layered architecture divides the operating system into several layers, each with a clear role and function. Layers communicate with each other through software interfaces. Taking a mobile phone as an example, the electronic device... Figure 8As shown, the mobile phone may include a gallery service module 810, a search module 820, a multimodal understanding module 830, and a natural language understanding module 840. These modules are used to implement the image-text matching method provided in the embodiments of this application. These modules may all be located in the same layer of the electronic device, or they may be located in different layers of the electronic device, or they may be located in multiple layers of the electronic device simultaneously, and their functions are implemented through software interfaces between layers. This application does not limit this.
[0139] under Figure 8 Taking a mobile phone as an example, and a user searching for images in the phone's gallery application, the image-text matching method provided in this application embodiment is explained in detail.
[0140] like Figure 8 As shown, the image-text matching method provided in this application embodiment can be divided into the following two stages: index building stage and search stage.
[0141] First, combined Figure 8 This section details the steps involved in the index building phase.
[0142] S801: The image library service module 810 receives user operations to add or modify images and their attribute information.
[0143] Adding an image refers to the action of a user storing an image in the gallery application. For example, adding an image could be a user taking a picture with their phone, downloading an image, or capturing a screenshot of their phone. Editing an image refers to the action of a user modifying an image already stored in the gallery application. For example, editing an image could be a user cropping, stitching, adding effects, or adding captions.
[0144] In some embodiments, the attribute information of an image may include: the location where the image was acquired, the time the image was acquired, names of people, category tags, and events, etc., which are not limited in this application. Category tags can be used to indicate the type of object shown in the image, such as people, plants, animals, buildings, or natural scenery. Events can be used to indicate what the object shown in the image is doing. For example, an event can be a game, sports, etc.
[0145] In some embodiments, the category labels can be manually configured by the user or automatically generated by the mobile phone to classify the images; this application does not limit this.
[0146] S802: Image library service module 810 stores images and their attribute information.
[0147] The gallery service module 810 responds to user operations on adding or modifying images and their attribute information by storing the images and their attribute information on the phone.
[0148] For example, images and their attribute information can be stored in local files on the phone, and users can view the images through various means such as local folders on the phone and the gallery application.
[0149] For example, with the user's authorization, images and their attribute information can be stored in the cloud for backup, reducing the memory pressure on the mobile phone.
[0150] S803: The image library service module 810 calls the multimodal understanding module 830 to perform visual semantic understanding on the image.
[0151] Visual semantic understanding refers to enabling mobile phones to understand the meaning expressed in the content displayed in an image, such as understanding the type, quantity, location, and relationships between objects in the image.
[0152] It should be understood that visual semantic understanding requires a significant amount of computing resources. In some embodiments, to avoid affecting user experience, this step S803 can be performed while the phone is charging and the screen is off.
[0153] In some embodiments, the multimodal understanding module 830 can achieve visual semantic understanding based on the CLIP model trained in the above embodiments, inputting the image into the image encoder of the CLIP model to generate the visual semantic vector corresponding to the image.
[0154] Based on the above introduction, the CLIP model's image encoder and text encoder can map images and search text into vectors of the same dimension and distribution, facilitating subsequent matching between visual semantic vectors and text semantic vectors.
[0155] S804: The multimodal understanding module 830 returns the visual semantic vector of the image to the image library service module 810.
[0156] S805: The image library service module 810 stores the visual semantic vectors of images.
[0157] After receiving the visual semantic vector of the image returned by the multimodal understanding module 830, the image library service module 810 can store it. In some embodiments, if the image library service module 810 is configured with a database, it can store the visual semantic vector of the image in the database.
[0158] S806: The image library service module 810 sends the visual semantic vector and attribute information of the image to the search module 820.
[0159] In some embodiments, the image library service module 810 can store the visual semantic vectors of the images returned by the multimodal understanding module 830, and then send the image attribute information and their visual semantic vectors to the search module 820 in batches.
[0160] S807: Search module 820 constructs the index corresponding to the image.
[0161] In some embodiments, the search module 820 can combine the visual semantic vector of an image with the image's attribute information to construct an index corresponding to the image. Based on the above example, the image index may include its visual semantic vector, as well as one or more of the following: acquisition time, acquisition location, person's name, category tag, or event, etc., which are not limited in this application.
[0162] For example, the search module 820 may include an index library, in which the search module 820 can store the index corresponding to the image in the index library so that subsequent search matching can be performed based on the index library.
[0163] Next, combined Figure 8 Let's continue with a detailed explanation of the steps involved in the search phase.
[0164] S808: The gallery service module 810 receives user input for search text.
[0165] Users can enter search text into the search interface provided on their mobile phones. For example, users can enter text into the search interface provided on their mobile phones. Figure 1 The interface 3 shown includes a search box where you can enter search text, such as "water glass", or other search text like "water glass on the table".
[0166] S809: The gallery service module 810 sends the search text to the search module 820.
[0167] S810: The search module 820 calls the multimodal understanding module 830 to perform text semantic understanding on the search text.
[0168] The search module 820 calls the multimodal understanding module 830 to perform text semantic understanding on the search text and obtain the text semantic vector of the search text.
[0169] Text semantic understanding refers to enabling mobile phones to understand the meaning expressed by text.
[0170] In some embodiments, the multimodal understanding module 830 provides the pre-trained CLIP model described in the above embodiments, which can be combined with... Figure 2 The model structure of the text encoder shown, and Figure 3The steps shown involve first inputting the search text into a word segmenter to obtain an ID sequence, and then inputting the ID sequence into the text encoder of the CLIP model to obtain the text semantic vector corresponding to the search text.
[0171] S811: Modality understanding module 830 returns the text semantic vector corresponding to the search text to search module 820.
[0172] S812: The image library service module 820 calls the natural language understanding module 840 to perform named entity recognition on the search text.
[0173] In some embodiments, the natural language understanding module 840 can perform named entity recognition on the search text based on a natural language understanding model to obtain the entities contained in the search text.
[0174] For example, entities can include: time, location, name, category label, etc. Based on the example "a water glass on a table" above, the entity included is "water glass".
[0175] S813: The Natural Language Understanding module 840 returns the entities contained in the search text to the search module 820.
[0176] S814: The search module 820 retrieves data from the index based on the text semantic vector and entity corresponding to the search text.
[0177] Text semantic vector recall (also known as vector recall) refers to the process of retrieving indexes from the index database that match the text semantic vectors corresponding to the search text.
[0178] In some embodiments, the vector similarity between the text semantic vector corresponding to the search text and the visual semantic vectors included in multiple indices in the index can be calculated to obtain the similarity calculation results for each index. Then, the K indices with higher vector similarity among the multiple similarity calculation results are selected as the vector recall results. Here, K is an integer greater than 0. For example, K can be a pre-set number of vector recall results, such as 5, 8, or 10.
[0179] Vector similarity refers to the degree of similarity between two vectors, and it can be calculated using various methods. For example, the degree of similarity can be determined by calculating the cosine similarity between two vectors, but other methods can also be used, and this application does not limit this to any particular method.
[0180] Entity recall (also known as entity retrieval) refers to the process of retrieving indexes in the index that match entities in the search text.
[0181] Based on the above embodiments, the index can include image attribute information, which includes entities. For example, the time, location, and category tags may all contain entities. Taking the search text "sky taken in city B on National Day" as an example, there are entities "city B" for location, "National Day" for time, and "sky" related to the content displayed in the image. These entities can then be matched among multiple indices to obtain the indices that match the entities in the search text as the entity retrieval results.
[0182] In some embodiments, the intersection or union of vector recall results and entity recall results can be sorted, and the sorted results can be used as search results.
[0183] S815: Search module 820 returns search results to image service module 810.
[0184] The search module 820 returns the sorted search results to the image library service module 810.
[0185] S816: The image gallery service module 810 displays search results to the user.
[0186] Furthermore, it's understandable that in practical applications, mobile phones typically store images and videos in the gallery app. Therefore, when searching through the search interface provided in the gallery app, both image and video search results can be displayed simultaneously. In other words, the index library includes not only indexes for images but also indexes for videos.
[0187] In some embodiments, a video can have multiple indexes. That is, a video is divided into multiple video segments, a video frame is selected from each video segment, its corresponding visual semantic vector is determined by the CLIP model, and the visual semantic vector is used to construct the index corresponding to the video segment. Then the video search results can be the video segments.
[0188] In some embodiments, the index corresponding to a video segment may also include attribute information of the video segment.
[0189] It should be noted that the image service module 810, search module 820, multimodal understanding module 830, and natural language understanding module 840 located on mobile devices are merely examples. These modules can also reside on a cloud server, meaning the cloud server utilizes the interaction of these four modules to implement the steps included in the index building phase. In the search phase, the steps included in the search phase can be implemented based on the interaction between mobile devices and the cloud server. This application does not limit this.
[0190] For example, the mobile phone can send the search text entered by the user to the gallery service module of the cloud server, so that the gallery service module of the cloud server can interact with other modules to implement the steps included in the search stage. The gallery service module of the cloud server then sends the search results to the mobile phone so that the mobile phone can display the search results to the user.
[0191] It should be noted that the above scenarios are merely examples, and the image-text matching method provided in this application embodiment can also be applied to other scenarios.
[0192] In some embodiments, the image-text matching method provided in this application can also implement the task of matching text based on images, that is, the image-text matching task mentioned above.
[0193] In one possible implementation, the steps for an electronic device to perform the above-mentioned image-to-text matching task may also include: an index building phase and a search phase.
[0194] In the index building stage, the electronic device can first acquire multiple texts; then, through the text encoder in the CLIP model trained in this application embodiment, perform text semantic understanding on the multiple texts to generate text semantic vectors corresponding to the multiple texts respectively; subsequently, the electronic device constructs the text semantic vectors into the index corresponding to the text and stores them in the text index library.
[0195] For example, the aforementioned texts can be stored in local files on the mobile phone or on a cloud server; this application does not limit this.
[0196] During the search phase, the electronic device can acquire the image to be searched and call the image encoder in the trained CLIP model to perform visual semantic understanding of the image and generate a visual semantic vector corresponding to the image. Then, the electronic device performs vector retrieval in the text index based on the visual semantic vector and obtains the text corresponding to at least one text semantic vector that matches the visual semantic vector as the search result. Finally, the electronic device can display the search result to the user.
[0197] For example, obtaining the image to be matched can be achieved by the user dragging and dropping the image into the image search box, or by the gallery application receiving the user's selection of images stored in the gallery application. This application does not limit this to any particular method.
[0198] In some embodiments, the electronic device may intelligently caption images stored in a gallery application. After a user selects an image and triggers the intelligent captioning function of the gallery application, the electronic device can perform visual semantic understanding on the image using an image encoder based on the CLIP model, generate its corresponding visual semantic vector, and then determine the text semantic vector that matches the visual semantic vector from multiple pre-stored text semantic vectors. The text corresponding to the matched text semantic vector is then presented to the user as the intelligent captioning result.
[0199] For example, the text semantic vectors corresponding to the multiple texts pre-stored above may be pre-stored on a cloud server or pre-stored in local files on the mobile phone, and this application does not limit them.
[0200] It should be noted that this application does not limit the type of electronic device. For example, electronic devices can be mobile phones, tablets, desktops, laptops, notebook computers, ultra-mobile personal computers (UMPCs), handheld computers, netbooks, personal digital assistants (PDAs), wearable electronic devices, smartwatches, etc. Electronic devices can also be servers, such as independent physical servers, server clusters or distributed systems composed of multiple physical servers, and cloud servers, etc. This application does not impose any special restrictions on the specific form of the aforementioned electronic devices.
[0201] This application also provides a computer-readable storage medium storing a computer program, which, when executed by a computer, can implement one or more steps of any of the above-described image-text matching methods.
[0202] Computer-readable storage media can be non-transitory computer-readable storage media, such as ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage devices.
[0203] Another embodiment of this application provides a computer program product containing instructions. When this computer program product is executed by a computer, it can implement one or more steps of any of the above-described image-text matching methods.
[0204] The electronic device, computer-readable storage medium, and computer program product provided in this embodiment are all used to execute the corresponding image and text matching method provided above. Therefore, the beneficial effects they can achieve can be referred to the beneficial effects in the corresponding image and text matching method provided above, and will not be repeated here.
[0205] The terms "first," "second," and "third," etc., used in this application specification, claims, and drawings are used to distinguish different objects, not to limit a specific order.
[0206] In the embodiments of this application, the terms "exemplary" or "for example" are used to indicate that something is an example, illustration, or description. Any embodiment or design that is described as "exemplary" or "for example" in the embodiments of this application should not be construed as being more preferred or advantageous than other embodiments or design. Specifically, the use of the terms "exemplary" or "for example" is intended to present the relevant concepts in a specific manner.
[0207] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A method for matching images and text, characterized in that, include: Get the text; The text is segmented to obtain segmentation results; the segmentation results include single characters and phrases, and a phrase includes at least two single characters; The feature vector corresponding to the single character is determined by the first embedding module of the pre-trained text encoder, and the feature vector corresponding to the phrase is determined by the second embedding module of the pre-trained text encoder; the first embedding module and the second embedding module are trained separately at different training stages; the pre-trained text encoder belongs to the pre-trained CLIP model. Encode the feature vectors corresponding to the single characters and the feature vectors corresponding to the phrases to determine the text semantic vectors corresponding to the text. Image-text matching is performed based on the text semantic vector; The process of encoding the feature vectors corresponding to the single characters and the feature vectors corresponding to the phrases to determine the text semantic vectors corresponding to the text includes: The feature vectors corresponding to the single characters and the feature vectors corresponding to the phrases are merged to obtain the feature vectors corresponding to the text. The transformer layer based on the pre-trained text encoder processes the feature vector corresponding to the text to generate the text encoding vector corresponding to the search text. The mapping layer based on the pre-trained text encoder transforms the text encoding vector and the visual semantic vector output by the image encoder to the same dimension and distribution, thereby obtaining the text semantic vector corresponding to the text. The image-text matching based on the text semantic vector includes: image-text matching based on the similarity between the visual semantic vector and the text semantic vector corresponding to the text.
2. The method according to claim 1, characterized in that, The pre-trained CLIP model is trained through the following steps: Obtain the first image training sample and the first text training sample corresponding to the first image sample; Based on the first text training samples, the first image training samples, and the first preset loss function, the first model parameters of the second embedding module included in the CLIP model to be fine-tuned are adjusted, while keeping the model parameters of the CLIP model to be fine-tuned other than the first model parameters unchanged, and the pre-trained CLIP model is trained; the CLIP model to be fine-tuned includes the first embedding module and the second embedding module.
3. The method according to claim 2, characterized in that, The CLIP model to be fine-tuned includes a first embedding module and a second embedding module, which are obtained through the following steps: Obtain the second image training sample and the second text training sample corresponding to the second image sample; Based on the second text training samples, the second image training samples, and the second preset loss function, the model parameters of the CLIP model to be pre-trained are adjusted; the CLIP model to be pre-trained includes a third embedding module; If the CLIP model after parameter adjustment meets the pre-training cutoff condition, the third embedding module included in the CLIP model after parameter adjustment is split to obtain the first embedding module and the second embedding module included in the CLIP model to be fine-tuned; the pre-training cutoff condition includes reaching a preset number of training times during model training, and / or the loss value of the loss function during model training is less than the loss value threshold.
4. The method according to claim 1, characterized in that, The method further includes: Multiple phrases are added to the vocabulary of the word segmenter to obtain an expanded vocabulary; the vocabulary includes multiple single characters. The process of segmenting the text to obtain the segmentation result includes: Based on the added vocabulary, the text is segmented using the word segmenter to obtain the segmentation result.
5. The method according to claim 4, characterized in that, The added vocabulary also includes a first mapping relationship where the multiple characters and multiple identifiers correspond one-to-one, and a second mapping relationship where the multiple phrases and multiple identifiers correspond one-to-one. The method further includes: Based on the first mapping relationship and the second mapping relationship, the word segmentation results are mapped to obtain the identifier sequence corresponding to the word segmentation results; The step of determining the feature vector corresponding to the single character through the first embedding module of the pre-trained text encoder, and determining the feature vector corresponding to the phrase through the second embedding module of the pre-trained text encoder, includes: Based on the identifier sequence, the feature vector corresponding to the single character is determined by the first embedding module of the pre-trained text encoder, and the feature vector corresponding to the phrase is determined by the second embedding module of the pre-trained text encoder.
6. The method according to claim 5, characterized in that, Based on the identifier sequence, the feature vector corresponding to the single character is determined through the first embedding module of the pre-trained text encoder, and the feature vector corresponding to the phrase is determined through the second embedding module of the pre-trained text encoder, including: The identifier sequence is split to obtain a first identifier sequence for representing the single character and a second identifier sequence for representing the phrase; The first embedding module of the pre-trained text encoder extracts features from the first identifier sequence to obtain the feature vector corresponding to the single character, and the second embedding module of the pre-trained text encoder extracts features from the second identifier sequence to obtain the feature vector corresponding to the phrase. The process of encoding the feature vectors corresponding to the single characters and the feature vectors corresponding to the phrases to determine the text semantic vectors corresponding to the text includes: Based on the position of the single character and the position of the phrase in the text, the feature vectors corresponding to the single character and the phrase are merged to obtain the feature vectors corresponding to the text. The feature vector corresponding to the text is encoded to obtain the text semantic vector corresponding to the text.
7. The method according to any one of claims 1-6, characterized in that, The image-text matching based on the text semantic vector includes: Based on the text semantic vector, an image matching the text is determined from a plurality of pre-stored images.
8. The method according to claim 7, characterized in that, The pre-stored plurality of images includes visual semantic vectors corresponding to each of the plurality of images; the step of determining the image matching the text from the pre-stored plurality of images based on the text semantic vector includes: Calculate the similarity between multiple visual semantic vectors and the text semantic vector respectively to obtain the similarity calculation results for the multiple images respectively; Sort the multiple similarity calculation results in descending order; From the plurality of images, the K images with the highest similarity calculation results are determined as images that match the text; where K is a positive integer and K is less than or equal to the number of the plurality of images.
9. The method according to claim 8, characterized in that, The visual semantic vectors corresponding to the multiple images are obtained through the image encoder of the pre-trained CLIP model.
10. The method according to any one of claims 1-6, characterized in that, The text includes multiple texts; prior to performing image-text matching based on the text semantic vector, the method further includes: Acquire images; Determine the visual semantic vector of the image; The process involves: acquiring text; performing word segmentation on the text to obtain segmentation results; determining the feature vector corresponding to each character using the first embedding module of a pre-trained text encoder, and determining the feature vector corresponding to each phrase using the second embedding module of the pre-trained text encoder; encoding the feature vectors corresponding to each character and phrase to determine the text semantic vector; and performing image-text matching based on the text semantic vector, including: Retrieve multiple texts; The multiple texts are segmented into words to obtain the segmentation results corresponding to each of the multiple texts. For each word segmentation result corresponding to a text, the feature vector corresponding to the single character is determined by the first embedding module of the pre-trained text encoder, and the feature vector corresponding to the phrase is determined by the second embedding module of the pre-trained text encoder. For each word segmentation result corresponding to a text, the feature vector corresponding to the single character and the feature vector corresponding to the phrase are encoded to determine the text semantic vector corresponding to the text. Based on the visual semantic vector and multiple text semantic vectors, the text that matches the image is determined from the multiple texts.
11. The method according to any one of claims 1-6, 8, and 9, characterized in that, The text acquisition includes: Get the text sent by the terminal; Following the text-based semantic vector-based image-text matching, the process also includes: The image-text matching results are sent to the terminal.
12. An electronic device, characterized in that, Including memory and processor; The memory is coupled to the processor and is used to store computer program code, the computer program code including computer instructions, and one or more of the processors call the computer instructions to cause the electronic device to perform the image-text matching method as described in any one of claims 1-11.
13. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the image-text matching method as described in any one of claims 1-11.