Systems and methods for product search by embedding visual representations into text sequences
By extracting features from product images and approximating them as text features, and combining convolutional neural networks and bidirectional encoder representation models, the problem of inaccurate product search in e-commerce platforms is solved, improving the accuracy of search results and customer experience.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING WODONG TIANJUN INFORMATION TECH CO LTD
- Filing Date
- 2022-11-30
- Publication Date
- 2026-06-12
AI Technical Summary
Existing e-commerce platforms lack the ability to utilize product image information in product searches, resulting in inaccurate search results, negatively impacting customer experience, and causing ranking algorithm bias.
By extracting features from product images and approximating them as text features, an end-to-end relevance model is designed. This model combines product text information for searching and uses a convolutional neural network and a pre-trained bidirectional encoder representation model to embed and match image and text features.
It improves the accuracy of product search, enhances the customer experience, reduces noise behavior caused by mismatched results, and improves the accuracy of the search algorithm.
Smart Images

Figure CN115757926B_ABST
Abstract
Description
[0001] Citation of relevant applications
[0002] References are cited and discussed in the description of this disclosure, which may include patents, patent applications, and various publications. The citation and / or discussion of such references are provided solely to clarify the description of this disclosure and do not imply that any such reference is “prior art” as disclosed herein. All references cited and discussed in the specification are incorporated herein by reference in their entirety, to the same extent as each individual reference is incorporated by reference individually. Technical Field
[0003] This disclosure generally relates to e-commerce, and more specifically, to systems and methods for extending product search engines in e-commerce by embedding visual representations into text sequences. Background Technology
[0004] The background description provided herein is intended to provide a general overview of the context of this disclosure. Within the scope of this background description, the inventors' work, and descriptions that may not conform to the prior art at the time of application, are neither explicitly nor implicitly acknowledged as prior art to this disclosure.
[0005] In e-commerce platforms, customers frequently search product databases to find products that match their search queries. The lack of product search functionality on e-commerce platforms is detrimental to customers and diminishes their shopping experience. When search functionality is available, the accuracy of search results is crucial; mismatched results not only worsen the customer experience but also lead to biased ranking algorithms and noisy behavioral feedback in search logs, such as clicks or purchases. Existing search engines can utilize product names, descriptions, and user profile information to retrieve products that match queries; however, searches based on this information often result in inaccurate search results.
[0006] Therefore, there is a need in this field to address the aforementioned defects and shortcomings. Summary of the Invention
[0007] Product images, carrying more descriptive information, are a key factor driving e-commerce conversion. The main product images, carefully designed, selected, and uploaded by sellers, contain far more information than we might imagine. For example, a fabric image can easily tell people its color, texture, and style, which is more useful than thousands of descriptive words. Therefore, in some respects, this disclosure uses product images to retrieve products that match a query, in addition to product titles, product descriptions, and user profile information. In some embodiments, this disclosure (1) extracts image features from product images and approximates these image features as text features; and (2) designs and implements an end-to-end relevance model between query information and product information. The relevance model can accept dynamic input from text, images, or a combination of text and images.
[0008] In some aspects, this disclosure relates to a computer-implemented method for searching for products corresponding to queries from customers. In some embodiments, the method includes:
[0009] Embedded queries to obtain query embeddings;
[0010] Retrieve product information including product text and product images;
[0011] Embed product text to obtain product text embedding, embed product image to obtain product image embedding, combine product text embedding and product image embedding to obtain product embedding, wherein product image embedding has the same format as product text embedding;
[0012] Provide query embeddings and product embeddings to the converter to determine whether the query and product are relevant; and
[0013] When the query is related to a product, the product is provided as the search result.
[0014] In some embodiments, both the query embedding and the product embedding are in vector form. In some embodiments, the query is text, and the product image embedding has the same format as the query embedding.
[0015] In some embodiments, the converter includes a query converter for processing query embeddings and a product converter for processing product embeddings. In some embodiments, both the query embedding and the product embedding are provided to the same converter. In some embodiments, one or more converters update the query embedding and the product embedding, and determine the relevance between the query and the product based on the updated query embedding and product embedding.
[0016] In some embodiments, the query includes query text and query image, and the query embedding includes query text embedding corresponding to the query text and query image embedding corresponding to the query image. In some embodiments, the query includes only query text and does not include query image.
[0017] In some embodiments, embedding a product image includes: normalizing the product image to obtain a normalized product image; segmenting the normalized product image into multiple grids; concatenating the multiple grids into a grid sequence; and extracting product image features from the grid sequence to obtain grid feature elements for each grid sequence. In some embodiments, normalization includes converting pixel intensity values to a range of -1 to 1. In some embodiments, the normalized product image is segmented into a 4×4 grid. In some embodiments, concatenation is performed by placing the first row of grids sequentially, followed by the next row sequentially. In some embodiments, a convolutional neural network (CNN) is used for product image feature extraction.
[0018] In some embodiments, embedding a product image further includes: adding a position vector to each grid feature element, the position vector representing the grid's position in the grid sequence; adding a fragment vector to each grid feature element, the fragment vector representing an identifier of the product image; adding a mask vector to each grid feature element, wherein the mask vector's value is 0 or 1, and when one of the mask vectors has a value of 0, the value of the corresponding grid feature element in the grid feature element is converted to 0; and defining a category identifier, the category identifier representing the product's category on the e-commerce platform. Product embedding includes grid feature elements, position vectors, fragment vectors, mask vectors, and category identifiers. In some embodiments, the product category is retrieved directly from product information stored in a product database. In some embodiments, the product category can also be inferred from the product image.
[0019] In some embodiments, approximately 10–15% of the mask vectors have a value of 0. In some embodiments, the mask vectors with a value of 0 are randomly defined.
[0020] In some embodiments, product image features are extracted from the grid sequence by using a CNN on each grid, and the transformer is a pre-trained bidirectional encoder representation (BERT) from the transformer.
[0021] In some embodiments, embedding product text includes acquiring feature elements of the product text and adding a position vector, fragment vector, and mask vector to the product text with acquired feature elements. In some embodiments, the query includes query text, and embedding the query text includes acquiring feature elements of the query text and adding a position vector, fragment vector, and mask vector to the query text with acquired feature elements. In some embodiments, word2vec, GloVe, or fastTex is used to acquire feature elements of the product text or query text.
[0022] In some embodiments, for each query, the method is executed against all products in the product database or product categories in the product database, so that the client can receive a list of products as the query result.
[0023] In some embodiments, the method further includes sending the query results to a client's terminal and displaying the query results on the terminal. The terminal may be, for example, the screen of a computer or smartphone.
[0024] In some aspects, this disclosure relates to a system for searching for products corresponding to a query provided by a customer. In some embodiments, the system includes a computing device having a processor and a storage device storing computer-executable code. The computer-executable code, when executed at the processor, is configured to:
[0025] Embedded queries to obtain query embeddings;
[0026] Retrieve product information including product text and product images;
[0027] Embed product text to obtain product text embedding, embed product image to obtain product image embedding, combine product text embedding and product image embedding to obtain product embedding, wherein product image embedding has the same format as product text embedding;
[0028] Provide query embeddings and product embeddings to the converter to determine whether the query and product are relevant; and
[0029] When the query is related to a product, the product is provided as the search result.
[0030] In some embodiments, both the query embedding and the product embedding are in vector form. In some embodiments, the query is text, and the product image embedding has the same format as the query embedding.
[0031] In some embodiments, the converter includes a query converter for processing query embeddings and a product converter for processing product embeddings. In some embodiments, the query embedding and product embedding are provided to the same converter. In some embodiments, one or more converters update the query embedding and product embedding, and determine the relevance between the query and the product based on the updated query embedding and product embedding.
[0032] In some embodiments, the query includes query text and query image, and the query embedding includes a query text embedding corresponding to the query text and a query image embedding corresponding to the query image. In some embodiments, the query includes only query text and does not include a query image.
[0033] In some embodiments, computer-executable code is configured to embed a product image by: normalizing the product image to obtain a normalized product image; segmenting the normalized product image into a plurality of grids; concatenating the plurality of grids into a grid sequence; and extracting product image features from the grid sequence to obtain grid feature elements for each grid sequence. In some embodiments, normalization includes converting pixel intensity values to a range of -1 to 1. In some embodiments, the normalized product image is segmented into a 4×4 grid. In some embodiments, concatenation is performed by placing the first row of grids sequentially, followed by the next row of grids sequentially. In some embodiments, a convolutional neural network (CNN) is used for product image feature extraction.
[0034] In some embodiments, the computer-executable code is further configured to embed a product image by: adding a position vector to each grid feature element, the position vector representing the grid's position in the grid sequence; adding a fragment vector to each grid feature element, the fragment vector representing an identifier of the product image; adding a mask vector to each grid feature element, wherein the mask vector has a value of 0 or 1, and when one of the mask vectors has a value of 0, the value of the corresponding grid feature element in the grid feature element is converted to 0; and defining a category identifier, the category identifier representing the product's category on the e-commerce platform. Product embedding includes grid feature elements, position vectors, fragment vectors, mask vectors, and category identifiers.
[0035] In some embodiments, approximately 10–15% of the mask vectors have a value of 0. In some embodiments, the mask vectors with a value of 0 are randomly defined.
[0036] In some embodiments, the computer-executable code is also configured to extract product image features from the grid sequence by using a CNN on each grid, and the converter is a pre-trained bidirectional encoder representation (BERT) from the converter.
[0037] In some embodiments, the computer-executable code is configured to embed product text by acquiring feature elements of the product text and adding a position vector, fragment vector, and mask vector to the product text acquiring the feature elements. In some embodiments, the query includes query text, and the computer-executable code is configured to embed query text by acquiring feature elements of the query text and adding a position vector, fragment vector, and mask vector to the query text acquiring the feature elements. In some embodiments, word2vec, GloVe, or fastTex is used to acquire feature elements of the product text or query text.
[0038] In some embodiments, for each query, the computer-executable code is configured to process all products in the product database or product categories in the product database, so that the customer can receive a list of products as the query result.
[0039] In some embodiments, the computer-executable code is also configured to send the query results to a client's terminal and display the query results on the terminal. The terminal may be, for example, the screen of a computer or a smartphone.
[0040] In some respects, this disclosure relates to a non-transitory computer-readable medium storing computer-executable code. The computer-executable code, when executed at a processor of a computing device, is configured to perform the methods described above.
[0041] These and other aspects of this disclosure will become apparent from the following description of preferred embodiments in conjunction with the accompanying drawings and their headings, although variations and modifications therein may affect the novel conception of this disclosure without departing from it. Attached Figure Description
[0042] The accompanying drawings illustrate one or more embodiments of this disclosure and, together with the written description, serve to explain the principles of this disclosure. Where possible, the same reference numerals are used throughout the drawings to refer to the same or similar elements of the embodiments.
[0043] Figure 1 A product search system according to certain embodiments of the present disclosure is illustrated schematically.
[0044] Figure 2 An example of query text embedding according to certain embodiments of this disclosure is illustrated schematically.
[0045] Figure 3 Product image feature modules according to certain embodiments of the present disclosure are schematically depicted.
[0046] Figure 4 The segmentation of a product image and the connection of segmented grids according to certain embodiments of the present disclosure are schematically depicted.
[0047] Figure 5 A product image embedding module according to certain embodiments of the present disclosure is schematically depicted.
[0048] Figure 6 Global features of an image according to certain embodiments of the present disclosure are schematically depicted.
[0049] Figure 7 Product image embeddings according to certain embodiments of this disclosure are schematically depicted.
[0050] Figure 8A product search system according to certain embodiments of the present disclosure is illustrated schematically.
[0051] Figure 9 An improvement to the product search system according to certain embodiments of the present disclosure is illustrated schematically.
[0052] Figure 10 A product search method according to certain embodiments of the present disclosure is illustrated schematically.
[0053] Figure 11 A product image extraction method according to certain embodiments of the present disclosure is illustrated schematically.
[0054] Figure 12 A product image embedding method according to certain embodiments of the present disclosure is illustrated schematically. Detailed Implementation
[0055] The present disclosure is described in more detail in the following examples, which are intended to be illustrative only, as many modifications and variations therein will be apparent to those skilled in the art. Various embodiments of the present disclosure are now described in detail. Referring to the accompanying drawings, throughout the views, the same numerals indicate the same parts. Unless the context clearly specifies otherwise, the terms “a,” “an,” and “the” as used herein and throughout the claims have the meaning of the plural. Furthermore, as used in the description and claims of this disclosure, unless the context clearly specifies otherwise, “in” has the meaning of “in” and “on”. As stated herein, “a plurality” means two or more. As stated herein, the terms “comprising,” “including,” “carrying,” “having,” “containing,” “involving,” etc., should be understood as open-ended, meaning including but not limited to.
[0056] The terms used in this specification generally have their ordinary meanings in the art, in the context of this disclosure, and in the specific context in which each term is used. Certain terms used to describe this disclosure are discussed below or elsewhere in the specification to provide practitioners with additional guidance regarding the description of this disclosure. It will be understood that the same thing can be expressed in more than one way. Therefore, alternative language and synonyms may be used for any one or more terms discussed herein, and have no particular significance in whether a term is elaborated or discussed herein. The use of one or more synonyms does not preclude the use of other synonyms. The use of examples anywhere in this specification, including examples of any terms discussed herein, is merely illustrative and in no way limits the scope and meaning of this disclosure or any exemplary terms. Similarly, this disclosure is not limited to the various embodiments given in this specification.
[0057] As described herein, at least one of the phrases A, B, and C should be interpreted as representing logic (A or B or C) using non-exclusive logic OR. It should be understood that one or more steps within the method may be performed in different orders (or simultaneously) without altering the principles of this disclosure. As described herein, the term "and / or" includes any and all combinations of one or more of the related listed items.
[0058] As described herein, the term "module" can refer to or include: application-specific integrated circuits (ASICs); electronic circuitry; combinational logic circuitry; field-programmable gate arrays (FPGAs); processors (shared, dedicated, or grouped) that execute code; other suitable hardware components that provide the described functionality; or some or all of the above, such as in a system-on-a-chip. The term "module" can include memory (shared, dedicated, or grouped) that stores code executed by a processor.
[0059] As described herein, the term "code" can include software, firmware, and / or microcode, and can refer to programs, routines, functions, classes, and / or objects. The term "shared" as used above means that some or all of the code from multiple modules can be executed using a single (shared) processor. Furthermore, some or all of the code from multiple modules can be stored in a single (shared) memory. The term "group" as used above means that some or all of the code from a single module can be executed using a group of processors. Furthermore, a group of memories can be used to store some or all of the code from a single module.
[0060] As described herein, the term "interface" generally refers to a communication tool or device used at the interaction point between components to perform data communication between components. Generally, interfaces can be applied at both the hardware and software levels, and can be unidirectional or bidirectional. Examples of physical hardware interfaces can include electrical connectors, buses, ports, cables, terminals, and other I / O devices or components. Components communicating with the interface can be, for example, multiple components of a computer system or peripheral devices.
[0061] This disclosure relates to computer systems. As illustrated in the accompanying drawings, computer components may include physical hardware components, shown as solid line blocks, and virtual software components, shown as dashed line blocks. Those skilled in the art will understand that, unless otherwise stated, these computer components may be implemented as software, firmware, or hardware components or combinations thereof, but are not limited to these forms. The apparatuses, systems, and methods described herein may be implemented by one or more computer programs executed by one or more processors. The computer program includes processor-executable instructions stored on a non-transitory tangible computer-readable medium. The computer program may also include stored data. Non-limiting examples of non-transitory tangible computer-readable media are non-volatile memory, magnetic storage, and optical storage.
[0062] This disclosure will now be described more fully below with reference to the accompanying drawings, in which embodiments of the disclosure are illustrated. However, this disclosure may be embodied in many different forms and should not be construed as limited to the embodiments set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete and will fully convey the scope of the disclosure to those skilled in the art.
[0063] In some aspects, this disclosure provides a multimodal visual language model. In some embodiments, the connection sequence between text features and image features is input into a machine learning model, such as a transformer encoder model. This approach can achieve good results in fields such as visual question answering, text and image matching. Regions of interest (RoIs) in an image function similarly to "word tokens" in language, representing "instance-level" information in the image. A series of RoIs in an image are treated as "sentences" in language. Once image features are extracted and treated as language features, they are concatenated with text features and input into a language model. However, this feature extraction method has three limitations: 1) RoIs tend to provide "instance-level" rather than fine-grained information. Product attribute descriptions may not have matching image regions. 2) There is too much overlap between different RoIs. These RoIs give similar features and contribute little to the modeling. 3) Without knowing the objects in the image, RoIs may appear as noise and may render feature elements invalid.
[0064] In some respects, this disclosure provides an improved multimodal model. Figure 1 A product query system according to certain embodiments of this disclosure is schematically depicted. For example... Figure 1 As shown, system 100 includes computing device 110. In some embodiments, computing device 110 may be a server computer, cluster, cloud computer, general-purpose computer, or special-purpose computer that provides product search services. In some embodiments, computing device 110 may communicate with other computing devices or services to obtain product information and order products. Product information may include a product title, description, main image, and optional other images. In some embodiments, communication is conducted over a network, which may be a wired or wireless network, or various forms such as public and private networks, or over non-transitory computer media, including but not limited to flash drives, USB drives, hard disk drives, floppy disks, SD cards, optical drives, or any other portable data storage media.
[0065] like Figure 1As shown, computing device 110 may include, but is not limited to, processor 112, memory 114, and storage device 116. In some embodiments, computing device 110 may include other hardware and software components (not shown) to perform their respective tasks. Examples of such hardware and software components may include, but are not limited to, other required memories, interfaces, buses, input / output (I / O) modules or devices, network interfaces, and peripheral devices. Processor 112 may be a central processing unit (CPU) configured to control the operation of computing device 110. Processor 112 may execute an operating system (OS) or other applications of computing device 110. In some embodiments, computing device 110 may have multiple CPUs as processors, such as two CPUs, four CPUs, eight CPUs, or any suitable number of CPUs. Memory 114 may be volatile memory, such as random access memory (RAM), used to store data and information during the operation of computing device 110. In some embodiments, memory 114 may be an array of volatile memory. In some embodiments, computing device 110 may run on multiple memories 114. Storage device 116 is a non-volatile data storage medium used to store the operating system (not shown) and other applications of computing device 110. Examples of storage device 116 may include non-volatile memory such as flash memory, memory cards, USB drives, hard disk drives, floppy disks, optical drives, solid-state drives (SSDs), or any other type of data storage device. In some embodiments, storage device 116 may be local storage, remote storage, or cloud storage. In some embodiments, computing device 110 may have multiple storage devices 116, which may be the same storage device or different types of storage devices, and applications of computing device 110 may be stored in one or more storage devices 116 of computing device 110. In some embodiments, computing device 110 is a cloud computer, where processor 112, memory 114, and storage device 116 are shared resources provided on demand via the Internet.
[0066] like Figure 1As shown, storage device 116 includes a product search application 118 and a product database 144. The product search application 118 is configured to provide a product search interface to a customer, allowing the customer to search for one or more products using text, images, or a combination of text and images as queries. The searched products match the query. The product database 144 includes product information, such as the product's title, description, main image, and optional other text or images. In some embodiments, the product database 144 may also be stored in a remote computing device communicating with computing device 110, provided that the product database 144 is accessible to the product search application 118. Among other things, the product search application 118 includes a query text feature module 120, a query text embedding module 122, a query image feature module 124, a query image embedding module 126, a query converter 128, a product text feature module 130, a product text embedding module 132, a product image feature module 134, a product image embedding module 136, a product converter 138, a relevance module 140, and a user interface 142. In some embodiments, the product search application 118 may include other applications or modules required for its operation. It should be noted that the various modules are implemented by computer-executable code or instructions, or by data tables or databases, which together constitute an application. In some embodiments, each module may also include sub-modules. Alternatively, some modules may be combined into a stack. In some embodiments, the query image feature module 124 and the query image embedding module 126 may not be necessary, where the customer only uses text during the query process. In some embodiments, some modules may be implemented as circuits rather than executable code. In some embodiments, some or all of the modules of the product search application 118 may be located at a remote computing device or distributed in the cloud.
[0067] The query text feature module 120 is configured to receive query text, extract query text features from the received query text, and send the query text features to the query text embedding module 122. In some embodiments, the query text feature module 120 is configured to receive query text from a user interface 142, where a customer can input query text through the user interface 142. In some embodiments, query text features are extracted by embedding the query text into text feature element embeddings, for example using Google's Word2Vec, Stanford's GloVe, Facebook's fastTex, or other types of pre-trained word embedding models. In some embodiments, the query text is embedded into text feature element embeddings using a custom lookup table based on the pre-trained word embedding model, where the pre-trained word embedding model is further fine-tuned using e-commerce text. In some embodiments, the obtained text feature element embeddings are in vector form, with each vector corresponding to a word in the query text. In some embodiments, when one or more punctuation marks are present in the query text, the text feature element embeddings may also include a vector corresponding to each punctuation mark. In some embodiments, [CLS] may be added at the beginning of the query text, and [SEP] may be added between sentences and at the end of the query text. The dimensions of a vector are variable, for example, they can be 1024 or 768, and each dimension can be a floating-point value.
[0068] The query text embedding module 122 is configured to embed the extracted query text features into a query text embedding upon receiving the extracted query text features, and then send the query text embedding to the query converter 128. In some embodiments, the query text embedding module 122 is configured to perform embedding by adding fragment embedding, position embedding, and optional mask embedding to the text feature element embedding. Fragment embedding represents the sentence identifier of a text word. For example, if the query text has two sentences, the words in the first sentence may have one fragment embedding, such as 1; the words in the second sentence may have another fragment embedding, such as 2. Position embedding represents the position of a word in the query text. For example, if the query text has 10 words sequentially, the position embeddings of these 10 words may be 0, 1, 2, 3, ..., 7, 8, and 9, respectively.
[0069] Figure 2 Examples of query text embedding according to certain embodiments of this disclosure are illustrated schematically. Figure 2As shown, the query text consists of two sentences. The first sentence contains three words, w11, w12, and w13, and the second sentence contains four words, w21, w22, w23, and w24. After executing the query text feature module 120, the result is the text feature element embedding. After executing the query text embedding module 122, the result includes text feature element embedding, fragment embedding, position embedding, and mask embedding.
[0070] The query image feature module 124 and the query image embedding module 126 are configured to mimic the functions of the query text feature module 120 and the query text embedding module 122, where query images are processed instead of query text. The query image feature module 124 is configured to, for example, receive a query image from the user interface 142, extract query image features (query image feature element embedding), and send the query image features to the query image embedding module 126. The query image embedding module 126 is configured to, upon receiving query image features, add fragment embedding, position embedding, and mask embedding to the query image features to form a query image embedding, and provide the query image embedding to the query converter 128. The query image feature module 124 and the query image embedding module 126 are optional and have a similar module structure and function to the product image feature module 134 and the product image embedding module 136. The query image feature module 124 and the query image embedding module 126 will be described in detail below with reference to the product image feature module 134 and the product image embedding module 136.
[0071] Query converter 128 is configured to, upon receiving query embeddings (query text embeddings and optional query image embeddings), use the query embeddings as input, run a converter model to update the query embeddings, and provide the latent features of the updated query embeddings to relevance module 140. In some embodiments, query converter 128 has a bidirectional encoder representation from a BERT (Browser-Based Transformer) structure. In some embodiments, query converter 128 has 3 to 12 BERT layers. In one example, query converter 128 has three BERT layers to ensure system efficiency.
[0072] Product text feature module 130 is configured to retrieve product text, extract product text features (text feature element embedding) from the product text, and provide the product text features to product text embedding module 132. Product text embedding module 132 is configured to, upon receiving product text features, add fragment embedding, position embedding, and optional mask embedding to the product text features to form a product text embedding, and provide the product text embedding to product converter 138. The structure and function of product text feature module 130 and product text embedding module 132 are basically the same as those of query text feature module 120 and query text embedding module 122, the difference being that the input of product text feature module 130 is the title and description of the product from product database 144, while the input of query text feature module 120 is the query text received from user interface 142.
[0073] The product image feature module 134 is configured to retrieve product images from the product database 144, extract product image features (product image feature element embedding) from the product images, and send the product image features to the product image embedding module 136. Figure 3 A product image feature module 134 according to certain embodiments of the present disclosure is schematically depicted. For example... Figure 3 As shown, the product image feature module 134 includes a product image retrieval module 1340, a product image normalization module 1342, a product image segmentation module 1344, a product image connection module 1346, and a product image feature extraction module 1348.
[0074] Product image retrieval module 1340 is configured to retrieve product images, such as main product images, from product database 144 and send the retrieved product images to product image normalization module 1342. In some embodiments, product image retrieval module 1340 may cooperate with product text feature module 130 to retrieve text and main product images of the same product substantially simultaneously. The main product image may be an RGB image, where each pixel has three channels corresponding to red, green, and blue, respectively. Each channel may have a value in the range of 0 to 255, indicating the intensity of the red, green, or blue pixel.
[0075] The product image normalization module 1342 is configured to, upon receiving a retrieved main product image, normalize the main product image to obtain a normalized product image, and send the normalized product image to the product image segmentation module 1344. In some embodiments, normalization is performed by converting each intensity value from 0 to 255 to a floating-point value in the range of -1 to 1. In some embodiments, intensity values from [0-255] are converted to intensity values from [-1, 1] using the following formula:
[0076]
[0077] Among them I [0,255] I represents the intensity value within the range of [0-255]. [-1,1 The intensity value is defined as the value within the range [-1, 1]. In some embodiments, the intensity distribution in the normalized image has a Gaussian distribution. The normalized product image contains all pixels, each pixel having three channels, and the value of each of the three channels is the corresponding normalized intensity. In some embodiments, the product image normalization module 1342 may use other types of normalization methods.
[0078] The product image segmentation module 1344 is configured to, upon receiving a normalized product image, segment the normalized product image into a grid and send the grid to the product image connection module 1346. In some embodiments, the image is cut into fixed square or rectangular non-overlapping grid blocks. In some embodiments, the normalized product image is cut into a 2×2, 3×3, 4×4, ..., or 16×16 grid. In some embodiments, the normalized product image is divided into 16 (4×4) grids. In some embodiments, the number of segments along the horizontal and vertical directions of the product image may be different.
[0079] The product image connection module 1346 is configured to, upon receiving a grid of product images, flatten and connect the grids to form a grid sequence, and send the grid sequence to the product image feature extraction module 1348. In some embodiments, the grids are arranged from left to right and from top to bottom. In other words, the grid sequence starts from the first row of grids from left to right, then the second row of grids from left to right, and so on until the last row of grids from left to right.
[0080] Figure 4 The illustration schematically depicts segmenting a normalized product image into a grid and connecting the grids into a grid sequence according to certain embodiments of the present disclosure. Figure 4 As shown, the product image is divided into 16 grids, consisting of four rows of four grids each. These 16 grids are then arranged from left to right, from the first row to the fourth row, to form a sequence of 16 grids. In Natural Language Processing (NLP), each grid can be considered a word, and the sequence of grids mimics the structure of a sentence. Therefore, Figure 4 The sequence of 16 grids shown can be viewed as a sentence of 16 words. In some embodiments, the grids can be connected in other ways; for example, the sequence of grids can start from the first column of grids from top to bottom, then the second column from top to bottom, and so on until the last column. However, the way the grids are connected should be consistent during the training and use of the product search application 118.
[0081] The product image feature extraction module 1348 is configured to extract features from the grid sequence upon receiving it from the product image concatenation module 1346 to obtain product image sequence features, and then send these product image sequence features to the product image embedding module 136. In some embodiments, a convolutional neural network (CNN) or a transformer model is used for feature extraction. Specifically, each grid is used as input to the CNN, each grid contains multiple pixels, and each pixel has its normalized intensity. The output of the CNN for each grid can be a vector with a series of floating-point numbers representing the image grid features. Each floating-point number is the dimension of the vector, and the dimension can be based on the configuration of the CNN model. In some embodiments, the CNN model is RESNET50, with dimensions such as 512, 1024, or 2048. In some embodiments, the feature extraction model is a visual transformer (VIT), with dimensions such as 768. It should be noted that the dimensions can be different and can be adjusted in the last layer using a multilayer perceptron (MLP). The tokenization process described above for obtaining feature elements in the image domain mimics the feature element extraction process for text in NLP. In some embodiments, the dimensions of the extracted product image features are the same as the dimensions of the extracted product text features. In some embodiments, the dimensions of each product text feature and image feature are also the same as the dimensions of the query text features.
[0082] It's important to note that the key to cross-modal search lies in approximating image features as text features. Product images are segmented into non-overlapping grid tiles. Image features extracted for each grid tile using a CNN are used as word feature elements in NLP. The flattened grid tile features after positional embedding are treated as a sentence in NLP. Once the two sets of features are aligned, cross-modal search can be performed.
[0083] Figure 5 The structure of the product image embedding module 136 is schematically depicted. For example... Figure 5 As shown, the product image embedding module 136 includes a product image location embedding module 1360, a product image fragment embedding module 1362, a product image mask embedding module 1364, and a product image category embedding module 1366. As previously described, the product image embedding module 136 essentially has the same structure and function as the query image embedding module 126. The difference lies in that the query image embedding module 126 processes images provided by the customer, such as images taken by the customer using a smartphone and uploaded to the product search application 118 via the user interface 142, while the product image embedding module 136 processes product images retrieved from the product database 144. Furthermore, the product image embedding module 136 has the product image category embedding module 1366, while the query image embedding module 126 may not have a category embedding module.
[0084] The product image location embedding module 1360 is configured to, upon receiving product image sequence features (or product image features) from the product image feature extraction module 1348, add location embeddings to the product image features and send the product image features and location embeddings to the product image fragment embedding module 1362. In the NLP field, each word is first mapped to a series of floating-point numbers, called word feature elements or word features. For a model to understand an article, a single word feature sequence is insufficient. The position of words within a sentence is also crucial. The same word can have different meanings in different positions. Location embedding embeds this location information into the word feature sequence. To simulate image processing as text processing, this disclosure adds location information to the image grid. The product image features sequentially contain vectors of the grids. Location embedding adds the location vectors to each grid vector. In some embodiments, the location vectors are defined by four numerical values representing the grid position in the image. For example, the top-left and bottom-right corners of the image can be defined as having coordinates of (0, 0) and (1, 1), respectively. Then each grid is defined by the coordinates of its top-left and bottom-right corners. Assume the image has 16 grids, 4 horizontally and 4 vertically. Then, for the first four grid cells in a 4×4 grid, the position vectors are (0, 0, 0.25, 0.25), (0.25, 0, 0.5, 0.25), (0.5, 0, 0.75, 0.25), and (0.75, 0, 1.0, 0.25), respectively. For the fourth four grid cells in a 4×4 grid, the position vectors are (0, 0.75, 0.25, 1.0), (0.25, 0.75, 0.5, 1.0), (0.5, 0.75, 0.75, 1.0), and (0.75, 0.75, 1.0, 1.0), respectively. In some embodiments, unlike text position embedding, the product image position embedding module 1360 is also configured to define global features for the entire image and add global features at the beginning of the 16 grid cells. Figure 6 As shown, in some embodiments, global features are generated by averaging all grid features, and the position of the global feature is defined as (0, 0, 1, 1) - the center of the normalized image. That is, the global feature is a vector whose value is the average of the 16 image grid vectors, and the position embedding of the global feature vector is (0, 0, 1, 1).
[0085] The product image fragment embedding module 1362 is configured to, upon receiving product image features and location embeddings from the product image location embedding module 1360, add fragment embeddings to the product image sequence features and send the product image features, location embeddings, and fragment embeddings to the product image mask embedding module 1364. Since an image is treated as a sentence, all grid features are assigned the same fragment identifier (ID). Specifically, in NLP, the position of a sentence within an article is important. In NLP, fragment embedding assigns different labels to different sentences. Similarly, for image processing, the product image fragment embedding module 1362 is configured to assign a fragment ID to each product image. When only the main product image is used in the product search application 118, only one image fragment ID exists. The image fragment ID is used to distinguish between images and text sentences. For example, for a product, if two sentences and a main product image are provided, the fragment IDs for the two sequences and the main product image can be defined as 0, 1, and 2, respectively.
[0086] The product image mask embedding module 1364 is configured to, upon receiving product image features, position embeddings, and fragment embeddings from the product image fragment embedding module 1362, add a mask embedding to the product image features and send the product image features, position embeddings, fragment embeddings, and image mask embeddings to the product image category embedding module 1366. As described above, the product image features are in the form of a series of vectors, each vector representing a grid. To understand the relationships between grids, the product image mask embedding module 1364 is configured to add a mask embedding to each vector. In some embodiments, 10% to 15% of the mask at random locations is assigned a value of N / A. Vectors with mask feature elements equal to N / A are converted to 0 in each of their dimensions. Other vectors are used to predict mask image sequence features. The higher the accuracy of the model's predictions, the more the model is understood.
[0087] The product image category embedding module 1366 is configured to, upon receiving product image features, location embeddings, fragment embeddings, and mask embeddings from the product image mask embedding module 1364, add category embeddings to the product image features and send the product image features, location embeddings, fragment embeddings, mask embeddings, and category embeddings to the product converter 138. In some embodiments, in the e-commerce field, images from different product categories carry different information. Fashion images typically carry more useful information than images of electronic products. Descriptions of hard drive size may not be reflected in the image, but textual descriptions and style descriptions of clothing can be easily found within the image. To emphasize the differences between product categories, the product image category embedding module 1366 is configured to add a category ID to the product image sequence features. In some embodiments, products are classified into approximately 41 different categories, which may include clothing, electronic products, home furnishings, home appliances, computers, etc. In some embodiments, the product image features, location embeddings, fragment embeddings, and mask embeddings are provided to the MLP with the category ID as a parameter, and the result is used as input to the product converter 138. It should be noted that the query image embedding module 126 may not have a corresponding product image category embedding module. In some embodiments, the product image embedding module 136 may not include the product image category embedding module 1366, and the product converter 138 is configured to add a category ID before using the product embedding operation converter.
[0088] Figure 7 The results from the product image embedding module 136 are schematically depicted, and these results are combined with those from the product text embedding module 132 as input to the product converter 138. Figure 7 As shown, from 16 image grids, a CNN is run to assign a vector value to each image grid. These 16 vectors are G. 01 To G 16 Each vector can have, for example, 1024 dimensions or 768 dimensions. These 16 grids come from the same image and are considered as a sentence, thus being assigned the same fragment embedding S2. The fragment ID differs from one or more fragment IDs of the sentence from the product text embedding module 132 in order to distinguish the image from the sentence. Masked embedding indicates that 10–15% of the image feature element embeddings are randomly masked. In the example, the embeddings of the 7th and 15th grids are masked. Therefore, the vector value G at this time... 07 and G 15 The value is 0 in every dimension. Furthermore, the product is categorized into the 12th product category, namely food, therefore the image feature element embedding has a category embedding C. 12In some embodiments, the category embedding may not be added by the product image embedding module 136, but rather directly by the product converter 138 as a parameter before operating the converter.
[0089] Product converter 138 is configured to, upon receiving product text embeddings and product image embeddings, combine the embeddings into a product embedding, perform converter encoding on the product embedding to update the product embedding, extract hidden features from the updated product embedding, and send the extracted hidden features to relevance module 140. In some embodiments, the product converter 138 for analyzing product text and image embeddings is inspired by the concept of "Attenrion is all you need." In some embodiments, a classic converter encoding architecture can be used. In some embodiments, product converter 138 includes one or more BERT layers, such as three BERT layers, six BERT layers, or twelve BERT layers. The output of product converter 138 is a hidden representation of the product text and image. In some embodiments, product converter 138 is configured to extract the hidden representation from the last layer of the product converter as the final result and send the extracted hidden representation to relevance module 140. In some embodiments, the extracted hidden representation is the first vector of the hidden representation.
[0090] It is important to note that a transformer is a deep learning model (design template) used to process input sequence features and output another sequence feature. Generally, a transformer consists of an encoder and a decoder. In some embodiments of this disclosure, transformer 138 is the encoder portion. In some embodiments, encoder 138 mainly consists of an attention layer and some feedforward layers. The feedforward layers simply transform each embedding in the input sequence individually to provide them with more modeling capabilities or change their dimensionality.
[0091] The attention layer first multiplies the input sequence embedding E by three weight matrices to transform it into three distinct feature sequences: Query(Q), Key(K), and Value(V). These three feature sequences are then combined to produce an output embedding sequence E′ of the same length as the input:
[0092] Take the dot product of Q_i and K_j for all j: D_{ij}=<Q_i,K_j>
[0093] These dot products are normalized using a softmax operation: W_{ij}=e^{D_{ij}} / sum_k e^{D_{ik}}. Here, the value of k ranges from 1 to the sequence length.
[0094] We obtain the weighted sum of V_j through W_{ij}: E′_i=sum_j W_{ij}V_j.
[0095] Therefore, E′_i is the weighted average of V_j, with the weights provided by the pairing of Q and K.
[0096] In some embodiments, preserving the sequence length has the advantage that the above process can be iterated any number of times, thereby allowing the model to achieve arbitrary depth (expressiveness).
[0097] In another embodiment, the method of processing the embedding sequence before attention is to simply sum the embeddings to produce a single embedding, E′ = sum_i E_i. E′ is then fed into a feedforward network (also known as a multilayer perceptron) to produce a fraction or another embedding. Its main drawback is that it does not capture (at least not explicitly) pairwise interactions between elements in the E sequence. In yet another embodiment, higher-order interactions can also be captured by stacking multiple attention / transformer layers.
[0098] The relevance module 140 is configured to, upon receiving a hidden representation of a query extracted from the query converter 128 and a hidden representation of a product extracted from the product converter 138, determine whether the query and product are relevant based on the extracted hidden representations, and provide the relevance result to the user interface 142 if the query and product are relevant. In some embodiments, the relevance module 140 is a multilayer perceptron (MLP). The input to the MLP is the extracted hidden features of the query and product, and the output is a relevance value. The extracted hidden features can be extracted from the last layer of the query converter 128 or the product converter 138, and may include only the header of the query or the header of the product hidden representation. The relevance value can be a true value in the range of 0 to 1, where 0 is irrelevant and 1 is highly relevant. In some embodiments, the relevance value can also be represented as a category of relevant or irrelevant. A threshold, such as 0.7, can be preset. If the relevance value is equal to or greater than 0.7, the relevance value is classified as relevant; if the relevance value is less than 0.7, the relevance value is classified as irrelevant. In some embodiments, cross-entropy loss is used to minimize the loss. Specifically, correlation is treated as a classification problem, where 1 represents correlation and 0 represents no correlation. Cross-entropy is used to minimize the loss, thus providing the correct prediction.
[0099] User interface 142 is configured to display results to the customer who submitted the query via the interface upon receiving relevant products from relevance module 140. In some embodiments, user interface 142 may communicate with a terminal such as a customer's remote computing device or smartphone and display a graphical user interface (GUI) on the remote computing device or smartphone. The customer can enter and submit their query through the GUI, and user interface 142 can display the query results on the GUI. In some embodiments, user interface 142 may be configured to display products only if the product is relevant to the query. In some embodiments, product search application 118 will perform a relevance analysis between the query and many or all products from product database 144, rank the products based on their relevance to the query, and display only the top-ranked products to the customer. The top-ranked products have the highest relevance value to the query. The top-ranked products may be the top five products, the top ten products, or a number selected by the customer.
[0100] In some embodiments, the product search application 118 may also include an ordering module, allowing customers to place an order for one or more products they wish to purchase when they browse the search results and find them. In some embodiments, the product search application 118 may also include a clickable link to an ordering interface, redirecting the customer to the ordering interface to order the selected products.
[0101] In some embodiments, for products in product database 144, product search application 118 can perform text and image feature analysis on each product by pre-operating product text feature module 130, product text embedding module 132, product image feature module 134, product image embedding module 136, and product converter 138. Therefore, the hidden features of the converter for each product can be extracted and stored offline. When a customer queries a product, product search application 118 only needs to execute the functions of modules 120 to 128, and then run the results in relevance module 140 for the hidden features of each product. In some embodiments, the customer can provide only a text query without executing the functions of query image feature module 124 and query image embedding module 126.
[0102] Figure 8 A product query system according to certain embodiments of the present disclosure is illustrated schematically. Figure 8 The components in the middle are similar to Figure 1 The components are different in that only one converter 838 receives input from the query text embedding module 822, the product text embedding module 832, the product image embedding module 836, and the optional query image embedding module 826, and uses all of these inputs to operate the converter encoding. Figure 1 and Figure 8The design of these components generally consists of three parts: the query part, the generation part, and the relevance calculation part. Figure 1 The dual-stream design shown extracts query hidden features and product hidden features separately using converter encoding. Its advantage is that product features can be calculated offline, reducing the workload of online computation. However, its accuracy is slightly lower than that of the single-stream design. On the other hand, Figure 8 The single-stream design shown concatenates the query sequence and the product information sequence together, extracting joint features. Its accuracy is slightly better than the two-stream design, but the online computational load is also higher, which may result in longer search latency.
[0103] Figure 9 Some embodiments according to this disclosure are illustrated schematically. Figure 1 and Figure 8 The design shown is an improvement over text-to-text and image-to-image search systems. Figure 9 The upper part illustrates text-to-text search and image-to-image search according to certain embodiments of this disclosure. In a search method based purely on text features using NLP, each word is treated as a feature element, and the sentence is approximated as a sequence of embedded feature elements through positional embedding, fragment embedding, and mask embedding. This method has achieved great success in handling text tasks. Search engines can calculate the relevance between a query and a title, and then estimate the query intent. However, image information is lacking in the intent estimation calculation. In a search method based purely on image features, the method utilizes a similarity score between high-dimensional image features of two images. High-dimensional features can be extracted using convolutional neural networks (CNNs) or transformer encoding. Both methods calculate high-dimensional tensors to represent the features of the entire image. However, textual information, such as words in a text query or product description, is not uniformly addressed in the relevance calculation. Figure 9 As shown in the lower part, through Figure 1 and Figure 8 The design shown, and certain embodiments of this disclosure, enable cross-modal verification between images and text. Unlike text-to-text or image-to-image search systems, the embodiments are capable of handling mixed cases, such as text-to-image, image-to-text, and even more complex cases, such as text-to-(text+image), image-to-(text+image), and (text+image)-to-(text+image).
[0104] Return to reference Figure 1In some embodiments, the product search application 118 may further include a scheduler configured to schedule data flows between other modules of the product search application 118. The scheduler may determine the training and search modes of the product search application 118, load retrieved product text and images into memory 114, and invoke different modules to perform module functions on the retrieved text and images. In some embodiments, the product search application 118 may further include a management interface for system administrators to configure and adjust module parameters, train the product search application 118 using product text and images from the product database 144, and provide search functionality to customers using the trained modules. In some embodiments, management functionality may also be incorporated into a user interface 142 configured to provide a management interface to administrators and a user interface to customers.
[0105] In some respects, this disclosure relates to a product search method. Figure 10 A product search method according to certain embodiments of the present disclosure is schematically depicted. In some embodiments, the method is performed by... Figure 1 The computing device 110 shown is implemented. It should be specifically noted that, unless otherwise stated in this disclosure, the steps of the method may be arranged in a different order, and therefore are not limited to... Figure 10 The order shown.
[0106] In step 1002, the query text feature module 120 receives the query text from the client, extracts query text features from the query text, and sends the extracted query text features to the query text embedding module 122. In some embodiments, the query text features may be as follows: Figure 2 The text feature elements are shown in the embedded form. In some embodiments, word2vec, GloVe, fastTex, or other word embedding models are used for extraction. In some embodiments, the extracted query text features are in vector form.
[0107] In step 1004, upon receiving the extracted query text features, the query text embedding module 122 embeds the extracted query text features to form a query text embedding and sends the query text embedding to the query converter 128. In some embodiments, the embedding process may include adding fragment embedding, position embedding, and mask embedding to the query text features. Therefore, query text embedding may include text feature element embedding, fragment embedding, position embedding, and mask embedding. When the query contains only text but not an image, the query text embedding is also called query embedding. When both query text and query image are present, the query embedding is a combination of query text embedding and query image embedding.
[0108] In step 1006, upon receiving the query embedding, the query converter 128 uses the query embedding as input to operate the converter to update the query embedding and make the hidden representation of the query embedding available to the relevance module 140. In some embodiments, the query converter 128 has one or more BERT layers. The updated query embedding includes the hidden representation of the query, where the vector values are updated by multiple converter layers.
[0109] When a customer provides a query that includes query text and a query image, the query image feature module 124 will also use a CNN model to extract query image features, add fragment embeddings, position embeddings, mask embeddings, and optional category embeddings to form a query image embedding, and send this query image embedding to the query converter 128. The query image embedding is combined with the query text embedding to form a query embedding, which is used as input to the query converter 128.
[0110] In step 1008, the product text feature module 130 retrieves product text from the product database 144, extracts product text features from the product text, and sends the extracted product text features to the product text embedding module 132. In some embodiments, the product text includes the product's title and description. In some embodiments, the product text features may be as follows: Figure 2 The text feature elements are embedded in the form shown. In some embodiments, word2vec or other word embedding models are used for extraction.
[0111] In step 1010, upon receiving the extracted product text features, the product text embedding module 132 embeds the extracted product text features to form a product text embedding, and sends the product text embedding to the product converter 138. In some embodiments, the product text embedding may include text feature element embedding (product text features), fragment embedding, position embedding, and mask embedding.
[0112] In step 1012, the product image feature module 134 retrieves product images from the product database 144, extracts product image features from the product images, and sends the extracted product image features to the product image embedding module 136. In some embodiments, text and images can be retrieved simultaneously, and the text and image correspond to the same product. In some embodiments, the product image is the main image of the product. In some embodiments, the product image features can be as follows: Figure 7 The image feature elements are embedded in the form shown. In some embodiments, a CNN is used to extract product image features.
[0113] In step 1014, upon receiving the extracted product image features, the product image embedding module 136 embeds the extracted product image features to form a product image embedding, and sends the product image embedding to the product converter 138. In some embodiments, the product image embedding may include, for example: Figure 7 The text feature element embeddings shown include fragment embeddings, position embeddings, mask embeddings, and optional category embeddings.
[0114] In step 1016, upon receiving the product text embedding and the product image embedding, the product converter 138 combines the product text embedding and the product image embedding to form a product embedding. Using the product embedding as input, the product converter 138 operates and sends the updated product embedding or makes the updated product embedding available to the relevance module 140. The product text embedding and the product image embedding originate from the same product.
[0115] In step 1018, the relevance module 140 retrieves the hidden representation of the query from the query converter 128 and the hidden representation of the product from the product converter 138, determines whether the query and the product are relevant, and provides the relevance value to the user interface 142. In some embodiments, the hidden representations are extracted from the last layer of the query converter 128 and the last layer of the product converter 138, respectively. In some embodiments, an MLP is used to determine the relevance value.
[0116] In step 1020, the relevance between the query and many or all products in the product database 144 is determined, and the relevance module 140 sends the products with the highest relevance to the user interface 142, so that the customer can view the ranked related products, and if the customer is satisfied with some of the related products, he / she can choose to order some of the related products.
[0117] In some embodiments, steps 1008-1018 are performed iteratively for all pre-stored products, such that when providing a query, the product search application 118 only needs to run the query text feature module 120, the query text embedding module 122, the query converter 128, the relevance module 140, and the user interface 142. In some embodiments, when the query also includes images, the product search application 118 may also need to run the query image feature module 124 and the query image embedding module 126.
[0118] Figure 11 A product image extraction method according to certain embodiments of the present disclosure is schematically depicted. In some embodiments, the method is performed by... Figure 3 The product image feature module 134 shown is implemented. It should be specifically noted that, unless otherwise stated in this disclosure, the steps of the method can be arranged in a different order, and therefore are not limited to... Figure 11 The order shown.
[0119] In step 1102, the product image retrieval module 1340 retrieves product images from the product database 144 and sends the retrieved product images to the product image normalization module 1342.
[0120] In step 1104, upon receiving the retrieved product image, the product image normalization module 1342 normalizes the product image to obtain a normalized product image, and sends the normalized product image to the product image segmentation module 1344.
[0121] In step 1106, upon receiving a normalized product image, the product image segmentation module 1344 segments the normalized product image into a product image grid and sends the product image grid to the product image connection module 1346.
[0122] In step 1108, upon receiving the product image grid, the product image connection module 1346 flattens the grid, connects the grids into a product image grid sequence, and sends the product image grid sequence to the product image feature extraction module 1348.
[0123] In step 1110, upon receiving the sequence of product image grids, the product image feature extraction module 1348 extracts features from each grid to form extracted product image features, and sends the extracted product image features to the product image embedding module 136. In some embodiments, the product image feature extraction module 1348 sends the extracted product image features to the product image position embedding module 1360 of the product image embedding module 136.
[0124] Figure 12 A product image embedding method according to certain embodiments of the present disclosure is schematically depicted. In some embodiments, the method is performed by... Figure 5 The product image embedding module 136 shown is implemented. It should be specifically noted that, unless otherwise stated in this disclosure, the steps of the method can be arranged in a different order, and therefore are not limited to... Figure 12 The order shown.
[0125] In step 1202, when the extracted product image features are received, the product image location embedding module 1360 adds the location embedding to the product image features and sends the product image features and location embedding to the product image fragment embedding module 1362.
[0126] In step 1204, upon receiving the product image features and location embedding, the product image fragment embedding module 1362 adds the fragment embedding to the product image features and sends the product image features, location embedding, and fragment embedding to the product image mask embedding module 1364.
[0127] In step 1206, upon receiving the product image features, location embedding, and fragment embedding, the product image mask embedding module 1364 adds the mask embedding to the product image features and sends the product image features, location embedding, fragment embedding, and mask embedding to the product image category embedding module 1366.
[0128] In process 1208, upon receiving product image features, location embeddings, fragment embeddings, and mask embeddings, the product image category embedding module 1366 adds the category embedding to the product image features and sends the product image features, location embeddings, fragment embeddings, mask embeddings, and category embeddings to the product converter 138. In some embodiments, an MLP is used to merge the category IDs.
[0129] Location embeddings, fragment embeddings, and mask embeddings can be added in parallel or in any order. Category IDs can be provided via the product image embedding module 136 or the product converter 138. Product image features, location embeddings, fragment embeddings, mask embeddings, and category embeddings are combined into a product image embedding. Product text embeddings and product image embeddings are combined into a product embedding.
[0130] In some aspects, this disclosure relates to a non-transitory computer-readable medium storing computer-executable code. The code, when executed by a processor of a computing device, can perform the methods described above. In some embodiments, the non-transitory computer-readable medium may include, but is not limited to, any physical or virtual storage medium. In some embodiments, the non-transitory computer-readable medium may be implemented as follows: Figure 1 The storage device 116 of the computing device 110 shown.
[0131] Furthermore, certain embodiments of this disclosure have the following advantages: (1) First, this disclosure provides an end-to-end model capable of cross-domain relevance calculation. This model is designed to accept flexible inputs, including text-to-text search, image-to-image search, and more advanced text-to-(text+image), image-to-(text+image), text-to-image, image-to-text, and (text+image)-to-(text+image). (2) Furthermore, this disclosure approximates image features as feature sequences similar to text sequences, thereby allowing features from different domains to be concatenated and fed into a converter model. Specifically, a given product image is first segmented into non-overlapping grid blocks. Each grid block approximates a language word. Then, the grid blocks are flattened into a sequence. In this way, the image is treated as a sentence. Subsequently, image features are extracted from each grid block using a CNN. The features in the grid blocks are used as a language feature element in the NLP domain. Finally, these image features are embedded using position embedding, fragment embedding, and mask embedding in a manner similar to that used for language features. Therefore, the approximation of image processing with text processing breaks down the barrier between the image domain and the text domain.
[0132] The foregoing description of exemplary embodiments of this disclosure is presented for illustrative and descriptive purposes only and is not intended to be exhaustive or to limit this disclosure to the precise form disclosed. Many modifications and variations are possible in accordance with the foregoing teachings.
[0133] The embodiments were chosen and described to explain the principles of this disclosure and its practical application, thereby enabling others skilled in the art to utilize this disclosure and various embodiments, as well as various modifications suitable for the particular intended use. Alternative embodiments will become apparent to those skilled in the art to which this disclosure pertains without departing from the spirit and scope of this disclosure. Therefore, the scope of this disclosure is defined by the appended claims rather than the foregoing description and the exemplary embodiments described therein.
Claims
1. A computer-implemented method for searching for products corresponding to a query from a customer, comprising: Embed the query to obtain query embedding; Retrieve product information including product text and product images; The product text is embedded to obtain a product text embedding, the product image is embedded to obtain a product image embedding, and the product text embedding and the product image embedding are combined to obtain a product embedding, wherein the product image embedding has the same format as the product text embedding; The query embedding and the product embedding are provided to the converter to determine whether the query and the product are relevant. as well as When the query is related to the product, the product is provided as the search result for the query; The embedding of the product image includes: The product image is normalized to obtain a normalized product image; The normalized product image is divided into multiple grids; Connect the multiple grids into a grid sequence; Product image features are extracted from the grid sequence to obtain grid feature elements for each grid sequence; A position vector is added to each grid feature element, the position vector representing the position of the grid in the grid sequence; A fragment vector is added to each of the grid feature elements, the fragment vector representing the identifier of the product image; A mask vector is added to each of the grid feature elements, wherein the value of the mask vector is 0 or 1, and when the value of one of the mask vectors is 0, the value of the corresponding grid feature element in the grid feature element is converted to 0; and Define a category identifier, which indicates the category of the product on the e-commerce platform. The product embedding includes the grid feature elements, the position vector, the fragment vector, the mask vector, and the category identifier.
2. The method according to claim 1, wherein, The converter includes a query converter for processing the query embedding and a product converter for processing the product embedding.
3. The method according to claim 1, wherein, The query includes query text and query image, and the query embedding includes query text embedding corresponding to the query text and query image embedding corresponding to the query image.
4. The method according to claim 1, wherein, 10-15% of the mask vectors have a value of 0.
5. The method according to claim 1, wherein, Extracting product image features from the grid sequence includes performing a convolutional neural network (CNN) on each grid, and the converter is a pre-trained bidirectional encoder representation (BERT) from the converter.
6. A system for searching for products corresponding to queries from customers, wherein, The system includes a computing device, which includes a processor and a storage device storing computer-executable code, the computer-executable code being configured to: Embed the query to obtain query embedding; Retrieve product information including product text and product images; The product text is embedded to obtain a product text embedding, the product image is embedded to obtain a product image embedding, and the product text embedding and the product image embedding are combined to obtain a product embedding, wherein the product image embedding has the same format as the product text embedding; The query embedding and the product embedding are provided to the converter to determine whether the query and the product are relevant. as well as When the query is related to the product, the product is provided as the search result for the query; The computer-executable code is configured to embed the product image through the following steps: The product image is normalized to obtain a normalized product image; The normalized product image is divided into multiple grids; Connect the multiple grids into a grid sequence; Product image features are extracted from the grid sequence to obtain grid feature elements for each grid sequence; A position vector is added to each grid feature element, the position vector representing the position of the grid in the grid sequence; A fragment vector is added to each of the grid feature elements, the fragment vector representing the identifier of the product image; A mask vector is added to each of the grid feature elements, wherein the value of the mask vector is 0 or 1, and when the value of one of the mask vectors is 0, the value of the corresponding grid feature element in the grid feature element is converted to 0; and Define a category identifier, which indicates the category of the product on the e-commerce platform. The product embedding includes the grid feature elements, the position vector, the fragment vector, the mask vector, and the category identifier.
7. The system according to claim 6, wherein, The converter includes a query converter for processing the query embedding and a product converter for processing the product embedding.
8. The system according to claim 6, wherein, The query includes query text and query image, and the query embedding includes query text embedding corresponding to the query text and query image embedding corresponding to the query image.
9. The system according to claim 6, wherein, 10-15% of the mask vectors have a value of 0.
10. The system according to claim 6, wherein, The computer-executable code is configured to extract product image features from the grid sequence by performing a convolutional neural network (CNN) on each grid, and the converter is a pre-trained bidirectional encoder representation (BERT) from the converter.
11. A non-transitory computer-readable medium for storing computer-executable code, wherein, The computer-executable code is configured to, when executed at the processor of the computing device: Embedded queries to obtain query embeddings; Retrieve product information including product text and product images; The product text is embedded to obtain a product text embedding, the product image is embedded to obtain a product image embedding, and the product text embedding and the product image embedding are combined to obtain a product embedding, wherein the product image embedding has the same format as the product text embedding; The query embedding and the product embedding are provided to the converter to determine whether the query and the product are relevant. as well as When the query is related to the product, the product is provided as the search result for the query; The computer-executable code is configured to embed the product image through the following steps: The product image is normalized to obtain a normalized product image; The normalized product image is divided into multiple grids; Connect the multiple grids into a grid sequence; Product image features are extracted from the grid sequence to obtain grid feature elements for each grid sequence; A position vector is added to each grid feature element, the position vector representing the position of the grid in the grid sequence; A fragment vector is added to each of the grid feature elements, the fragment vector representing the identifier of the product image; A mask vector is added to each of the grid feature elements, wherein the value of the mask vector is 0 or 1, and when the value of one of the mask vectors is 0, the value of the corresponding grid feature element in the grid feature element is converted to 0; and Define a category identifier, which indicates the category of the product on the e-commerce platform. The product embedding includes the grid feature elements, the position vector, the fragment vector, the mask vector, and the category identifier.