Plug-and-play embedding enhancement in vector databases for search-based applications
By decomposing pre-computed image embeddings into linear combinations of task-specific vocabularies and using Dice coefficients for similarity search, the accuracy and efficiency issues of image retrieval in vector databases are addressed, resulting in more efficient image retrieval and improved performance for downstream tasks.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ROBERT BOSCH GMBH
- Filing Date
- 2025-12-31
- Publication Date
- 2026-06-30
AI Technical Summary
Existing vector database embeddings are not optimized for specific downstream retrieval applications, resulting in the omission of objects of interest in image retrieval and low computational efficiency.
By decomposing pre-computed image embeddings into a linear combination of task-specific vocabularies, a sparse weight set is generated, and similarity search is performed using Dice coefficients, thereby enhancing the retrieval performance of vector databases.
It improves the accuracy and diversity of image retrieval, enhances the performance of downstream tasks, and reduces computational costs.
Smart Images

Figure CN122309790A_ABST
Abstract
Description
[0001] Cross-reference to related applications This patent application claims the benefit of U.S. Provisional Patent Application No. 63 / 740,802, filed December 31, 2024, which is incorporated herein by reference in its entirety. Technical Field
[0002] This disclosure generally relates to digital data processing, and more specifically to systems and methods for enhancing vector databases used in retrieval-based applications and generating curated datasets from retrieved digital image data for training machine learning models. Background Technology
[0003] Transforming unstructured data into a vector database of semantically rich embeddings enables a variety of retrieval-based applications (e.g., retrieval augmentation generation and data curation) that are crucial for the training and deployment of the underlying model (FM). However, embeddings are often pre-computed using FMs that have not yet been optimized for specific downstream retrieval applications. For example, some image retrievals using embeddings generated from a contrastive language-image pre-trained (CLIP) encoder may result in the retrieval of irrelevant images that miss one or more objects of interest due to other shared background elements. Summary of the Invention
[0004] The following is an overview of certain embodiments described in detail below. The described aspects are presented merely to provide the reader with a brief overview of these specific embodiments, and the description of these aspects is not intended to limit the scope of this disclosure. In fact, this disclosure may include a wide variety of aspects that may not be explicitly set forth below.
[0005] According to at least one aspect, a computer-implemented method relates to digital image retrieval. According to at least one aspect, the computer-implemented method may further relate to using digital image retrieval to generate a curated dataset for training a machine learning model. The method includes generating a vocabulary of visual concepts for a specific task using a target dataset. The vocabulary includes representative image embeddings or representative patch embeddings for each visual concept. The method includes retrieving pre-computed image embeddings from a vector database. The method includes decomposing each pre-computed image embedding into a linear combination of visual concepts. The method includes generating a weight set for each pre-computed image embedding based on the vocabulary. Each weight indicates the prominence of the corresponding representative image embedding or corresponding representative patch embedding. The method includes storing the weight set of each pre-computed image embedding in an enhanced vector database. The method includes retrieving a digital image set using the enhanced vector database in response to a query. For example, this digital image set is used to create a curated dataset for training the machine learning model, such as a classifier.
[0006] According to at least one aspect, a system includes one or more processors and one or more computer memories. The one or more computer memories communicate with the one or more processors via data. The one or more computer memories have computer-readable data stored thereon. The computer-readable data includes instructions that, when executed by the one or more processors, cause the one or more processors to perform a method for digital image retrieval. According to at least one aspect, the method may further involve using digital image retrieval to generate a curated dataset for training a machine learning model. The method includes generating a vocabulary of visual concepts for a specific task using a target dataset. The vocabulary includes representative image embeddings or representative tile embeddings for each visual concept. The method includes retrieving pre-computed image embeddings from a vector database. The method includes decomposing each pre-computed image embedding into a linear combination of the visual concepts. The method includes generating a weight set for each pre-computed image embedding based on the vocabulary. Each weight indicates the salience of a corresponding representative image embedding or corresponding representative tile embedding. The method includes storing the weight set of each pre-computed image embedding in an enhanced vector database. The method includes retrieving a digital image set using the enhanced vector database in response to a query. For example, this set of digital images is used to create a curated dataset for training machine learning models, such as classifiers.
[0007] These and other features, aspects, and advantages of the invention are discussed in the following detailed description with reference to the accompanying drawings, throughout which the same characters denote similar or identical parts. Furthermore, the drawings are not necessarily drawn to scale, as some features may be enlarged or reduced to show details of specific components. Attached Figure Description
[0008] Figure 1 This is an illustration of an example of a process associated with a Plug and Play Embedded Enhancement (PERA) system for a retrieval-based application, according to an exemplary embodiment of this disclosure.
[0009] Figure 2 This is a diagram illustrating an example of comparing image retrieval results using PERA with image retrieval results using a vector database, according to an exemplary embodiment of this disclosure.
[0010] Figure 3 The illustration shows a first example of image retrieval results using image embedding and image retrieval results using PERA, according to an exemplary embodiment of the present disclosure.
[0011] Figure 4 The illustration shows a second example of image retrieval results using image embedding and image retrieval results using PERA, according to an example embodiment of the present disclosure.
[0012] Figure 5 The illustrations show a third example of image retrieval results using image embedding and image retrieval results using PERA, according to an example embodiment of the present disclosure.
[0013] Figure 6 The illustration shows a first example of an application of PERA according to an exemplary embodiment of the present disclosure.
[0014] Figure 7 A second example of the application of PERA according to an exemplary embodiment of the present disclosure is illustrated.
[0015] Figure 8 An example of a system including PERA according to an exemplary embodiment of the present disclosure is illustrated.
[0016] Figure 9 A schematic diagram depicts the interaction between a computer-controlled machine and a control system according to an exemplary embodiment of the present disclosure.
[0017] Figure 10 An example embodiment according to this disclosure is described. Figure 9 A schematic diagram of a control system configured to control a mobile machine that is at least partially or fully autonomous.
[0018] Figure 11 An example embodiment according to this disclosure is described. Figure 9 A schematic diagram of a control system configured to control manufacturing machines in a manufacturing system, such as a part of a production line.
[0019] Figure 12 An example embodiment according to this disclosure is described. Figure 9 A schematic diagram of the control system, which is configured as a control and monitoring system.
[0020] Figure 13 An example embodiment according to this disclosure is described. Figure 9 A schematic diagram of the control system configured to control a medical imaging system. Detailed Implementation
[0021] From the foregoing description, it will become clear that the embodiments described herein, and their many advantages, have been shown and described by way of example, and that various changes in the form, construction, and arrangement of the components can be made without departing from the disclosed subject matter or sacrificing one or more of its advantages. In fact, the description of these embodiments is merely illustrative. These embodiments allow for various modifications and alternatives, and the appended claims are intended to cover and include such changes, and are not limited to the specific forms disclosed, but rather cover all modifications, equivalents, and alternatives falling within the spirit and scope of this disclosure.
[0022] Figure 1 This illustration shows an example of an overview of the process associated with Plug and Play Embedding Enhancement (PERA) 100 for retrieval-based applications. PERA 100 enhances embeddings 50 stored in a vector database 140 to improve the performance of retrieval-based applications. PERA 100 includes a novel method that enhances the performance of retrieval applications without recompiling application-specific embeddings. Specifically, PERA 100 enhances the pre-computed embeddings 50 of the vector database 140 by decomposing them into linear combinations of embeddings tailored to downstream applications, which is computationally efficient. For each pre-computed embedding 50, the process includes generating a weight set 60 based on a vocabulary 130. Each weight set 60 is then stored in an enhanced vector database 160. Finally, the process includes leveraging these decomposed sparse weights along with Dice coefficients for similarity searches via the enhanced vector database 160 to enhance the performance of downstream tasks.
[0023] For a given retrieval application, the process includes vocabulary generation 100A. Vocabulary generation 100A includes constructing an embedding dictionary for that retrieval application. This embedding dictionary is referred to herein as a task-specific vocabulary 130. Specifically, it is derived from the target dataset 10. D TTo construct a task-specific vocabulary 130 D For example, such as Figure 1 As shown, the process includes generating image embeddings 20 based on pixels from digital images from target dataset 10 via image encoder 110. Furthermore, the process includes a vocabulary generator 120 configured to generate a task-specific vocabulary 130 of visual concepts based on the image embeddings 20. The task-specific vocabulary 130 contains a representative set of embeddings tailored for downstream tasks. Each representative embedding is associated with a visual concept. The process includes a strategy involving selecting embeddings that are both representative and sufficiently diverse to comprehensively cover target dataset 10. For object-centric tasks, such as instance search and retrieval-enhanced classification, the process uses embeddings from target dataset 10... D T The process involves embedding images. For dense recognition tasks involving multiple objects per image, tile-level embedding is employed. Furthermore, a vocabulary generator 120 is configured to classify these embeddings into clusters 30 and select centroids 40 from each cluster 30 to form a task-specific vocabulary 130. This approach considers both representativeness and diversity, thus balancing the trade-offs between retrieval speed and memory requirements with performance.
[0024] To construct the task-specific vocabulary130, the process employs slightly different strategies for different tasks. For instance search, when the query dataset is typically small (around 90 images), DBSCAN can be utilized to automatically identify the number of clusters. Conversely, for dense recognition (where the number of embeddings extracted from the target dataset may be large), the process may include using k-means clustering and setting the number of clusters to 500. For retrieval augmentation classification designed to solve long-tail problems where few-shot categories may contain only 5 images, direct clustering may ignore these few-shot categories. Therefore, the process may include using the centroids40 of the embeddings for each category as our vocabulary. For hyperparameter settings in the Alternating Directional Multiplier Method (ADMM), the process includes setting the maximum number of iterations k to 2000 and setting the penalty values for τ and λ to 0.2 and 0.01, respectively.
[0025] Furthermore, this process includes sparse decomposition of 100B. For example, as... Figure 1 As shown, the process includes decomposing the vector database 140 according to the task-specific vocabulary 130 via a linear solver 150. V SEmbedding 50. When decomposing the embeddings, the process considers two key features: sparsity and nonnegativity. Generally, sparse and nonnegative combinations of embeddings are easier to understand, while the presence of negative values in semantics is often less intuitive and more difficult to interpret. This motivates the optimization problem: to reconstruct the embeddings using sparse, nonnegative combinations of representative embeddings from a task-specific vocabulary 130. Given a task-specific vocabulary 130 D and embedding v The norm can be determined by using Equation 1 under the constraint of exact reconstruction. l The sparse decomposition is obtained by minimizing 0, where "subject to" is denoted as "st" and... w Indicates the weight.
[0026] Due to norm l Minimizing 0 is a nondeterministic polynomial-time (NP) difficult problem, so it is based on the norm. l 1. Replacement norm l 0. Norm has been proven. l 1 also produces highly sparse solutions and has the advantage of being computationally more feasible due to its convexity, as demonstrated by Equation 2. w The linearity of the weights allows each weight in the weight set 60 to be interpreted as the saliency or saliency of its corresponding embedding in the task-specific vocabulary 130. Sparse weights w This is then used as the basis for similarity searches in retrieval applications.
[0027] As discussed above, the sparse decomposition 100B includes decomposing pre-computed embeddings 50 obtained from one or more vector databases 140. Specifically, the embeddings 50 are decomposed into linear combinations of embeddings from a dictionary (e.g., a task-specific vocabulary 130) via a linear solver 150. The pre-computed embeddings 50 are decomposed into sparse, non-negative combinations of the task-specific vocabulary 130. Furthermore, the process includes storing each weight set in an enhanced vector database 160 and using the decomposed sparse weights for retrieval applications to achieve better performance. Using this approach, the process enhances the pre-computed embeddings 50 with a lightweight decomposition method that balances computational cost and performance for downstream applications.
[0028] Furthermore, the process includes performing a similarity search. While cosine similarity is a widely used metric for calculating the similarity between two embeddings in a vector database, this approach is unsuitable for PERA 100. Specifically, in the context of PERA 100, where image embeddings are decomposed and weights are used to represent the presence, salience, and / or salience of a vocabulary, cosine similarity may not adequately capture the weights. w The process addresses subtle overlaps between the groups. To overcome this limitation, the process employs the Dice coefficient, as expressed in Equation 3. The Dice coefficient specifically quantifies the overlap between two groups, making it sensitive to overlap in weights. By using the Dice coefficient for similarity search, the process ensures that instances with significant semantic overlap with the query instance are prioritized, thus focusing on the presence of a key vocabulary rather than overall semantic information.
[0029] Furthermore, the process can include scaling by leveraging graphics processing units (GPUs). While the aforementioned optimization problem can be solved directly using widely used libraries on the central processing unit (CPU), such as Scikit-learn, GPU acceleration becomes necessary for large vector databases with millions of instances. Therefore, the process includes an implementation of the Alternating Directional Multiplier (ADMM) algorithm in GPU-enabled PyTorch for efficient decomposition. To apply ADMM to this task, Equation 2 is rewritten as Equation 4. Then, Equation 5 defines a penalty parameter for Equation 4. The Lagrange function.
[0030] exist , With both fixed, the calculation is performed via Equation 6. w The update. Furthermore, calculation via Equation 7. ,in S λτ It is a term-by-term soft threshold operator. In addition, the dual update rule is calculated via Equation 8.
[0031] As described above, the steps outlined in Equations 6, 7, and 8 can be executed efficiently. In practice, the process iterates until convergence or the maximum number of iterations, set to 2000, is reached. A single GPU can efficiently process the decomposition of approximately 2000 embeddings per second, which is about 50 times faster than extracting new embeddings using FM with the same hardware.
[0032] Figure 2 The advantages of PERA image retrieval over contrastive image retrieval are illustrated for a given query image 200. Specifically, query image 200 is a digital image showing a road, sidewalk, trees, and buildings. The road has two lanes. Furthermore, query image 200 shows the front sides of some cars 200B parked on one side of the road, while also showing at least one motorcycle 200A traveling on the same side of the road. Query image 200 is then used as a query to obtain (i) PERA retrieval results 230 using weight set 210 and enhanced vector database 220, and (ii) contrastive retrieval results 260 using image embeddings and vector database 250.
[0033] The process associated with PERA image retrieval includes generating an image embedding using pixels from a query image 200 via an image encoder. The image embedding is decomposed into a linear combination of visual concepts from a task-specific vocabulary. A weight set 210 is generated based on the task-specific vocabulary. This weight set 210 is then used in a similarity search to retrieve a set of digital images from an enhanced vector database 220. Figure 2 An example of PERA search result 230 based on the enhanced vector database 220 is illustrated. As shown, PERA search result 230 is a digital image. Specifically, PERA search result 230 displays a road with multiple lanes, trees, sidewalks, and buildings. Furthermore, PERA search result 230 also displays a motorcycle 230A traveling in one lane and a car 230B traveling in another lane. In this respect, PERA search result 230 is successful in retrieving and capturing objects of interest (e.g., motorcycles, cars, etc.).
[0034] Conversely, the process associated with the comparison image retrieval involves generating an image embedding 240 using pixels from the query image 200 via an image encoder. The image embedding 240 is then directly used in the similarity search to retrieve a set of digital images from the vector database 250. Figure 2 An example of a contrastive search result 260 based on vector database 250 is illustrated. As shown, contrastive search result 260 is a digital image. Specifically, contrastive search result 260 displays a road with multiple lanes, trees, sidewalks, and buildings. However, in contrast to query image 200 and PERA search result 230, contrastive search result 260 does not include many objects of interest (e.g., motorcycles, cars, etc.). In this respect, contrastive search result 260 misses many objects of interest. Therefore, PERA search result 230 is more similar to query image 200 than contrastive search result 260. When provided with the same query image 200, PERA search result 230 therefore provides a better and more valuable result than contrastive search result 260.
[0035] Figure 3 , Figure 4 and Figure 5 Several examples of retrieved images used for pre-training are illustrated to highlight the advantages of PERA 100. Figure 3 , Figure 4 and Figure 5 In the table, the top image is the query image from the Cityscapes dataset, while the other images show the top 5 retrieved images from the nuImages dataset, using both CLIP embedding with and without PERA enhancement. First, PERA 100 produces more diverse results compared to using CLIP embedding alone. Further examination reveals that images retrieved using only CLIP embedding often overlook important objects. For example, in a simple and explicit scene where query image 300 includes a large truck (…),… Figure 3 The retrieved images using CLIP embedding do not contain large vehicles. Conversely, the results using PERA often contain large vehicles, such as buses (e.g., Figure 3 (The box in the image), which even matched the yellow color of the truck in the query image. In more complex scenarios ( Figure 4 The query image 400 included multiple elements such as cars, pedestrians, buildings, trees, and intersections. Here, the top 5 results using CLIP embedding failed to include pedestrians. This observation echoes the observation that embedding complexity can sometimes lead to overlooking important objects in a scene. Conversely, the results using PERA included many pedestrians (e.g., Figure 4 (The box in the image). Furthermore, in scenarios with unusual features, such as query images 500 containing unique painted advertisements, PERA results include similar advertisements on diverse vehicles such as trucks, cruise ships, and buses (e.g., Figure 5 The images (with bounding boxes in the image) are more specific and diverse than CLIP results. These enhancements in diversity and relevance by PERA not only improve its performance in pre-training but also its efficiency in subsequent downstream tasks.
[0036] Figure 6 The illustration shows a retrieval-based application using PERA image retrieval via an enhanced vector database 630. Specifically, Figure 6 The illustration shows an example of a retrieval-enhanced generation / classification process 600. The core concept involves retrieving relevant information from external knowledge sources to enhance the performance of machine learning systems, such as classification models or classifiers. In computer vision, retrieval-enhanced classification has been used to address the long-tail challenges in classification.
[0037] like Figure 6As shown, process 600 involves training a more robust image classifier 610 by leveraging external knowledge such as PERA 100. Specifically, in Figure 6 In this context, PERA 100 generates weights 620 for the query data. These weights 620 are used in similarity search to retrieve digital images from the enhanced vector database 630. For example... Figure 1 As discussed in the document, an enhanced vector database 630 is generated via PERA 100 according to a specific task. Furthermore, as... Figure 6 As shown, query data and digital images from PERA image retrieval results were used to train machine learning systems, such as Image Classifier 610. As a performance metric, experiments have shown that retrieval-enhanced classification using PERA 100 achieves an accuracy (ACC) of +7.5.
[0038] Figure 7 The illustration shows another retrieval-based application using PERA image retrieval via an enhanced vector database 730. Specifically, Figure 7 An example of a data curation process 700 for a pre-trained machine learning system 710 (e.g., FM) is illustrated. Pre-trained FMs have achieved significant performance gains across many tasks in the field of computer vision, primarily driven by large-scale pre-trained datasets. However, raw web data may contain 60% to 90% noisy or uninformative content, wasting computational resources and potentially degrading final performance. To address these challenges, the data curation process 700 involves starting with a well-curated dataset.
[0039] like Figure 7 As shown, process 700 involves pre-training a machine learning system 710 using PERA image retrieval results obtained from curatorial data. Specifically, in Figure 7 In this context, PERA 100 generates weights 720 for curatorial data. These weights 720 are used in a similarity search to retrieve digital images from an enhanced vector database 730. (For example...) Figure 1 As discussed herein, an enhanced vector database 730 is generated via PERA 100 according to a specific task. Process 700 includes using only digital images from PERA image retrieval results for model pre-training. PERA image retrieval results are then used to pre-train a machine learning system 710 to improve performance on downstream tasks such as instance segmentation. Furthermore, as a performance metric, experiments have shown that model pre-training improves the mean average precision (mAP) by up to 1.0 for downstream instance segmentation tasks.
[0040] Figure 8An example of a system 800 including PERA 100 according to at least one example embodiment is illustrated. System 800 includes at least a processing system 802. Processing system 802 includes one or more processing devices. For example, processing system 802 includes at least one or more GPUs. Processing system 802 may further include an electronic processor, CPU, microprocessor, field-programmable gate array (FPGA), application-specific integrated circuit (ASIC), any suitable processing technology, or any number and combination thereof. Processing system 802 is operable to provide the functionality described herein.
[0041] System 800 includes at least a memory system 810 operatively connected to processing system 802. The memory system 810 communicates data with processing system 802. In an example embodiment, memory system 810 includes at least one non-transitory computer-readable medium configured to store various types of data and provide access to that data, enabling at least processing system 802 to perform the operations and functionalities disclosed herein. In an example embodiment, memory system 810 includes a single device or multiple devices. Memory system 810 may include any suitable storage technology that is electrical, electronic, magnetic, optical, semiconductor, electromagnetic, or operable with system 800. For example, in an example embodiment, memory system 810 may include random access memory (RAM), read-only memory (ROM), flash memory, disk drive, memory card, optical storage device, magnetic storage device, memory module, any suitable type of memory device, or any combination thereof.
[0042] The memory system 810 includes at least PERA 100, application program 812, various PERA data 814, and other related data 816 stored thereon. The memory system 810 includes computer-readable data configured to provide functions and processes as described in this disclosure when executed by the processing system 802. The computer-readable data may include instructions, code, routines, various related data, any software technology, or any number and combination thereof. Specifically, the application program 812 includes computer-readable data with instructions configured to provide an application platform for PERA 100 to operate with other components of the system 800 and interact with a user when executed by the processing system 802. Furthermore, PERA 100 includes computer-readable data with instructions configured to perform at least... Figure 1The process described herein. PERA 100 also includes an image encoder 110, a vocabulary generator 120, a task-specific vocabulary 130, a vector database 140, a linear solver 150, and an enhanced vector database 160, or some applicable combinations / variations thereof. Furthermore, various PERA data 814 include various image data, various image embedding data, various image identifiers (IDs), various weight data, various similarity calculation data, various parameter data, and any related PERA data (e.g., vector databases, enhanced vector databases, machine learning data, etc.) that enable system 800 to perform the functions disclosed herein. For example, various training data include at least various digital image / video data, etc. Meanwhile, other related data 816 provides various data (e.g., operating systems, etc.) that enable system 800 to perform the functions discussed herein.
[0043] In example embodiments, such as Figure 8 As shown, system 800 is configured to include at least one sensor system 804. Sensor system 804 includes one or more sensors. For example, sensor system 804 includes an image sensor or camera configured to capture digital images and / or digital video. Sensor system 804 may also include a radar sensor, a light detection and ranging (LIDAR) sensor, a thermal sensor, an ultrasonic sensor, an infrared sensor, a motion sensor, an audio sensor, an inertial measurement unit (IMU), any suitable sensor, or any combination thereof. Sensor system 804 is operable to communicate with one or more other components of system 800, such as processing system 802 and memory system 810. More specifically, for example, processing system 802 is configured to acquire sensor data directly or indirectly from at least one sensor. Sensor system 804 and / or processing system 802 are configured to generate digital images and / or digital video. Processing system 802 is configured to process digital images and / or digital video associated with PERA 100 and various PERA data 814.
[0044] In addition, System 800 includes other components that contribute to PERA 100. For example, such as... Figure 8As shown, the memory system 810 is also configured to store additional relevant data 816 related to the operation of one or more components, such as the sensor system 804, the input / output (I / O) system 806, and other functional modules 808. Furthermore, the I / O system 806 includes I / O interfaces and may include one or more devices (e.g., a display device, a keyboard device, a speaker device, etc.). Additionally, the system 800 includes other functional modules 808, such as any suitable hardware technology, software technology, or combination thereof that assists in or contributes to the operation of the system 800. For example, other functional modules 808 include communication technologies, as described herein, that enable the components of the system 800 to communicate at least with each other. The communication technologies may enable the system 800 to communicate with other network devices (not shown) via a communication network. Utilizing at least... Figure 8 The configuration discussed in the example enables PERA 100 to perform the functions discussed in this disclosure.
[0045] Figure 9 A schematic diagram depicting the interaction between a computer-controlled machine 900 and a control system 902 is shown. The computer-controlled machine 900 includes an actuator 904 and a sensor 906. The actuator 904 may include one or more actuators, and the sensor 906 may include one or more sensors. The sensor 906 is configured to sense the condition of the computer-controlled machine 900. The sensor 906 may be configured to encode the sensed condition into a sensor signal 908 and transmit the sensor signal 908 to the control system 902. Non-limiting examples of the sensor 906 include video, radar, LiDAR, ultrasonic sensors, image sensors, audio sensors, motion sensors, etc. In some embodiments, the sensor 906 is an optical sensor configured to sense an optical image of the environment approaching the computer-controlled machine 900.
[0046] The control system 902 is configured to receive sensor signals 908 from a computer-controlled machine 900. As described below, the control system 902 may be further configured to calculate an actuator control command 910 based on the sensor signals and transmit the actuator control command 910 to the actuator 904 of the computer-controlled machine 900.
[0047] like Figure 9As shown, the control system 902 includes a receiving unit 912. The receiving unit 912 can be configured to receive sensor signals 908 from sensor 906 and transform the sensor signals 908 into input signals x. In an alternative embodiment, the sensor signals 908 are received directly as input signals x without the receiving unit 912. Each input signal x may be a part of each sensor signal 908. The receiving unit 912 can be configured to process each sensor signal 908 to generate each input signal x. The input signals x may include data corresponding to the image recorded by sensor 906.
[0048] The control system 902 includes a classifier 914 (e.g., an image classifier 610) trained on a training dataset comprising at least one set of digital images retrieved via PERA 100. The classifier 914 can be configured to classify an input signal x into one or more labels using a machine learning (ML) algorithm. The classifier 914 is configured to be parameterized by parameters, such as those described above (e.g., θ). The parameter θ can be stored in and provided by a non-volatile storage device 916. The classifier 914 is configured to determine an output signal y from the input signal x. Each output signal y includes information assigning one or more labels to each input signal x. The classifier 914 can transmit the output signal y to a conversion unit 918. The conversion unit 918 is configured to convert the output signal y into an actuator control command 910. The control system 902 is configured to transmit actuator control command 910 to actuator 904, which is configured to drive computer-controlled machine 900 in response to actuator control command 910. In some embodiments, actuator 904 is configured to drive computer-controlled machine 900 directly based on output signal y.
[0049] Upon receiving an actuator control command 910, actuator 904 is configured to perform an action corresponding to the relevant actuator control command 910. Actuator 904 may include control logic configured to transform the actuator control command 910 into a second actuator control command used to control actuator 904. In one or more embodiments, instead of an actuator or otherwise, actuator control command 910 may be used to control a display.
[0050] In some embodiments, instead of the computer-controlled machine 900 including sensor 906, or otherwise, the control system 902 includes sensor 906. Instead of the computer-controlled machine 900 including actuator 904, or otherwise, the control system 902 may also include actuator 904. Figure 9As shown, the control system 902 also includes a processor 920 and a memory 922. The processor 920 may include one or more processors. The memory 922 may include one or more memory devices. A classifier 914 of one or more embodiments may be implemented by the control system 902, which includes a non-volatile storage device 916, a processor 920, and a memory 922.
[0051] Non-volatile storage device 916 may include one or more permanent data storage devices, such as hard disk drives, optical disk drives, magnetic tape drives, non-volatile solid-state devices, cloud storage devices, or any other device capable of permanently storing information. Processor 920 may include one or more devices selected from a high-performance computing (HPC) system. Processor 920 may include one or more high-performance cores, graphics processing units, microprocessors, microcontrollers, digital signal processors, microcomputers, central processing units, field-programmable gate arrays, programmable logic devices, state machines, logic circuits, analog circuits, digital circuits, or any other device that manipulates (analog or digital) signals based on computer-executable instructions residing in memory 922. Memory 922 may include a single memory device or a number of memory devices, including but not limited to RAM, volatile memory, non-volatile memory, static random access memory (SRAM), dynamic random access memory (DRAM), flash memory, cache memory, or any other device capable of storing information.
[0052] Processor 920 may be configured to read into memory 922 and execute computer-executable instructions residing in non-volatile storage device 916, embodying one or more ML algorithms and / or methodologies of one or more embodiments. Non-volatile storage device 916 may include one or more operating systems and applications. Non-volatile storage device 916 may store computer programs compiled and / or interpreted using various programming languages and / or techniques, including but not limited to Java, C, C++, C#, Objective C, Fortran, Pascal, JavaScript, Python, Perl, and PL / SQL, individually or in combination.
[0053] The computer-executable instructions of the non-volatile storage device 916, when executed by the processor 920, can cause the control system 902 to implement one or more ML algorithms and / or methodologies as disclosed herein to employ the classifier 914. The non-volatile storage device 916 may also include ML data (including model parameters) supporting the functionality, features, and processes of one or more embodiments described herein.
[0054] Program code embodying the algorithms and / or methodologies described herein can be distributed individually or collectively as a program product in various different forms. The program code can be distributed using a computer-readable storage medium having computer-readable program instructions thereon, for causing a processor to perform aspects of one or more embodiments. Essentially non-transitory computer-readable storage media can include tangible media, implemented in any method or technology, that are volatile or non-volatile, and that are removable or non-removable, for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media can further include RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid-state storage technologies, portable optical disc read-only memory (CD-ROM) or other optical storage devices, magnetic tape, magnetic tape, disk storage devices or other magnetic storage devices, or any other medium that can be used to store desired information and can be read by a computer. Computer-readable program instructions can be downloaded from the computer-readable storage medium to a computer, another type of programmable data processing device or another device, or downloaded via a network to an external computer or external storage device.
[0055] Computer-readable program instructions stored in a computer-readable medium can be used to direct a computer, other type of programmable data processing apparatus, or other device to operate in a particular manner, causing the instructions stored in the computer-readable medium to produce an article of art including instructions that implement the functions, actions, and / or operations specified in a flowchart or diagram. In some alternative embodiments, the functions, actions, and / or operations specified in the flowcharts and diagrams may be reordered, processed sequentially, and / or processed concurrently in accordance with one or more embodiments. Furthermore, any of the flowcharts and / or diagrams may include more or fewer nodes or blocks than illustrated in accordance with one or more embodiments. Additionally, suitable hardware components, such as ASICs, FPGAs, state machines, controllers, or other hardware components, or devices, or combinations of hardware, software, and firmware components, may be used to embody these processes, methods, or algorithms wholly or partially.
[0056] Figure 10A schematic diagram is depicted of a control system 902 configured to control a vehicle 800, which may be at least partially autonomous or a partially autonomous robot. The vehicle 1000 includes actuators 904 and sensors 906. Sensors 906 may include one or more video sensors, cameras, radar sensors, ultrasonic sensors, LiDAR sensors, and / or position sensors (e.g., Global Positioning System). One or more of these specific sensors may be integrated into the vehicle 1000. Instead of or in addition to the one or more specific sensors identified above, sensor 906 may include a software module configured to determine the state of actuator 904 upon execution. A non-limiting example of the software module includes a weather information software module configured to determine the current or future state of weather near the vehicle 1000 or at another location.
[0057] The classifier 914 of the control system 902 of vehicle 1000 can be configured to detect objects near vehicle 1000 depending on the input signal x. In such an embodiment, the output signal y may include information classifying or characterizing the objects near vehicle 1000. An actuator control command 910 can be determined based on this information. The actuator control command 910 can be used to avoid collisions with the detected objects.
[0058] In some embodiments, vehicle 1000 is at least partially autonomous or fully autonomous. Actuator 904 may be embodied in the vehicle 1000's brakes, propulsion system, engine, transmission, steering mechanism, etc. Actuator control command 910 may be determined to control actuator 904 so that vehicle 1000 avoids collisions with detected objects. Detected objects may also be classified according to what classifier 914 deems them most likely to be, such as pedestrians, trees, any suitable labels, etc. Actuator control command 910 may be determined based on the classification.
[0059] In some embodiments where the vehicle 1000 is at least partially autonomous, the vehicle 1000 may be a mobile robot configured to perform one or more functions, such as flying, swimming, diving, and walking. The mobile robot may be at least partially autonomous lawnmower or at least partially autonomous cleaning robot. In such embodiments, actuator control commands 910 may be determined to control the mobile robot's propulsion unit, steering unit, and / or braking unit, enabling the mobile robot to avoid collisions with identified objects.
[0060] In some embodiments, the vehicle 1000 is at least partially autonomous robots in the form of gardening robots. In such embodiments, the vehicle 1000 may use optical sensors as sensor 906 to determine the state of plants in the environment approaching the vehicle 1000. The actuator 904 may be a nozzle configured to spray chemicals. Depending on the identified plant species and / or the identified plant state, an actuator control command 910 may be determined to cause the actuator 904 to spray an appropriate amount of suitable chemical onto the plant.
[0061] The vehicle 1000 may be a robot that is at least partially autonomous and takes the form of a household appliance. As a non-limiting example, the household appliance may include a washing machine, stove, oven, microwave oven, dishwasher, etc. In such a vehicle 1000, the sensor 906 may be an optical sensor configured to detect the state of an object to be processed by the household appliance. For example, in the case of a washing machine, the sensor 906 may detect the state of the laundry inside the washing machine. An actuator control command 910 may be determined based on the detected state of the laundry.
[0062] Figure 11 A schematic diagram of a control system 902 is depicted, which is configured to control a system 1100 (e.g., a manufacturing machine) of a manufacturing system 1102 (such as a part of a production line). The system 1100 may include stamping tools, cutting tools, or gun drills, etc. The control system 902 may be configured to control an actuator 904, which is configured to control the control system 1100 (e.g., the manufacturing machine).
[0063] Sensor 906 of system 1100 (e.g., a manufacturing machine) may be an optical sensor configured to capture one or more attributes of the manufactured product 1104. Classifier 914 may be configured to determine the state of the manufactured product 1104 based on one or more captured attributes. Actuator 904 may be configured to control system 1100 (e.g., the manufacturing machine) based on the determined state of the manufactured product 1104 for subsequent manufacturing steps of the manufactured product 1104. Actuator 904 may be configured to control system 1100 (e.g., the manufacturing machine) on the function of the subsequently manufactured product 1106 of system 1100 (e.g., the manufacturing machine) based on the determined state of the manufactured product 1104.
[0064] Figure 12A schematic diagram of a control system 902 configured to control a monitoring system 1200 is depicted. The monitoring system 1200 can be configured to physically control access through a door 1202. A sensor 906 can be configured to detect scenarios related to determining whether access is permitted. The sensor 906 can be an optical sensor configured to generate and transmit image and / or video data. The control system 902 can use such data to detect a person's face.
[0065] The classifier 914 of the control system 902 of the surveillance system 1200 can be configured to determine a person's identity by interpreting image and / or video data by matching it with the identities of known persons stored in non-volatile storage 916. The classifier 914 can be configured to generate an actuator control command 910 in response to the interpretation of the image and / or video data. The control system 902 is configured to transmit the actuator control command 910 to the actuator 904. In this embodiment, the actuator 904 is configured to lock or unlock the door 1202 in response to the actuator control command 910. In some embodiments, non-physical logical access control is also possible.
[0066] The monitoring system 1200 can also be a supervision system. In such an embodiment, the sensor 906 can be an optical sensor configured to detect the supervised scene, and the control system 902 is configured as a display 1204. The classifier 914 is configured to determine the classification of the scene, such as whether the scene detected by the sensor 906 is suspicious. The control system 902 is configured to transmit actuator control commands 910 to the display 1204 in response to the classification. The display 1204 can adjust the displayed content in response to the actuator control commands 910. For example, the display 1204 can highlight objects that the classifier 914 considers suspicious.
[0067] Figure 13 A schematic diagram of a control system 902 is depicted, configured to control an imaging system 1300, such as a magnetic resonance imaging (MRI) device, an X-ray imaging device, or an ultrasound device. A sensor 906 may be, for example, an imaging sensor. A classifier 914 may be configured to determine the classification of all or part of the sensed image. The classifier 914 may be configured to determine or select an actuator control command 910 in response to the classification obtained through a trained neural network. For example, the classifier 914 may interpret a region of the sensed image as a potential anomaly. In this case, the actuator control command 910 can be selected to cause the display 1302 to display the image and highlight the potentially anomalous region.
[0068] As described in this disclosure, embodiments include numerous advantageous features and benefits. For example, embodiments provide a technical solution to the problem: "Is it possible to enhance the performance of retrieval-based applications by improving pre-computed embeddings in a vector database without recompiling application-specific embeddings?". To address this problem, embodiments include PERA 100, which provides a novel approach to decomposing pre-computed embeddings into a linear combination of embeddings tailored to downstream applications (e.g., embeddings of foreground objects in an image). In this respect, PERA 100 addresses the challenge of improving pre-computed embeddings in a vector database for downstream retrieval applications without recompiling application-specific embeddings. PERA 100 enhances performance efficiently by decomposing pre-computed embeddings into a linear combination of embeddings tailored to specific applications. In this respect, PERA 100 enhances recomputed embeddings by decomposing recomputed embeddings into a linear combination of embeddings that satisfy the requirements of the target retrieval application. In this respect, PERA 100 relates to enhancing embeddings in a vector database for downstream retrieval applications without requiring recompiling embeddings from the original dataset. Furthermore, PERA100 is computationally efficient and does not use the original dataset.
[0069] Furthermore, the significant improvements of PERA 100 across various retrieval applications have been demonstrated, confirming its usefulness and effectiveness. Experimental results confirm that PERA 100 significantly improves retrieval performance across various applications. Specifically, PERA 100 improves instance search performance by up to 23.1 mean accuracy (mAP), enhances retrieval augmentation classification accuracy by up to 7.5%, and increases model pre-training accuracy for downstream instance segmentation tasks by up to 1.9 mAP.
[0070] Furthermore, the above description is intended to be illustrative rather than restrictive, and is provided in the context of a particular application and its requirements. Those skilled in the art will understand from the foregoing description that the invention can be implemented in various forms, and various embodiments can be implemented individually or in combination. Therefore, although embodiments of the invention have been described in conjunction with specific examples of the invention, the general principles defined herein can be applied to other embodiments and applications without departing from the spirit and scope of the described embodiments, and the true scope of the embodiments and / or methods of the invention is not limited to the shown and described embodiments, as various modifications will become apparent to those skilled in the art upon review of the drawings, specification, and appended claims. Additionally or alternatively, components and functionality may be separated or combined in ways different from the various described embodiments, and may be described using different terms. These and other variations, modifications, additions, and improvements may fall within the scope of this disclosure as defined in the following claims.
Claims
1. A computer-implemented method for digital image retrieval, comprising: A vocabulary of visual concepts for a specific task is generated using a target dataset, the vocabulary including representative image embeddings or representative tile embeddings for each visual concept; Retrieve pre-computed image embeddings from a vector database; Each pre-computed image embedding is decomposed into a linear combination of the visual concepts; A weight set is generated for each pre-computed image embedding based on the vocabulary, and each weight indicates the salience of the corresponding representative image embedding or the corresponding representative tile embedding; The weight set embedded in each pre-computed image is stored in an enhanced vector database; as well as In response to a query, the enhanced vector database is used to retrieve a set of digital images.
2. The computer-implemented method of claim 1, wherein each linear combination of the visual concepts comprises non-negative and sparse weights.
3. The computer-implemented method according to claim 1, further comprising: The Alternating Directional Multiplier Method (ADMM) algorithm is implemented via a graphics processing unit (GPU) to perform the step of decomposing each pre-computed image embedding.
4. The computer-implemented method according to claim 1, further comprising: Receive another digital image as the query; Another image embedding is generated using pixels from the other digital image via an image encoder; The other image is embedded and decomposed into another linear combination of the visual concepts; Based on the vocabulary, a query weight set is generated for the query; The query weight set is used to perform a similarity search on the enhanced vector database.
5. The computer-implemented method of claim 1, wherein the similarity search is performed by employing the Dice coefficient.
6. The computer-implemented method according to claim 1, further comprising: Receive a target dataset including target images for the specific task; The target image embedding or target tile embedding is generated using the pixels of the target image via an image encoder. For each visual concept, select either the representative image embedding or the representative tile embedding; as well as The vocabulary is constructed to include each representative image embedding or each representative tile embedding for each cluster.
7. The computer-implemented method according to claim 6, further comprising: The target image embedding or the target tile embedding is classified into clusters; as well as Calculate the centroid of each cluster in the target image embedding or target tile embedding. Each centroid is selected as the representative image for each cluster.
8. The computer-implemented method according to claim 1, further comprising: Receive the target dataset including the target image; as well as Generate tiles for each target image. The target image embedding is generated using the tiles.
9. The computer-implemented method according to claim 1, further comprising: The image identifier corresponding to each weight set is stored in the enhanced vector database. When performing a similarity search on the enhanced vector database using the query weight set, the corresponding image identifier set is used to retrieve the digital image set.
10. The computer-implemented method according to claim 1, further comprising: Generate a training dataset that includes the retrieved image set; as well as The training dataset is used to train a machine learning model to perform the specific task.
11. A system comprising: One or more processors; One or more computer memories that communicate data with the one or more processors, the one or more computer memories having computer-readable data stored thereon, the computer-readable data including instructions that, when executed by the one or more processors, cause the one or more processors to perform a method for digital image retrieval, the method comprising: A vocabulary of visual concepts for a specific task is generated using a target dataset, the vocabulary including representative image embeddings or representative tile embeddings for each visual concept; Retrieve pre-computed image embeddings from a vector database; Each pre-computed image embedding is decomposed into a linear combination of the visual concepts; A weight set is generated for each pre-computed image embedding based on the vocabulary, and each weight indicates the salience of the corresponding representative image embedding or the corresponding representative tile embedding; The weight set for each pre-computed image embedding is stored in an enhanced vector database; and In response to a query, the enhanced vector database is used to retrieve a set of digital images.
12. The system of claim 11, wherein each linear combination of the visual concepts comprises non-negative and sparse weights.
13. The system according to claim 11, wherein: The one or more processors include a graphics processing unit (GPU); and The Alternating Directional Multiplier Method (ADMM) algorithm is implemented via the GPU to perform the step of decomposing each pre-computed image embedding.
14. The system of claim 11, wherein the method further comprises: Receive another digital image as the query; Another image embedding is generated using pixels from the other digital image via an image encoder; The other image is embedded and decomposed into another linear combination of the visual concepts; Based on the vocabulary, a query weight set is generated for the query; as well as The query weight set is used to perform a similarity search on the enhanced vector database.
15. The system of claim 11, wherein the similarity search is performed by employing the Dice coefficient.
16. The system of claim 11, wherein the method further comprises: Receive a target dataset including target images for the specific task; The target image embedding or target tile embedding is generated using the pixels of the target image via an image encoder. For each visual concept, select either the representative image embedding or the representative tile embedding; as well as The vocabulary is constructed to include each representative image embedding or each representative tile embedding for each cluster.
17. The system of claim 16, wherein the method further comprises: The target image embedding or the target tile embedding is classified into clusters; as well as Calculate the centroid of each cluster in the target image embedding or target tile embedding. Each centroid is selected as the representative image for each cluster.
18. The system of claim 11, wherein the method further comprises: Receive the target dataset including the target image; as well as Generate tiles for each target image. The target image embedding is generated using the tiles.
19. The system of claim 11, wherein the method further comprises: The image identifier corresponding to each weight set is stored in the enhanced vector database. When performing a similarity search on the enhanced vector database using the query weight set, the corresponding image identifier set is used to retrieve the digital image set.
20. The system of claim 11, wherein the method further comprises: Generate a training dataset that includes the retrieved image set; as well as The training dataset is used to train a machine learning model to perform the specific task.