Plug-and-play embedding improvement in vector databases for retrieval-based applications

PERA addresses the issue of suboptimal embeddings in vector databases by decomposing them into task-specific visual concepts, improving retrieval performance through sparse decomposition and similarity search, achieving enhanced relevance and diversity in image retrieval.

DE102025155331A1Undetermined Publication Date: 2026-07-02ROBERT BOSCH GMBH

Patent Information

Authority / Receiving Office
DE · DE
Patent Type
Applications
Current Assignee / Owner
ROBERT BOSCH GMBH
Filing Date
2025-12-29
Publication Date
2026-07-02

AI Technical Summary

Technical Problem

Existing vector databases generate embeddings that are not optimized for specific downstream retrieval applications, leading to irrelevant image retrievals due to shared background elements, missing objects of interest.

Method used

The Plug-and-Play Embedding Improvement (PERA) method decomposes pre-computed image embeddings into a linear combination of task-specific visual concepts, using a sparse and non-negative combination optimized by the Alternate Direction Multiplier Method (ADMM) with GPU acceleration, and employs the cube coefficient for similarity search to enhance retrieval performance.

Benefits of technology

PERA significantly improves retrieval performance by up to 23.1% in instance search, 7.5% in retrieval-enhanced classification, and 1.0 mAP in model pretraining for instance segmentation, enhancing the relevance and diversity of retrieved images.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 00000000_0000_ABST
    Figure 00000000_0000_ABST
Patent Text Reader

Abstract

A computer-implemented process and a computer-implemented system refer to the retrieval of digital images and data curation. Data curation can refer to training a machine learning model on at least one specific task. A vocabulary of visual concepts is generated for a specific task using a target dataset. The vocabulary contains a representative image embedding for each visual concept. Pre-computed image embeddings are retrieved from a vector database. Each pre-computed image embedding is decomposed into a linear combination of the visual concepts. For each pre-computed image embedding, a set of weights is generated based on the vocabulary. Each weight specifies a meaning of a respective representative image embedding. The set of weights for each pre-computed image embedding is stored in an enhanced vector database.A set of digital images can be retrieved in response to a query from the improved vector database.
Need to check novelty before this filing date? Find Prior Art

Description

Cross-reference to related registration This patent application claims priority over the preliminary US patent application No. 63 / 740,802, which was filed on December 31, 2024, and which is incorporated herein in its entirety by reference. Technical field This disclosure relates generally to digital data processing and in particular to systems and methods for improving vector databases used in retrieval-based applications and generating curated datasets from retrieved digital image data for training machine learning models. background Vector databases, which transform unstructured data into semantically rich embeddings, enable various retrieval-based applications (e.g., retrieval-enhanced generation and data curation) that are essential for training and deploying base models (FMs). However, the embeddings are often precomputed using FMs that are not optimized for specific downstream retrieval applications. For example, some image retrievals using embeddings generated by the contrastive speech / image pretraining encoder (CLIP encoder) can result in retrieving irrelevant images, missing one or more objects of interest due to other shared background elements. Summary The following is a summary of certain embodiments, which are described in detail below. The aspects described are presented merely to provide the reader with a brief summary of these particular embodiments, and the description of these aspects is not intended to limit the scope of this disclosure. In fact, this disclosure may encompass a multitude of aspects that need not be expressly set forth below. According to at least one aspect, a computer-implemented method relates to the retrieval of digital images. According to at least one aspect, the computer-implemented method may further relate to using the retrieval of digital images to generate curated datasets for training a machine learning model. The method includes generating a vocabulary of visual concepts for a specific task using a target dataset. The vocabulary contains a representative image embedding or a representative patch embedding for each visual concept. The method includes retrieving pre-computed image embeddings from a vector database. The method includes decomposing each pre-computed image embedding into a linear combination of the visual concepts. The method includes generating a set of weights for each pre-computed image embedding based on the vocabulary.Each weight represents a specific weight for a particular representative image embedding or patch embedding. The procedure involves storing the set of weights for each pre-computed image embedding in an enhanced vector database. The procedure also involves retrieving a set of digital images in response to a query using this enhanced vector database. As an example, the set of digital images is used to generate a curated dataset for training a machine learning model, such as a classifier. According to at least one aspect, a system comprises one or more processors and one or more computer memories. The one or more computer memories are in data communication with the one or more processors. The one or more computer memories contain computer-readable data. The computer-readable data contains instructions which, when executed by one or more processors, cause the one or more processors to perform a procedure for retrieving digital images. According to at least one aspect, the procedure may further relate to using the retrieval of digital images to generate curated datasets for training a machine learning model. The procedure includes generating a vocabulary of visual concepts for a specific task using a target dataset.The vocabulary contains a representative image embedding or a representative patch embedding for each visual concept. The procedure includes retrieving pre-computed image embeddings from a vector database. The procedure includes decomposing each pre-computed image embedding into a linear combination of the visual concepts. The procedure includes generating a set of weights for each pre-computed image embedding based on the vocabulary. Each weight indicates a meaning of a respective representative image embedding or a respective representative patch embedding. The procedure includes storing the set of weights for each pre-computed image embedding in an enhanced vector database. The procedure includes retrieving a set of digital images in response to a query using the enhanced vector database.As an example, a set of digital images is used to create a curated dataset for training a machine learning model, such as a classifier. These and other features, aspects, and advantages of the present invention are discussed in detail below with reference to the accompanying drawings, in which similar symbols represent identical or similar parts. Furthermore, the drawings are not necessarily to scale, as some features may be exaggerated or minimized to show detail of certain components. Brief description of the characters Fig. 1 is a diagram of an example process associated with a plug-and-play embedding enhancement system for retrieval-based applications (PERA system) according to an exemplary embodiment of this disclosure. Fig. 2 is a diagram showing an example comparing an image retrieval result using PERA and an image retrieval result using a vector database according to an exemplary embodiment of this disclosure. Fig. 3 illustrates a first example of image retrieval results using image embedding and image retrieval results using PERA according to an exemplary embodiment of this disclosure. Fig. 4 illustrates a second example of image retrieval results using image embedding and image retrieval results using PERA according to an exemplary embodiment of this disclosure.Figure 5 illustrates a second example of image retrieval results using image embedding and image retrieval results using PERA according to an exemplary embodiment of this disclosure. Figure 6 illustrates a first example of an application of PERA according to an exemplary embodiment of this disclosure. Figure 7 illustrates a second example of an application of PERA according to an exemplary embodiment of this disclosure. Figure 8 illustrates an example of a system incorporating PERA according to an exemplary embodiment of this disclosure. Figure 9 presents a schematic diagram of an interaction between a computer-controlled machine and a control system according to an exemplary embodiment of this disclosure. Figure 10 presents a schematic diagram of the control system of Figure 11.Figure 9, configured to control a mobile machine that is at least partially or completely autonomous, according to an exemplary embodiment of this disclosure. Figure 11 shows a schematic diagram of the control system of Figure 9, configured to control a manufacturing machine of a manufacturing system, such as part of a production line, according to an exemplary embodiment of this disclosure. Figure 12 shows a schematic diagram of the control system of Figure 9, configured to control a monitoring system, according to an exemplary embodiment of this disclosure. Figure 13 shows a schematic diagram of the control system of Figure 9, configured to control a medical imaging system, according to an exemplary embodiment of this disclosure. Detailed description The embodiments described herein, which have been shown and described by way of example, and many of their advantages are understood from the preceding description, and it becomes apparent that various modifications in the form, construction, and arrangement of the components can be made without departing from the disclosed subject matter or sacrificing one or more of its advantages. In fact, the described forms of these embodiments are merely illustrative. These embodiments are receptive to various modifications and alternative forms, and the following claims are intended to include and encompass such modifications and are not limited to the specific forms disclosed, but rather to cover all modifications, correspondences, and alternatives that fall within the spirit and scope of this disclosure. Figure 1 illustrates an example of an overview of a process associated with a Plug-and-Play Embed Improvement for Retrieval-Based Applications (PERA) 100. PERA 100 improves the embeddings 50 stored in a vector database 140 to enhance the performance of retrieval-based applications. PERA 100 comprises a novel method that improves the performance of retrieval applications without recalculating application-specific embeddings. Specifically, PERA 100 improves the pre-computed embeddings 50 of a vector database 140 by decomposing them into a computationally efficient linear combination of embeddings adapted to a downstream application. For each pre-computed embedding 50, the process includes generating a set of weights 60 based on the vocabulary 130. Each set of weights 60 is then stored in the improved vector database 160.Finally, the process involves using these decomposed sparse weights together with the cube coefficient for a similarity search using the improved vector database 160 to improve the performance of the downstream task. For a given retrieval application, the process includes vocabulary generation 100A. Vocabulary generation 100A involves constructing a dictionary of embeddings for this retrieval application. The dictionary of embeddings is referred to here as a task-specific vocabulary 130. Specifically, a task-specific vocabulary 130, D, is constructed from the target data set 10, DT. For example, as shown in Fig. 1, the process includes generating image embeddings 20 based on pixels of digital images from the target data set 10 using an image encoder 110. The process also includes a vocabulary generator 120 configured to generate a task-specific vocabulary 130 of visual concepts based on the image embeddings 20. The task-specific vocabulary 130 contains a set of representative embeddings adapted to the downstream task.Each representative embedding relates to a visual concept. This process employs a strategy that involves selecting embeddings that are both representative and diverse enough to comprehensively cover the target dataset 10. For object-centric tasks, such as instance searches and retrieval-enhanced classification, the process uses embeddings of images from the target dataset 10, DT. For dense recognition tasks involving multiple objects per image, the process uses patch-level embeddings. Additionally, the vocabulary generator 120 is configured to group these embeddings into groups 30 and select a focal point 40 from each group 30 to form the task-specific vocabulary 130. This approach considers both representativeness and diversity by balancing the trade-off between retrieval speed and memory requirements against performance. To build the task-specific vocabulary 130, the process employs slightly different strategies for different tasks. For an instance search, if the query dataset is typically small (around 90 images), DBSCAN can be used to automatically identify the number of groups. Conversely, for dense detection, where the number of extracted embeddings from the target dataset can be large, the process might involve using a K-mean grouping and setting the number of groups to 500. For query-extended classification, designed to address long-tail problems where classes with few settings might contain as few as 5 images, direct grouping might miss these classes with few settings. Thus, the process might include using the focal point of 40 embeddings for each class as our vocabulary.For the hyperparameter settings in the Alternating Direction Multiplier (ADMM) procedure, the process includes setting the maximum number of iterations k to 2000, and setting the penalty values ​​of τ and λ to 0.2 and 0.01 respectively. Additionally, the process includes a sparse decomposition 100B. For example, as shown in Fig. 1, the process includes decomposing the embeddings 50 in the vector database 140, VS, using a linear solver 150 according to the task-specific vocabulary 130. When decomposing embeddings, the process considers two main features: sparsity and non-negativity. In general, a sparse and non-negative combination of embeddings is easier to understand, whereas the presence of negative values ​​in a semantics is often less intuitive and more difficult to interpret. This motivates the optimization problem: Reconstruct an embedding with a sparse, non-negative combination of the representative embeddings from the task-specific vocabulary 130.Given the task-specific vocabulary 130, D, and an embedding, v, the sparse decomposition can be obtained by minimizing the l0-norm subject to the condition of exact reconstruction using equation 1, where “depending on” is denoted as “st” and w represents the weight. Since minimizing the l0-norm is a difficult non-deterministic polynomial-time (NP) problem, the l0-norm is replaced by the l1-norm. The l1-norm has been proven to yield strongly sparse solutions and has the advantage of being computationally more feasible due to its convexity, as shown by Equation 2. The linearity of w allows each weight of a set of weights 60 to be interpreted as the importance or significance of the corresponding embedding in the task-specific vocabulary 130. The sparse weight, w, then serves as a basis for similarity searches in retrieval applications. As discussed above, a sparse decomposition 100B involves decomposing precomputed embeddings 50 obtained from one or more vector databases 140. Specifically, the embeddings 50 are decomposed into a linear combination of the embeddings from the dictionary (e.g., the task-specific vocabulary 130) using a linear solver 150. The precomputed embeddings 50 are decomposed into sparse non-negative combinations of the task-specific vocabulary 130. Furthermore, the process includes storing each set of weights in the improved vector database 160 and using the decomposed sparse weights for the retrieval application to achieve better performance. Using this solution approach, the process improves the precomputed embeddings 50 with a lightweight decomposition procedure that balances computational cost and performance for downstream applications. Additionally, the process includes performing a similarity search. While cosine similarity is a commonly used measure to calculate the similarity between two embeddings in a vector database, this approach is unsuitable for PERA 100. Specifically, in the context of PERA 100, which decomposes image embeddings and uses weights to represent the presence, importance, and / or significance of a particular vocabulary, cosine similarity might not adequately capture the nuanced overlap between weights, w. To address this limitation, the process employs the cube coefficient, as expressed in Equation 3. The cube coefficient specifically quantifies the overlap between two sets, making it sensitive to weight overlap.Using the dice coefficient for similarity searching, the process ensures that the retrieval process prioritizes cases that have significant semantic overlap with the query instance and focuses on the presence of a critical vocabulary rather than all semantic information. Furthermore, the process can include scaling up with GPU acceleration. Although the optimization problem described above can be solved directly on a central processing unit (CPU) using commonly used libraries such as scikit-learn, GPU acceleration is necessary for large vector databases with millions of cases. Thus, the process involves implementing the Alternate Direction Multiplier Method (ADMM algorithm) in PyTorch with GPU support for efficient decomposition. To apply ADMM to this task, Equation 2 is rewritten as Equation 4. Then, the Lagrangian function with the penalty parameter 1 / τ > 0 for Equation 4 is defined by Equation 5. If both zk-1 and yk-1 are fixed, the update of w is calculated using Equation 6. Additionally, zk is calculated using Equation 7, where Sλτ is the expression operator for a soft threshold. Furthermore, the double update rule is calculated using Equation 8. As stated above, the steps represented in equations 6, 7, and 8 can be executed efficiently. In practice, the process iterates until convergence is reached or the maximum number of iterations is reached, which is set to 2000. A single GPU can efficiently decompose approximately 2000 embeddings per second, which is about 50 times faster than extracting a new embedding using FM with the same hardware. Fig. 2 illustrates the benefits of a PERA image retrieval via a comparative image retrieval with respect to a given query image 200. Specifically, the query image 200 is a digital image showing a roadway, a sidewalk, trees, and a building. The roadway has two lanes. Additionally, the query image 200 shows the front of several passenger cars 200B parked on one side of the roadway, while also showing at least one motorcycle 200A traveling on the same side of the roadway. The query image 200 is then used as a query to obtain (i) a PERA retrieval result 230 using a set of weights 210 and an extended vector database 220, and (ii) a comparative retrieval result 260 using an image embedding and a vector database 250. The process associated with PERA image retrieval involves generating an image embed using pixels from the query image 200 via an image encoder. The image embed is decomposed into a linear combination of visual concepts from a task-specific vocabulary. A set of weights 210 is generated based on the task-specific vocabulary. The set of weights 210 is then used in a similarity search to retrieve a set of digital images from the enhanced vector database 220. Figure 2 illustrates an example of a PERA retrieval result 230 based on the enhanced vector database 220. As shown, the PERA retrieval result 230 is a digital image. Specifically, the PERA retrieval result 230 depicts a multi-lane roadway, trees, a sidewalk, and buildings.Additionally, PERA retrieval result 230 also shows a motorcycle 230A driving in one lane and a passenger car 230B in another lane. In this respect, PERA retrieval result 230 is successful in retrieving and recording objects of interest (e.g., a motorcycle, a passenger car, etc.). In contrast, the process associated with a comparative image retrieval involves generating an image embed 240 using pixels from the query image 200 via the image encoder. The image embed 240 is then used directly in a similarity search to retrieve a set of digital images from the vector database 250. Figure 2 illustrates an example of a comparative retrieval result 260 based on the vector database 250. As shown, the comparative retrieval result 260 is a digital image. Specifically, the comparative retrieval result 260 shows a roadway with multiple lanes, trees, a sidewalk, and buildings. However, unlike the query image 200 and the PERA retrieval result 230, the comparative retrieval result 260 does not include a number of objects of interest (e.g., a motorcycle, a passenger car, etc.). In this respect, the comparative retrieval result 260 misses a number of objects of interest.Therefore, PERA retrieval result 230 is more similar to query image 200 than to the comparative retrieval result 260. PERA retrieval result 230 thus provides better and more valuable results than comparative retrieval result 260 when it is provided with the same query image 200. Figures 3, 4, and 5 illustrate several examples of retrieved images used for pretraining to highlight the benefits of PERA 100. In Figures 3, 4, and 5, the top image is a query image from the Cityscapes dataset, while the other images show the five most frequently retrieved images from the nulmages dataset using CLIP embedding, both with and without PERA enhancement. Initially, PERA 100 yields more diverse results compared to using only CLIP embedding. Closer examination reveals that images retrieved using only CLIP embedding often miss important objects. For example, in a straightforward scenario (Figure 3), where query image 300 contains a large truck, none of the images retrieved using CLIP embedding include a large vehicle.Conversely, results using PERA often include large vehicles such as buses (e.g., boxes in Fig. 3), which even match the yellow color of the truck in the query image. In a more complex scenario (Fig. 4), the query image 400 contains several elements, such as a passenger car, a pedestrian, a building, a tree, and an intersection. Here, the five most frequently retrieved results using CLIP embedding fail to include the pedestrian. This observation reflects that the complexity of embedding can sometimes lead to overlooking important objects in the scene. Conversely, results using PERA include several pedestrians (e.g., boxes in Fig. 4). Furthermore, in scenarios with unusual features, such as a query image 500 containing unique painted advertisements, the PERA results include images with similar advertisements (e.g., boxes in Fig. 4).5) on various vehicles such as trucks, ships, and buses, whereas the results from CLIP do not exhibit this specificity and diversity. These improvements in diversity and relevance with PERA not only enhance their performance during pre-training but also improve their effectiveness in subsequent downstream tasks. Figure 6 illustrates a retrieval-based application using a PERA image retrieval via an enhanced vector database 630. In particular, Figure 6 illustrates an example of a process 600 of retrieval-enhanced generation / classification. The core concept involves retrieving relevant information from external knowledge sources to improve the performance of a machine learning system, such as a classification model or classifier. In computer vision, retrieval-enhanced classification has been used to address long-tail classification challenges. As shown in Fig. 6, the process 600 involves training a more robust image classifier 610 by effectively employing external knowledge such as PERA 100. Specifically, in Fig. 6, PERA 100 generates weights 620 from the query data. The weights 620 are used in a similarity search to retrieve digital images from the enhanced vector database 630. The enhanced vector database 630 is generated according to a specific task using PERA 100, as discussed in Fig. 1. Furthermore, as shown in Fig. 6, the query data and the digital images from the PERA image retrieval results are used to train a machine learning system, such as an image classifier 610. As a performance metric, experiments have shown that a retrieval-enhanced classification using PERA 100 achieves an accuracy (ACC) of +7.5. Figure 7 illustrates another retrieval-based application using a PERA image retrieval via an enhanced vector database 730. In particular, Figure 7 illustrates an example of a data curation process 700 for pretraining a machine learning system 710 (e.g., FM). Pretrained FMs have achieved significant performance gains across many tasks in the computer vision domain, driven primarily by large pretraining datasets. However, raw internet data can contain as much as 60% to 90% noise or uninformative content, wasting computing resources and potentially weakening final performance. To address these challenges, the data curation process 700 involves starting with well-curated datasets. As shown in Fig. 7, process 700 involves pretraining a machine learning system 710 using PERA image retrieval results obtained from curated data. Specifically, in Fig. 7, PERA 100 generates weights 720 from the curated data. The weights 720 are used in a similarity search to retrieve digital images from the enhanced vector database 730. The enhanced vector database 730 is generated according to a specific task using PERA 100, as discussed in Fig. 1. Process 700 involves using only digital images from PERA image retrieval results for model pretraining. The PERA image retrieval results are then used to pretrain a machine learning system 710 to improve the performance of downstream tasks such as instance segmentation.Furthermore, experiments have shown, as a performance metric, that model pretraining for downstream instance segmentation tasks is enhanced by up to 1.0 mean average precision (mAP). Figure 8 illustrates an example of a System 800 containing PERA 100, according to at least one exemplary embodiment. The System 800 includes at least one Processing System 802. The Processing System 802 includes one or more processing units. For example, the Processing System 802 includes one or more GPUs. The Processing System 802 may further include an electronic processor, a CPU, a microprocessor, a field-programmable gate array (FPGA), an application-specific integrated circuit (ASIC), any suitable processing technology, or any number and combination thereof. The Processing System 802 is operable to provide the functionality as described herein.System 800 comprises at least one storage system 810 that is functionally connected to processing system 802. Storage system 810 communicates with processing system 802. In an exemplary embodiment, storage system 810 comprises at least one non-transient, computer-readable medium configured to store and provide access to various data, enabling processing system 802 to perform the operations and functionality disclosed herein. In an exemplary embodiment, storage system 810 comprises a single device or multiple devices. Storage system 810 may include electrical, electronic, magnetic, visual, semiconductor, electromagnetic, or any suitable storage technology that is operable with system 800.For example, in an exemplary embodiment, the memory system 810 may include a read / write memory (RAM), a read-only memory (ROM), a flash memory, a disk drive, a memory card, a visual storage device, a magnetic storage device, a memory module, any suitable type of storage device, or any combination thereof. The storage system 810 contains at least PERA 100, an application program 812, various PERA data 814, and other relevant data 816 stored therein. The storage system 810 contains computer-readable data which, when executed by the processing system 802, is configured to provide the functions and processes described in this disclosure. The computer-readable data may include instructions, code, routines, various related data, any software technology, or any number and combination thereof. Specifically, the application program 812 contains computer-readable data with instructions which, when executed by the processing system 802, are configured to provide an application platform enabling PERA 100 to interact with other components of the system 800 and provide an interface to a user.Furthermore, PERA 100 contains computer-readable data with commands that, when executed by the processing system 802, are configured to perform the process described at least in Fig. 1. PERA 100 also includes an image encoder 110, a vocabulary generator 120, a task-specific vocabulary 130, a vector database 140, a linear solver 150, and an enhanced vector database 160, or an applicable combination / variant thereof. Additionally, the various PERA data 814 include various image data, various image embedding data, various image identifiers (IDs), various weight data, various similarity calculation data, various parameter data, and any related PERA data (e.g., vector databases, enhanced vector databases, machine learning data, etc.) that enable the system 800 to perform the functions disclosed herein.For example, the various training data include at least different digital image / video data, etc. Meanwhile, the other relevant data 816 provides different data (e.g., an operating system, etc.) that enable the system 800 to perform the functions discussed here. In an exemplary embodiment, as shown in Fig. 8, the system 800 is configured to include at least one sensor system 804. The sensor system 804 contains one or more sensors. For example, the sensor system 804 includes an image sensor or a camera configured to capture digital images and / or digital video. The sensor system 804 may also include a radar sensor, a light detection and distance sensor (LIDAR sensor), a temperature sensor, an ultrasonic sensor, an infrared sensor, a motion sensor, an audio sensor, an inertial measurement unit (IMU), any suitable sensor, or any combination thereof. The sensor system 804 is operable to communicate with one or more other components (e.g., the processing system 802 and the storage system 810) of the system 800. In particular, for example,The processing system 802 is configured to receive sensor data directly or indirectly from at least one sensor. The sensor system 804 and / or the processing system 802 are configured to generate digital images and / or digital video. The processing system 802 is configured to process digital images and / or digital video in conjunction with PERA 100 and the various PERA data 814. In addition, the System 800 contains further components that contribute to PERA 100. For example, as shown in Fig. 8, the storage system 810 is also configured to store further relevant data 816 relating to the operation of one or more components (e.g., the sensor system 804, an input / output (I / O) system 806, and further function modules 808). The input / output system 806 also includes an input / output interface and may contain one or more devices (e.g., a display device, a keyboard device, a speaker device, etc.). Furthermore, the System 800 contains further function modules 808, such as any suitable hardware technology, software technology, or combination thereof, that support or contribute to the operation of the System 800.For example, the additional functional modules 808 include a communication technology that enables components of the System 800 to communicate with each other, at least as described herein. The communication technology can enable the System 800 to communicate with other network devices (not shown) via a communication network. With at least the configuration discussed in the example of Fig. 8, the System 800 is configured to enable PERA 100 to perform the functions discussed in this disclosure. Fig. 9 shows a schematic diagram of an interaction between the computer-controlled machine 900 and the control system 902. The computer-controlled machine 900 includes an actuator 904 and a sensor 906. The actuator 904 can include one or more actuators, and the sensor 906 can include one or more sensors. The sensor 906 is configured to detect a state of the computer-controlled machine 900. The sensor 906 can be configured to encode the detected state into sensor signals 908 and to send sensor signals 908 to control a system 902. A non-limiting example of a sensor 906 includes video, radar, LiDAR, an ultrasonic sensor, an image sensor, an audio sensor, a motion sensor, etc. In some embodiments, the sensor 906 is a visual sensor configured to detect visual images of an environment near a computer-controlled machine 900. The control system 902 is configured to receive sensor signals 908 from the computer-controlled machine 900. As outlined below, the control system 902 can further be configured to calculate actuator control instructions 910 based on the sensor signals and to send actuator control instructions 910 to the actuator 904 of the computer-controlled machine 900. As shown in Fig. 9, the control system 902 includes a receiver 912. The receiver 912 can be configured to receive sensor signals 908 from the sensor 906 and convert these sensor signals 908 into input signals x. In an alternative embodiment, sensor signals 908 are received directly as input signals x without a receiver 912. Each input signal x can be a part of each sensor signal 908. The receiver 912 can be configured to process each sensor signal 908 to generate each input signal x. The input signal x can contain data corresponding to an image recorded by the sensor 906. The control system 902 includes a classifier 914 (e.g., an image classifier 610) which is trained on a training dataset containing at least one set of digital images retrieved via PERA 100. The classifier 914 can be configured to classify input signals x into one or more labels using a machine learning (ML) algorithm. The classifier 914 is configured to be parameterized by parameters such as those described above (e.g., the parameter θ). The parameter θ can be stored in and provided by a non-volatile memory 916. The classifier 914 is configured to determine output signals y from input signals x. Each output signal y contains information that assigns one or more labels to each input signal x. The classifier 914 can send output signals y to a conversion unit 918.The conversion unit 918 is configured to convert output signals y into actuator control instructions 910. The control system 902 is configured to send actuator control instructions 910 to an actuator 904, which is configured to actuate a computer-controlled machine 900 in response to actuator control instructions 910. In some embodiments, the actuator 904 is configured to actuate a computer-controlled machine 900 directly based on output signals y. Upon receiving actuator control instructions 910, the actuator 904 is configured to perform an action corresponding to the related actuator control instruction 910. The actuator 904 may contain control logic configured to convert actuator control instructions 910 into a second actuator control instruction used to control the actuator 904. In one or more embodiments, actuator control instructions 910 may be used to control a display device instead of, or in addition to, an actuator. In some embodiments, a control system 902 includes the sensor 906 instead of, or in addition to, the computer-controlled machine 900 containing the sensor 906. The control system 902 may also include the actuator 904 instead of, or in addition to, the computer-controlled machine 900 containing the actuator 904. As shown in Fig. 9, a control system 902 also includes the processor 920 and the main memory 922. The processor 920 may contain one or more processors. The main memory 922 may contain one or more memory devices. The classifier 914 of one or more embodiments may be implemented by the control system 902, which includes a non-volatile memory 916, a processor 920, and a main memory 922. The non-volatile memory 916 can include one or more persistent data storage devices, such as a hard disk, a visual drive, a tape drive, a non-volatile solid-state device, cloud storage, or any other device capable of persistently storing information. The processor 920 can include one or more devices selected from high-performance computing (HPC) systems. The processor 920 can include one or more high-performance cores, graphics processing units, microprocessors, microcontrollers, digital signal processors, microcomputers, central processing units, field-programmable gate arrays, programmable logic devices, state machines, logic circuits, analog circuits, digital circuits, or any other devices that manipulate (analog or digital) signals based on computer-executable instructions located in the main memory 922.The 922 memory may contain, but is not limited to, a single memory device or multiple memory devices containing RAM, volatile memory, non-volatile memory, static read / write memory (SRAM), dynamic read / write memory (DRAM), flash memory, cache, or any other device capable of storing information. The 920 processor can be configured to read and execute computer-executable instructions residing in non-volatile memory 916, which embody one or more machine learning algorithms and / or methodologies of one or more implementations, into main memory 922. The non-volatile memory 916 can contain one or more operating systems and applications. The non-volatile memory 916 can store compiled and / or interpreted computer programs created using a wide variety of programming languages ​​and / or technologies, including, without limitation, and either alone or in combination, Java, C, C++, C#, Objective-C, Fortran, Pascal, JavaScript, Python, Perl, and PL / SQL. Due to execution by the processor 920, the computer-executable instructions of the non-volatile memory 916 can cause the control system 902 to implement one or more of the ML algorithms and / or methodologies for using the classifier 914, as disclosed herein. The non-volatile memory 916 can also contain ML data (including model parameters) that support the functions, features, and processes of one or more of the embodiments described herein. The program code embodying the algorithms and / or methodologies described herein can be distributed individually or collectively as a program product in a variety of different forms. The program code can be distributed using a computer-readable storage medium containing computer-readable program instructions to cause a processor to execute aspects of one or more embodiments. Computer-readable storage media that are inherently non-transient can include volatile and non-volatile, removable and non-removable physical media implemented by any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data.Computer-readable storage media may also include RAM, ROM, erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other solid-state memory technologies, a portable compact read-only storage medium (CD-ROM) or other visual storage, magnetic cartridges, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to store the desired information and that can be read by a computer. Computer-readable program instructions may be downloaded to a computer, another type of programmable data processing device, or other device from a computer-readable storage medium, or to an external computer or external storage device via a network. Computer-readable program instructions stored on a computer-readable medium can be used to instruct a computer, other types of programmable data processing equipment, or other devices to operate in a specific manner, such that the instructions stored on the computer-readable medium produce a manufactured item containing instructions that implement the functions, processes, and / or operations specified in the flowcharts or diagrams. In certain alternative embodiments, the functions, processes, and / or operations specified in the flowcharts and diagrams can be rearranged, processed serially, and / or processed with temporal overlap, consistent with one or more embodiments.Furthermore, consistent with one or more embodiments, any flowcharts and / or diagrams may contain more or fewer nodes or blocks than those illustrated. In addition, the processes, procedures, or algorithms may be fully or partially embodied using suitable hardware components such as ASICs, FPGAs, state machines, control units, or other hardware components or devices, or a combination of hardware, software, and firmware components. Fig. 10 shows a schematic diagram of a control system 902 configured to control a vehicle 800, which can be at least one semi-autonomous vehicle or a semi-autonomous robot. The vehicle 1000 includes an actuator 904 and a sensor 906. The sensor 906 can include one or more video sensors, cameras, radar sensors, ultrasonic sensors, LiDAR sensors, and / or position sensors (e.g., a global positioning system). One or more of these specific sensors can be integrated into the vehicle 1000. Alternatively or in addition to one or more of the specific sensors identified above, the sensor 906 can include a software module configured to determine the state of an actuator 904 upon execution.A non-restrictive example of a software module includes a weather information software module configured to determine a present or future state of the weather near vehicle 1000 or at another location. The classifier 914 of the control system 902 of the vehicle 1000 can be configured to detect objects in the vicinity of the vehicle 1000 depending on input signals x. In such an embodiment, the output signal y can contain information that classifies or characterizes objects in the vicinity of the vehicle 1000. The actuator control instruction 910 can be determined according to this information. The actuator control instruction 910 can be used to avoid collisions with the detected objects. In some embodiments, the vehicle 1000 is at least partially autonomous or fully autonomous. The actuator 904 can be embodied in a brake, a propulsion system, a power unit, a drivetrain, a steering system of the vehicle 1000, etc. Actuator control instructions 910 can be determined such that the actuator 904 is controlled in such a way that the vehicle 1000 avoids collisions with detected objects. Detected objects can also be classified according to what the classifier 914 considers most likely to be, such as pedestrians, trees, any suitable labels, etc. The actuator control instructions 910 can be determined depending on the classification. In some embodiments where the vehicle 1000 is at least a semi-autonomous robot, the vehicle 1000 can be a mobile robot configured to perform one or more functions, such as flying, swimming, diving, and walking. The mobile robot can be a lawnmower that is at least partially autonomous or a cleaning robot that is at least partially autonomous. In such embodiments, the actuator control instruction 910 can be configured such that a propulsion unit, a steering unit, and / or a braking unit of the mobile robot can be controlled in such a way that the mobile robot can avoid collisions with identified objects. In some embodiments, vehicle 1000 is a at least partially autonomous robot in the form of a garden robot. In such an embodiment, vehicle 1000 can use a visual sensor, sensor 906, to determine the condition of equipment in an environment near vehicle 1000. Actuator 904 can be a nozzle configured to spray chemicals. Depending on an identified type and / or condition of the equipment, the actuator control instruction 910 can be determined to cause actuator 904 to spray the equipment with an appropriate quantity of suitable chemicals. Vehicle 1000 can be a robot that is at least partially autonomous and takes the form of a household appliance. As a non-restrictive example, a household appliance could be a washing machine, a stove, an oven, a microwave, a dishwasher, etc. In such a Vehicle 1000, Sensor 906 can be a visual sensor configured to detect the state of an object undergoing processing by the household appliance. For example, in the case of a washing machine, Sensor 906 can detect the state of the laundry inside the washing machine. The actuator control instruction 910 can then be determined based on the detected state of the laundry. Fig. 11 shows a schematic diagram of a control system 902 configured to control a system 1100 (e.g., a manufacturing machine) which may contain a punching tool, cutter, gun drill, or the like of a manufacturing system 1102, such as part of a production line. The control system 902 may be configured to control an actuator 904, which is configured to control the system 1100 (e.g., the manufacturing machine). The sensor 906 of the system 1100 (e.g., the production machine) can be a visual sensor configured to detect one or more properties of a manufactured product 1104. The classifier 914 can be configured to determine a state of a manufactured product 1104 from one or more of the detected properties. The actuator 904 can be configured to control the system 1100 (e.g., the production machine) for a subsequent production step of the manufactured product 1104, depending on the determined state of the manufactured product 1104. The actuator 904 can be configured to control functions of the system 1100 (e.g., the production machine) on a subsequent manufactured product 1106 of a system 1100 (e.g., a production machine), depending on the determined state of the manufactured product 1104. Fig. 12 shows a schematic diagram of a control system 902 configured to control a monitoring system 1200. The monitoring system 1200 can be configured to physically control access through the door 1202. The sensor 906 can be configured to detect a scene relevant to deciding whether to grant access. The sensor 906 can be a visual sensor configured to generate and transmit image and / or video data. Such data can be used by the control system 902 to detect a person's face. The classifier 914 of a control system 902 of the monitoring system 1200 can be configured to interpret the image and / or video data by comparing it with the identities of known persons stored in non-volatile memory 916, thereby determining a person's identity. The classifier 914 can be configured to generate an actuator control instruction 910 in response to the interpretation of the image and / or video data. The control system 902 is configured to send the actuator control instruction 910 to the actuator 904. In this embodiment, the actuator 904 is configured to lock or unlock the door 1202 in response to the actuator control instruction 910. In some embodiments, non-physical logical access control is also possible. The monitoring system 1200 can also be an observation system. In such an embodiment, the sensor 906 can be a visual sensor configured to detect a scene under observation, and the control system 902 is configured to control a display device 1204. The classifier 914 is configured to determine a classification of a scene, e.g., whether the scene detected by sensor 906 is suspicious. The control system 902 is configured to send an actuator control instruction 910 to a display device 1204 in response to the classification. The display device 1204 can be configured to adjust the displayed content in response to the actuator control instruction 910. For example, the display device 1204 can highlight an object that is considered suspicious by the classifier 914. Fig. 13 shows a schematic diagram of a control system 902 configured to control an imaging system 1300, such as a magnetic resonance imaging (MRI) device, an X-ray imaging device, or an ultrasound device. The sensor 906 can be, for example, an imaging sensor. The classifier 914 can be configured to determine a classification of all or part of the acquired image. The classifier 914 can be configured to determine or select an actuator control instruction 910 in response to the classification obtained by the trained neural network. For example, the classifier 914 can interpret an area of ​​an acquired image as potentially anomalous. In this case, the actuator control instruction 910 can be selected to cause the display device 1302 to display the image and highlight the potentially anomalous area. As described in this disclosure, the embodiments include a number of advantageous features and benefits. For example, the embodiments provide a technical solution to the following problem: “Is it possible to improve pre-calculated embeddings in vector databases to enhance the performance of retrieval-based applications without recalculating application-specific embeddings?” To solve this problem, the embodiments include PERA 100, which provides a novel approach to decomposing the pre-calculated embeddings into a linear combination of embeddings adapted to the downstream application (e.g., embedding foreground objects in images). In this respect, PERA 100 addresses the challenge of improving pre-calculated embeddings in vector databases for downstream retrieval applications without recalculating application-specific embeddings.PERA 100 decomposes pre-calculated embeddings into a linear combination of embeddings tailored to specific applications, thereby efficiently improving performance. In this respect, PERA 100 enhances recalculated embeddings by decomposing them into a linear combination that meets the requirements of the target retrieval application. This refers to improving embeddings in vector databases for downstream retrieval applications without the need to recalculate embeddings from the original dataset. Furthermore, PERA 100 is computationally efficient and does not use the original dataset. In addition, PERA 100 has demonstrated significant improvements across various retrieval applications, confirming its utility and effectiveness. Experimental results show that PERA 100 substantially improves retrieval performance across different applications. Specifically, PERA 100 increases instance search performance by up to 23.1 mean average precision (mAP), improves retrieval-enhanced classification accuracy by up to 7.5%, and enhances model pretraining for the downstream instance segmentation task by 1.9 mAP. Furthermore, the description above is intended to be illustrative and not limiting, and is provided in the context of a specific application and its requirements. Those skilled in the art can understand from the preceding description that the present invention can be implemented in a multitude of forms and that the various embodiments can be implemented alone or in combination.Therefore, while the embodiments of the present invention have been described in connection with certain examples thereof, the general principles defined herein can be applied to further embodiments and applications without departing from the concept and scope of the described embodiments, and the true scope of the embodiments and / or methods of the present invention is not limited to the embodiments shown and described, since various modifications will be apparent to the qualified practitioner upon study of the drawings, the application text, and the following claims. Additionally or alternatively, components and functionality can be separated or combined differently from the manner described in the various embodiments and can be described using different terminology.These and other variants, modifications, additions and improvements may fall within the scope of disclosure defined in the claims that follow. QUOTES INCLUDED IN THE DESCRIPTION This list of documents cited by the applicant was automatically generated and is included solely for the reader's convenience. The list is not part of the German patent or utility model application. The DPMA accepts no liability for any errors or omissions. Cited patent literature US 63 / 740,802

[0001]

Claims

A computer-implemented method for retrieving digital images, comprising: generating a vocabulary of visual concepts for a specific task using a target dataset, wherein the vocabulary contains a representative image embedding or a representative patch embedding for each visual concept; retrieving pre-computed image embeddings from a vector database; decomposing each pre-computed image embedding into a linear combination of the visual concepts; generating a set of weights for each pre-computed image embedding based on the vocabulary, wherein each weight represents a meaning of a respective representative image embedding or a respective representative patch embedding; storing the set of weights for each pre-computed image embedding in an enhanced vector database; and retrieving a set of digital images in response to a query using the enhanced vector database. Computer-implemented method according to claim 1, wherein each linear combination of the visual concepts contains non-negative and sparsely populated weights. Computer-implemented method according to claim 1, further comprising: implementing an algorithm of an alternating direction multiplier method (ADMM algorithm) using a graphics processing unit (GPU) to perform the step of decomposing each pre-calculated image embedding. A computer-implemented method according to claim 1, further comprising: receiving a further digital image as the query; generating a further image embedding using pixels of the further digital image by means of an image encoder; decomposing the further image embedding into a further linear combination of the visual concepts; generating a query set of weights for the query based on the vocabulary; and performing a similarity search in the improved vector database using the query set of weights. Computer-implemented method according to claim 1, wherein the similarity search is performed by inserting a cube coefficient. A computer-implemented method according to claim 1, further comprising: receiving the target data set containing target images for the specific task; generating target image embeddings or target patch embeddings using pixels of the target images by means of an image encoder; selecting the representative image embedding or the representative patch embedding for each visual concept and forming the vocabulary such that it contains each representative image embedding or each representative patch embedding of each group. Computer-implemented method according to claim 6, further comprising: grouping the target image embeddings or the target patch embeddings into groups and calculating a centroid for each group of target image embeddings or target patch embeddings, wherein each centroid is selected as the representative image of each group. Computer-implemented method according to claim 1, further comprising: receiving the target data set containing target images and generating patches of each target image, wherein the target image embeddings are generated using the patches. Computer-implemented method according to claim 1, further comprising: storing image identifiers corresponding to each set of weights in the enhanced vector database, wherein the set of digital images is retrieved using a corresponding set of image identifiers based on performing a similarity search in the enhanced vector database using the query set of weights. Computer-implemented method according to claim 1, further comprising: generating a training dataset containing the set of retrieved images and training a machine learning model to perform the specific task using the training dataset. A system comprising: one or more processors and one or more computer memories in data communication with the one or more processors, wherein the one or more computer memories contain computer-readable data, the computer-readable data containing instructions which, when executed by one or more processors, cause the one or more processors to perform a method for retrieving digital images, and the method comprising: generating a vocabulary of visual concepts for a specific task using a target dataset, wherein the vocabulary contains a representative image embedding or a representative patch embedding for each visual concept; retrieving pre-computed image embeddings from a vector database; decomposing each pre-computed image embedding into a linear combination of the visual concepts;Generating a set of weights for each pre-computed image embedding based on the vocabulary, where each weight represents a meaning of a particular representative image embedding or patch embedding; storing the set of weights for each pre-computed image embedding in an enhanced vector database and retrieving a set of digital images in response to a query using the enhanced vector database. System according to claim 11, wherein each linear combination of the visual concepts contains non-negative and sparsely populated weights. System according to claim 11, wherein the one or more processors include a graphics processing unit (GPU) and an algorithm of an alternating direction multiplier method (ADMM algorithm) is implemented by means of the GPU to perform the step of decomposing each pre-computed image embedding. System according to claim 11, wherein the method further comprises: receiving a further digital image as the query; generating a further image embedding using pixels of the further digital image by means of an image encoder; decomposing the further image embedding into a further linear combination of the visual concepts; generating a query set of weights for the query based on the vocabulary and performing a similarity search in the improved vector database using the query set of weights. System according to claim 11, wherein the similarity search is performed by inserting a cube coefficient. System according to claim 11, wherein the method further comprises: receiving the target data set containing target images for the specific task; generating target image embeddings or target patch embeddings using pixels of the target images by means of an image encoder; selecting the representative image embedding or the representative patch embedding for each visual concept and forming the vocabulary such that it contains each representative image embedding or each representative patch embedding of each group. System according to claim 16, wherein the method further comprises: grouping the target image embeddings into groups or the target patch embeddings and calculating a center of gravity for each group of target image embeddings or target patch embeddings, wherein each center of gravity is selected as the representative image of each group. System according to claim 11, wherein the method further comprises: receiving the target data set containing target images and generating patches of each target image, wherein the target image embeddings are generated using the patches. System according to claim 11, wherein the method further comprises: storing image identifiers corresponding to each set of weights in the enhanced vector database, wherein the set of digital images is retrieved using a corresponding set of image identifiers based on performing a similarity search in the enhanced vector database using the query set of weights. System according to claim 11, wherein the method further comprises: generating a training dataset containing the set of retrieved images and training a machine learning model to perform the specific task using the training dataset.