Methods and systems for optimising generative ai context
By optimizing generative AI prompts through relevance-based data pruning and enrichment, the method addresses computational challenges, improving output quality and resource efficiency.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ARMSTRONG STUDIO LTD
- Filing Date
- 2025-12-15
- Publication Date
- 2026-06-18
AI Technical Summary
Generative AI models face challenges in managing computational load and resource demands due to the processing of prompts, while maintaining output quality, which is often compromised by insufficient or excessive contextual data.
A method to optimize prompts by computing weighted relevance values, ranking and pruning contextual data based on token limits and relevance thresholds, and generating revised prompts using the most relevant information.
Reduces computational load and latency while enhancing prompt quality and model output accuracy by focusing on the most relevant contextual data, thus optimizing GPU/CPU usage and memory requirements.
Smart Images

Figure EP2025087041_18062026_PF_FP_ABST
Abstract
Description
[0001] METHODS AND SYSTEMS FOR OPTIMISING GENERATIVE Al CONTEXT
[0002] Field of the Invention
[0003] The invention of the present disclosure relates to methods and system for reducing computational load during inference of generative artificial intelligence (genAI) models. In particular, the invention relates to methods that balance the processing demands that contextual data places on a genAI model’s processing of prompts with the need for contextual data that makes genAI outputs relevant.
[0004] Background
[0005] Generative Al models are designed to receive a variety of different types of inputs from users, including string of natural language, images, videos, and audio files. Collectively, these inputs are classified as “prompts”, in response to which the genAI model provides some form of output at the user interface (UI) and / or in the “back-end” of a computing system. The quality of outputs from the genAI model are dependent on numerous factors, but a significant predictor of outcome is the quality of the contextual data that the user prompt contains. The term “quality” in the context of genAI prompts and outputs is widely understood to mean that a prompt is decorated with sufficient contextual information that the genAI model can accurately discern the meaning and aims of the user prompt, and that the output accurately addresses the contents of the prompt. There are numerous metrics in the literature by which model output quality can be measured objectively, for example BLEU (BiLingual Evaluation Understudy), ROUGE (Recall-Oriented Understudy for Gisting Evaluation), and SPICE (Semantic Propositional Image Caption Evaluation). That is to say, the improvements to model output quality based on improved prompts can be determined with an objective standard and are not based on subjective user experience.
[0006] A prompt "Can you provide a summary of the key principles of quantum mechanics, focusing on wave-particle duality and the uncertainty principle?” succinctly defines the subject matter that the user is interested in at a general level, “quantum mechanics”, but also provides detailed contextual data indicating a specific sub-topic to focus on within this field, i.e. “wave-particle duality and the uncertainty principle”. There is a strong likelihood that the genAI model’s output will align with the user prompt’s aims, owing to the contextual data provided. In contrast, a low quality prompt could be "Tell me about science", which lacks any substantial contextual data and dramatically increasing the likelihood that the topic selected effectively randomly by the genAI model will not be the one of particular interest to the user.
[0007] A common problem in the realm of user-facing genAI models is a lack of user knowledge around the subject of “prompt engineering”, being how to correctly phrase and decorate prompts in order to optimise a model’s outputs.
[0008] A second problem is the immense computer resources, and associated costs, resulting from genAI models processing the prompts and producing outputs.
[0009] It is therefore desirable to address the needs of facilitating higher quality prompts and outputs whilst managing the immense computer resources required to process said prompts and outputs. Maximising the volume of contextual data brings with it a massive strain on computer resources, but minimising contextual data presents a strong likelihood that model outputs become inaccurate and impractical for any meaningful use case and also increases latency due to larger numbers of iterations over irrelevant or less relevant pieces of information in the dataset.
[0010] Summary
[0011] According to a first aspect of the present disclosure, there is provided a computer- implemented method for reducing computational load during inference of a generative artificial intelligence (genAI) model. The method comprises receiving a first prompt at a user input field associated with the genAI model; computing weighted relevance values of pieces of information in storage; ranking the weighted relevance values of each of the pieces of information; removing ranked pieces of information from the contextual data at least in part based on a pre-defined token limit and / or a pre-defined degree of relevance; and generating a revised version of the first prompt utilising the remaining ranked pieces of information as contextual data.
[0012] The quantity of tokens used by a prompt to a genAI model corresponds directly to computational resources, in particular memory capacity (VRAM), compute (FLOPs), memory bandwidth, and energy. For example, processing complexity scales quadratically with input length N during the prefill stage of inference, i.e. O(N2) FLOPs, which has significant negative consequences for the GPU / CPU processing load incurred during inference. By way of example if a user doubles their prompt length from 4000 to 8000 tokens, the GPU's computation load increases from 16 million to 64 million units of work. This also has significant negative consequences for latency, e.g. measured in Time to First Token (TTFT) or Time to Last Token (TTLT) in large language models. TTFT refers to the time elapsed from when an initial request is sent to the server until the LLM generates the first token. TTLT refers to the time elapsed from when an initial request is sent to the server until the final token is generated. TTLT can be a particularly informative metric of computational load, because it takes in to account the total computational work done by the LLM to produce an entire output sequence, including both TTFT corresponding to the prefill stage of inference as well as tool / function calling during the decode stage of inference.
[0013] On the one hand, it is desirable to reduce the token count in order to relieve the computational load and reduce latency. However, indiscriminate truncation, compression or the like will compromise model output accuracy (error rate) and negatively affect TTLT in the decode stage of inference due to the model spending additional time considering irrelevant or low relevance pieces of information. The inventors have found that the exemplary methods disclosed herein enhance the quality of prompts to Al models, by pruning contextual data to the most relevant pieces and / or enriching contextual data with more relevant pieces, whilst at the same time greatly reducing the computational load during inference of generative Al models, for example with regard to both bandwidth and GPU / CPU usage. Importantly, the inventors have found that reducing computational load is not merely a function of reducing token count. Improving the quality of prompts in terms of relevance is also tied to a reduction in the computational load during inference, because pruning contextual data available from a database down to the computed most relevant contextual data will reduce TTLT. It has therefore been found that there is a synergy between the steps of pruning contextual data based on both token count and relevance to the original prompt.
[0014] Optionally, before computing weighted relevance values of the pieces of information in the storage, the method further comprises determining the presence of one or more pieces of information in the storage that are not related to the contents of the first prompt, and removing said one or more unrelated pieces of information from the contextual data.
[0015] Optionally, computing weighted relevance values of the pieces of information in the storage comprises creating a vector representation of each of the pieces of information, creating a vector representation of the first prompt, and comparing the vector representation of each of the pieces of information with the vector representation of the first prompt.
[0016] Optionally, the method further comprises generating a summary of one or more of the ranked pieces of information, and computing the token sum of the summarised ranked piece(s) of information, wherein the step of removing ranked pieces of information from the contextual data comprises removing one or more summarised pieces based on the pre-defined token limit and / or pre-defined degree of relevance.
[0017] Optionally, an extent of summarisation of a ranked piece of information is based on its weighted relevance value. Optionally, a ranked piece of information with a maximal positive weighted relevance is not summarised, and a ranked piece of information with a weighted relevance equal to or exceeding a predefined threshold value is summarised.
[0018] Optionally, the method further comprises generating a plurality of summaries of each of a plurality of the ranked pieces of information, and caching the plurality of summaries.
[0019] Optionally, each of the plurality of summaries comprises a different length.
[0020] Optionally, the storage comprises a cache or a database.
[0021] Optionally, the method further comprises storing the revised first prompt in storage for use in revising subsequent prompts.
[0022] Optionally, the method further comprises reviewing the contents of storage for additional piece(s) of information, identifying piece(s) of information relevant to the contents of the first prompt based on one or more criteria, and adding the piece(s) of information to the contextual data for revision of the first prompt.
[0023] Optionally, the method further comprises replacing one or more of the ranked piece(s) of information in the contextual data with a different piece of information from storage having a greater relevance to the content of the first prompt based on a weighted relevance value comparison. Optionally, the method further comprises assessing the similarity of at least two pieces of information in storage, and merging the at least two pieces in response to determining they are characterised by a similarity exceeding a predefined threshold value.
[0024] Optionally, assessing the similarity of the at least two pieces of information comprises determining a weighted relevance value of one or each relative to the other(s) and comparing the determined value(s).
[0025] According to a second aspect of the present disclosure there is provided a computer- implemented method for reducing computational load during inference of a generative artificial intelligence (genAI) model, comprising the following steps: receiving a plurality of pieces of information to a computing system; generating a summary of the content of at least some of the pieces of information; receiving a first prompt at a user input field associated with the genAI model; computing weighted relevance values of the pieces of information compared with the first prompt; adding a number of the most relevant pieces of information to the contextual data; generating a revised version of the first prompt using the pre-defined number of the most relevant pieces of information.
[0026] As noted above, the inventors have found that the exemplary methods disclosed herein reduce computational load during inference of a generative Al model. The disclosed methods enhance the quality of prompts to Al models by improving their contextual data, whilst at the same time managing the significant demands placed on computer resources that can result from increasing the amount of contextual information the genAI model is tasked with considering. In particular, the exemplary methods facilitate a greatly reduced load on computing systems utilising generative Al models both in terms of bandwidth and CPU usage.
[0027] Optionally, the method further comprises repeating the step of adding a one or more of the most relevant pieces of information to the contextual data.
[0028] Optionally, the method further comprises repeating the step of adding a predefined number of the most relevant pieces of information to the contextual data until a predefined token limit is reached for the pieces of information. Optionally, the method further comprises replacing one or more of the ranked piece(s) of information in the contextual data with a different piece of information from storage having a greater relevance to the content of the first prompt based on a weighted relevance value comparison.
[0029] Optionally, the method further comprises assessing the similarity of at least two pieces of information in storage, and merging the at least two pieces in response to determining they are characterised by a similarity exceeding a predefined threshold value.
[0030] Optionally, comparing the similarity of the at least two pieces of information comprises determining a weighted relevance value of one or each relative to the other(s) and comparing the determined value(s).
[0031] Optionally, computing weighted relevance values of the pieces of information comprises creating a vector representation of each of the pieces of information, creating a vector representation of the first prompt, and comparing the vector representation of each of the pieces of information with the vector representation of the first prompt.
[0032] Optionally, the method further comprises storing the revised first prompt for use in revising subsequent prompts.
[0033] According to a further aspect of the present disclosure, there is provided a system for implementing the computer implemented method described above and disclosed herein. The system comprises a virtual environment in which a generative artificial intelligence (genAI) model is deployed. The virtual environment comprises at least one user input field for the user to input a prompt to the genAI model. The genAI model and / or the virtual environment is in communication with one or more storage elements comprising pieces of information.
[0034] Optionally, the virtual environment comprises a website, a web application, a mobile application, a virtual reality environment, or an augmented reality environment.
[0035] According to a further aspect of the present disclosure there is provided computer-readable media comprising instructions stored thereon that, when executed by one or more processors, cause the processor(s) to carry out the steps of the methods described above and disclosed herein.
[0036] Brief Description of Drawings
[0037] Figures 1-3 illustrate embodiments of a computer-implemented “scale-down” method for reducing the computational load during inference of a genAI model according to an exemplary embodiment of the present disclosure;
[0038] Figure 4 provides a flow diagram including steps of a computer-implemented “scale-up” method for reducing the computational load during inference of a genAI according to an exemplary embodiment of the present disclosure;
[0039] Figure 5 provides a conceptual illustration of a computer-implemented “scale-up” method according to an exemplary embodiment of the present disclosure;
[0040] Figure 6 illustrates an aspect of a computer-implemented “scale-up” method according to an exemplary embodiment of the present disclosure.
[0041] Figure 7 illustrates a system for implementing the exemplary methods described in the embodiments of the present disclosure.
[0042] Detailed Description of Drawings
[0043] The present disclosure of methods and systems for reducing the computational load during inference of genAI models will now be made with reference to the accompanying drawings. The methods may be stored as a set of instructions on appropriate computer-readable media and are executable by one or more processors. The steps of the methods described herein may be performed by the genAI model receiving user prompts, or by another Al model, or by another non-AI software application that is appropriately configured for the relevant task. When referring to “storage” in the present disclosure, it will be understood to mean one or more data storage modules, a database, or a cache, located remotely or locally with respect to the deployment of the genAI model. The genAI model may comprise any suitably configured generative artificial intelligence model that permits inputting of prompts including one or more of natural language, images, videos, and audio files. For example, the genAI model may comprise or utilise any known model, such as but not limited to GPT-3, GPT-4, BERT, DALL-E, StyleGAN, VQ-VAE-2, WaveNet, and so on. Alternatively, the genAI model may be an in-house, custom-built model.
[0044] Referring to Figure 1, a flow diagram illustrates an exemplary computer-implemented method for reducing the computational load during inference of a genAI model.
[0045] In a first step 110, a first prompt is received at a user-input field of the genAI model. The user-input field may comprise a text box, an attachment button, and so on. The genAI model, or another software application, analyses the contents of the first prompt to identify one or more characteristics of the prompt. Characteristics may include, but are not limited to, tokenisation, semantic analysis, intent, an attention mechanism, context understanding, and so on.
[0046] The genAI model, or another software application, then analyses the contents of storage and computes 120 a relevance score for each of a number of pieces of information in storage in order to identify contextual information that can be utilised in revising the first prompt. The relevance score is a metric designed to determine the degree of relevance a piece of information has with respect to the contents of the user’s prompt. The relevance score may comprise a weighted relevance score involving any of a variety of known techniques. For example, determining the weighted relevance score for a piece of information may involve creating a vector representation of a piece of information, creating a vector representation of the first prompt, and comparing the vector representation of the pieces of information with the vector representation of the first prompt. The degree of relevance in the vector representation may correspond to an embedding distance between the vector representation of the first prompt and the vector representation of the piece of information in a feature space. In various embodiments, the embedding distance may comprise a Euclidean distance, a cosine similarity, or a Manhattan distance.
[0047] The genAI model, or another software application, may then rank 130 the relevance of each of the pieces of information that have been scored. The ranking comprise a listing from most relevant to least relevant, or vice versa.
[0048] The token sum of the remaining ranked pieces of information is then computed, and one or more ranked pieces of information are then removed 140 from the contextual data at least in part based on a pre-defined token limit and / or a pre-defined degree of relevance threshold. The predefined token limit may be any number of tokens dependent on the use cases and is not intended to be limiting, whilst the pre-defined degree of relevance threshold may be a threshold value relevant to the relevance metric being implemented such as but not limited to a vector distance. Preferably, although not exclusively, the removal of ranked piece(s) of information begins with the least relevant pieces of information, working “up” the ranking. The method of Figure 1 may be described as a “scale-down” method of reducing computational load during inference of genAI models, because it starts with a ranking of some or all of the pieces of information in storage based on their relevance and cuts back pieces from the contextual data to be used based on one or more parameters such as the predefined token limit and / or the pre-defined degree of relevance threshold. When the predefined token limit and the pre-defined degree of relevance threshold are used in combination, the method may involve first removing all pieces of information beyond the pre-defined degree of relevance threshold, and then removing any pieces of information beyond the pre-defined token limit optionally starting with the least relevant pieces of information remaining after the step of removing pieces based on the pre-defined degree of relevance threshold.
[0049] In some embodiments, the method token limit may be checked and, if necessary, updated with a predefined frequency. For example, the method involve refreshing the token limit periodically to account for varied usage trends involving the Al model in question. Usage trends may vary based on the time of day, week, or year, and this will have consequences with the computational load. In some embodiments, “periodically” may comprise checking the current token limit for the model whenever a user prompt is received, or after receiving a predefined number of user prompts to the model, or based on the time. Checking the token limit based on the time may be purely based on the time, e.g. set to a particular time zone, or based on a combination of time and geographical location from where a prompt is received.
[0050] Pruning the contextual data to be used by the genAI model to only the most relevant pieces of data will reduce the computational load during inference. With regard to the prefill stage of inference, TTFT (and consequently TTLT) will be reduced because the model will spend less time parsing through irrelevant or less relevant information and achieve a predefined target confidence level more quickly before the decode stage. Consequently, the Key Value (KV) cache size and VRAM consumption are reduced. Conversely, TTLT is increased in scenarios where less relevant data is maintained for the revised prompt (for example, if all contextual data is retained irrespective of relevance), because the model spends more time in the prefill and / or decode stages of inference iterating over data of widely varying relevance to achieve a predefined confidence before the first token is selected.
[0051] Subsequent to removing ranked piece(s) of information from the contextual data, and once the token sum is less than or equal to the token limit, the genAI model may then utilise the remaining ranked pieces of information as contextual data to generate 150 a revised version of the first prompt. The revised first prompt may comprise the original user prompt and the selected contextual information from storage, or it may comprise an altered version of the original user prompt, or it may comprise an altered version of the original user prompt together with the selected contextual information from storage. The genAI model may then use the revised prompt in order to provide an output. The output may, for example, be provided at a virtual environment.
[0052] By weighting and ranking the relevance of pieces of information in storage relevant to the first prompt, and then dropping ranked, weighted pieces of information from the contextual data based on the pre-defined token limit and / or pre-defined degree of relevance threshold, the method of Figure 1 enhances the quality of the prompt whilst managing the demands placed on computer resources by increasing the amount of contextual information the genAI model must consider. In some embodiments the revised prompt may be stored for use in revising subsequent prompts received by the genAI model.
[0053] Referring to Figure 2, a method similar to Figure 1 is illustrated but with additional advantageous features that further optimise the contextual data that is to be utilised and contribute to reducing computational load during inference. A summary of one or more of the ranked pieces of information is generated 210, thus reducing the number of tokens that a piece of information occupies. A summary preferably comprises a natural language string of text of a given length describing the contents of the piece of information. For example, if the piece of information is a Microsoft Word® file comprising 10 ten pages of text, a summary may be, but is not limited to, a 5-10 -sentence overview of the contents of the file including salient information. In another embodiment, if the piece of information is an image file, the summary may be, but is not limited to, a 1-3 sentence overview of what is depicted in the image. The token sum of the summarised ranked piece(s) of information is then computed, and if the sum size of the pieces still does not fit the token limit, the summarised ranked piece(s) of information are then sequentially removed from the contextual data for the genAI model to use in revising the prompt, preferably starting with the pieces of the lowest relevance.
[0054] The inventors have found that incorporating summarisation step 210 in to the scale-down method of Figure 1 further helps to reduce the bandwidth and CPU usage of computer systems implementing genAI models, whilst still permitting the decorating of prompts with important contextual data that improve their quality and ultimately the quality of model outputs. Moreover, model reliability is enhanced when the model is able to augment prompts within the bounds of its hardware resources. As noted above, pruning contextual data to remove irrelevant or less relevant pieces of information with a predefined level will also serve to reduce the computational load during inference, for example reflected in reduced latency (TTLT) and VRAM load.
[0055] In some embodiments, the extent of summarisation 210 of a ranked piece of information is based on its weighted relevance value. As one example, a ranked piece of information with a maximal positive weighted relevance may not be summarised at all and instead is used in full as contextual data by the genAI model, whilst a ranked piece of information with a weighted relevance equal to or exceeding a predefined threshold value may be summarised. For example, a piece of information with a low degree of relevance may be summarised extensively. This helps to prioritise computer resources based on the extent to which contextual data will actively improve prompt quality, and therefore model output quality.
[0056] The “scale-down” method of Figures 1 and 2 may be adjusted at any point with replacements or additions of further contextual data from storage, in an iterative approach. For example, in one embodiment, the methods of Figures 1 and 2 may include an additional step in which the genAI model or another software application reviews the contents of storage for additional piece(s) of information that could be used to revise the first prompt. The identification of additional piece(s) of information relevant to the contents of the first prompt may be based on one or more criteria, but preferably involves a relevance score such as the weighted relevance value, with respect to the contents of the first prompt. The steps 130-140 of Figure 1 may be performed again one or more times, and the additional piece(s) of information may be added to the contextual data for revision of the first prompt. In another embodiment, the genAI model or another software application replaces one or more of the ranked piece(s) of information in the contextual data with a different piece of information from storage having a greater relevance to the content of the first prompt based on a weighted relevance value comparison. In other embodiments, the similarity of at least two pieces of information in storage may be assessed by the genAI model, and if it is determined that they are characterised by a similarity exceeding a predefined threshold value, they may be merged. In order to assess the similarity of the at least two pieces of information, their weighted relevance values in relation to each other may be determined. Only if the at least two pieces of information are similar to each other to within a predefined value and not to the user prompt, are they then merged.
[0057] Figure 3 provides a further illustrative example of the scale-down method. Pieces of information are represented by a series of blocks of different sizes denoting the varied file sizes in storage. A token limit is pre-defined, for example by an administrator of the genAI model, and is represented conceptually by a horizontal dashed line.
[0058] In an optional step 1, any pieces of information that are deemed “unrelated” may be removed from the contextual data to be used at the outset. By “unrelated” it is meant that the piece of information is either unreferenced by the prompt, or the piece of information has a relevance close to or equal to nil (i.e. totally irrelevant to the content of the first prompt). The determination of a piece of information being unrelated may be based on relevance scoring as in step 120 of Figure 1.
[0059] In step 2, corresponding to step 120 of Figures 1 and 2, the weighted relevance values of the remaining pieces of information in storage for consideration are computed. In some embodiments involving workflow systems, the weighted relevance value could be based on a static distance of a piece of information in a workflow relative to the prompt’s place in the workflow. For example the workflow system may be an e-learning platform having a course comprising 100 tasks to be completed by a student; the distance in this example is the number of tasks between the task related to the piece of information and the task in relation to which the user is inputting a prompt to the genAI model. In other embodiments the weighted relevance value could be based on a vector (embedding) distance. In step 3, corresponding to step 130 of Figure 3, the weighted relevance pieces of information are ranked according to their degree of relevance to the contents of the prompt. In the example of Figure 3 the relevance scoring is denoted on a scale of 0-1, with the most relevant pieces of information being labelled with a “1”, and the least relevant pieces of information being labelled with a number close to 0. As unrelated values score “0” in this metric, they may already have been removed from the contextual data in optional step 1.
[0060] A token sum of the weighted relevance ranked pieces of information may then be taken. Conceptually this can be seen in Figure 3 where the 11 most relevant pieces of information falling above the dashed token line are within the pre-defined token limit, and those pieces of information falling below the dashed token line exceed the pre-defined token limit. In the example of Figure 1, after step 3 the weighted relevance ranked pieces of information below the dashed token line are then removed 140 from the contextual data, and the remaining contextual data is utilised by the genAI model to generate a revised version of the first prompt. However, Figure 3 further comprises step 4 corresponding to step 210 of Figure 2, in which the weighted relevance ranked pieces of information are summarised, and in a final step 5 all summarised ranked pieces of information exceeding the pre-defined token limit are removed from the contextual data. In the embodiment of Figure 3, the degree of summarisation of a piece of information is a function of its relevance to the first prompt.
[0061] Referring to Figures 4 and 5, an alternative solution to the “scale-down” method of Figures 1- 3, that may be denoted the “scale-up” method, is provided. In the scale-down method of Figures 2 and 3, information already contained in storage is summarised in response to receipt of the first prompt. In contrast, the scale-up method of Figures 4 and 5 involves summarising information as it is received at a computing system and storing the summaries for consideration when a prompt is received, thus “building” a contextual data set rather than “stripping back” pieces from the contextual data as in the scale-down method. However, both the scale-up and scale-down methods address the same problem, namely the need for prompts with more relevant contextual data whilst managing the computer resources that this can entail.
[0062] In an initial step, a plurality of pieces of information are received 410 at a computing system.
[0063] The computing system may be a server associated over a distributed network with a virtual environment e.g. an e-learning website, and in this example the plurality of pieces of information may be answers to assignments and the like in the platform that have been uploaded by students. As pieces of information are received at the computing system, a summary of the content of each piece of information may be generated 420 and stored in memory. When a prompt is received 430 at a user input field of the genAI model, the relevance scores of the summarised pieces of information are computed 440 with respect to the content of the prompt. A predefined number of the most relevant pieces of information are then added 450 to the contextual data to be used by the genAI model. For example, in one embodiment the 5 highest relevance scoring pieces of information may be maintained in the contextual data and the other pieces removed. A revised version of the first prompt may then be generated 460 by the genAI model, utilising the predefined number of the most relevant pieces of information. The genAI model may then use the revised prompt in order to provide an output. The output may, for example, be provided at a virtual environment.
[0064] In preferred embodiments, revising the prompt in the scale-up method comprises using the original prompt alongside the un-summarised most relevant pieces of information selected for contextual data. A conceptual example illustrating the selection of a pre-defined number of the most relevant pieces of information is depicted in Figure 6.
[0065] Figure 5 illustrates another example embodiment of the scale-up method of Figure 4, but further including an iterative step of the genAI model assessing whether further pieces of information are necessary / useful as contextual data. This process is continued iteratively until the genAI model confirms that no further pieces are needed or a pre-defined token limit is reached. In other embodiments, the iterative process in Figure 5 may involve replacing or merging one or more pieces of information with other pieces of information in storage.
[0066] It will be appreciated that combinations of the scale up and scale down methodologies in Figures 1-5 may be implemented in some embodiments. For example, in one embodiment the scale-down method of Figure 1 may further comprise steps 410-420 and 440-450 to enrich the context for the prompt. In another embodiment, the scale-up method of Figure 4 may further comprise step 140 and optionally 130 from Figure 1.
[0067] Referring to Figure 7, embodiments of the present disclosure also provide a system in which the methods in Figures 1-6 may be deployed. The system at least comprises a computing system 710, such as a virtual environment 710, in which is deployed a generative artificial intelligence model or models 720. The computing system 710 preferably comprises at least one user input field for a user to input a prompt to the genAI model(s) 720 of the system. The virtual environment 710 may comprise a website, a web application, a mobile application, a virtual reality environment, an augmented reality environment. The virtual environment should be configured with an appropriate API in order to host the genAI model(s) 720. The model(s) 720 themselves may be deployed via any appropriate means such as but not limited to a cloud platform, e.g. Google® Cloud, Amazon® Web Services, or Microsoft Azure®. The genAI model 720 and / or the virtual environment 710 is in direct or indirect wireless communication with one or more storage elements 730 over a network. The storage element(s) 730 may comprise a cloud database(s), a cache(s), and the like, suitably configured to store the pieces of information. The storage element(s) 730 may or may not be part of the system itself. In some embodiments, the genAI model is communication with the storage element(s) 730, whilst in other embodiments software application(s) associated with the virtual environment is / are in communication with the storage element(s) 730.
[0068] Case Study 1
[0069] In an example case study demonstrating the advantages of the scale-down method of Figure 1, one might consider an online learning or work virtual environment 720 that includes courses having lecture videos to watch and assignments to download, complete and upload. The course may comprise a “workflow” with a number of tasks, such as the lectures to watch, assignments to complete, and so on. When a task is completed, the workflow moves on to the next task and some indication of progress in the workflow is provided on the administrator end and possibly also the user. Any answers provided by the student, such as documents, may be uploaded to the virtual environment 720 and stored in storage element(s) 730. The course may have a course administrator, denoted “user A”, and at least one student denoted “user B”. In one example, there may be 100 tasks in the workflow. Each task has a task description, on average 5 supporting documents attached related to the task that must be completed, and requires user B to input (upload) one document with their answer.
[0070] The average token counts for these pieces of information are as follows:
[0071] • Task description: 200 tokens
[0072] • Supporting material: 2200 tokens each
[0073] • User input: 3300 tokens This means that the token count of a typical task is 200 + 3300 + 5*2200 = 14,500 tokens. The token count in the full system of 100 tasks is therefore 100*14,500 = 1,450,000 tokens. Running this number of tokens in a genAI model 710 would be highly impractical and require substantial computational resources, if all of the tasks were to be utilised as contextual information.
[0074] In this example let's say that each task has 3 other tasks that are "related" to the one in question. Two are related for belonging in the same group and one is related because the course administrator user A has marked it explicitly as a prerequisite.
[0075] Now if one runs the scale-down method of Figures 1-3 and limits the full context in a way that one only takes into consideration the pieces of information tied to a single task, then one can rule out 99% of the pieces of information in storage 730. However, this may result in the model 710 missing important contextual information and resulting in an accurate or incomplete output. The dataset may instead be limited in such a way that it takes into consideration two levels: "current task" (1st level), and "any relation with current task" (2nd level: same group or prerequisite). Following the numbers above, this means one uses the data from 1 + 3 = 4 tasks corresponding to 58,000 tokens. Alternatively if we consider 3 levels, 1st level (1): current task; 2nd level (3): prerequisite of current + same group with current; 3rd level (3): prerequisite of 2nd level, the data usage originates from 1 + 3 + 3 = 7 tasks corresponding to 101,500 tokens. This demonstrates that, from e.g. step 1 in Figure 3 alone, the scale-down method is particularly effective in filtering a substantial part of the full dataset from storage 730, with 1 or 2 orders of magnitude less data for the next, more expensive steps to work on, where the genAI model 710 actually has to run through all remaining pieces and evaluate their weighted relevance values.
[0076] Case Study 2
[0077] The inventors tested the “scale up” and “scale down” methods of Figures 1-5 by tracking the token count of Al queries structured to answer a question using a primary information fragment and then enriching the context with supplementary data available in a database. In Table 1, 10 measurements were taken implementing the scale-up method, with each measurement corresponding to a prompt to a genAI model. The measurements were based on prompting the Al model to create a document along predefined requirements (e.g. “Title: Description of my product, Description: This document should cover the following points. . .”), using a specific library of source materials comprised of documents previously created by the prompter, articles about all aspects of product development with some including information at least partially about how to create such a document. It was found that, using the scale-up methodology in Figures 4-5, it was possible to achieve a sum to used token ratio of between 5.92% and 84.43%, dependent on how much relevant contextual information exists in the database for a given prompt. Even reducing the token count by 15% provides significant improvements for GPU / CPU load when it is considered that processing complexity scales quadratically with input length (O(N2) FLOPs) during the prefill phase of inference. Moreover, using the scale-down method, only the most contextually relevant data is retained, such that even if the sum to used token ratio is high, irrelevant or less relevant data has been reduced which will improve TTLT and VRAM load.
[0078] TABLE 1
[0079] It is noted that the reported "used token" count includes the tokens generated by the Al for its response; therefore, the true input token count (which is our target for reduction) is actually even lower than the measured values in Tables 1-3 of the present disclosure. The “sum tokens” refers to the total number of tokens that could be considered as part of the context for the Al model, taking in to account all of the contextual pieces of information available in the database.
[0080] In a second series of measurements, summarised in Table 2, the “scale-down” method in Figures 1-3 was implemented and measured. The measurements in Table 2 involved three key steps to answer a question prompt:
[0081] 1. Candidate Search'. Embeddings and vector search were used to scan the entire database and find the most highly-ranked pieces of information that could potentially be relevant to the question.
[0082] 2. Context Selection'. The top-ranked candidates were provided to the Large Language Model (LLM).
[0083] 3. Final Decision'. The LLM was allowed to decide which specific pieces from those candidates should be included as enriched context for generating the final answer.
[0084] TABLE 2
[0085] As demonstrated by the example in Table 2, the inventors have found that the scale-down method was particularly effective in compressing the ratio of sunrused tokens, representing a significant reduction in computational load during inference of a genAI model.
[0086] In a third series of measurements, presented in Table 3 below, the inventors implemented a combination of steps from the scale-up and scale-down methodologies from Tables 1 and 2. In this third set of measurements, the Al model was asked to generate its response using a large database for context. Context was enriched with some related pieces of information based on an information map specifying which pieces can provide extra context for other pieces). In addition, embeddings and vector search were used to let the LLM enrich its context with extra pieces of information from the database.
[0087] TABLE 3
[0088] The combination of the scale-up and scale-down steps was found to have a particularly synergy exceeding a mere additive benefit, as demonstrated by Table 3.
[0089] A key finding is that the number of used tokens does not scale linearly with the total tokens in the system. On the contrary, the number of used tokens remains constant, demonstrating its independence from the size of the core database.
[0090] The demonstrated reduction in input token volume, achieved through the present context minimization method, translates directly into significant computational efficiency and hardware resource optimization within large-scale Al processing systems.
[0091] Specifically, the reduced input sequence length, denoted as N, directly mitigates the computational burden associated with the transformer's attention mechanism, where the processing complexity scales quadratically with input length (O(N2) FLOPs) during the prefill phase. This optimization significantly lowers the required GPU / CPU processing load. Moreover, as noted above, retaining the most relevant contextual data for the revised prompt also has significant benefits in terms of a reduction of computational load, such as a reduced TTFT which corresponds to reduced overall latency and thus more GPU FLOP cycles freed up for other processes. Furthermore, a shorter input sequence reduces the necessary memory footprint for the Key- Value (KV) Cache, which is stored in the GPU's high-bandwidth memory (VRAM). The memory requirement for the KV Cache scales linearly with the context length, the batch size (B), and the model's depth and dimension (a N * B * Depth * Dim) where a is a constant factor that helps calculate the precise memory required for the KV cache. Minimizing N thus linearly increases the maximum feasible concurrent requests, or batch size (B), leading to greater system throughput.
[0092] Whilst the disclosure of the present invention has been made with reference to the accompany Figures, it will be appreciated that the invention is not limited to the specific examples in those Figures.
Claims
CLAIMS1. A computer-implemented method for reducing computational load during inference of a generative artificial intelligence (genAI) model, comprising the following steps: receiving a first prompt at a user input field associated with the genAI model; computing weighted relevance values of pieces of information in storage; ranking the weighted relevance values of each of the pieces of information; removing ranked pieces of information from the contextual data at least in part based on a pre-defined token limit; and generating a revised version of the first prompt utilising the remaining ranked pieces of information as contextual data.
2. The method of claim 1, wherein in addition or alternatively, the ranked pieces of information are removed from the contextual data based on a pre-defined degree of relevance threshold.
3. The method of claim 1 or claim 2, wherein before computing weighted relevance values of the pieces of information in the storage, the method further comprises: determining the presence of one or more pieces of information in the storage that are not related to the contents of the first prompt, and removing said one or more unrelated pieces of information from the contextual data.
4. The method of claims 1-3, wherein computing weighted relevance values of the pieces of information in the storage comprises: creating a vector representation of each of the pieces of information, creating a vector representation of the first prompt, and comparing the vector representation of each of the pieces of information with the vector representation of the first prompt.
5. The method of claims 1-4, further comprising generating a summary of one or more of the ranked pieces of information, and computing the token sum of the summarised ranked piece(s) of information, wherein the step of removing ranked pieces ofinformation from the contextual data comprises removing one or more summarised pieces based on the pre-defined token limit.
6. The method of claim 5, wherein an extent of summarisation of a ranked piece of information is based on its weighted relevance value.
7. The method of claim 6, wherein a ranked piece of information with a maximal positive weighted relevance is not summarised, and a ranked piece of information with a weighted relevance equal to or exceeding a predefined threshold value is summarised.
8. The method of claims 1-7, further comprising generating a plurality of summaries of each of a plurality of the ranked pieces of information, and caching the plurality of summaries.
9. The method of claim 8, wherein each of the plurality of summaries comprises a different length.
10. The method of claims 1-9, wherein the storage comprises a cache or a database.
11. The method of claims 1-10, further comprising storing the revised first prompt in storage for use in revising subsequent prompts.
12. The method of claims 1-12, further comprising reviewing the contents of storage for additional piece(s) of information, identifying piece(s) of information relevant to the contents of the first prompt based on one or more criteria, and adding the piece(s) of information to the contextual data for revision of the first prompt.
13. The method of claims 1-13, further comprising replacing one or more of the ranked piece(s) of information in the contextual data with a different piece of information from storage having a greater relevance to the content of the first prompt based on a weighted relevance value comparison.
14. The method of claims 1-14, further comprising assessing the similarity of at least two pieces of information in storage, and merging the at least two pieces in response todetermining they are characterised by a similarity exceeding a predefined threshold value.
15. The method of claim 14, wherein assessing the similarity of the at least two pieces of information comprises determining a weighted relevance value of one or each relative to the other(s) and comparing the determined value(s).
16. A computer-implemented method for reducing computational load during inference of a generative artificial intelligence (genAI) model, comprising the following steps; receiving a plurality of pieces of information to a computing system; generating a summary of the content of at least some of the pieces of information; receiving a first prompt at a user input field associated with the genAI model; computing weighted relevance values of the pieces of information compared with the first prompt; adding a number of the most relevant pieces of information to the contextual data; generating a revised version of the first prompt using the pre-defined number of the most relevant pieces of information.
17. The method of claim 16, further comprising repeating the step of adding a one or more of the most relevant pieces of information to the contextual data.
18. The method of claim 17, comprising repeating the step of adding a predefined number of the most relevant pieces of information to the contextual data until a predefined token limit is reached for the pieces of information.
19. The method of claims 16-18, further comprising replacing one or more of the ranked piece(s) of information in the contextual data with a different piece of information from storage having a greater relevance to the content of the first prompt based on a weighted relevance value comparison.
20. The method of claims 16-19, further comprising assessing the similarity of at least two pieces of information in storage, and merging the at least two pieces in response to determining they are characterised by a similarity exceeding a predefined threshold value.
21. The method of claims 16-20, wherein comparing the similarity of the at least two pieces of information comprises determining a weighted relevance value of one or each relative to the other(s) and comparing the determined value(s).
22. The method of claims 16-21, wherein computing weighted relevance values of the pieces of information comprises: creating a vector representation of each of the pieces of information, creating a vector representation of the first prompt, and comparing the vector representation of each of the pieces of information with the vector representation of the first prompt.
23. The method of claims 16-22, further comprising storing the revised first prompt for use in revising subsequent prompts.
24. A system for implementing the computer implemented method of any one of claims 1-23, comprising a virtual environment in which a generative artificial intelligence (genAI) model is deployed, wherein the virtual environment comprises at least one user input field for the user to input a prompt to the genAI model, and wherein the genAI model and / or the virtual environment is in communication with one or more storage elements comprising pieces of information.
25. The system of claim 24, wherein the virtual environment comprises a website, a web application, a mobile application, a virtual reality environment, or an augmented reality environment.
26. Computer-readable media comprising instructions stored thereon that, when executed by one or more processors, cause the processor(s) to carry out the steps of any of claims 1-23.