System and method for updating a vector database through continuous vectorization

WO2026143203A2PCT designated stage Publication Date: 2026-07-02MORPHOS AI INC

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
MORPHOS AI INC
Filing Date
2025-12-24
Publication Date
2026-07-02

Smart Images

  • Figure US2025061312_02072026_PF_FP_ABST
    Figure US2025061312_02072026_PF_FP_ABST
Patent Text Reader

Abstract

A continuous vectorization system and method for updating a vector database is disclosed. The method includes receiving a new data and identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data. The target facet vector belongs to the target facet. The method also includes generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and an update vector. The update vector is the new data in a vectorized form. The update vector is multiplied by a weight w produced by a weighting function and the target facet vector is multiplied by (1-w). The method includes storing the updated facet vector within the vector database.
Need to check novelty before this filing date? Find Prior Art

Description

Agent Reference: 22735-002WO-PCTSYSTEM AND METHOD FOR UPDATING A VECTOR DATABASE THROUGH CONTINUOUS VECTORIZATIONCROSS-REFERENCE TO RELATED APPLICATIONS

[0001] This application claims the benefit of U.S. utility patent application 19 / 004,131, filed December 27, 2024, titled “SYSTEM AND METHOD FOR UPDATING A VECTOR DATABASE THROUGH CONTINUOUS VECTORIZATION,” the entirety of the disclosure of which is hereby incorporated by this reference.TECHNICAL FIELD

[0002] Aspects of this document relate generally to vector storage.BACKGROUND

[0003] Vector storage has emerged as a crucial element in the framework of modem artificial intelligence (Al) and machine learning (ML) systems. Vectors, serving as numerical representations of data, encapsulate semantic meanings, thereby facilitating the efficient processing, retrieval, and comparison of extensive and complex datasets. The surge in demand for Al and ML applications which typically require vast quantities of data, underscores the need for efficient vector storage solutions. Nevertheless, traditional vector storage technologies encounter several challenges that can impede the performance, scalability, and cost-efficiency of Al systems.

[0004] Traditional vector storage struggles to efficiently manage continuous data streams. In Al applications handling substantial volumes of real-time data, such as social media platforms, recommendation engines, and financial analysis tools, the frequent updating of vector representations can impose a significant burden. Conventional approaches typically address this either by maintaining a single vector per data point or facet, or by storing each incoming piece of data as a distinct vector. Both approaches have significant drawbacks.

[0005] Utilizing the single vector approach necessitates re-vectorizing the entire dataset each time new data is received. This process entails recalculating all previous data for that data point to generate a new single vector representation, ensuring that the stored vectors remain concise and performant. However, the computational cost associated with this method is prohibitive. For example, in a social media application, a facet intended to represent the postsAgent Reference: 22735-002WO-PCTof a particular user, every new post would require re-vectorizing the entire history of posts and interactions, resulting in substantial computational overhead and inefficiency that will only increase over time. This approach is akin to rebuilding a house to change a lightbulb.

[0006] A more common approach is the method of storing each new data piece as an individual vector. This is attractive because it reduces the immediate computational costs by eliminating the need to reprocess existing data, and storage is inexpensive in comparison to compute. Nonetheless, this leads to an exponential increase in storage requirements. As data continually streams in, the database rapidly becomes bloated with vectors, many of which contain redundant or minimally varied information. This causes elevated storage costs and deteriorates search performance. For example, a recommendation system frequently updating user preferences would accumulate vast quantities of nearly identical vectors representing each minor change, thereby slowing searches and consuming excessive resources.

[0007] Traditional vector storage methodologies also grapple with maintaining data relevance and accuracy. When each data point is stored as an individual vector, searches can become imprecise due to noise from redundant vectors, complicating the retrieval of relevant information swiftly, especially in real-time processing and decision-making applications.

[0008] In addition to increased costs that can be a barrier to innovation and performance that degrades over time, the inefficiencies found in traditional vector storage also have environmental consequences. The computational power required to continuously update and store vectors results in high energy consumption, contributing to increased carbon emissions.Agent Reference: 22735-002WO-PCTSUMMARY

[0009] According to one aspect, a method for updating a vector database includes receiving a new data, and identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet. The method also includes generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and an update vector, the update vector being the new data in a vectorized form, with the update vector being multiplied by a weight w produced by a weighting function and the target facet vector being multiplied by (1-w). The method additionally includes storing the updated facet vector within the vector database.

[0010] Particular embodiments may comprise one or more of the following features. The method may further include generating the update vector by vectorizing the new data with an embedding model. The new data may be received as raw data. The target facet may include at most one vector. Storing the updated facet vector within the vector database may include overwriting the target facet vector with the updated facet vector. The method may further include storing the update vector within the vector database. The weighting function may depend, at least in part, on a vector count, the vector count being the number of vectors that have been combined through linear interpolation to yield the target facet vector. The weighting function may be average-based, and, if n is the vector count, the weight may be l / (n+l). The weighting function may be order-based, and the weight may be equal to a decay factor that is greater than 0 and less than 1. The decay factor may be a function and may be dependent on an elapsed time since the target facet vector was last updated.

[0011] According to another aspect of the disclosure, a continuous vectorization system includes a vector database having a plurality of vectors and a plurality of facets, each facet describing at least one vector associated with the facet on the basis of at least one of a value and an attribute reflected by the vector. The system also includes a continuous vectorization server communicatively coupled to the vector database. The continuous vectorization server includes a processor and a memory, the memory having a weighting function and the processor configured to receive a new data and identify a target facet within the vector database using at least one of a value of the new data and an attribute of the new data. The processor is further configured to identify a target facet vector belonging to the target facet using the new data, retrieve the target facet vector from the vector database, and generate a weight w by applying the weighting function to at least a part of at least one of the target facet, the target facet vector, the new data in a raw data form, and the new data in a vectorizedAgent Reference: 22735-002WO-PCTform. Additionally, the processor is configured to create an updated facet vector via a weighted linear interpolation between the target facet vector and an update vector by performing a linear interpolation between the update vector multiplied by the weight and the target facet vector multiplied by (1-w), and send the updated facet vector to the vector database for storage. The update vector is the new data in a vectorized form.

[0012] Particular embodiments may comprise one or more of the following features. The processor of the continuous vectorization server may be further configured to receive the new data from a client device communicatively coupled to the continuous vectorization server through a network. The vector database may be remote and may be communicatively coupled to the continuous vectorization server through a network. The new data may be raw data, and the processor of the continuous vectorization server may be further configured to generate the update vector by vectorizing the new data with an embedding model. The target facet may include, at most, one vector. Sending the updated facet vector to the vector database for storage may include instructing the vector database to overwrite the target facet vector with the updated facet vector. The processor may be further configured to send the update vector to the vector database for storage. The weighting function may depend, at least in part, on a vector count, the vector count being the number of vectors that have been combined through linear interpolation to yield the target facet vector. The weighting function may be average-based, and, if n is the vector count, the weight may be l / (n+l). The weighting function may be order-based, and the weight may be equal to a decay factor that is greater than 0 and less than 1. The decay factor may be a function and may be dependent on an elapsed time since the target facet vector was last updated.

[0013] According to still another aspect of the disclosure, a method for updating a vector database includes receiving a new data. The new data is received as raw data including text. The method also includes identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet, as well as generating a weight w produced by a weighting function. Generating the weight w includes preprocessing the new data to generate a term profile, updating corpus statistics within a term statistics database, calculating using a relevance function, for each facet within the vector database, a relevance score for the new data relative to the facet, the relevance score being a function of the term profile and the corpus statistics. Generating the weight w also includes calculating the weight w using the weighting function, the weighting function being a function of the relevance scores, generating an update vector by vectorizing the new data with an embedding model, and generating an updated facetAgent Reference: 22735-002WO-PCTvector that reflects the new data by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w). The method also includes storing the updated facet vector within the vector database.

[0014] Particular embodiments may comprise one or more of the following features. The relevance function may be BM25(C, F) = SUM(s in C) [ IDF(s) * (tf(s, C) * (kl + 1)) / (tf(s, C) + kl * (1 - b + b * (|C| / avgdl))) ], Where: s represents each stem in chunk C, tf(s, C) is the term frequency of stem s in chunk C, |C| is the length of chunk C (in stems), avgdl is the average chunk length across the corpus, kl is a tuning parameter controlling term frequency saturation, b is a tuning parameter controlling length normalization. The IDF(s) may be the inverse document frequency of stem s and defined as IDF(s) = ln( (N - n(s) + 0.5) / (n(s) + 0.5) + 1 ) Where N is the total number of chunks in the corpus, and n(s) is the number of chunks containing stem s. The relevance function may include an inverse document frequency calculation. The weighting function may be w_F = BM25(C, F) / (SUM(F' in all facets) BM25(C, F') + epsilon), where a is a small constant to prevent division by zero. The weighting function may be w_F = exp(beta * BM25(C, F)) / (SUM(F') exp(beta * BM25(C, F'))), where P is a temperature parameter. The weighting function may also be a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated. Preprocessing may include removing stop words from the new data. Preprocessing may include applying stemming to generate stems for the term profile.

[0015] According to yet another aspect of the disclosure, a method for updating a vector database includes receiving a new data. The new data is received as raw data including a document having natural boundaries. The method also includes determining that a size of the new data exceeds a threshold, defining a root node with the document as a chunk, organizing the root node into a hierarchical tree of nodes by recursively splitting the chunk of each node along natural boundaries into smaller chunks and defining a new node for each chunk, each node having metadata specifying at least one of a parent node relationship pointing to a node with a larger chunk that includes the chunk and a child node relationship pointing to a node with a smaller chunk that is part the chunk. The method also includes defining, for each node, a new facet within the vector database including the chunk of the node and at least one attribute having at least one of a parent facet relationship mirroring the parent node relationship of the node and a child facet relationship mirroring the child node relationship of the node. The method includes vectorizing the chunk of at least a subset of the new facets with an embedding model, with the subset comprising an update facet, identifying within the vector database aAgent Reference: 22735-002WO-PCTtarget facet and a target facet vector, using at least one of a value of the update facet and an attribute of the update facet, the target facet vector belonging to the target facet, and generating a weight w produced by a weighting function. The method includes generating an updated facet vector that reflects an update vector of the update facet by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w), and storing the updated facet vector within the vector database.

[0016] Particular embodiments may comprise one or more of the following features. All new facets may be vectorized. The metadata of each node may further include a level within the hierarchical tree of nodes. The attributes of each new facet may further include the level of the associated node. The subset of the new facets that are vectorized may be composed of new facets having at least one of the top N levels and the bottom M levels. N and M may be equal 1. Each parent facet relationship and / or child facet relationship may be bidirectional. Natural boundaries may include at least one of section boundaries, subsection boundaries, paragraph boundaries, sentence boundaries, and token boundaries. Recursively splitting the new data along natural boundaries into smaller chunks along natural boundaries may continue until the size of the smaller chunks is at most equal to a target chunk size. Recursively splitting the chunk of each node along natural boundaries may include selecting the natural boundaries to split along based, at least in part, on the relative sizes of the resulting chunks. The threshold may be one of 50,000 tokens, 100,000 characters, and 50 pages. The threshold may vary based on a document type. The threshold may vary based on a source of the document. The method may further include dynamically adjusting the threshold based on at least one of available computing resources and use case requirements. Generating the weight w may include preprocessing the chunk of the update facet to generate a term profile; updating corpus statistics within a term statistics database, calculating using a relevance function, for each facet within the vector database, a relevance score for the chunk of the update facet relative to the facet. The relevance score may be a function of the term profile and the corpus statistics. Generating the weight w may also include calculating the weight w using the weighting function, the weighting function being a function of the relevance scores. The relevance function may be BM25(C, F) = SUM(s in C) [ IDF(s) * (tf(s, C) * (kl + 1)) / (tf(s, C) + kl * (1 - b + b * (|C| / avgdl))) ] Where s represents each stem in chunk C, tf(s, C) is the term frequency of stem s in chunk C, |C| is the length of chunk C (in stems), avgdl is the average chunk length across the corpus, kl is a tuning parameter controlling term frequency saturation, b is a tuning parameter controlling length normalization, and IDF(s) may be the inverseAgent Reference: 22735-002WO-PCTdocument frequency of stem s and defined as IDF(s) = ln( (N - n(s) + 0.5) / (n(s) + 0.5) + 1 ) Where N is the total number of chunks in the corpus, and n(s) is the number of chunks containing stem s. The relevance function may include an inverse document frequency calculation. The weighting function may be w_F = BM25(C, F) / (SUM(F' in all facets) BM25(C, F') + epsilon), where a is a small constant to prevent division by zero. The weighting function may be w_F = exp(beta * BM25(C, F)) / (SUM(F') exp(beta * BM25(C, F'))), where P is a temperature parameter. The weighting function may also be a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated. Preprocessing may include removing stop words from the new data. Preprocessing may include applying stemming to generate stems for the term profile.

[0017] Aspects and applications of the disclosure presented here are described below in the drawings and detailed description. Unless specifically noted, it is intended that the words and phrases in the specification and the claims be given their plain, ordinary, and accustomed meaning to those of ordinary skill in the applicable arts. The inventors are fully aware that they can be their own lexicographers if desired. The inventors expressly elect, as their own lexicographers, to use only the plain and ordinary meaning of terms in the specification and claims unless they clearly state otherwise and then further, expressly set forth the “special” definition of that term and explain how it differs from the plain and ordinary meaning. Absent such clear statements of intent to apply a “special” definition, it is the inventors’ intent and desire that the simple, plain and ordinary meaning to the terms be applied to the interpretation of the specification and claims.

[0018] The inventors are also aware of the normal precepts of English grammar. Thus, if a noun, term, or phrase is intended to be further characterized, specified, or narrowed in some way, then such noun, term, or phrase will expressly include additional adjectives, descriptive terms, or other modifiers in accordance with the normal precepts of English grammar. Absent the use of such adjectives, descriptive terms, or modifiers, it is the intent that such nouns, terms, or phrases be given their plain, and ordinary English meaning to those skilled in the applicable arts as set forth above.

[0019] Further, the inventors are fully informed of the standards and application of the special provisions of 35 U.S.C. § 112(f). Thus, the use of the words “function,” “means” or “step” in the Detailed Description or Description of the Drawings or claims is not intended to somehow indicate a desire to invoke the special provisions of 35 U.S.C. § 112(f), to define the invention. To the contrary, if the provisions of 35 U.S.C. § 112(f) are sought to be invoked to define the inventions, the claims will specifically and expressly state the exact phrasesAgent Reference: 22735-002WO-PCT“means for” or “step for”, and will also recite the word “function” (i.e., will state “means for performing the function of [insert function]”), without also reciting in such phrases any structure, material or act in support of the function. Thus, even when the claims recite a “means for performing the function of . . . “ or “step for performing the function of . . . ,” if the claims also recite any structure, material or acts in support of that means or step, or that perform the recited function, then it is the clear intention of the inventors not to invoke the provisions of 35 U.S.C. § 112(f). Moreover, even if the provisions of 35 U.S.C. § 112(f) are invoked to define the claimed aspects, it is intended that these aspects not be limited only to the specific structure, material or acts that are described in the preferred embodiments, but in addition, include any and all structures, materials or acts that perform the claimed function as described in alternative embodiments or forms of the disclosure, or that are well known present or later-developed, equivalent structures, material or acts for performing the claimed function.

[0020] The foregoing and other aspects, features, and advantages will be apparent to those artisans of ordinary skill in the art from the DESCRIPTION and DRAWINGS, and from the CLAIMS.Agent Reference: 22735-002WO-PCTBRIEF DESCRIPTION OF THE DRAWINGS

[0022] The disclosure will hereinafter be described in conjunction with the appended drawings, where like designations denote like elements, and:

[0023] FIGs. 1 A and IB are schematic views of two embodiments of a continuous vectorization system;

[0024] FIGs. 2A and 2B are process views of the continuous vectorization systems of FIGs. 1A and IB, respectively;

[0025] FIG. 3 is a process flow of a method for updating a vector database through continuous vectorization;

[0026] FIG. 4 is a relevance plot of L2 distances from query vectors to result vectors obtained through standard and continuous vectorization;

[0027] FIG. 5 is a process view of the application of auto weighting within the system of FIG. IB; and

[0028] FIG. 6 is a process view of a megachunking process for constructing a hierarchical tree representation of a document and defining facets corresponding to nodes of the tree.Agent Reference: 22735-002WO-PCTDETAILED DESCRIPTION

[0029] This disclosure, its aspects and implementations, are not limited to the specific material types, components, methods, or other examples disclosed herein. Many additional material types, components, methods, and procedures known in the art are contemplated for use with particular implementations from this disclosure. Accordingly, for example, although particular implementations are disclosed, such implementations and implementing components may comprise any components, models, types, materials, versions, quantities, and / or the like as is known in the art for such systems and implementing components, consistent with the intended operation.

[0030] The word "exemplary," "example," or various forms thereof are used herein to mean serving as an example, instance, or illustration. Any aspect or design described herein as "exemplary" or as an “example” is not necessarily to be construed as preferred or advantageous over other aspects or designs. Furthermore, examples are provided solely for purposes of clarity and understanding and are not meant to limit or restrict the disclosed subject matter or relevant portions of this disclosure in any manner. It is to be appreciated that a myriad of additional or alternate examples of varying scope could have been presented, but have been omitted for purposes of brevity.

[0031] While this disclosure includes a number of embodiments in many different forms, there is shown in the drawings and will herein be described in detail particular embodiments with the understanding that the present disclosure is to be considered as an exemplification of the principles of the disclosed methods and systems, and is not intended to limit the broad aspect of the disclosed concepts to the embodiments illustrated.

[0032] Vector storage has emerged as a crucial element in the framework of modem artificial intelligence (Al) and machine learning (ML) systems. Vectors, serving as numerical representations of data, encapsulate semantic meanings, thereby facilitating the efficient processing, retrieval, and comparison of extensive and complex datasets. The surge in demand for Al and ML applications which typically require vast quantities of data, underscores the need for efficient vector storage solutions. Nevertheless, traditional vector storage technologies encounter several challenges that can impede the performance, scalability, and cost-efficiency of Al systems.

[0033] Traditional vector storage struggles to efficiently manage continuous data streams. In Al applications handling substantial volumes of real-time data, such as social media platforms, recommendation engines, and financial analysis tools, the frequent updating of vector representations can impose a significant burden. Conventional approaches typicallyAgent Reference: 22735-002WO-PCTaddress this either by maintaining a single vector per data point or facet, or by storing each incoming piece of data as a distinct vector. Both approaches have significant drawbacks.

[0034] Utilizing the single vector approach necessitates re-vectorizing the entire dataset each time new data is received. This process entails recalculating all previous data for that data point to generate a new single vector representation, ensuring that the stored vectors remain concise and performant. However, the computational cost associated with this method is prohibitive. For example, in a social media application, a facet intended to represent the posts of a particular user, every new post would require re-vectorizing the entire history of posts and interactions, resulting in substantial computational overhead and inefficiency that will only increase over time. This approach is akin to rebuilding a house to change a lightbulb.

[0035] A more common approach is the method of storing each new data piece as an individual vector. This is attractive because it reduces the immediate computational costs by eliminating the need to reprocess existing data, and storage is inexpensive in comparison to compute. Nonetheless, this leads to an exponential increase in storage requirements. As data continually streams in, the database rapidly becomes bloated with vectors, many of which contain redundant or minimally varied information. This causes elevated storage costs and deteriorates search performance. For example, a recommendation system frequently updating user preferences would accumulate vast quantities of nearly identical vectors representing each minor change, thereby slowing searches and consuming excessive resources.

[0036] Traditional vector storage methodologies also grapple with maintaining data relevance and accuracy. When each data point is stored as an individual vector, searches can become imprecise due to noise from redundant vectors, complicating the retrieval of relevant information swiftly, especially in real-time processing and decision-making applications.

[0037] In addition to increased costs that can be a barrier to innovation and performance that degrades over time, the inefficiencies found in traditional vector storage also have environmental consequences. The computational power required to continuously update and store vectors results in high energy consumption, contributing to increased carbon emissions.

[0038] Contemplated herein is a system and method for updating a vector database using continuous vectorization. The contemplated system and method addresses several critical challenges associated with traditional vector storage and computation. Similar to the conventional "single vector" approach, continuous vectorization (or CV) reduces a facet to one or a few vectors. However, rather than re-vectorizing an ever expanding set of data, the contemplated system and method adjusts the existing facet vector to reflect the influence theAgent Reference: 22735-002WO-PCTupdated data would have by performing a weighted linear interpolation between the facet vector and the new or updated data in vectorized form. This linear interpolation maintains the semantic meaning of the facet vector without polluting the search space with noisy redundant vectors, according to various embodiments.

[0039] The contemplated continuous vectorization approach provides the computational benefits of the conventional "everything as a separate vector" method and the storage efficiency of the conventional "single vector" method. A facet, in the context of CV, has a single vector, or a few vectors, rather than the hundreds or thousands found in conventional systems. More beneficial than just a conventional "single vector", these CV facet vectors can also be referred to as green vectors, reflecting the efficiencies and environmental benefits they provide through reduction in storage and energy use. This is in sharp contrast with the expensive inefficiencies of conventional vector storage approaches.

[0040] The use of weighted linear interpolation in the contemplated system and method eliminates the need for complete re-vectorization of datasets. By employing linear interpolation, continuous vectorization significantly reduces computational costs and storage bloat associated with handling high-dimensional data. In some applications, the contemplated CV approach reduced the storage requirements by as much as 90% when compared to conventional solutions, and those storage requirements remain at the same level over time, unlike those of conventional systems.

[0041] Advantageous over conventional methods, the continuous vectorization approach minimizes data redundancy and computational overhead by integrating new data into existing vectors rather than creating new (and often highly redundant) vectors for every minor change. This leads to a substantial reduction in energy consumption and carbon emissions due to the decreased computational power required.

[0042] The continuous vectorization technique also enhances data retrieval precision by reducing noise introduced by redundant vectors, facilitating quicker and more accurate information retrieval. Moreover, it addresses scalability issues as Al models grow in complexity and handle more diverse datasets, maintaining compact vector representations that scale efficiently without escalating storage and processing demands.

[0043] The contemplated CV system and method provides performant and accurate vector storage of high-dimensional data streams with lower computational and storage requirements than would be demanded by applying conventional methods to the same data stream.Agent Reference: 22735-002WO-PCT

[0044] The contemplated CV system and method introduces a novel approach to vector storage through the dynamic interaction between faceting and weighted interpolation, where facets serve as active semantic aggregation points that are continuously updated through a weighted interpolation process. This specific combination transforms how facets function in vector storage, moving beyond their traditional role as passive organizational units to become dynamic semantic maintainers that preserve meaning while dramatically reducing storage requirements.

[0045] Unlike conventional systems where facets merely organize static vectors, this approach enables facets to actively participate in maintaining semantic relationships through continuous weighted interpolation. This transformation is made possible through the specific interaction between faceting and weighted interpolation described herein, as neither component alone could achieve the dual benefits of semantic preservation and storage reduction.

[0046] It should be noted that while much of the following discussion is done in the context of a continuous vectorization system being applied in a social media-related use case where the vectors are being used for semantic search and comparison, the contemplated CV system and method may be applied to a wide range of additional vector storage use cases. The use cases discussed below are for illustrative purposes, and should not be taken as limitations to how the contemplated continuous vectorization system and method may be applied.

[0047] FIGs. 1A and IB are schematic views of two non-limiting examples of a continuous vectorization system 100 for updating a vector database 104. Specifically, FIG. 1 A shows a continuous vectorization system 100 providing "storage-as-a-service" through a network 112, with users interacting with the system 100 in a manner similar to that of a conventional vector storage service. FIG. IB shows a continuous vectorization system 100 implemented as an in-house solution for a user, while coupled to a cloud-based vector database 104. These two embodiments, and others, will be discussed in greater detail below.

[0048] As shown, the continuous vectorization system 100 comprises a continuous vectorization server 102 that is communicatively coupled to a vector database 104. In some embodiments, including the non-limiting example shown in FIG. IB, the continuous vectorization server 102 may be communicatively coupled to the vector database 104 through a network 112 (e.g., the Internet). Additionally, in some embodiments, the continuous vectorization server 102 may be communicatively coupled to one or more client devices 110 through the network 112.Agent Reference: 22735-002WO-PCT

[0049] In the non-limiting examples of a continuous vectorization system 100 shown in FIGs. 1A and IB, the continuous vectorization server 102 is depicted as comprising a processor 106 and a memory 108, with the memory 108 holding various elements, such as a weighting function 120 and an embedding model 118. According to various embodiments, the continuous vectorization server 102 is a computing device comprising at least one processor 106 and able to perform the various functions that will be discussed below. The continuous vectorization server 102 may be implemented in a variety of hardware environments. In some embodiments it may be a discrete machine, while in other embodiments the continuous vectorization server 102 may be implemented in a distributed computing environment. In still other embodiments, the contemplated continuous vectorization server 102 may be implemented in a containerized environment, or as a virtual machine. Those skilled in the art will recognize that the continuous vectorization server 102 may be adapted to use a wide range of hardware environments. The depiction of the continuous vectorization server 102 as a single machine with a processor 106 in FIGs. 1 A and IB should not be interpreted as a limitation.

[0050] Additionally, the continuous vectorization server 102 comprises a memory 108, and that memory 108 comprises at least a weighting function 120. In the context of the present description and the claims that follow, for a server to comprise a memory 108 that comprises elements such as a weighting function 120 means that the weighting function 120 (or any other elements that are described as being in the memory 108) is maintained / stored in a fashion that makes it available to the processor 106 for use. This could be in long-term storage such as magnetic media or solid state storage, held in RAM, or otherwise accessible to the processor 106. The depiction of the weighting function 120 and an embedding model 118 being in the memory 108 of the continuous vectorization server 102 shown in FIGs. 1A and IB should not be interpreted as a limitation. Those skilled in the art will recognize the various ways a routine such as the weighting function 120 or a model such as the embedding model 118 may be made readily available to the processor 106 for use.

[0051] The continuous vectorization system 100 comprises a vector database 104 that is communicatively coupled to the continuous vectorization server 102. In some embodiments, the vector database 104 may be a discrete computing device connected to the server, either locally or remotely through a network 112 (e.g., a cloud-based vector database 104, etc.). In other embodiments, the vector database 104 may be implemented in a distributed computing environment. In still other embodiments, the vector database 104 may be implemented within the same hardware environment as the continuous vectorization server 102Agent Reference: 22735-002WO-PCT(e.g., same machine, same computing cluster, virtual machines or networked containers in the same hardware environment, etc.).

[0052] In some embodiments, the continuous vectorization server 102 may be communicatively coupled to one or more client devices 110, which may provide the continuous vectorization server 102 with new data streams, or request particular facets 116 and / or vectors 114 that have been stored in the vector database 104. In the context of the present description and the claims that follow, a client device 110 is any computing device able to interface with the continuous vectorization server 102 such that at least part of the server's functionality is made available. Examples range from a massive data center streaming large volumes of data to be stored in the vector database 104, down to a mobile device being used to perform a semantic search across a facet 116 using a web interface provided by the continuous vectorization server 102 over the network 112.

[0053] Those skilled in the art will recognize that the continuous vectorization server 102, the vector database 104, and the client device 110 may all be implemented in a wide variety of hardware environments. The specific environments depicted in the Figures and discussed herein are non-limiting examples.

[0054] According to various embodiments, the vector database 104 comprises a plurality of facets 116, each facet 116 describing (e.g., pointing to, identifying, etc.) at least one vector 114 that is associated with the facet 116. The vector database 104 may also comprise other elements such as a facet index 128 and / or vector index 126, as is known in the art. In some embodiments, the vector database 104 may be a conventional vector database 104, operating in a conventional way, unaware that it is being updated and maintained using the continuous vectorization methods contemplated herein. This means already existing vector storage can easily be adapted to employ the contemplated continuous vectorization method, and reap the associated benefits, without requiring substantial change. As a specific, nonlimiting example, in one embodiment the continuous vectorization server 102 may utilize a cloud-based vector database 104 that is entirely unaware it is being updated in such a novel manner.

[0055] In other embodiments, the vector database 104 may differ from a conventional vector storage solution, having been modified to more deeply incorporate and implement continuous vectorization. For example, in some embodiments the continuous vectorization server 102 may be wrapped around a vector database 104 in such a way that the updating of facets 116 occurs in a seamless manner, such that the continuous vectorization server 102 acts as a conventional vector database from the point of view of a client device 110,Agent Reference: 22735-002WO-PCTbut is actually the continuous vectorization server 102 blended with, and modifying, a vector database 104. As a specific, non-limiting example, in one embodiment, the continuous vectorization server 102 may present as a vector storage service that happens to provide a vector database that is fast and inexpensive because it is implementing continuous vectorization.

[0056] It should be noted that although much of the discussion of the vector database 104 will be done in the context of facets 116 and the vectors 114 that make them up, the vector database 104 does not have to be exclusively used for continuous vectorization. Any vectors 114 stored and updated via continuous vectorization will be interpolated, as will be discussed below. However, in some embodiments, those vectors 114 and facets 116 may be stored alongside non-CV vectors 114 and facets 116, in the same vector database 104. Continuous vectorization provides many benefits due to how it optimizes the updating of vectors. The initial storage and subsequent retrieval are not affected by the use of CV according to various embodiments; conventional vectors 114 and facets 116 may be mixed in with no ill effects.

[0057] The vector database 104 comprises a plurality of facets 116, each facet 116 describing at least one vector 114 associated with the facet 116. In the context of the present description and the claims that follow, a facet 116 is an organizational unit that encompasses multiple related vectors 114 based on some shared aspect. A facet 116 can serve as a category or filter for grouping vector 114. Facets 116 allow for more efficient processing and retrieval of a collection of vectors 114, as is known in the art.

[0058] The vectors 114 identified or pointed to by a facet 116 are grouped together based on a shared aspect of some form. Often, these vectors 114 are being grouped together based on their intended use or use context. However, the intended use or use context of a vector is not always immediately discernable, so a more expansive definition may be that the vectors 114 belonging to a facet 116 are associated with that facet 116 on the basis of at least one of a value 122 and an attribute 124 reflected by that vector 114.

[0059] In the context of the present description and the claims that follow, a value 122 reflected by a vector 114 is some part of the actual data represented by the vector 114. For example, in a case where a facet 116 is defined to include all social media posts and comments from a particular account, the vectors 114 belonging to that facet 116 are gathered based on the value 122 "username".

[0060] Likewise, In the context of the present description and the claims that follow, an attribute 124 reflected by a vector 114 is a piece of information describing the vectorAgent Reference: 22735-002WO-PCT114 itself, or it's data before being vectorized by an embedding model 118. Examples include, but are not limited to, a data type (e.g., image, text, etc.), a content type (e.g., social media post, social media comment, etc.), metadata (e.g., identifier, timestamp, tag, source information, etc.), a size / length (e.g., number of characters, number of words, number of paragraphs, average sentence length, etc.), and the like.

[0061] Furthermore, In the context of the present description and the claims that follow, a value 122 or attribute 124 being "reflected" by a vector 114 means that the value 122 or attribute 124 of interest can be extracted from said vector 114 or from the source material used to create said vector 114 (e.g., raw data, metadata, provenance, etc.). In some cases, it is part of the vector 114 as it exists within the vector database 104, like metadata. In other cases, it was part of the raw data that was vectorized and thus converted into the array of numbers that make up the vector 114. Those skilled in the art will recognize that there are other bases for grouping vectors into a facet 116.

[0062] There are a few differences between the facets 116 of a continuous vectorization system 100, and the facets 116 of a conventional vector storage solution. Conventional facets can be associated with hundreds, thousands, or even tens of thousands of vectors 114. The facets 116 of a continuous vectorization system 100 (meaning the facets 116 whose updates are handled using the methods contemplated herein), in contrast, will have a single vector 114, or perhaps a few vectors 114. This is not due to a limitation of continuous vectorization; the facets 116 of a continuous vectorization system 100 are capable of having just as many vectors 114 as the facets 116 of a conventional vector storage system. However, the CV facets 116 typically only require one or just a handful of vectors 114 to accomplish the facet's intended purpose. This will be illustrated below as part of a discussion of the computational, storage, and performance advantages provided by the contemplated continuous vectorization system 100 over conventional vector storage solutions.

[0063] How a facet 116 is defined and what criteria is used to identify associated vectors 114 is highly dependent on the specific use case. According to various embodiments, a facet 116 functions like a filter used to separate out a subset of vectors 114 from the larger set. The better defined the filtering criteria, the less extraneous information or "noise" will be gathered, leading to more accurate and timely results.

[0064] In some embodiments, the definition of a new facet 116 may begin with prefiltering data to ensure that only relevant information is processed, thereby avoiding the inefficiencies associated with indiscriminate data inclusion. This pre-filtering process requires a clear understanding of the specific criteria or attributes that are significant to the intendedAgent Reference: 22735-002WO-PCTsearches, as these criteria form the basis for creating facets 116. Although facets 116 can be added later, it may require re-vectorization, which can be resource-intensive.

[0065] It should be noted that transitioning a conventional facet, where each data point is a separate vector 114, into a continuous vectorization facet 116, is simply a matter of performing a linear interpolation of those vectors 114, combining them into a single vector 114, for example. The computational cost to perform such a transition would be small, as linear interpolation can be an inexpensive operation, as are some (but not all) weighting functions.

[0066] According to various embodiments, the vectors 114 of a continuous vectorization system 100 are no different from the vectors 114 found in any conventional vector database 104. The vectors 114 contain data (i.e., the array of numerical elements that are output by an embedding model) that reflect a value 122 (i.e., the raw data that was fed into the embedding model). As is known in the art, the vectors 114 may also each comprise metadata, or attributes 124. Examples include, but are not limited to, identifier / index, timestamp, label, category, tags, description, user ID, source, data type, creation date, update date, update iteration, score, and the like.

[0067] An embedding model 118 is a machine learning model that transforms highdimensional data into a lower-dimensional vector space, preserving the semantic relationships between data points. It is utilized to convert various types of data, such as text, images, or audio, into numerical representations that can be efficiently processed by algorithms for tasks such as clustering, classification, and similarity search. As is known in the art, this is particularly valuable in applications requiring natural language processing, recommendation systems, and image recognition, where it enhances the ability to analyze and interpret complex data by mapping it into a continuous vector space.

[0068] In some embodiments, new data (e.g., a data update, a data stream, etc.) is provided to the continuous vectorization system 100 in a vector 114 format. In other embodiments, new data may be provided to the continuous vectorization system 100 in its raw form. In the context of the present description and the claims that follow, raw data is data in a form that is readily usable by a human, such as text, images, video, or sound. When provided with raw data, the continuous vectorization server 102 will vectorize it using an embedding model 118 that is available to the continuous vectorization system 100 (e.g., stored locally and executable by the server 102, available on a cloud computing platform, accessible through an API, etc.). Just as the vectors 114 of a continuous vectorization system 100 are the same as the vectors 114 of a conventional system, the continuous vectorization system 100 contemplated herein is agnostic to what embedding model 118 is used to create those vectors 114, so long asAgent Reference: 22735-002WO-PCTthey are in a format that the continuous vectorization server 102 and / or the vector database 104 is configured to handle.

[0069] It should be noted that while the term "continuous vectorization" is used to describe the system, method, and server contemplated herein, it is not meant to be taken literally. The term "continuous" isn't referring to a non-stop process of vectorization so much as it is referring to vectorization that does not backtrack. In a conventional system, data will be vectorized and re-vectorized with every update (or proliferate a large number of near identical vectors 114). In the contemplated system, the incoming data is vectorized once and then used to update the facet vector(s) through linear interpolation. The term "continuous vectorization" should be treated as a general description, and not as a strict limitation. As mentioned above, these vectors 114 that are updated through linear interpolation may also be referred to as green vectors, due to their resource efficiency and reduced environmental impact.

[0070] The contemplated system 100 may be implemented in a number of different ways, and may be presented to the end user in a number of different forms. FIG. 1 A shows a continuous vectorization system 100 providing "storage-as-a-service" through a network 112, with users interacting with the system 100 in a manner similar to that of a conventional vector storage service. FIG. IB shows a continuous vectorization system 100 implemented as an inhouse solution for a user, while coupled to a cloud-based vector database 104.

[0071] In some embodiments, the continuous vectorization system 100 may be used to provide inexpensive and fast vector storage as a service. Users may be able to send new data 206 to the continuous vectorization server 102 as though it were a conventional vector storage service. However, as previously discussed, the definition of facets 116 is highly dependent on the specific use case. According to various embodiments, the continuous vectorization system 100 may provide an interface (e.g., web portal, app, API interface, etc.) to assist a user in defining the search conditions for a facet 116 that will then be updated using the methods contemplated herein. This may mostly appear to be a similar user experience as what is provided by conventional systems, except the continuous vectorization system 100 can provide the user a way to specify aspects of the weighting function 120 to be used in the interpolation, a feature that does not appear in conventional vector storage systems. However, in other embodiments, the continuous vectorization system 100 may appear indistinguishable from a conventional vector database 104 to an end user, with details such as type and parameters of the weighting function 120 being hidden from the end user.

[0072] According to various embodiments, the continuous vectorization server 102 is configured to receive new data 206 from a source, and then use that new data 206 to updateAgent Reference: 22735-002WO-PCTthe appropriate vectors 114 and facets 116 in the vector database 104. In some embodiments, including the non-limiting example shown in FIG. 1 A, the source of the new data 206 may be a client device 110 communicatively coupled to the continuous vectorization server 102 through a network 112. In other embodiments, the source of the new data 206 may be local to the continuous vectorization server 102 (e.g., obtained from an internal network, pulled from a database local to the server 102, etc.). See, for example, the embodiment shown in FIG. IB.

[0073] Additionally, in some embodiments, the new data 206 may be provided to the system 100 in vectorized form, as shown in FIG. 1A. In other embodiments, the system 100 may begin with new data 206 that is raw data 208 that the continuous vectorization server 102 must first vectorize with an embedding model 118 before using the new data 206 (in vectorized form) to update the vector database 104. In some embodiments, the new data 206 may be accompanied by a weight or information to be fed to the weighting function 120. In other embodiments, the weighting function 120 may operate automatically, having already been parameterized during the onboarding of the user.

[0074] As shown in FIGs. 1A and IB, in some embodiments the system 100 may further comprise a term statistics database 130 communicatively coupled to the continuous vectorization server 102. In the context of the present description and the claims that follow, a “term statistics database” 130 is a database, datastore, or other non-transitory digital storage object that stores statistical information about the content of the vector database 104 in its raw form (for example, before vectorization, or as text tokens and derived linguistic features). In some embodiments, the term statistics database 130 stores statistics regarding which terms are commonly found in different facets, which terms are more unique, and how frequently particular terms occur within particular facets and across the corpus as a whole. Such statistics can be maintained at one or more granularities, including at least one of document-level statistics, chunk-level statistics, facet-level statistics, and corpus-level statistics.

[0075] In some embodiments, the term statistics database 130 is used to implement an adaptive weighting system. For example, when new content is received for ingestion, the continuous vectorization server 102 may preprocess the new content to derive a term profile, may update corpus statistics within the term statistics database 130, and may compute one or more relevance scores that quantify lexical and statistical alignment between the new content and one or more facets. In such embodiments, the computed relevance scores may be used to generate a weight value that controls an interpolation operation used to update facet vectors. This adaptive weighting system is discussed in greater detail below (for example, with respect to FIG. 5).Agent Reference: 22735-002WO-PCT

[0076] It should be noted that, while the term statistics database 130 is depicted as a separate device directly connected to the continuous vectorization server 102 in the nonlimiting examples shown in FIGs. 1 A and IB, in other embodiments the term statistics database 130 may be connected through a wide area network (WAN), such as a cloud service accessible over the Internet. In still other embodiments, the term statistics database 130 may be stored within the continuous vectorization server 102, may be co-located with the vector database 104 on a common computing device, or may be distributed across multiple devices and storage locations. Those skilled in the art will recognize that other configurations may be used to make the term statistics database 130 available to the continuous vectorization server 102, without departing from the scope of the present disclosure.

[0077] FIGs. 2A and 2B are process views of a non-limiting example of the application of the systems 100 of FIGs. 1A and IB, respectively, to the updating of a vector database 104. First, the system 100 receives new data 206. See circle T of FIGs. 2A and 2B. In some embodiments, the new data 206 may be received by the continuous vectorization server 102 in a vectorized form (i.e., as the update vector 210), as shown in FIG. 2A. In other embodiments, the continuous vectorization system 100 may be provided with new data 206 that is raw data 208 (e.g., text, images, sound, etc.) which will need to be vectorized. In some embodiments, the continuous vectorization server 102 is configured to generate the update vector 210 by vectorizing the new data 206 with an embedding model 118. See circle 'A' of FIG. 2B.

[0078] Next, the continuous vectorization server 102 identifies a target facet 202 within the vector database 104 that the new data 206 would be associated with. See circle '2' of FIGs. 2A and 2B. According to various embodiments, the target facet 202 may be identified using a value 122 of the new data 206 and / or an attribute 124 (e.g., metadata, category, tag, etc.) of the new data 206.

[0079] Once the target facet 202 is identified, a target facet vector 204 belonging to the target facet 202 is identified using the new data 206 (e.g., using the value 122 and / or an attribute 124 of the new data 206, etc.). See circle '3' of FIGs. 2A and 2B. It should be noted that while the following example will only include one facet 116 having one vector 114, in use there may be multiple facets 116 that would be affected by the new data 206, and some or all of them may have more than one vector 114 that should be updated with the new data 206.

[0080] After the target facet vector 204 is identified, it is retrieved from the vector database 104. See circle '4' of FIGs. 2A and 2B. Before the update vector 210 (i.e., the vectorized form of the new data 206) and the target facet vector 204 can be interpolated, theyAgent Reference: 22735-002WO-PCTneed to be weighted. According to various embodiments, a weight 200 is generated by applying the weighting function 120 to at least a part of the target facet 202, the target facet vector 204, the new data 206 in a raw form (i.e., human-readable), and / or the update vector 210. See circle '5' of FIGs. 2A and 2B. The weighting function 120 will be discussed in greater detail, below.

[0081] According to various embodiments, the weight 200 that is generated and the manner in which it is applied to the update vector 210 and the target facet vector 204 may be done to take into account any previous updates (i.e., weighted linear interpolations) that this target facet vector 204 has been through. In some embodiments, the weight 200 may be used as a "mixing coefficient" or "contribution factor" that describes how much influence the update vector 210 has on the resulting updated facet vector 212. With the weight 200 w less than 1 and greater than 0, these vectors may be weighted by multiplying the update vector 210 by w and multiplying the target facet vector 204 by one minus the weight 200 w. A linear interpolation is performed on the resulting weighted vectors to create an updated facet vector 212. See circle '6' of FIGs. 2 A and 2B.

[0082] The use of a linear interpolation to update the facet vectors is advantageous, because it maintains semantic meaning of the facet vector without polluting the search space with noise. The interpolation operation itself is significantly less computationally expensive as vectorization, providing a quick, computationally easy way to update a vector without the usual downside of increased storage usage.

[0083] Next, the result of the linear interpolation, the updated facet vector 212, is stored in the vector database 104. See circle '7' of FIGs. 2A and 2B. In some embodiments, the target facet vector 204 may be overwritten by the updated facet vector 212, accomplishing an update without using up any additional storage space. In other embodiments, the target facet vector 204 may be replaced by the updated facet vector 212, but the previous facet vector may be retained to preserve a record of how the vector has evolved over time.

[0084] Finally, after the updated facet vector 212 has been stored, the vector database 104 may update the vector index 126 and / or the facet index 128 to reflect the update. See circle '8' of FIGs. 2A and 2B.

[0085] In some embodiments, the update vector 210 may be discarded after the weighted interpolation has been performed. However, in other embodiments, the update vector 210 may also be stored in the vector database 104. This is what is done in conventional vector databases 104 that follow the "vectorize everything" approach.Agent Reference: 22735-002WO-PCT

[0086] Because continuous vectorization is so cheap, adding it on top of traditional vector storage may be beneficial if there is need for categorizing partitions of the data. Although such an arrangement would not provide any storage savings, the functionality of the database will be enhanced, with more precise and performant search.

[0087] The continuous vectorization system 100 and method contemplated improves upon the technology for updating a vector database in a number of tangible ways. This is better illustrated by examining a use case, and how the contemplated system 100 would perform against a conventional vector storage system.

[0088] As a specific, non-limiting example, consider the use case where vector storage is being used to capture and analyze the activities of a particular user of a social media site. A facet 116 is formed that is defined by the username, so that all posts made by that user will be represented by the vectors 114 of that facet 116. Over time, as that user makes posts on the site, their posts are sent to vector storage where they are associated with the facet 116 in anticipation of a subsequent use of the database (e.g., performing an engagement analysis for that user and a particular brand, a general sentiment analysis of their posts, a content trend analysis, etc.).

[0089] Using a conventional vector database for this exemplary use case could be accomplished with either of the two approaches previously discussed. The least burdensome approach, at least initially, would be to vectorize each new post as a separate vector 114 which is then stored in the vector database 104 and associated with the facet 116. Over time, as this user makes more and more posts, the number of vectors 114 in the facet 116 will balloon, slowing down searches and eating up storage space.

[0090] The other conventional approach is to combine all the posts of the user into one block of data, which is then vectorized into a single vector 114. In this scenario, the facet 116 only has one vector 114, resulting in quick searches and minimal storage use. However, the entire block of data will have to be re-vectorized with every new post from that user, and the computational costs will quickly become untenable.

[0091] In contrast, the continuous vectorization system 100 contemplated herein would take the best of both; the facet 116 would have a single vector 114 representing all of the user's posts. However, as new posts are made and their vectorized forms are sent to the system 100, they are combined with the existing vector 114 of the facet 116 through weighted linear interpolation. They are weighted such that every vector 114 that has been interpolated into the target facet vector 204 has an equal impact. The storage requirement does not increase, and the computational cost is barely more than what is required to vectorize a post. There isAgent Reference: 22735-002WO-PCTnothing preventing hundreds of additional vectors 114 from being associated with that facet 116, but it certainly is not necessary in the continuous vectorization solution.

[0092] Continuous vectorization overcomes the weaknesses of both conventional approaches discussed above, and is on par with their best attributes. According to various embodiments, continuous vectorization has low computational requirements because, assuming the new data 206 is provided in vector form, updating the facet 116 does not require any vectorization (or re-vectorization, in the case of a conventional system).

[0093] Additionally, continuous vectorization has low storage requirements because once an update vector 210 has been interpolated with the target facet vector 204 and the updated facet vector 212 has been stored in the place of the target facet vector 204, the update vector 210 is no longer needed, and is removed from memory unless there is some other use for it - continuous vectorization has no further need for it. Thus, the update does not increase the storage needed. According to various embodiments, the storage requirements for a continuous vectorization system 100 grow only with the number of facets 116 being updated, not the total amount of data coming in.

[0094] The conventional approach of vectorizing every new piece of data has another downside, apart from a rapidly expanding storage requirement. When every piece of data is vectorized and stored, the search space can quickly become noisy. All vectors appear the same - no matter how important the data is or how off-topic it is, one vector will have the same importance as another in the search that is being performed. Additionally, in the use case where a facet is being updated with streaming data, the vectors formed from the streaming data may be very similar to each other.

[0095] Advantageously, continuous vectorization focuses on the facet 116. Using the facet 116 as a target allows the incoming data to be vetted before going much further in the update process. Through focusing only on relevant data, and by limiting the number of vectors 114 associated with the facet 116 to just one, or a few, the search space becomes clearer, yielding search results that are more accurate, but in less time than what would come from a conventional vector storage system.

[0096] Of course, there are exceptions. For example, re-vectorization may be needed in the continuous vectorization system 100 in cases where the intended end use has changed and the facets 116 are redefined or redirected. The storage savings provided by continuous vectorization may be lessened if iterations of the target facet vector 204 or the update vector 210 are preserved for versioning purposes. However, even with these exceptions,Agent Reference: 22735-002WO-PCTthe continuous vectorization system 100 is still much more efficient and effective than conventional systems, according to various embodiments.

[0097] As mentioned above, vectors (i.e., the update vector 210 and the target facet vector 204) are weighted before they are blended through linear interpolation. According to various embodiments, the weights 200 used are produced by the continuous vectorization server 102 using a weighting function 120. According to various embodiments, the weighting function 120 depends on at least a part (e.g., a value 122, an attribute 124, etc.) of the target facet 202, the target facet vector 204. The weighting function 120 provides a degree of control over the behavior of the facet 116, allowing the facet's 116 operation to be fine-tuned to better accomplish its intended purpose.

[0098] There is a wide range of weighting functions that may be used in the contemplated continuous vectorization system 100. The following discussion will examine four examples, but it should be noted that other types of weighting function 120 exist, and the following discussion is for illustrative purposes, and not meant to limit the possibilities of what can be done.

[0099] In some embodiments, the weighting function 120 may be count-based, where the weighting function 120 will depend, at least in part, on a vector count of the target facet vector 204. In the context of the present description and the claims that follow, a vector count is the number of vectors 114 that have been combined through linear interpolation to yield the target facet vector 204. For example, a new vector 114 would have a vector count of 1. After the first update via interpolation, that vector count will be 2 (e.g., the original vector 114 and the update vector 210), and so forth.

[0100] According to various embodiments, the vector count may be used to create a weighting function 120 that considers the number of vectors 114 that have already been interpolated when determining what degree of influence a new update vector 210 will have on the target facet vector 204.

[0101] As a specific, non-limiting example, in one embodiment the weighting function 120 may be average-based, where the weight 200 is chosen to give all component vectors 114 of the target facet vector 204 the same level of influence. For example, when updating a target facet vector 204 that is the result of 998 interpolations, or 999 vectors total, the update vector 210 may be given the weight 200 of w = 0.001, and the target facet vector 204 given the weight of (l-w)=0.999, such that in the resulting updated facet vector 212, each of the 1000 component vectors 114 has the same impact.Agent Reference: 22735-002WO-PCT

[0102] In some embodiments, the weighting function 120 may be order-based, where the weighting function 120 will make use of a decay factor a, which is less than 1 and greater than 0, according to various embodiments. Weighting the update vector 210 with a and the target facet vector 204 with (1-a) results in the impact of a vector 114 on the updated facet vector 212 decreasing as more and more vectors 114 are stacked in front of it. The value of a will determine how quickly that decrease happens. In other embodiments, the reverse may be implemented in similar fashion, where older vectors 114 increase in weight, and each new vector 114 has less impact than the previous vector 114.

[0103] In some embodiments, the weighting function 120 may be time-based, which is similar to the order-based function, except the decay factor is a function of the elapsed time since the target facet vector 204 was last updated, instead of a constant. This can be used to make the impact of a vector 114 drop off as time goes by.

[0104] In some embodiments, the weighting function 120 may be content-based, where the weighting function 120 will produce a weight 200 that takes into account at least one of a value 122 of the update vector 210 and an attribute 124 of the update vector 210, or new data 206 in a raw form (i.e. pre-vectorization form). Put differently, content-based weighting functions 120 are applied to value(s) and / or attribute(s) of the new data 206, either in its raw, human-readable form or in a vectorized form, to produce a weight 200, according to various embodiments. The application of this weighting function 120 is highly dependent on the specific use case being addressed. For example, in the case of analyzing the sentiment of the social media account of an individual, the weight 200 given to an update vector 210 that represents a new submission may be given a different weight depending on if it is an original post, or if it is responding to another user’s original post. As another specific example, a content-based weighting function 120 could be used to make the weight 200 depend on the number of views, likes, or shares a post received in the first day, for a facet 116 defined to help determine the level of engagement with a particular product. As yet another specific example, a content-based weighting function 120 could be used to make the weight 200 depend on the number of people following the author of a reply, for a facet defined to help estimate the exposure a particular post may currently have.

[0105] In some embodiments, multiple weighting functions 120 may be used together as a "hybrid" weighting by applying different weighting functions to vectors belonging to the same facet 116. This may best be explained through a specific but non-limiting example.

[0106] According to cognitive science, the human mind tends to give "weight" on the basis of primacy and recency - people tend to assign both more memory and more credenceAgent Reference: 22735-002WO-PCTto those things that happened first and those things that happened most recently, with the middle events having less emphasis. As a specific, non-limiting example, in one embodiment, a hybrid weighting based on the concepts of primacy and recency may be affected through the use of two vectors 114 within a facet 116, each the result of linear interpolations using different weighting functions 120. The first vector 114 would give greater weight to the newest data, and the second vector 114 would give greater weight to the oldest data. Any search within the CV database in this space would be more likely to hit the earliest or the latest data, with less emphasis given to what happened in between.

[0107] According to various embodiments, a hybrid weighting may be implemented through the use of multiple vectors within a facet, each implemented with a different weighting function 120. Each provides an opportunity for certain aspects of the data to "stand out" (e.g., the newest and oldest data in the example above).

[0108] It should be noted that these examples of weighting functions 120 is not exhaustive. One of the advantages of the contemplated system and method is different weights may be assigned for the linear interpolation on whatever basis desired. The use of a weighting function 120 allows the CV database to be designed to focus on just the important data (as defined by the user), resulting in significant advantages in terms of storage, performance, and accuracy, over conventional vector database technologies.

[0109] In some embodiments, the weighting function 120 may be simple, such as the average-based weighting function discussed above. In other embodiments, the weighting function 120 may be complex. For example, in one embodiment, the weighting function 120 may consider the content of what is being weighted using a large language model, and assign a weight 200 to the update vector 210 that is based upon a multi-step analysis performed by the LLM. Those skilled in the art will recognize that other weighting functions 120 may be used that can be tailored to the specific use case.

[0110] FIG. 3 is a flowchart of a non-limiting example of a method for updating a vector database through continuous vectorization. According to various embodiments, at step 300 new data 206 is received. In some embodiments, the new data 206 may be a vector 114 (i.e., the update vector 210). In other embodiments, the new data 206 may be raw data 208 that will be vectorized with an embedding model 118.[OHl] At step 302, a target facet 202 and target facet vector 204 are identified within the vector database 104. The target facet vector 204 belongs to, or is pointed at by, the target facet 202. According to various embodiments, the target facet 202 and target facet vectorAgent Reference: 22735-002WO-PCT204 are identified using at least one of a value 122 of the new data 206 and an attribute 124 of the new data 206.

[0112] At step 304, an updated facet vector 212 that reflects the new data 206 is generated by performing a weighted linear interpolation between the target facet vector 204 and an update vector 210.

[0113] The update vector 210 is the new data 206 in a vectorized form. According to various embodiments, before performing the linear interpolation the update vector 210 is multiplied by a weight 200 produced by a weighting function 120 and the target facet vector 204 is multiplied by one minus the weight 200.

[0114] Finally, at step 306, the updated facet vector 212 is stored within the vector database 104. In some embodiments, the target facet vector 204 is overwritten by the updated facet vector 212 within the vector database 104.

[0115] The following is a series of exemplary use cases for the contemplated system and method for continuous vectorization. These use cases are meant for illustrative purposes, demonstrating how CV improves upon the technology of vector databases in a number of significant ways. These examples are not presented as limitations, or examples of the only way CV could be applied to a particular use case, and should not be interpreted as the only way CV could be applied advantageously to a use case.

[0116] Additionally, continuous vectorization's ability to reduce a complicated collection of information to a single vector 114 in a facet 116 that is continuously updated is meant to illustrate a simple implementation, and is not meant to preclude the inclusion of additional vectors 114 in a facet 116, as has been discussed above.

[0117] As a specific, non-limiting example, continuous vectorization (CV) may be advantageously used in an advanced scientific literature search engine. Research institutions and universities increasingly rely on Al-powered systems to search and analyze the evergrowing body of scientific literature. These systems use vector embeddings to represent and process multidimensional data about research papers, including text, citations, and metadata. However, conventional vector storage systems face significant challenges in this domain. The need to store individual vector embeddings for each research paper and its components results in massive storage demands. Additionally, frequently re-vectorizing entire research corpora to accommodate newly published material is computationally expensive. Keeping search results relevant in real-time as the scientific landscape evolves rapidly is another hurdle. Furthermore, connecting research across various disciplines is a crucial yet difficult task with traditionalAgent Reference: 22735-002WO-PCTsearch methods, as these methods often fail to identify cross-disciplinary connections effectively.

[0118] The application of continuous vectorization technology significantly improves the efficiency and effectiveness of scientific literature search engines. By maintaining a single, continuously updated vector 114 per research topic and author profile, CV substantially reduces storage requirements. This reduction allows for the retention of a more comprehensive body of literature without needing to invest in additional infrastructure.

[0119] CV eliminates the need for batch updates when new papers are published. Instead, it uses weighted linear interpolation to dynamically update existing vectors 114. This enables search results to immediately reflect the latest research trends and findings, providing users with near-instantaneous access to the most relevant information.

[0120] Another major improvement is CV’s ability to enhance semantic understanding. Continuous updates allow the system to better represent evolving scientific concepts and relationships. This results in more accurate and nuanced search outcomes, capturing new developments in scientific terminology and theories.

[0121] CV also excels at fostering cross-disciplinary discovery. By efficiently updating topic vectors 114, it improves the representation of emerging interdisciplinary connections. This makes it easier to surface research from adjacent fields, potentially leading to breakthrough discoveries at the intersection of disciplines.

[0122] In addition to this, CV can continuously update user profile vectors 114 based on researchers' evolving interests, ensuring that paper recommendations are relevant and timely. This personalized approach helps researchers stay up-to-date with minimal effort on their part.

[0123] These advancements are powered by several key components of continuous vectorization. A weighting function 120 tailored to balance new publications against established literature ensures that high-impact papers influence topic vector representations appropriately. Facets may be defined by topics, methodologies, and author networks, which enable targeted updates and more precise searches. Linear interpolation allows the smooth integration of new research into existing vectors 114, preserving the continuity of scientific knowledge while incorporating cutting-edge findings. The continuous updating mechanism ensures that vector embeddings reflect the latest state of scientific knowledge, keeping search indices and recommendation systems up to date.

[0124] This application of CV could significantly advance scientific discovery and collaboration. It addresses critical challenges in managing and searching large-scale, rapidlyAgent Reference: 22735-002WO-PCTevolving scientific literature, making it highly relevant to research institutions and universities, scientific publishers and aggregators, Al companies focusing on knowledge management, as well as pharmaceutical and biotech companies for drug discovery research.

[0125] Continuous vectorization enables more efficient, accurate, and timely literature searches, which could accelerate the pace of discovery and encourage greater cross-disciplinary collaboration. By improving the identification of emerging trends and reducing the latency in updating search indices, CV could facilitate faster scientific breakthroughs and more effective knowledge sharing. Its reduced storage and computational requirements also lower operational costs and increase the scalability of Al systems to manage growing volumes of scientific data.

[0126] This contemplated CV-powered scientific literature search engine has the potential to become an essential tool for researchers, universities, and other scientific organizations, offering speed, accuracy, and insight in navigating the expanding sea of scientific knowledge.

[0127] As another specific, non-limiting example, continuous vectorization may be advantageously used in an e-commerce product discovery platform. E-commerce platforms use Al-powered systems to assist customers in discovering products from their vast and constantly changing inventories. These systems use vector embeddings to represent and process multidimensional data about products, user preferences, and shopping behaviors. However, traditional vector storage systems face significant problems. Storing individual vector embeddings for each product and user interaction quickly results in massive storage requirements. Additionally, re-vectorizing product catalogs to reflect inventory changes and emerging trends is computationally expensive. Another major issue is real-time personalization — user preferences can shift rapidly, requiring frequent updates to search and recommendation models. Lastly, traditional methods often struggle to help users discover long-tail products that match specific needs, particularly niche items.

[0128] Continuous vectorization technology addresses these problems by significantly enhancing the efficiency and effectiveness of product discovery on e-commerce platforms. One key advantage of CV is its ability to maintain a single, continuously updated vector 114 per product category and user profile. This approach drastically reduces storage requirements, allowing for more comprehensive product and user data retention without increasing infrastructure costs.

[0129] Rather than re-vectorizing entire catalogs when product details or popularity shift, CV uses weighted linear interpolation to dynamically update vectors 114. This methodAgent Reference: 22735-002WO-PCTensures that search results are almost instantly reflective of the latest product trends, availability, and customer preferences, eliminating the need for batch updates. Real-time personalization is another strength of CV, as user profile vectors 114 are continuously updated based on browsing and purchasing behavior. This results in more accurate and timely personalized search results that can adapt quickly to shifting user preferences.

[0130] CV also excels at enhancing the discovery of long-tail products. The efficient updating of product vectors 114 allows for a better representation of niche items and their relationships to broader categories. This increases the visibility of lesser-known products in search results, helping users find items that match their specific needs more effectively. Furthermore, CV enables trend-aware search rankings by continuously updating product vectors 114 to capture emerging market trends and seasonal patterns, improving the overall discovery experience and potentially boosting sales.

[0131] CV's efficient updating mechanism also supports scalability. As product catalogs and user bases expand, CV's ability to update vectors 114 dynamically without the need for large-scale batch processing allows for continued system growth without a corresponding increase in computational resources. This scalability ensures that even as platforms grow, recommendation quality and response times remain high.

[0132] Several core components of CV power these advancements. The weighting function 120 can be tailored to balance recent user interactions with long-term preferences, such as giving higher weight 200 to recent purchases to reflect current interests while maintaining stability in overall preferences. Facets may be defined based on product categories, attributes, and user interaction patterns, allowing for targeted updates and refined searches. Linear interpolation ensures smooth integration of new product information and user interactions into existing vectors 114, preserving the continuity of product relationships and user preferences while incorporating new data. The continuous updating mechanism keeps vector embeddings up to date with the latest state of the product catalog and user behavior, which is critical for maintaining relevant search indices and recommendation systems.

[0133] The impact of CV on the e-commerce industry could be significant. It enables more efficient, accurate, and personalized product search, which could improve customer satisfaction, increase conversion rates, and drive revenue growth. CV’s ability to handle vast amounts of product and user data more efficiently could lead to more engaging shopping experiences and better inventory management. By reducing operational costs through lower compute and storage requirements and increasing scalability to manage growing productAgent Reference: 22735-002WO-PCTcatalogs and user bases, CV could offer businesses of all sizes a competitive edge in the e-commerce space.

[0134] This contemplated CV-powered e-commerce product discovery platform could become an essential tool for online retailers, offering a highly personalized and efficient shopping experience that rivals or exceeds the performance of e-commerce giants.

[0135] As yet another specific, non-limiting example, continuous vectorization may be advantageously used in a legal document search and analysis system. Law firms and legal departments manage vast repositories of legal documents, such as case law, contracts, and regulations, and are beginning to turn to Al-powered systems for efficient search and analysis. These systems rely on vector embeddings to process the multidimensional data within legal texts and their relationships. However, traditional vector storage systems present significant challenges. Storing individual vector embeddings for each legal document and its components leads to massive storage demands. Furthermore, re-vectorizing entire corpora to incorporate new laws, rulings, or interpretations is computationally expensive. Legal searches also require nuanced contextual understanding, which is difficult to achieve with conventional keyword-based systems. Finally, identifying relevant legal documents across different jurisdictions is a crucial yet complex task.

[0136] Continuous vectorization technology offers substantial improvements for legal document search and analysis. One of CV’s major advantages is its ability to maintain a single, continuously updated vector 114 per legal topic, jurisdiction, and document type. This approach dramatically reduces storage requirements, allowing legal systems to cover a more comprehensive range of legal documents without needing additional infrastructure.

[0137] CV reduces the need for computationally expensive batch updates by using weighted linear interpolation to dynamically update existing vectors 114 as new rulings or laws emerge. This allows search results to reflect the latest legal developments in near real-time, ensuring that legal professionals always have access to the most current and relevant information. Additionally, CV’s continuous updating mechanism enhances the system’s contextual understanding of legal concepts. This improvement results in more accurate search outcomes, capturing nuanced interpretations and applications of legal principles.

[0138] Another strength of CV is its ability to provide cross-jurisdictional insight. The efficient updating of jurisdiction-specific vectors 114 allows the system to better represent legal similarities and differences across regions. This makes it easier to surface relevant cases or regulations from different jurisdictions, supporting more comprehensive legal research. Additionally, CV can continuously track the evolving significance of legal precedents, helpingAgent Reference: 22735-002WO-PCTlegal professionals identify relevant precedents, including recent rulings that might affect case outcomes.

[0139] Several key components of CV enable these advancements. A tailored weighting function 120 can balance the importance of new rulings against established precedents, ensuring that landmark cases significantly influence legal concept vector representations. Legal facets 116 may be defined based on areas of law, jurisdictions, and document types, enabling more targeted searches and updates. Linear interpolation allows for the seamless integration of new legal developments into existing vector embeddings, preserving the continuity of legal knowledge while incorporating the latest rulings and interpretations. The continuous updating mechanism ensures that vector embeddings remain up to date with the current legal landscape.

[0140] CV could have a profound impact on the legal industry by enabling more efficient, accurate, and context-aware legal document search and analysis. It could improve the quality of legal research, enhance decision-making in cases, and ultimately contribute to more effective legal practice. CV’s ability to handle vast amounts of legal data more efficiently could also lead to the discovery of non-obvious legal connections and trends, potentially uncovering novel legal strategies or areas for policy reform.

[0141] This contemplated CV-powered legal document search and analysis system could become an indispensable tool for legal professionals, offering unparalleled speed, accuracy, and insight in navigating the complex and ever-changing legal landscape.

[0142] As another specific, non-limiting example, the contemplated continuous vectorization system may be used to enhance multimedia content searching for streaming platforms. Streaming platforms use Al-powered systems to help users discover relevant content from vast libraries of media such as videos, music, and podcasts. These systems rely on vector embeddings to represent and process multidimensional data about content, user preferences, and viewing / listening behaviors.

[0143] Approaching this task with conventional vector database technology presents a number of problems. Storing individual vector embeddings for each piece of content and user interaction leads to massive storage requirements. Additionally, this will be computationally intensive, as frequent re-vectorization of content libraries to reflect new additions and changing popularity is computationally taxing. Also, effectively searching across different types of media (e.g., video, audio, text) is complex and often inaccurate. Providing personalized content recommendations for millions of users with diverse and changing tastes is challenging and will require significant storage and computing resources.Agent Reference: 22735-002WO-PCT

[0144] The continuous vectorization system and method contemplated herein addresses these challenges. Because continuous vectorization would be able to maintain a single, continuously updated vector 114 per content category and user profile, the storage requirements would be substantially reduced. This would also allow for more comprehensive content and user data retention without increased infrastructure costs.

[0145] Using continuous vectorization, re-vectorizing the entire library as content popularity or metadata changes is no longer necessary. Instead, weighted linear interpolation is used to update existing vectors 114 to reflect the changes. This means search results will reflect the latest content trends and additions almost instantaneously, without the need for batch updates.

[0146] Furthermore, CV's efficient updating mechanism allows for better representation of the relationships between different media types. This means the user can receive more accurate cross-media search results, enabling them to find relevant content regardless of media type.

[0147] Continuous vectorization makes incorporating new or updated information faster and more efficient. This means that user profile vectors 114 can be continuously updated based on viewing / listening behavior, resulting in more accurate and timely personalized search results that adapt quickly to changing user preferences across different media types.

[0148] Finally, since continuously updating is not an expensive endeavor with CV, having continuously updated content vectors 114 will capture emerging trends and seasonal patterns in viewing / listening habits. This makes it easier to provide search rankings that adapt dynamically to popular trends, improving content discovery and potentially increasing user engagement.

[0149] Continuous vectorization is able to provide these advantages through considered application of the weighting function 120 and definition of facets 116. The weighting function 120 can be tailored to balance the importance of recent user interactions against long-term preferences (e.g., recent views may be given higher weight 200 to capture current interests while maintaining overall taste profile stability, etc.). The facets 116 may be defined based what information is of greatest use (e.g., genres, themes, creators, user interaction patterns, etc.). User facets 116 may be defined based on aspects of user behavior that have greatest impact on their interact! ons / searches (e.g., viewing / listening history, preferences, demographic information, etc.). Defining facets 116 such as these enable targeted updates and searches across specific aspects of the content library and user base.Agent Reference: 22735-002WO-PCT

[0150] The linear interpolation at the core of continuous vectorization allows smooth integration of new content information and user interactions into existing vector embeddings. This preserves the continuity of content relationships and user preferences while still incorporating new data. This easy updating ensures vector embeddings always reflect the latest state of the content library and user behaviors. This can be critical for maintaining up-to-date search indices and recommendation systems in a dynamic streaming environment.

[0151] Continuous vectorization could significantly advance multimedia content discovery and personalization. It addresses critical challenges in managing and searching large-scale, diverse content libraries, making it highly relevant to streaming platforms (e.g., video, music, podcast, etc.), content production companies, Al companies focusing on recommendation systems, advertising networks targeting streaming audiences, and similar industries.

[0152] By enabling more efficient, accurate, and personalized content search, CV has the potential to improve user satisfaction, increase engagement time, and ultimately drive subscriber growth and retention. The system's ability to handle vast amounts of multi-modal content and user data more efficiently could lead to more engaging entertainment experiences and better content curation.

[0153] Not only can continuous vectorization do the job of a conventional vector database at lower storage / computational expense, but it can also do the job better. CV allows more relevant and personalized content search results across media types, faster adaptation to changing user preferences and content trends, and improved discovery of niche content that matches specific user interests. Other advantages include reduced latency in updating search indices with new content or popularity changes, lower operational costs due to reduced compute and storage requirements, and increased scalability to handle growing content libraries and user bases. And since CV can be implemented at any scale, these benefits would be available to streaming platforms of all sizes.

[0154] As a specific, non-limiting example, continuous vectorization may be advantageously used in an enterprise knowledge management system. Large corporations increasingly rely on Al-powered systems to manage and search through their vast repositories of internal documents, including reports, emails, presentations, and project documentation. These systems use vector embeddings to represent and process the multidimensional data about documents, their relationships, and user access patterns. However, conventional vector storage systems face significant problems. Storing individual vector embeddings for each document and its components leads to massive storage requirements, particularly for enterprisesAgent Reference: 22735-002WO-PCTmanaging millions of documents. Frequent re-vectorization of the entire corpus to incorporate new documents and updates is computationally expensive. Additionally, cross-departmental relevance is often difficult to establish with traditional search methods, leaving valuable information siloed in specific departments. The dynamic and frequently changing access patterns of employees also necessitate constant updates to search relevance models.

[0155] Continuous vectorization offers a transformative solution for enterprise knowledge management systems, according to various embodiments. One of the core advantages of CV is its ability to maintain a single, continuously updated vector 114 per document type, project, and department. This dramatically reduces storage requirements, allowing organizations to cover a broader range of documents without increasing infrastructure costs.

[0156] CV also eliminates the need for batch re-vectorization when new documents are added or modified. Instead, it uses weighted linear interpolation to update existing vectors 114 in real-time. This ensures that search results reflect the most current organizational knowledge almost instantaneously, improving the efficiency and effectiveness of enterprise search functions.

[0157] In addition, CV enhances cross-departmental discovery by continuously updating relationships between documents across different organizational silos. This facilitates more accurate and relevant search results that surface valuable information from disparate parts of the organization. The system's ability to break down information silos fosters greater collaboration and innovation by enabling the discovery of connections between seemingly unrelated documents.

[0158] Furthermore, CV adapts quickly to the dynamic access patterns of employees. By continuously updating user interaction vectors 114 based on search and access behaviors, it provides more accurate and personalized search results that evolve in response to changing organizational needs and priorities.

[0159] Several key components of CV enable these advancements. The weighting function 120 can be tailored to balance document recency, user access patterns, and organizational hierarchy, ensuring that recent and frequently accessed documents are given higher weight 200 in search results while maintaining foundational knowledge. Facets can be defined based on document type, department, project, and user roles, enabling targeted updates and searches across the organization. Linear interpolation allows for the smooth integration of new document information and user interactions into existing vector embeddings, preserving continuity while incorporating new data. The continuous updating mechanism ensures thatAgent Reference: 22735-002WO-PCTvector embeddings always reflect the latest state of organizational knowledge and user behavior, which is crucial for maintaining up-to-date search indices in a constantly evolving corporate environment.

[0160] The application of CV in enterprise knowledge management could have a profound impact on productivity and decision-making within large organizations. By enabling faster and more accurate discovery of relevant information across departmental boundaries, CV helps surface expertise and knowledge that may otherwise remain hidden. The system's efficiency in handling vast amounts of corporate data reduces operational costs, while its scalability ensures that growing volumes of documents and users can be managed effectively. Moreover, the ability to break down information silos and foster collaboration across departments could lead to significant innovation and competitive advantage.

[0161] This contemplated CV-powered enterprise knowledge management system could become an indispensable tool for large corporations across all sectors, offering unparalleled speed, accuracy, and insight in navigating complex internal knowledge landscapes.

[0162] As still another specific, non-limiting example, continuous vectorization may be advantageously used in a real-time news analysis and recommendation system. News aggregators and media companies face the challenging task of analyzing, categorizing, and recommending news articles in real-time. These systems rely on vector embeddings to represent and process the multidimensional data of news content, including text, metadata, and user interactions. However, conventional vector storage systems face critical limitations. Storing individual vector embeddings for each news article and user interaction results in massive storage demands. Re-vectorizing news datasets to incorporate breaking stories and evolving narratives is computationally expensive, while maintaining real-time analysis and content relevance in a rapidly changing news cycle is another substantial hurdle.

[0163] Continuous vectorization significantly improves the efficiency and performance of news analysis and recommendation systems. One of CV’s primary advantages is its ability to maintain a single, continuously updated vector 114 per news topic and user interest profile, dramatically reducing storage requirements. This reduction allows for the retention of more comprehensive news content and user data without requiring additional infrastructure.

[0164] By using weighted linear interpolation, CV updates vectors 114 in real-time as new articles are published and user interactions occur. This eliminates the need for costly batch updates and enables near-instantaneous analysis of news trends and user interests. As aAgent Reference: 22735-002WO-PCTresult, vector embeddings continuously reflect the most current information, allowing news categorization and recommendation systems to respond quickly to breaking news and shifting narratives.

[0165] In addition to improved real-time analysis, CV enhances the representation of evolving news topics and user interests. Its efficient updating mechanism allows for more nuanced content categorization and a better understanding of changing news landscapes, resulting in more accurate recommendations. Furthermore, CV-powered systems are able to continuously update user profile vectors 114 based on reading habits and interactions, ensuring that recommendations are relevant to users’ evolving interests.

[0166] Several CV components drive these improvements. The weighting function 120 can be tailored to prioritize breaking news over established narratives, ensuring that high-impact events significantly influence topic vector representations. Facets can be defined based on categories such as topics, geographic regions, and news sources, enabling targeted updates and searches within the news ecosystem. Linear interpolation allows for the smooth integration of new articles and user interactions into existing vectors 114, preserving the continuity of news narratives while incorporating new data. The continuous updating mechanism ensures that vector embeddings reflect the latest news content and user behavior, which is essential for maintaining up-to-date topic models and recommendation systems.

[0167] The impact of CV on the news and media industry could be substantial. By enabling more efficient, accurate, and timely analysis of news content, CV improves the ability to detect breaking news, categorize content, and provide personalized recommendations. This can lead to a more engaged and informed readership, with news platforms benefiting from increased user satisfaction and retention. Moreover, CV’s ability to handle large-scale, continuously changing data streams reduces operational costs and increases scalability, allowing news platforms to manage growing volumes of content and users more effectively.

[0168] This contemplated CV-powered real-time news analysis and recommendation system could become an essential tool for media companies seeking to stay competitive in the fast-paced digital news landscape, offering unmatched speed, accuracy, and efficiency in news analysis and recommendation.

[0169] As a specific, non-limiting example, continuous vectorization may be advantageously used in an environmental monitoring and climate change analysis system. Environmental agencies and research institutions use Al-powered systems to analyze large quantities of climate data to monitor and predict climate change patterns. These systems rely on vector embeddings to manage multidimensional data streams from a wide array of sources,Agent Reference: 22735-002WO-PCTsuch as sensors, satellite data, and historical climate records. Traditional vector storage systems, however, present several limitations in this domain. The need to store individual vector embeddings for each data point leads to exponential growth in storage demands. Additionally, frequent re-vectorization of datasets to incorporate new information is computationally expensive. Moreover, real-time analysis is crucial in this field, as climate patterns change rapidly. The high computational demands of constant re-vectorization also result in significant energy consumption, which poses a challenge, particularly for systems trying to align with sustainability goals.

[0170] Continuous vectorization technology offers a transformative solution for environmental monitoring and climate change analysis systems. CV’s ability to maintain a single, continuously updated vector 114 per climate facet 116 — such as geographical regions or climate variables — significantly reduces storage requirements. This allows organizations to retain more comprehensive historical and real-time climate data without the need for additional infrastructure.

[0171] Rather than re-vectorizing entire datasets, CV uses weighted linear interpolation to update existing vectors 114 as new climate data becomes available. This reduces the computational load and enables more frequent updates to the models, allowing for near real-time analysis of climate trends. The ability of CV to incorporate new data almost instantaneously ensures that environmental monitoring systems can respond quickly to emerging patterns or extreme weather events, which is critical for effective climate change mitigation and disaster preparedness.

[0172] In addition to enhancing real-time analysis, CV’s efficient updating mechanism allows for better representation of complex climate interactions and long-term trends. This results in more accurate climate models that can capture intricate relationships between various climate variables. Another major advantage of CV is its energy efficiency — by reducing the computational demands of vector processing, CV translates into lower energy consumption, which is aligned with the sustainability goals of environmental agencies and institutions.

[0173] Key components of CV enable these improvements. The weighting function 120 can be tailored to prioritize new and extreme climate events over established data, ensuring that significant weather occurrences influence climate vector representations. Facets may be defined based on relevant categories such as geographical regions and climate variables, allowing for targeted analysis of specific aspects of the climate system. Linear interpolation ensures the smooth integration of new data into existing vector embeddings, preserving theAgent Reference: 22735-002WO-PCTcontinuity of climate trends while incorporating new information. CV’s continuous updating mechanism keeps vector embeddings up to date with the latest climate data, which is essential for monitoring rapid environmental changes and providing accurate inputs for Al models.

[0174] The application of CV in environmental monitoring and climate change analysis could have a profound impact on the field. By enabling more efficient, accurate, and sustainable climate data analysis, CV improves early warning systems for extreme weather events, enhances long-term climate projections, and contributes to more effective climate change mitigation strategies. The system’s ability to handle vast amounts of climate data more efficiently could lead to breakthroughs in understanding complex climate systems and their interactions, potentially uncovering new patterns or correlations that were previously undetectable.

[0175] This contemplated CV-powered environmental monitoring system could become an essential tool for government agencies, research institutions, and tech companies developing Al solutions for climate analysis, offering unparalleled speed, accuracy, and sustainability in managing and interpreting climate data.

[0176] As a specific, non-limiting example, continuous vectorization may be advantageously used in a large-scale genetic analysis system for disease research. Research institutions and biotech companies increasingly rely on Al-powered systems to analyze vast amounts of genomic data for disease research and drug discovery. These systems use vector embeddings to represent and process multidimensional genetic data from millions of individuals and organisms. However, traditional vector storage systems face significant challenges in this domain. Storing individual vector embeddings for each genetic sequence leads to enormous storage demands. Additionally, frequent re-vectorization of genomic datasets to incorporate new discoveries is computationally expensive. Real-time analysis is critical in this field, as new genetic associations are constantly being discovered. Furthermore, the complexity and interconnectedness of genetic data make it difficult to represent and analyze efficiently.

[0177] Continuous vectorization technology provides a powerful solution to these challenges in genetic analysis and disease research. CV’s ability to maintain a single, continuously updated vector 114 per genetic facet 116 — such as gene function or disease association — significantly reduces storage requirements. This reduction allows for the retention of more comprehensive genetic datasets without increasing infrastructure costs, making it easier to handle the massive volume of data generated in genomic research.Agent Reference: 22735-002WO-PCT

[0178] Instead of re-vectorizing entire genomic datasets when new genetic discoveries are made, CV uses weighted linear interpolation to update existing vectors 114. This reduces the computational cost of updates, enabling more frequent model adjustments and faster incorporation of new genetic information. CV’s real-time vector updates allow near-instantaneous analysis of genetic trends, ensuring that genetic models are always reflective of the most current discoveries. This capability is crucial for research in disease mechanisms, where new gene-disease associations must be quickly integrated into ongoing analyses.

[0179] CV also excels at handling the complexity of genetic data. Its efficient updating mechanism allows for better representation of complex genetic interactions and pathways, leading to more accurate models and potentially novel insights. The continuous integration of new genetic data preserves the continuity of genetic knowledge while still incorporating cutting-edge findings, enhancing the predictive power of Al-driven analyses.

[0180] Several key components of CV drive these advancements. The weighting function 120 balances the influence of new genetic discoveries with established knowledge, ensuring that newly identified gene-disease associations significantly impact the overall vector representations. Facets may be defined based on relevant categories such as gene functions, metabolic pathways, and disease associations, enabling targeted updates and more precise analysis of specific aspects of the genetic landscape. Linear interpolation allows for the smooth integration of new data into existing vectors 114, ensuring that the continuity of genetic knowledge is maintained. CV’s continuous updating mechanism ensures that vector embeddings reflect the latest genetic research, which is essential for maintaining accurate and up-to-date models in disease research.

[0181] The impact of CV on genomic research and drug discovery could be transformative. By enabling more efficient, accurate, and comprehensive analysis of genetic data, CV enhances the understanding of disease mechanisms, accelerates drug discovery, and optimizes personalized treatment approaches. The system’s ability to handle vast amounts of complex genetic data more efficiently could lead to breakthroughs in understanding intricate genetic interactions and their role in diseases, potentially uncovering novel therapeutic targets and treatment strategies.

[0182] This contemplated CV-powered genetic analysis system could become an indispensable tool for research institutions, biotech companies, and healthcare providers, offering unparalleled speed, accuracy, and insight in managing and interpreting large-scale genetic data.Agent Reference: 22735-002WO-PCT

[0183] As a specific, non-limiting example, continuous vectorization may be advantageously used in a smart city infrastructure management system. Smart cities leverage Al-powered systems to manage and optimize urban infrastructure, including traffic flow, energy usage, and public transportation. These systems rely on vector embeddings to represent and process data streams from millions of loT devices and sensors scattered throughout the city. However, traditional vector storage approaches present significant challenges. Storing individual vector embeddings for each data point from these numerous sensors results in massive storage demands. Additionally, frequent re-vectorization of city-wide data to incorporate new information is computationally expensive. Real-time analysis is critical in smart city management, as urban conditions such as traffic congestion, energy consumption, and public transportation needs can change rapidly. Furthermore, the computational demands of constant re-vectorization result in high energy consumption, which is at odds with the sustainability goals of smart cities.

[0184] Continuous vectorization technology offers a transformative solution for smart city infrastructure management. One of CV's key advantages is its ability to maintain a single, continuously updated vector 114 per urban infrastructure facet 116, such as traffic patterns or energy grids. This approach dramatically reduces storage requirements, allowing for the retention of comprehensive historical and real-time urban data without increasing infrastructure costs.

[0185] Rather than re-vectorizing entire datasets when new data arrives from citywide sensors, CV uses weighted linear interpolation to update existing vectors 114. This approach significantly reduces the computational load, enabling more frequent updates to the models and allowing for real-time analysis of urban trends. By continuously incorporating new data almost instantaneously, CV ensures that urban management systems can respond quickly to changing conditions, whether it be a traffic incident or a spike in energy demand.

[0186] CV also contributes to smart cities' sustainability goals by reducing energy consumption. The reduced computational load translates directly into lower energy usage, which is critical for minimizing the environmental footprint of smart city operations.

[0187] Several key components of CV power these advancements. The weighting function 120 can prioritize new urban data over historical trends, ensuring that recent incidents such as traffic jams or power outages significantly influence vector representations. Facets may be defined based on relevant urban categories, such as traffic zones, energy grids, and public transportation routes, enabling targeted analysis and more accurate system responses. Linear interpolation allows for the smooth integration of new data into existing vector embeddings,Agent Reference: 22735-002WO-PCTpreserving continuity in urban patterns while incorporating new information. CV's continuous updating mechanism ensures that vector embeddings reflect the latest city data, which is critical for monitoring rapid urban changes and providing up-to-date inputs to Al-driven city management models.

[0188] The application of CV in smart city infrastructure management could have a significant impact on urban planning and operations. By enabling more efficient, accurate, and sustainable data analysis, CV improves traffic management, optimizes energy distribution, and enhances public transportation efficiency. The system's ability to handle vast amounts of urban data more efficiently could also lead to breakthroughs in understanding complex urban dynamics, potentially uncovering new approaches to long-standing challenges in urban planning.

[0189] This contemplated CV-powered smart city management system could become an indispensable tool for city governments, urban planners, and loT companies, offering unparalleled speed, accuracy, and sustainability in managing the vast and complex data streams generated by smart city infrastructure.

[0190] As a specific, non-limiting example, continuous vectorization may be advantageously used in a personalized medicine and treatment optimization system. Healthcare providers and research institutions increasingly rely on Al-powered systems to analyze patient data to develop personalized treatment plans and predict drug efficacy. These systems use vector embeddings to represent and process multidimensional data related to patients, including genomic information, medical histories, and real-time health metrics. Traditional vector storage systems face significant problems in this domain. Storing individual vector embeddings for each patient data point leads to massive storage requirements. Additionally, frequent re-vectorization of patient histories to accommodate new health data is computationally expensive. Real-time analysis is critical, as patient conditions can change rapidly, requiring timely adjustments to treatment plans. The storage and processing of numerous vectors 114 also increase the risk of data breaches, raising concerns about patient data privacy.

[0191] Continuous vectorization technology offers a powerful solution for personalized medicine and treatment optimization. CV's ability to maintain a single, continuously updated vector 114 per patient profile significantly reduces storage requirements. This approach allows healthcare providers to retain comprehensive patient histories without the need for additional infrastructure, thereby facilitating long-term tracking and analysis of patient data.Agent Reference: 22735-002WO-PCT

[0192] Rather than re-vectorizing entire patient histories when new data arrives, CV uses weighted linear interpolation to update existing vectors 114 dynamically. This reduces the computational burden of updates and enables more frequent model adjustments, allowing for real-time analysis of patient health trends. As new data streams in from wearable devices, health monitoring systems, and check-ups, CV-powered systems can immediately update patient profiles, ensuring that treatment recommendations remain current and accurate.

[0193] CV also contributes to enhanced data privacy. By reducing the number of stored vectors 114, CV lowers the attack surface for potential data breaches, improving data security and making it easier for healthcare providers to comply with privacy regulations such as HIPAA.

[0194] Key components of CV drive these advancements. The weighting function 120 can be tailored to prioritize acute health events over long-term trends, ensuring that recent and significant changes in a patient’s condition significantly influence their treatment plan. Facets may be defined based on relevant categories such as medical conditions, treatment plans, and demographic data, allowing for targeted analysis of specific aspects of patient health. Linear interpolation ensures the smooth integration of new health data into existing vector embeddings, preserving the continuity of patient health trends while incorporating new information. CV's continuous updating mechanism ensures that patient profiles are up to date, which is critical for monitoring rapid changes in health and providing accurate inputs for AL driven treatment models.

[0195] The impact of CV on personalized medicine could be substantial. By enabling more efficient, accurate, and secure patient data analysis, CV enhances the ability of healthcare providers to deliver personalized treatments that are better aligned with individual patient needs. This could improve treatment outcomes, accelerate drug discovery processes, and ultimately contribute to more effective and tailored healthcare delivery. The system’s ability to handle vast amounts of patient data more efficiently could also lead to breakthroughs in understanding complex disease mechanisms, potentially uncovering novel therapeutic approaches and enabling more precise interventions.

[0196] This contemplated CV-powered personalized medicine and treatment optimization system could become an essential tool for healthcare providers, pharmaceutical companies, and research institutions, offering unmatched speed, accuracy, and security in managing patient data and optimizing treatment strategies.

[0197] The following is a discussion comparing the performance of the contemplated continuous vectorization system 100 with a conventional vectorization approachAgent Reference: 22735-002WO-PCTon a specific, non-limiting example of a use case for vector databases. The data set to be vectorized by the two approaches is made up of roughly 50,000 books in the public domain, obtained from Project Gutenberg. The data set is roughly 17 gigabytes of raw text data.

[0198] Two vector databases are formed, one using standard vectorization and the other using continuous vectorization. The weighted interpolation of continuous vectorization preserves relevance while reducing storage requirements. The storage advantages of the contemplated CV method are significant: the standard vectorization method requires roughly 1.5 terabytes, while the continuous vectorization method actually shrinks the storage requirements to roughly 0.5 gigabytes. In this specific, non-limiting example involving the vectorization of literature, the contemplated continuous vectorization method reduced the storage requirements of the vectorized data by a factor of 300x compared to conventional vectorization.

[0199] As previously discussed, the contemplated system and method is able to provide storage advantages without sacrificing performance or accuracy. Continuous vectorization essentially transforms facets from passive organizational units into active semantic "maintainers" that ensure the semantic meaning is preserved across continuous updates. This can be seen in a comparison between the accuracy of the standard and continuous vector data sets of the specific non-limiting example discussed above.

[0200] Both vector databases were given the query "Find authors whose writing style is most similar to Shakespeare". The top 10 results were returned from each database, ranked by an L2 distance. The L2 distance represents the distance between the query and the book being searched for represented as a number from 0 to 1. The smaller the number is, the more relevant to the query the content in the book is.

[0201] FIG. 4 shows a relevance plot of the L2 distances from query vectors to result vectors for the top ten results obtained through the specific non-limiting examples of standard and continuous vector databases. As suggested by the plot, both databases returned the same top result, "Shakespeare's First Folio". However, for the remaining nine books the continuous vectorization database returned results with better (i.e., shorter) L2 distances than the standard vector database. Not only does CV offer substantial storage savings, but it can also provide more accurate results in a semantic search.

[0202] It should be noted that this preservation of semantic meaning while also reducing storage requirements is an emergent capability that cannot be achieved through faceting or weighting alone. The interaction between these components enables dynamic semantic maintenance without the storage overhead typically associated with vector databases.Agent Reference: 22735-002WO-PCTIn conventional systems, an improvement in one of these aspects comes at the expense of the other.

[0203] NEW MATERIAL STARTS HERE Vector databases 104 that implement continuous vectorization maintain facet vectors 114 by incrementally interpolating new update vectors 210 into existing facet representations. In such systems the processor 106 generates a weight 200 w, applies it to the incoming update vector 210, applies (1-w) to the existing target facet vector 204, and writes the interpolated result back to the vector database 104. As previously discussed, in some embodiments the weighting value may be produced by a weighting function 120 applied to information about the facet 116, the vectors 114, or the incoming data 206, and that this mechanism is a core control surface for facet behavior in continuous interpolation workflows .

[0204] Despite the flexibility of these approaches, systems that rely on them encounter a persistent deficiency when facet meaning is driven by mixed-relevance streams. In typical deployments, not all incoming chunks contribute equally to a facet’s intended semantics. Average-style schemes tend to distribute influence uniformly regardless of content; order- or time-based schemes modulate by recency or position; and count-based schemes respond to accumulation rather than significance. None of these, as conventionally applied, automatically assess how semantically aligned the new content is with the facet’s core concept, which dilutes facet specificity over time and degrades retrieval quality for queries targeting that facet.

[0205] This shortfall becomes evident in enterprise knowledge bases and similar corpora where facets reflect concrete topics. For example, a facet meant to represent “remote work policy” may see highly relevant remote work policy sections, moderately relevant mentions of remote work in benefits summaries, and marginal references to remote work in unrelated logistics documents (e.g., discussion regarding office supplies). Conventional weighting either treats these contributions equally or weights them by recency or ingestion order, which are metadata-centric rather than meaning-centric. The result is drift in the facet vector and reduced precision at retrieval time when disambiguation depends on semantically concentrated signals rather than incidental mentions.

[0206] Information retrieval techniques such as lexical statistics and probabilistic ranking are well known in the art and provide mature mechanisms to measure relevance within a corpus. However, in continuous vectorization workflows, a gap remains between these lexical relevance measures and the generation of the interpolation weight that governs how strongly a new update vector should influence a facet vector at ingestion time. In practice, systemsAgent Reference: 22735-002WO-PCTtherefore require a weighting approach that operates automatically, that evaluates semantic alignment between new content and existing facets, that avoids manual per-facet tuning, and that can be inserted into the standard interpolation pipeline without altering downstream retrieval or storage semantics.

[0207] Contemplated herein is a system and method for automatically determining interpolation weights for continuous vectorization. The contemplated method, also referred to as "auto weighting", computes a content-aware relevance measure between incoming text and existing facet representations. The approach integrates a linguistically grounded scoring pipeline into the weighted linear interpolation loop previously discussed, replacing uniform or metadata-driven weights with values derived from the true semantic alignment of new content to a target facet.

[0208] In operation, incoming text is normalized through a preprocessing pipeline that removes stop words and applies stemming to reduce variance while preserving meaningbearing terms, according to various embodiments. The system maintains corpus-level stem statistics and facet-specific term profiles that characterize the semantic focus of each facet. A scoring engine computes a BM25-style relevance score for each candidate facet by comparing the stem composition of the incoming text against the facet’s term profile, with inverse document frequency weighting emphasizing discriminative terminology, according to various embodiments. The resulting relevance score is then transformed into an interpolation weight for continuous vectorization. In some embodiments, the score may be normalized across candidate facets, optionally combined with temporal decay, and used directly to govern how strongly the update vector influences the target facet vector.

[0209] Advantageous over conventional weighting families, the contemplated system and method directly addresses the dilution of facet meaning that arises from mixed-relevance streams. By elevating semantically concentrated signals and suppressing incidental mentions, it sustains facet specificity over time and improves retrieval precision without manual per-facet tuning. Integration is non-disruptive to the previously discussed continuous vectorization system 100 (e.g., the system 100 of FIGs. 1A-2B). The weighting function 120 is inserted in the same place as in systems 100 that do not employ auto weighting, preserving downstream storage and retrieval semantics while substituting a relevance-derived value for prior uniform or heuristic schemes. The result is an automated, content-aware weight selection mechanism that aligns the interpolation with established information retrieval principles, thereby resolving the identified need for a meaning-centric, low-maintenance weighting approach within continuous vectorization workflows.Agent Reference: 22735-002WO-PCT

[0210] FIG. 5 is a process view of a non-limiting example of the application of auto weighting with the system 100 of FIG. IB, to the updating of a vector database 104 using continuous vectorization. First, the system 100 receives new data 206. See circle '1' of FIG.5. As shown in this non-limiting example, the new data 206 may be received as raw data 208 comprising text. This is being shown in this non-limiting example for convenience, because auto weighting relies on access to the raw data 208. In embodiments where the new data 206 is received in vector form, auto weighting may still be used so long as the system 100 has access to the raw data 208 from which the vector 114 originated.

[0211] In some embodiments, prior to relevance scoring and generating weights 200, the continuous vectorization server 102 preprocesses the new data 206 to generate a term profile 500 (see, for example, circle “2” of FIG. 5). In the context of the present description and the claims that follow, a “term profile” 500 is a data structure that represents normalized lexical features of the new data 206, and may include, for example, stem frequencies, token counts, and other term-level statistics derived from the text after normalization.

[0212] In some embodiments, preprocessing comprises a linguistic normalization pipeline that includes at least stop word removal and stemming. In other embodiments, preprocessing may additionally include tokenization, case normalization, punctuation handling, language detection, lemmatization, phrase detection, or equivalent operations that facilitate consistent relevance scoring.

[0213] In some embodiments, preprocessing comprises removing stop words 508. In the context of the present description and the claims that follow, “stop words” 508 are common words that carry minimal discriminative meaning and appear with high frequency across many documents. Examples include, but are not limited to, articles (e.g., a, an, the), prepositions (e.g., in, on, at, by), conjunctions (e.g., and, or, but), and common verbs (e.g., is, are, was, were, have, has). In some embodiments, the system maintains a comprehensive stop word list for one or more applicable languages and removes tokens matching entries in the list after tokenization. Stop word removal may eliminate noise that would otherwise inflate term frequencies without contributing to semantic meaning. This improves the accuracy of subsequent relevance calculations and reduces computational overhead.

[0214] In some embodiments, preprocessing comprises applying stemming to generate stems 510 for a term profile 500. In the context of the present description and the claims that follow, “stems” 510 refer to normalized token forms representing word roots, which may be produced by a stemming algorithm (e.g., Porter stemmer, Snowball stemmer, languagespecific stemmers, etc.). Stemming is the process of reducing words to their root or stem formAgent Reference: 22735-002WO-PCTby removing morphological affixes (prefixes and suffixes). For example, “running,” “runs,” and “ran” may all reduce to “run”. As another example, “policies” and “policy” may both reduce to “polic” (or “policy”, depending on the stemmer). By reducing morphological variants to a common form, documents discussing the same concept with different word forms are properly recognized as semantically related. This dramatically improves recall in relevance matching.

[0215] In some embodiments, the method further comprises updating corpus statistics 502 within the term statistics database 130 (see, for example, circle “3”). In the context of the present description and the claims that follow, “corpus statistics” 502 refer to aggregated statistics about stems or terms across a collection of ingested content, and may include global, per-chunk, per-document, per-chunk, and per-facet statistics.

[0216] In some embodiments, the term statistics database 130 maintains global stem statistics for each unique stem 510 encountered across all processed content, including document frequency (df), collection frequency (cf), and inverse document frequency (idf) values. In the context of the present description and the claims that follow, “document frequency” refers to the number of distinct chunks (or documents, depending on configuration) in which a stem 510 appears, and “collection frequency” refers to the total number of occurrences of the stem 510 across the corpus. Inverse document frequency (idf) may be calculated from these values to represent how rare or discriminative a stem 510 is within the corpus.

[0217] In the context of the present description and the claims that follow, a chunk is a segment of raw data 208, such as a portion of text. The concept of a chunk will be central to megachunking, which will be discussed further in the context of FIG. 6, below.

[0218] In some embodiments, the term statistics database 130 additionally maintains per-chunk stem statistics for each chunk of content processed, including term frequency (tf), chunk length, and chunk identifiers. In the context of the present description and the claims that follow, term frequency is the number of times each stem 510 appears within that specific chunk. Chunk length is the total number of stems 510 in the chunk after preprocessing. A chunk identifies is a unique reference to the source chunk that links the statistics to a source record or provenance information, according to various embodiments.

[0219] In some embodiments, the term statistics database 130 further maintains per-facet stem statistics for each facet 116 in the system 100, including facet term frequencies, facet document count, and facet centroid terms, according to various embodiments. In the context of the present description and the claims that follow, facet term frequencies areAgent Reference: 22735-002WO-PCTaggregated term frequencies for all stems associated with a particular facet. Facet document count is the number of chunks that have been interpolated into or contributed to that facet. Facet centroid terms are the stems that are most characteristic of that facet’s semantic meaning.

[0220] In some embodiments, updating corpus statistics 502 comprises, for each stem 510 in the preprocessed chunk, updating the global document frequency (e.g., incrementing the global document frequency if this is a new occurrence of the stem 510 in this chunk), updating the collection frequency (e.g., incrementing the collection frequency by the stem’s term frequency in this chunk), recalculating idf values as needed, and storing the chunk’s term frequencies and chunk length as per-chunk statistics.

[0221] Next, a relevance score 504 for the new data 206 is calculated relative to each facet 116 within the vector database 104 using a relevance function 506. See circle "4". According to various embodiments, the relevance score 504 is a function of the term profile 500 and the corpus statistics 502.

[0222] In the context of the present description and the claims that follow, a relevance function 506 refers to a function that outputs a numeric value quantifying lexical / statistical relevance of the new data 206 to a facet 116, based on the terms present in the new data 206 and the term statistics associated with the corpus and, in some embodiments, the facet 116. Going forward, the following notation will be used to indicate, for example, the relevance score 504 for chunk C against facet 116 F provided by the BM25 relevance function 506: BM25(C,F).

[0223] One example of a relevance function 506 is "Best Matching 25", or BM25. BM25 is a probabilistic information retrieval ranking function developed through decades of academic research and proven highly effective in real-world search systems. For a given chunk C and facet F, the BM25 score is calculated using the following linear notation:Where:s represents each stem in chunk Ctf(s, C) is the term frequency of stem s in chunk C|C| is the length of chunk C (in stems)avgdl is the average chunk length across the corpuskl is a tuning parameter controlling term frequency saturation (typically 1.2 to 2.0) b is a tuning parameter controlling length normalization (typically 0.75)Agent Reference: 22735-002WO-PCTIDF(s) is the inverse document frequency of stem s

[0224] According to various embodiments, the Inverse Document Frequency (IDF) component measures how discriminative a stem 510 is:N — n(s) + 0.5< + 0.5+ 1>Where:N is the total number of chunks in the corpusn(s) is the number of chunks containing stem sIn represents the natural logarithm

[0225] In some embodiments, the Okapi BM25 relevance function may be used. In other embodiments, other variations in the BM25 family including, but not limited to, BM25+ may be used as the relevance function. In still other embodiments, other ranking functions known in the art may be used as the relevance function in the contemplated system and method.

[0226] Stems 510 that appear in many documents have low IDF because they are common and less discriminative, stems 510 that appear in few documents have high IDF (they are rare and highly discriminative). This ensures that the presence of rare, topic-specific terminology contributes more to relevance scoring than common words.

[0227] In some embodiments, a facet-specific BM25 adaptation comprises maintaining, for each facet, a characteristic stem profile derived from term statistics associated with the facet, including stem frequencies and discriminative weights. In the context of the present description and the claims that follow, characteristic stems are the stems 510 that are most strongly associated with the facet 116, weighted by their frequency and discriminative power within the facet 116. A facet term profile is a weighted distribution of stems 510 representing the facet’s conceptual focus. When a new chunk arrives, the BM25 score quantifies how well the chunk’s stem composition aligns with the facet’s term profile.

[0228] Next, the continuous vectorization server 102 identifies a target facet 202 within the vector database 104 that the new data 206 would be associated with. See circle “5” of FIG. 5. According to various embodiments, the target facet 202 may be identified using a value 122 of the new data 206 and / or an attribute 124 (e.g., metadata, category, tag, source identifier, document identifier, author identifier, timestamp, or other contextual label) of the new data 206.

[0229] In the context of the present description and the claims that follow, a facet 202 refers to a logical grouping or unit of organization in the vector database 104 that is associated with one or more vectors 114 representing the semantic content of that grouping. InAgent Reference: 22735-002WO-PCTsome embodiments, each facet 116 corresponds to an entity, topic, document, section, user, thread, or other conceptual bucket, and the vector(s) of the facet 116 represent an embeddingbased semantic summary of content assigned to that facet 116.

[0230] Once the target facet 202 is identified, a target facet vector 204 belonging to the target facet 202 is identified using the relevance score 504. See circle "6" of FIG. 5. It should be noted that while the following example will only include one facet 116 having one vector 114, in use there may be multiple facets 116 that would be affected by the new data 206, and some or all of them may have more than one vector 114 that should be updated with the new data 206.

[0231] Next, a weight 200 (w) is calculated using a weighting function 120 that is a function of the relevance scores 504. See circle "7" of FIG. 5. In some embodiments, the system 100 may compute relevance scores 504 for a chunk relative to multiple facets and then derives a normalized weight 200 for a target facet 202.

[0232] In some embodiments, the weighting function 120 may comprise a normalized weighted average such as:

[0233] where E is a small constant to prevent division by zero. In other embodiments, the weighting function 120 may comprise a softmax-like normalization:

[0234] where P is a temperature parameter that controls how sharply the weight distribution concentrates on the highest-scoring facets.

[0235] In some embodiments, the weighting function 120 may also be a function of a decay factor dependent on elapsed time since the target facet vector 204 was last updated. For example, the system 100 may reduce the influence of older facet vectors or increase the influence of new data when a facet has not been updated for a threshold period, thereby supporting recency-sensitive updates while maintaining bounded interpolation behavior.

[0236] After the target facet vector 204 is identified, it is retrieved from the vector database 104. See circle “8” of FIG. 5. In some embodiments, retrieving the target facet vector 204 comprises fetching an embedding representation stored in the vector database 104, along with any associated facet attributes 124 needed for subsequent operations. In other embodiments, retrieving may additionally comprise retrieving multiple vectors associated with the facet and selecting one as the target for interpolation.Agent Reference: 22735-002WO-PCT

[0237] In some embodiments, the continuous vectorization server 102 is configured to generate the update vector 210 by vectorizing the new data 206 with an embedding model 118 (see circle “9” of FIG. 5). In the context of the present description and the claims that follow, an embedding model 118 refers to a machine learning model that maps raw content (for example, text) into a vector space such that semantically related content maps to nearby vectors. In some embodiments, the embedding model 118 produces fixed-length numeric vectors. In other embodiments, multiple embedding models may be used depending on modality or source.

[0238] A weighted linear interpolation is performed on the resulting vectors to create an updated facet vector 212. See circle “10” of FIG. 5. In some embodiments, the method comprises generating an updated facet vector 212 that reflects the new data 206 by performing a weighted linear interpolation between the target facet vector 204 and the update vector 210, with the update vector 210 being multiplied by the weight 200 w and the target facet vector 204 being multiplied by (1-w). A linear interpolation may be expressed as:Updated Facet Vector F = (1 - w_F) * Target Facet Vector F + w_F * Update Vector

[0239] A linear interpolation is performed on the resulting weighted vectors to create an updated facet vector 212. In some embodiments, this interpolation maintains semantic meaning of the facet vector while reducing noise that could result from unweighted or overly aggressive updates. In some embodiments, interpolation is computationally inexpensive relative to generating embeddings, enabling efficient continuous updates without requiring storage of a growing set of per-chunk vectors for each facet.

[0240] Next, the result of the linear interpolation, the updated facet vector 212, is stored in the vector database 104. See circle "11" of FIG. 5. In some embodiments, the target facet vector 204 may be overwritten by the updated facet vector 212, accomplishing an update without using up any additional storage space. In other embodiments, the target facet vector 204 may be replaced by the updated facet vector 212, but the previous facet vector may be retained to preserve a record of how the vector has evolved over time.

[0241] Finally, after the updated facet vector 212 has been stored, the vector database 104 may update the vector index 126 and / or the facet index 128 to reflect the update. See circle "12" of FIG. 5.

[0242] According to various embodiments, auto weighting is well adapted to integrate with an existing continuous vectorization architecture for a number of reasons. First, auto weighting may replace or supplement existing weighting functions. For example, auto weighting may be used as a default weighting function 120, or may be combined with otherAgent Reference: 22735-002WO-PCTweighting approaches, such as multiplication by a time-decay factor to incorporate recency weighting as described above.

[0243] Second, auto weighting may operate during ingestion. In some embodiments, preprocessing, term statistics updates, and BM25 calculations occur when new data 206 is received and prior to the interpolation step that produces the updated facet vector 212. Third, auto weighting may maintain core interpolation mechanics. In some embodiments, the weighted linear interpolation described herein remains unchanged, and only the method for determining the weight 200 w is enhanced by content-aware relevance scoring.

[0244] Fourth, auto weighting may require no changes to retrieval. In some embodiments, query vectorization and vector similarity search remain identical. However, in such embodiments, facet vectors more accurately represent their intended semantic concepts because updates are weighted by relevance at ingestion time.

[0245] According to various embodiments, auto weighting provides one or more advantages. For example, auto weighting may provide content-aware relevance by automatically determining weights based on semantic content of new data relative to existing facets. Auto weighting may reduce human configuration by avoiding domain-specific handtuning of weighting parameters for each use case. Auto weighting may leverage proven information retrieval techniques, such as BM25, that have been extensively validated in academic and production contexts. Auto weighting may improve facet vector quality, leading to improved retrieval accuracy. Auto weighting may provide hybrid search benefits by combining lexical / statistical matching with vector similarity (a semantic technique) without incurring additional query-time computational overhead, because lexical / statistical assessment occurs during ingestion. Auto weighting may reduce or eliminate the need for post-retrieval re-ranking by front-loading relevance assessment into the ingestion weighting step.

[0246] Retrieval-Augmented Generation (RAG) systems, including those that employ continuous vectorization for storage efficiency and semantic maintenance, face a fundamental scaling challenge when ingesting and retrieving from large documents such as handbooks, legal contracts, technical manuals, regulatory filings, research papers, and similar lengthy texts. The prevailing “micro-chunking” approach partitions documents into small, fixed-size units for embedding and storage, commonly on the order of a few sentences to a few paragraphs (for example, 512 to 2048 tokens). While tractable, this practice fragments concepts across arbitrary boundaries, may cut across natural semantic boundaries such that chunks begin mid-sentence or end mid-thought, impairs the integrity of internal cross-references, and often returns insufficient context for queries that require synthesis beyond a few paragraphs. InAgent Reference: 22735-002WO-PCTproduction settings this increases retrieval noise and forces compensating heuristics that add latency and storage overhead without resolving the underlying fragmentation problem.

[0247] Micro-chunking further exhibits “cross-reference blindness” in large documents that contain internal citations such as “as discussed in Section 3.2.” When the referenced material is stored in a different chunk, the cross-reference becomes difficult to follow during retrieval, thereby degrading interpretability and answer completeness. Additionally, for complex queries that require coordinated reasoning across multiple sections, retrieving a small number of micro-chunks commonly provides too little surrounding context for an LLM to respond accurately and comprehensively.

[0248] At the opposite extreme, systems may attempt whole-document retrieval to ensure the necessary context is present. This alternative suffers from high token costs, increased latency from very large context windows, context window overflow for long documents, and signal dilution where relevant passages are surrounded by extensive irrelevant material. For example, sending an entire multi-hundred-page document as context can consume hundreds of thousands of tokens and materially increase inference time, which is unsuitable for interactive applications operating under strict time and cost budgets. Many large documents also exceed available context windows, making whole-document retrieval infeasible in the first instance. Further, when a correct answer is located in a small portion of a large document, surrounding that portion with large amounts of irrelevant content can degrade answer quality and increase the risk of hallucination by diluting the relevant signal. These effects also represent inefficient resource utilization, because the majority of transmitted tokens are unrelated to the user’s query.

[0249] Between these extremes there is a recognized need for document representations that preserve semantic coherence while enabling retrieval at an appropriate level of granularity. A practical solution should maintain a navigable structure that allows systems to traverse the document representation and descend to finer detail only when necessary, and it should permit configuration of granularity so that different use cases can target different levels of detail. Storage techniques should also align with this structure to avoid redundant vectors while retaining provenance and allowing efficient traversal during retrieval.

[0250] Existing hierarchical attempts in the art exhibit notable limitations. Summary-based hierarchies introduce generated content that can omit critical detail or diverge from the source. Graph-based approaches, including graph-centric RAG techniques, require entity extraction and relationship identification that are computationally expensive and can distort the source’s natural structure. Sliding-window chunking reduces hard boundaries butAgent Reference: 22735-002WO-PCTinflates storage through overlapping redundancy. Moreover, these approaches do not integrate cleanly with continuous vectorization or leverage a faceting architecture in which nodes can behave as addressable facets for interpolation and retrieval, including the faceting architecture of Green Vectors. The absence of a hierarchy that is both semantically faithful and operationally compatible with continuous interpolation impedes scalable, accurate retrieval from large documents.

[0251] Contemplated herein is a system and method for partitioning large documents into a navigable hierarchy of semantically coherent units and treating each unit as an addressable facet 116 within a continuous vectorization architecture. The method, hereinafter "Megachunking", reconciles the competing pressures of micro-chunking and whole-document retrieval by preserving context at appropriate scales while maintaining storage efficiency and retrieval tractability.

[0252] Advantageous over micro-chunking, the contemplated hierarchy respects semantic boundaries, maintains cross-references within coherent parents, and delivers sufficient context for synthesis-oriented queries without scattering relevant material across arbitrarily cut fragments. Advantageous over whole-document retrieval, it avoids token cost explosions, reduces latency, and prevents context-window overflow by admitting only the portion of the tree commensurate with the query, thereby reducing signal dilution and avoiding wasteful transmission of irrelevant tokens. In contrast to summary-based or graph-centric hierarchies, the method operates directly on the document’s native structure without introducing generated intermediates or expensive entity extraction, thereby preserving detail while remaining computationally practical.

[0253] Seamless integration with the continuous vectorization systems discussed above follows from treating nodes as facets: the same interpolation mechanics apply at each level, allowing updates to be incorporated incrementally as documents evolve. In this manner, the system furnishes a semantically faithful, navigable, and configurable representation that resolves fragmentation, cost, and integration limitations while remaining compatible with facet-based interpolation and retrieval.

[0254] Megachunking provides a hierarchical document representation strategy that maintains semantic coherence across multiple levels of granularity. In some embodiments, megachunking enables efficient context retrieval for large documents by representing the document as a hierarchy of chunks at different sizes, while preserving the storage and update advantages of a continuous vectorization architecture.Agent Reference: 22735-002WO-PCT

[0255] In the context of the present description and the claims that follow, “megachunking” is a method for processing large documents that creates a hierarchical, tree-structured representation in which each level of the hierarchy represents the document at a different granularity. In some embodiments, each node in the tree can be treated as a distinct facet within a vector database system. The tree structure enables navigation during retrieval to identify an appropriate context level for a query, and storage can be optimized by selectively storing vectors for only a subset of levels.

[0256] In some embodiments, megachunking may be applied within a continuous vectorization system 100. In other embodiments, megachunking may be applied outside continuous vectorization and may be used with conventional vector database storage methods, because the underlying problem addressed by megachunking, namely segmentation of large documents into coherent retrievable units, arises broadly in retrieval-augmented generation (RAG) systems.

[0257] FIG. 6 is a process view of a non-limiting example of a megachunking process for constructing a hierarchical tree representation 606 of a document 600 and defining facets 116 corresponding to nodes 610 of the tree 606. When ingesting new data 206, received as raw data 208 comprising a document 600 having natural boundaries 604, megachunking is selectively applied based on the size 622 of the document 600. Size 622 may be measured in tokens, characters, pages, bytes, or another unit suitable to the implementation.

[0258] According to various embodiments, a configurable size threshold 624 determines whether a document 600 is processed using standard chunking or megachunking. Thus, the first step is determining whether the size 622 of a new data 206 (i.e., a document 600) exceeds a threshold 624. See circle " 1 " . Small documents 600 do not benefit from hierarchical representation since the overhead of building a tree structure is not justified when the entire document 600 fits comfortably in context. Megachunking provides value specifically for large documents 600 where the micro-chunking vs. whole-document trade-off is most acute.

[0259] In some embodiments, the threshold 624 may be set globally for the system 100, electing a specific size 622 such as 50,000 tokens, 100,000 characters, or 50 pages, as non-limiting examples of a threshold 624. The threshold 624 may be chosen to reflect practical context-window limits and cost constraints associated with downstream language model usage, according to various embodiments.

[0260] In some embodiments, threshold 624 selection varies based on document type (e.g., policy manuals, contracts, technical specifications, etc.) and / or source (e.g., aAgent Reference: 22735-002WO-PCTparticular repository, customer, domain, ingestion pipeline, etc.), because different sources may have different typical document lengths and retrieval needs.

[0261] In some embodiments, the system 100 may dynamically adjust the threshold 624 based on available computing resources (e.g., CPU, GPU, memory, queue depth, etc.) and / or use case requirements (e.g., desired recall, latency targets, budget constraints, etc.). In such embodiments, the system 100 may choose standard chunking for moderately large documents 600 during resource-constrained periods, and may choose megachunking when resources permit or when retrieval granularity demands it.

[0262] Once it has been determined that the size 622 of the document 600 exceeds the threshold 624, the megachunking process creates a hierarchical tree structure 606, starting with defining a root node 608 having the entire document 600 as a chunk 602. See circle "2".

[0263] The root node 608 represents the entire document content at the coarsest granularity. In some embodiments, the document 600 may be vectorized as a single unit to create a root-level vector representing overall semantic content of the document 600. This rootlevel representation enables high-level matching to determine whether the document 600 is relevant to a query at all.

[0264] As shown, the root node 608 comprises metadata 612 in addition to the single chunk 602. Metadata 612 of the root node 608 may include, but is not limited to, document title, source, date, document identifier, and other attributes suitable for provenance and retrieval.

[0265] Once the root node 608 has been defined, the megachunking process begins to organize the root node 608 into a hierarchical tree of nodes 606. See circle "3". According to various embodiments, the system 100 organizes the root node 608 into a hierarchical tree of nodes 606 by recursively splitting the chunk 602 of each node 610 into smaller chunks 602 along natural boundaries 604, and defining a new node 610 for each resulting chunk 602. In the context of the present description and the claims that follow, a node 610 is a unit of the tree 606 that comprises a chunk 602 of content and metadata 612 describing the chunk 602 and its relationship to other nodes 610 in the hierarchy.

[0266] According to various embodiments, the document 600 may be recursively split into smaller chunks using the following algorithm:FUNCTION MegachunkSplit(node, target chunk size, current depth, max depth):IF size(node) <= target chunk size OR current depth >= max depth:RETURN node / / Base case: node is small enough or max depth reachedAgent Reference: 22735-002WO-PCTchildren = SemanticSplit(node, target chunk size)FOR EACH child IN children:child. parent = nodenode, children, append(child)MegachunkSplit(child, target chunk size, current depth + 1, max depth)RETURN node

[0267] In some embodiments, recursive splitting along natural boundaries 604 into smaller chunks 602 continues until a base condition is met, such as when the size of a node’s chunk is at most equal to a target chunk size (i.e., the biggest size of leaf-level chunks that is acceptable), and / or when a maximum depth is reached. In some embodiments, the target chunk size is configurable based on at least one of context window limitations, desired retrieval granularity, and cost or latency trade-offs. If the base condition is met, the node is treated as a leaf node, according to various embodiments.

[0268] In some embodiments, the semantic split function selects split points according to a preferred order that respects natural boundaries 604. In the context of the present description and the claims that follow, a natural boundary 604 is conceptual partition within a chunk 602 that serves an organizational purpose. Examples of natural boundaries 604 include, but are not limited to, section boundaries (e.g., chapters, major headings like Hl, H2, etc.), subsection boundaries (e.g., subheadings like H3, H4, H5, etc.), paragraph boundaries (e.g., natural paragraph breaks, etc.), sentence boundaries (e.g., complete sentences, etc.), and, as a last resort, token boundaries (e.g. word boundaries, token boundaries, etc.).

[0269] In some embodiments, natural boundaries 604 are detected using anticipated sizes in combination with embedded metadata. In other embodiments, the natural boundaries 604 may be found using a trained machine learning model, natural language processing, or other techniques known in the art.

[0270] In some embodiments, the system 100 performs split balancing. For example, when a preferred boundary type would produce severely unbalanced chunks 602 within a level 620, the system 100 may select a less-preferred boundary type to produce more balanced children while still preserving semantic coherence where practical. Put differently, the recursive splitting of the chunk 602 of each node 610 along natural boundaries 604 may comprise selecting the natural boundaries 604 to split along based, at least in part, on the relative sizes of the resulting chunks 602. For example, in some embodiments, the splittingAgent Reference: 22735-002WO-PCTalgorithm aims to create roughly equal-sized children while respecting semantic boundaries. If a natural boundary 604 would create severely unbalanced children, the algorithm may use a less-preferred boundary type to achieve better balance.

[0271] According to various embodiments, the tree structure maintains explicit relationships between nodes 610. As shown, each node 610 in the hierarchical tree of nodes 606 comprises metadata 612 specifying at least one of a parent node relationship 614 pointing to a node 610 with a larger chunk 602 that comprises the current chunk 602 and a child node relationship 616 pointing to a node 610 with a smaller chunk 602 that is part the current chunk 602. In some embodiments, sibling node relationships 618 are also stored, such as pointers to previous and next siblings at the same depth, enabling ordered traversal of content at a given level 620. In some embodiments, these parent / child / sibling relationships may be bidirectional, such that each child references its parent and each parent references its children.

[0272] In some embodiments, each node’s metadata 612 includes a depth or level 620 within the hierarchical tree 606, a position among siblings, a document reference identifier linking back to the source document, and a text span indicating the portion of the original document represented by the node (e.g., character offsets or token offsets). In some embodiments, this metadata 612 enables provenance preservation such that any node 610 can be traced back to its location in the original document 600 and the chain of parents provides hierarchical context for citation and explanation.

[0273] In some embodiments, for each node 610 in the tree 606, the system 100 defines a new facet 626 within the vector database 104. See circle "4". The new facet 626 comprises the chunk 602 of the node 610 and at least one attribute 124. In some embodiments, the attributes 124 include at least one of a parent facet relationship 628 mirroring the parent node relationship 614 of the node 610, a child facet relationship 630 mirroring the child node relationship 616 of the node 610, and optionally a sibling facet relationship 632 mirroring sibling node relationship 618. In this manner, the facet graph mirrors the node tree. In some embodiments, node metadata 612 may include the level 620 within the hierarchy, and the facet attributes 124 may store the level of the associated node, enabling level-based retrieval strategies and storage policies.

[0274] In some embodiments, each node 610 may have metadata 612 that comprises a node identifier that maps to a facet identifier, the node’s content maps to a facet vector representation, and the node’s metadata maps to facet attributes stored in the vector database 104 and / or in associated indices. Examples of node 610 metadata 612 include, but are not limited to, depth, position, document reference, and the like. In some embodiments,Agent Reference: 22735-002WO-PCTfacet naming conventions encode document identity and level information to facilitate debugging, navigation, and administration.

[0275] According to various embodiments, the system 100 vectorizes the chunk 602 of at least a subset of the new facets 626 with an embedding model 118. See circle "5". In some embodiments, the subset comprises an update facet 634. In the context of the present description and the claims that follow, an “update facet” 634 refers to a facet 116 whose vector 114 is generated or updated for storage and retrieval operations, including by weighted linear interpolation with previously stored vectors 114, when applicable.

[0276] In some embodiments, all new facets 626 may be vectorized, providing maximum retrieval flexibility at the cost of increased storage and vectorization compute. In other embodiments, storage may be optimized by vectorizing only a subset of levels 620, such as vectorizing the root level (the top M levels with M=l), top M levels, the bottom N levels, the leaf level (the bottom N levels with N=l), while omitting intermediate levels unless needed. In one embodiment, the root and leaf levels were stored, enabling top-down and bottom-up retrieval with minimal intermediate storage. In some embodiments where intermediate levels are not vectorized, they may retain text content in metadata 612 or a separate document store, and vectors 114 may be generated on-demand during retrieval if the traversal process requires an intermediate level.

[0277] Only vectorizing some of the levels results in beneficial storage savings. As a specific example, in one embodiment a document 600 creates a tree 606 having 1 root node, 4 level-2 nodes, 16 level-3 nodes, and 64 level-4 (leaf) nodes, for a total of 85 nodes. If only root and leaves are stored, this is reduced to 65 vectors, a 24% reduction. If only levels 1, 2, and 4 are stored the total is 69 vectors, a 19% reduction. These savings compound across many large documents 600.

[0278] In some embodiments, the hierarchical facet structure enables tree-navigated retrieval strategies that combine vector similarity search with navigational traversal along parent-child facet relationships. In some embodiments, for broad queries where intent spans a wide topic, retrieval proceeds top-down. The system 100 may first compare a query vector against root-level facet vectors to identify relevant documents. For each matched document, the system 100 may compare the query vector against child facets at level 1 and descend into the best-matching child facets. The system 100 may continue descending until reaching a level that provides an appropriate context size.

[0279] According to various embodiments, the criteria for what is an appropriate context size may depend on the specific embedding model being used and what the embeddingAgent Reference: 22735-002WO-PCTsegment size is. The appropriate context size is that embedding segment size multiplied by a scaling factor. In some embodiments, the scaling factor may vary from 10 to 100 times the embedding segment size. In other embodiments, the scaling factor may be larger than 100.

[0280] As a specific example, in one embodiment the query “What are the company’s policies on time off?” is made. Taking the top-down approach, the search begins at the root level, giving a match: Employee Handbook (high similarity). Continuing on to the next level, the highest similarity among the siblings comes from a Benefits section (of the Employee Handbook). Moving on to the next level down, the highest similarity is found with a PTO subsection (of the Benefits section of the Employee Handbook). This retrieval query returns the PTO subsection content, roughly 12,000 tokens of focused context.

[0281] In some embodiments, for narrow queries that match specific details, retrieval proceeds bottom-up. The system may compare the query vector against leaf-level facets to identify highly specific matching chunks. If the retrieved leaf-level chunk is insufficient (e.g., the relevance score 504 of the parent is better than, or within a threshold from, the leaf relevance score 504), the system may retrieve the leafs parent facet to expand context, and may continue expanding upward until sufficient context is obtained or the root is reached.

[0282] As a specific example, in one embodiment, the query “What is the bereavement leave policy for loss of a grandparent?” is made. Taking the bottom-up approach, the search begins at the leaf level, which matches a specific paragraph about bereavement leave. If the paragraph doesn’t specify grandparents, we expand, moving up and retrieving a parent section on leave types. The combined context from relevant leaf and parent are now sufficient.

[0283] A hybrid approach that uses vector similarity at each level to guide navigation:FUNCTION TreeTraversalRetrieval(query vector, root facet, similarity threshold):current node = root facetbest match = root facetbest score = similarity(query vector, root facet.vector)WHILE current node HAS children:child scores = []FOR EACH child IN current node. children:IF child. vector EXISTS:score = similarity(query_vector, child.vector)Agent Reference: 22735-002WO-PCTchild_scores.append((child, score))IF max(child scores). score > best score:best match = max(child scores). childbest score = max(child scores). scorecurrent node = best matchELSE:BREAK / / Parent was better; stop descending RETURN best match

[0284] This algorithm descends the tree only as long as children provide better matches than parents, automatically finding the optimal granularity level.

[0285] In some embodiments, the system performs similarity -guided traversal. At each level, the system computes similarity between the query vector and available child facet vectors and descends only while children provide a better match than the current node. In this manner, the system automatically finds a granularity level that best matches the query without descending to overly specific or overly broad levels unnecessarily.

[0286] In some embodiments, for complex queries spanning multiple topics, the system identifies multiple high-scoring branches in the tree and retrieves content from each branch at an appropriate level. The retrieved content from multiple branches may then be combined to provide comprehensive context for downstream processing.

[0287] As a specific example, in one embodiment the query “How does the remote work policy affect health insurance eligibility?” is made. Two branches are identified; Policies and Benefits. Each branch is traversed, retrieving optimal-level content. In the first branch the search progressed from Policies to Remote Work to finally land on Eligibility Requirements. In the second branch the search progressed from Benefits to Health Insurance and ending on Enrollment Criteria. These two results are then combined, providing context from both relevant sections.

[0288] According to various embodiments, megachunking integrates well with an existing continuous vectorization architecture. First, during ingestion, the system receives a new document and evaluates document size. If the document is below the threshold, the system performs standard chunking. If the document is above the threshold, the system performs megachunking to build a hierarchical tree, create facets for configured levels, and generate vectors for those facets. If a facet already exists, the system may perform weightedAgent Reference: 22735-002WO-PCTlinear interpolation to update the facet vector; otherwise, the system may initialize a new facet vector.

[0289] Second, megachunking is compatible with auto weighting. In some embodiments, when megachunk nodes are interpolated into their corresponding facets, weights may be determined using auto weighting as described above, including preprocessing, term statistics updates, and BM25-based relevance scoring.

[0290] Third, when a large document is updated, the system may perform structural comparison between an existing tree and a new document structure, identify changed sections, and update only affected tree nodes. In such embodiments, unchanged portions of the tree may remain intact, avoiding reprocessing of the entire document when only a portion has changed.

[0291] In some embodiments, auto weighting and megachunking are implemented together within a continuous vectorization system 100 to provide synergistic improvements in ingestion quality, hierarchical representation, retrieval precision, and storage efficiency. In such embodiments, megachunking constructs a hierarchical tree representation of a document in which nodes at different levels correspond to different granularities of content, and auto weighting computes content-aware weights that control how newly received content influences facet vectors through weighted interpolation. When combined, the hierarchical organization provided by megachunking may be leveraged to apply relevance-based weighting at multiple levels of granularity, and to incorporate cross-level signals into weight determination.

[0292] In the context of the present description and the claims that follow, “integration” refers to a coordinated ingestion and update workflow in which (i) nodes of a megachunk hierarchy are mapped to corresponding facets within a vector database, and (ii) weights used to update facet vectors for those facets are computed using an auto weighting process based on term profiles, corpus statistics, and relevance scores (for example, BM25-based relevance scores).

[0293] In some embodiments, auto weighting is applied at each level of a megachunk hierarchy such that weights are computed separately for nodes at different granularities, and each node’s corresponding facet vector is updated using a weight tailored to that node’s lexical and statistical alignment with one or more facets. In such embodiments, the same underlying relevance function (e.g., a BM25-based relevance function) may be used across levels, while the candidate facets evaluated for relevance, and the semantic scope of the node content, may vary by level.

[0294] In some embodiments, hierarchical relevance weighting includes applying auto weighting at a root level, a section level, and a leaf level. At a root level, the system mayAgent Reference: 22735-002WO-PCTtreat an entire document as a root node and may define a corresponding document-level facet. In such embodiments, the system preprocesses the document text to generate a term profile and updates corpus statistics in a term statistics database. The system then calculates one or more relevance scores that quantify how strongly the document’s overall theme aligns with existing document-level facets. The resulting relevance scores may be transformed by a weighting function to produce a weight used to update a document-level facet vector through weighted interpolation. This may cause the document-level facet vector to track the evolving themes of ingested documents while avoiding disproportionate influence from documents that are lexically weakly aligned.

[0295] At a section level, the system may split the document into section chunks (for example, by major headings) and may define corresponding section-level facets. In such embodiments, the system generates section-level term profiles and computes relevance scores for each section chunk relative to section-level facets. Weights derived from those relevance scores may be used to update section-level facet vectors such that each section vector more accurately reflects the topical focus of that section and is less influenced by off-topic or generic language.

[0296] At a leaf level, the system may further split the document into fine-grained chunks, such as paragraphs or smaller semantically coherent units, and may define corresponding leaf-level facets. In such embodiments, auto weighting computes fine-grained relevance scores for each leaf chunk relative to leaf-level facets, and weights derived therefrom control interpolation updates at the paragraph level. This may improve the fidelity of detailed facets and may reduce the accumulation of noise that can arise when generic content is permitted to update highly specific facet vectors with excessive influence.

[0297] In some embodiments, the hierarchical structure created by megachunking enables cross-level relevance assessment, in which the relevance of a node’s content is evaluated against facets at multiple levels of the hierarchy. In the context of the present description and the claims that follow, “cross-level relevance assessment” refers to computing relevance scores for a given chunk relative to at least one facet that is not at the same hierarchical level as the chunk.

[0298] In some embodiments, a leaf chunk’s relevance is evaluated not only against leaf-level facets, but also against one or more parent-level facets. For example, the system may compute relevance scores for a leaf chunk relative to its parent section facet and relative to its root document facet, in addition to computing relevance scores relative to leaf-level facets. In such embodiments, multi-level relevance scores provide a richer signal for weightAgent Reference: 22735-002WO-PCTdetermination than a single-level assessment, because the parent facets capture broader thematic context that may disambiguate the leaf chunk’s content and reduce spurious alignment caused by isolated token overlap.

[0299] In some embodiments, the weighting function incorporates cross-level signals by combining, gating, or normalizing relevance scores computed at different levels. For example, a weight for updating a leaf-level facet vector may be adjusted based on whether the leaf chunk is also strongly relevant to its parent section facet, thereby reducing the influence of a leaf chunk that appears locally similar to a leaf facet but is thematically inconsistent with the containing section or document. In other embodiments, cross-level relevance assessment may be used to select which facets are candidates for update at all, such as restricting updates to leaf facets whose ancestors exceed a minimum relevance threshold.

[0300] In some embodiments, auto weighting and megachunking together provide combined storage efficiency beyond what either technique achieves alone. In the context of the present description and the claims that follow, “storage efficiency” refers to reducing the number of stored vectors, reducing redundant vector storage, reducing the need to store perchunk embeddings, and or improving the quality of stored vectors such that fewer vectors are required to achieve a target retrieval performance.

[0301] In some embodiments, a base continuous vectorization approach reduces stored vectors through semantic faceting, in which content is aggregated into facets represented by one or more facet vectors rather than storing a separate vector for every ingested chunk. In such embodiments, auto weighting improves facet vector quality by computing content-aware weights that reduce noisy updates and increase the influence of highly relevant content, thereby improving the signal-to-noise ratio of facet vectors. Improved facet vector quality may reduce the need to store additional compensating vectors or redundant facets that would otherwise be created to address drift or contamination.

[0302] In some embodiments, megachunking enables selective storage of hierarchy levels, such that only a subset of nodes in the hierarchical tree are vectorized and stored as facet vectors, while other nodes retain text and metadata without a stored vector unless on-demand vectorization is performed. This level-selective storage can reduce the number of vectors stored for large documents while preserving navigational capability through parent-child facet relationships.

[0303] In some embodiments, when auto weighting improves the quality of stored vectors at each stored level, and megachunking reduces the number of levels that must be stored, the techniques jointly yield substantial reductions in stored vectors and associatedAgent Reference: 22735-002WO-PCTindexing overhead. In such embodiments, storage reductions can be achieved while maintaining or improving retrieval accuracy and hierarchical context selection, because ingestion-time relevance weighting and hierarchical organization work together to concentrate semantic signal into fewer, higher-quality vectors.

[0304] In some embodiments, megachunking and auto weighting are combined such that each update facet corresponding to a megachunk node is weighted based on content-aware relevance. In such embodiments, the system preprocesses the update facet chunk to generate a term profile 500, updates corpus statistics 502 within the term statistics database 130, computes relevance scores 504 for the chunk relative to facets within the vector database 104, and computes the weight 200 (w) as a function of the relevance scores 504. The computed weight may then be used in the interpolation step that updates the facet vector for the update facet.

[0305] In some embodiments, megachunking provides one or more advantages. For example, megachunking may enable optimal context retrieval by returning an amount of context appropriate to the query, reducing both missing context and wasted context. Megachunking may preserve semantic coherence by splitting along natural boundaries 604 rather than fixed-size boundaries. Megachunking may reduce cost by decreasing the tokens transmitted to downstream language models. Megachunking may reduce latency by enabling smaller, focused context windows. Megachunking may improve accuracy by providing more relevant context and reducing hallucination risk. Megachunking may enable navigable retrieval strategies via the tree structure. Megachunking may provide storage flexibility through configurable level selection and on-demand vectorization. Megachunking may preserve provenance and document structure for citation generation. Finally, megachunking may integrate seamlessly with continuous vectorization, including storage efficiency, continuous updates, and weighted interpolation.

[0306] It should be noted that, while this discussion of megachunking has been presented in the context of a continuous vectorization system 100, the method of megachunking may be advantageously applied in vector databases 104 using conventional storage methods. The examples discussed herein should not be taken as limiting the application of megachunking to a particular architecture, because megachunking addresses a problem felt in many RAGbased solutions, namely, how to break up a document into retrievable segments while preserving context and remaining efficient.

[0307] In some embodiments, auto weighting and megachunking are designed to integrate with an existing continuous vectorization system described in a parent patentAgent Reference: 22735-002WO-PCTapplication. In such embodiments, megachunking provides a hierarchical facet structure for large documents, and auto weighting provides content-aware, relevance-based interpolation weights for updating facet vectors at one or more levels of the hierarchy. In some embodiments, the combination yields improved facet vector quality, improved retrieval granularity, and improved ingestion-time optimization without requiring material changes to query-time retrieval mechanics.

[0308] Clauses

[0309] The following clauses describe non-limiting embodiments. Unless the context clearly requires otherwise, features of any clause may be combined with features of any other clause, and a clause that depends from another clause may depend from any earlier clause of the same statutory category.

[0310] Clause 1. A method for updating a vector database, comprising: receiving new data; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet; generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and an update vector, the update vector being the new data in a vectorized form, with the update vector being multiplied by a weight w produced by a weighting function and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0311] Clause!. The method of Clause 1, further comprising generating the update vector by vectorizing the new data with an embedding model, wherein the new data is received as raw data.

[0312] Clause 3. The method of Clause 1 or Clause 2, wherein the target facet comprises at most one vector.

[0313] Clause 4. The method of any of Clauses 1-3, wherein storing the updated facet vector within the vector database comprises overwriting the target facet vector with the updated facet vector.

[0314] Clause 5. The method of any of Clauses 1-4, further comprising storing the update vector within the vector database.

[0315] Clause 6. The method of any of Clauses 1-5, wherein the weighting function depends, at least in part, on a vector count, the vector count being a number of vectors that have been combined through linear interpolation to yield the target facet vector.

[0316] Clause 7. The method of Clause 6, wherein the weighting function is average-based, and wherein, if n is the vector count, the weight is l / (n+l).Agent Reference: 22735-002WO-PCT

[0317] Clause 8. The method of any of Clauses 1-7, wherein the weighting function is order-based, and wherein the weight is equal to a decay factor that is greater than 0 and less than 1.

[0318] Clause 9. The method of Clause 8, wherein the decay factor is a function and is dependent on an elapsed time since the target facet vector was last updated.

[0319] Clause 10. A continuous vectorization system, comprising: a vector database comprising a plurality of vectors and a plurality of facets, each facet describing at least one vector associated with the facet on the basis of at least one of a value and an attribute reflected by the vector; and a continuous vectorization server communicatively coupled to the vector database, the continuous vectorization server comprising a processor and a memory, the memory comprising a weighting function and the processor configured to: receive new data; identify a target facet within the vector database using at least one of a value of the new data and an attribute of the new data; identify a target facet vector belonging to the target facet using the new data; retrieve the target facet vector from the vector database; generate a weight w by applying the weighting function to at least a part of at least one of the target facet, the target facet vector, the new data in a raw data form, and the new data in a vectorized form; create an updated facet vector via a weighted linear interpolation between the target facet vector and an update vector by performing a linear interpolation between the update vector multiplied by the weight and the target facet vector multiplied by (1-w); and send the updated facet vector to the vector database for storage; wherein the update vector is the new data in a vectorized form.

[0320] Clause 11. The continuous vectorization system of Clause 10, wherein the processor of the continuous vectorization server is further configured to receive the new data from a client device communicatively coupled to the continuous vectorization server through a network.

[0321] Clause 12. The continuous vectorization system of Clause 10 or Clause 11, wherein the vector database is remote and is communicatively coupled to the continuous vectorization server through a network.

[0322] Clause 13. The continuous vectorization system of any of Clauses 10-12, wherein the new data is raw data, and wherein the processor of the continuous vectorization server is further configured to generate the update vector by vectorizing the new data with an embedding model.

[0323] Clause 14. The continuous vectorization system of any of Clauses 10-13, wherein the target facet comprises, at most, one vector.Agent Reference: 22735-002WO-PCT

[0324] Clause 15. The continuous vectorization system of any of Clauses 10-14, wherein sending the updated facet vector to the vector database for storage comprises instructing the vector database to overwrite the target facet vector with the updated facet vector.

[0325] Clause 16. The continuous vectorization system of any of Clauses 10-15, wherein the processor is further configured to send the update vector to the vector database for storage.

[0326] Clause 17. The continuous vectorization system of any of Clauses 10-16, wherein the weighting function depends, at least in part, on a vector count, the vector count being a number of vectors that have been combined through linear interpolation to yield the target facet vector.

[0327] Clause 18. The continuous vectorization system of Clause 17, wherein the weighting function is average-based, such that the update vector is weighted the same as any of n vectors previously interpolated to yield the target facet vector, and wherein, if n is the vector count, the weight is l / (n+l).

[0328] Clause 19. The continuous vectorization system of any of Clauses 10-18, wherein the weighting function is order-based, and wherein the weight is equal to a decay factor that is greater than 0 and less than 1.

[0329] Clause 20. The continuous vectorization system of Clause 19, wherein the decay factor is a function and is dependent on an elapsed time since the target facet vector was last updated.

[0330] Clause 21. A method for updating a vector database, comprising: receiving new data, wherein the new data is received as raw data comprising text; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet; generating a weight w produced by a weighting function, wherein generating the weight w comprises: preprocessing the new data to generate a term profile; updating corpus statistics within a term statistics database; calculating, using a relevance function, for each facet within the vector database, a relevance score for the new data relative to the facet, the relevance score being a function of the term profile and the corpus statistics; and calculating the weight w using the weighting function, the weighting function being a function of the relevance scores; generating an update vector by vectorizing the new data with an embedding model; generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multipliedAgent Reference: 22735-002WO-PCTby the weight w and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0331] Clause 22. The method of Clause 21, wherein the relevance function is:

[0332] BM25(C, F) = SUM(s in C) [ IDF(s) * (tf(s, C) * (kl + 1)) / (tf(s, C) + kl * (1 - b + b * (|C| / avgdl))) ]

[0333] where: s represents each stem in chunk C; tf(s, C) is the term frequency of stem s in chunk C; |C| is the length of chunk C (in stems); avgdl is the average chunk length across a corpus; kl is a tuning parameter controlling term frequency saturation; b is a tuning parameter controlling length normalization; and IDF(s) is an inverse document frequency of stem s defined as IDF(s) = ln( (N - n(s) + 0.5) / (n(s) + 0.5) + 1 ), where N is a total number of chunks in the corpus and n(s) is a number of chunks containing stem s.

[0334] Clause 23. The method of Clause 21 or Clause 22, wherein the relevance function comprises an inverse document frequency calculation.

[0335] Clause 24. The method of any of Clauses 21-23, wherein the weighting function is w_F = BM25(C, F) / (SUM(F' in all facets) BM25(C, F') + epsilon), where epsilon is a small constant to prevent division by zero.

[0336] Clause 25. The method of any of Clauses 21-24, wherein the weighting function is w_F = exp(beta * BM25(C, F)) / (SUM(F') exp(beta * BM25(C, F'))), where beta is a temperature parameter.

[0337] Clause 26. The method of any of Clauses 21-25, wherein the weighting function is also a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated.

[0338] Clause 27. The method of any of Clauses 21-26, wherein preprocessing comprises removing stop words from the new data.

[0339] Clause 28. The method of any of Clauses 21-27, wherein preprocessing comprises applying stemming to generate stems for the term profile.

[0340] Clause 29. A method for updating a vector database, comprising: receiving new data, wherein the new data is received as raw data comprising a document having natural boundaries; determining that a size of the new data exceeds a threshold; defining a root node with the document as a chunk; organizing the root node into a hierarchical tree of nodes by recursively splitting the chunk of each node along natural boundaries into smaller chunks and defining a new node for each chunk, each node having metadata specifying at least one of a parent node relationship pointing to a node with a larger chunk that comprises the chunk and a child node relationship pointing to a node with a smaller chunk that is part of the chunk;Agent Reference: 22735-002WO-PCTdefining, for each node, a new facet within the vector database comprising the chunk of the node and at least one attribute comprising at least one of a parent facet relationship mirroring the parent node relationship of the node and a child facet relationship mirroring the child node relationship of the node; vectorizing the chunk of at least a subset of the new facets with an embedding model, with the subset comprising an update facet; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the update facet and an attribute of the update facet, the target facet vector belonging to the target facet; generating a weight w produced by a weighting function; generating an updated facet vector that reflects an update vector of the update facet by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0341] Clause 30. The method of Clause 29, wherein all new facets are vectorized.

[0342] Clause 31. The method of Clause 29 or Clause 30, wherein the metadata of each node further comprises a level within the hierarchical tree of nodes, and wherein the attributes of each new facet further comprise the level of the associated node.

[0343] Clause 32. The method of Clause 31, wherein the subset of the new facets that is vectorized is composed of new facets having at least one of the top N levels and the bottom M levels.

[0344] Clause 33. The method of Clause 32, wherein N and M equal 1.

[0345] Clause 34. The method of any of Clauses 29-33, wherein each parent facet relationship and each child facet relationship is bidirectional.

[0346] Clause 35. The method of any of Clauses 29-34, wherein natural boundaries comprise at least one of section boundaries, subsection boundaries, paragraph boundaries, sentence boundaries, and token boundaries.

[0347] Clause 36. The method of any of Clauses 29-35, wherein recursively splitting the new data along natural boundaries into smaller chunks continues until a size of the smaller chunks is at most equal to a target chunk size.

[0348] Clause 37. The method of any of Clauses 29-36, wherein recursively splitting the chunk of each node along natural boundaries comprises selecting the natural boundaries to split along based, at least in part, on relative sizes of resulting chunks.

[0349] Clause 38. The method of any of Clauses 29-37, wherein the threshold is one of 50,000 tokens, 100,000 characters, and 50 pages.Agent Reference: 22735-002WO-PCT

[0350] Clause 39. The method of any of Clauses 29-38, wherein the threshold varies based on a document type.

[0351] Clause 40. The method of any of Clauses 29-39, wherein the threshold varies based on a source of the document.

[0352] Clause 41. The method of any of Clauses 29-40, further comprising dynamically adjusting the threshold based on at least one of available computing resources and use case requirements.

[0353] Clause 42. The method of any of Clauses 29-41, wherein generating the weight w comprises: preprocessing the chunk of the update facet to generate a term profile; updating corpus statistics within a term statistics database; calculating, using a relevance function, for each facet within the vector database, a relevance score for the chunk of the update facet relative to the facet, the relevance score being a function of the term profile and the corpus statistics; and calculating the weight w using the weighting function, the weighting function being a function of the relevance scores.

[0354] Clause 43. The method of Clause 42, wherein the relevance function is BM25(C, F) as defined in Clause 22, with C corresponding to the chunk of the update facet.

[0355] Clause 44. The method of Clause 42 or Clause 43, wherein the relevance function comprises an inverse document frequency calculation.

[0356] Clause 45. The method of any of Clauses 42-44, wherein the weighting function is w_F = BM25(C, F) / (SUM(F' in all facets) BM25(C, F') + epsilon), where epsilon is a small constant to prevent division by zero.

[0357] Clause 46. The method of any of Clauses 42-45, wherein the weighting function is w_F = exp(beta * BM25(C, F)) / (SUM(F') exp(beta * BM25(C, F'))), where beta is a temperature parameter.

[0358] Clause 47. The method of any of Clauses 42-46, wherein the weighting function is also a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated.

[0359] Clause 48. The method of any of Clauses 42-47, wherein preprocessing comprises removing stop words from the new data.

[0360] Clause 49. The method of any of Clauses 42-48, wherein preprocessing comprises applying stemming to generate stems for the term profile.

[0361] Clause 50. A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for updating a vector database, the method comprising: receiving a newAgent Reference: 22735-002WO-PCTdata; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet; generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and an update vector, the update vector being the new data in a vectorized form, with the update vector being multiplied by a weight w produced by a weighting function and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0362] Clause 51. The non-transitory computer-readable storage medium of Clause 50, wherein the method further comprises generating the update vector by vectorizing the new data with an embedding model, wherein the new data is received as raw data.

[0363] Clause 52. The non-transitory computer-readable storage medium of Clause 50 or Clause 51, wherein the target facet comprises at most one vector.

[0364] Clause 53. The non-transitory computer-readable storage medium of any of Clauses 50-52, wherein storing the updated facet vector within the vector database comprises overwriting the target facet vector with the updated facet vector.

[0365] Clause 54. The non-transitory computer-readable storage medium of any of Clauses 50-53, wherein the method further comprises storing the update vector within the vector database.

[0366] Clause 55. The non-transitory computer-readable storage medium of any of Clauses 50-54, wherein the weighting function depends, at least in part, on a vector count, the vector count being the number of vectors that have been combined through linear interpolation to yield the target facet vector.

[0367] Clause 56. The non-transitory computer-readable storage medium of Clause 55, wherein the weighting function is average-based, and wherein, if n is the vector count, the weight is l / (n+l).

[0368] Clause 57. The non-transitory computer-readable storage medium of any of Clauses 50-56, wherein the weighting function is order-based, and wherein the weight is equal to a decay factor that is greater than 0 and less than 1.

[0369] Clause 58. The non-transitory computer-readable storage medium of Clause 57, wherein the decay factor is a function and is dependent on an elapsed time since the target facet vector was last updated.

[0370] Clause 59. A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for updating a vector database, the method comprising: receiving a newAgent Reference: 22735-002WO-PCTdata, wherein the new data is received as raw data comprising text; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet; generating a weight w produced by a weighting function, wherein generating the weight w comprises: preprocessing the new data to generate a term profile; updating corpus statistics within a term statistics database; calculating using a relevance function, for each facet within the vector database, a relevance score for the new data relative to the facet, the relevance score being a function of the term profile and the corpus statistics; and calculating the weight w using the weighting function, the weighting function being a function of the relevance scores; generating an update vector by vectorizing the new data with an embedding model; generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0371] Clause 60. The non-transitory computer-readable storage medium of Clause 59, wherein the relevance function is BM25(C, F) = SUM(s in C) [ IDF(s) * (tf(s, C) * (kl + 1)) / (tf(s, C) + kl * (1 - b + b * (|C| / avgdl))) ], where: s represents each stem in chunk C, tf(s, C) is the term frequency of stem s in chunk C, |C| is the length of chunk C (in stems), avgdl is the average chunk length across the corpus, kl is a tuning parameter controlling term frequency saturation, b is a tuning parameter controlling length normalization, and IDF(s) is the inverse document frequency of stem s and defined as IDF(s) = ln( (N - n(s) + 0.5) / (n(s) + 0.5) + 1 ), where N is the total number of chunks in the corpus and n(s) is the number of chunks containing stem s.

[0372] Clause 61. The non-transitory computer-readable storage medium of Clause 59 or Clause 60, wherein the relevance function comprises an inverse document frequency calculation.

[0373] Clause 62. The non-transitory computer-readable storage medium of any of Clauses 59-61, wherein the weighting function is w_F = BM25(C, F) / (SUM(F' in all facets) BM25(C, F') + epsilon), where epsilon is a small constant to prevent division by zero.

[0374] Clause 63. The non-transitory computer-readable storage medium of any of Clauses 59-62, wherein the weighting function is w_F = exp(beta * BM25(C, F)) / (SUM(F') exp(beta * BM25(C, F'))), where beta is a temperature parameter.Agent Reference: 22735-002WO-PCT

[0375] Clause 64. The non-transitory computer-readable storage medium of any of Clauses 59-63, wherein the weighting function is also a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated.

[0376] Clause 65. The non-transitory computer-readable storage medium of any of Clauses 59-64, wherein preprocessing comprises removing stop words from the new data.

[0377] Clause 66. The non-transitory computer-readable storage medium of any of Clauses 59-65, wherein preprocessing comprises applying stemming to generate stems for the term profile.

[0378] Clause 67. A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform a method for updating a vector database, the method comprising: receiving a new data, wherein the new data is received as raw data comprising a document having natural boundaries; determining that a size of the new data exceeds a threshold; defining a root node with the document as a chunk; organizing the root node into a hierarchical tree of nodes by recursively splitting the chunk of each node along natural boundaries into smaller chunks and defining a new node for each chunk, each node having metadata specifying at least one of a parent node relationship pointing to a node with a larger chunk that comprises the chunk and a child node relationship pointing to a node with a smaller chunk that is part the chunk; defining, for each node, a new facet within the vector database comprising the chunk of the node and at least one attribute comprising at least one of a parent facet relationship mirroring the parent node relationship of the node and a child facet relationship mirroring the child node relationship of the node; vectorizing the chunk of at least a subset of the new facets with an embedding model, with the subset comprising an update facet; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the update facet and an attribute of the update facet, the target facet vector belonging to the target facet; generating a weight w produced by a weighting function; generating an updated facet vector that reflects an update vector of the update facet by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0379] Clause 68. The non-transitory computer-readable storage medium of Clause 67, wherein all new facets are vectorized.

[0380] Clause 69. The non-transitory computer-readable storage medium of Clause 67 or Clause 68, wherein the metadata of each node further comprises a level within theAgent Reference: 22735-002WO-PCThierarchical tree of nodes, and wherein the attributes of each new facet further comprise the level of the associated node.

[0381] Clause 70. The non-transitory computer-readable storage medium of Clause 69, wherein the subset of the new facets that is vectorized is composed of new facets having at least one of the top N levels and the bottom M levels.

[0382] Clause 71. The non-transitory computer-readable storage medium of Clause 70, wherein N and M equal 1.

[0383] Clause 72. The non-transitory computer-readable storage medium of any of Clauses 67-71, wherein each parent facet relationship and child facet relationship is bidirectional.

[0384] Clause 73. The non-transitory computer-readable storage medium of any of Clauses 67-72, wherein natural boundaries comprise at least one of section boundaries, subsection boundaries, paragraph boundaries, sentence boundaries, and token boundaries.

[0385] Clause 74. The non-transitory computer-readable storage medium of any of Clauses 67-73, wherein recursively splitting the new data along natural boundaries into smaller chunks along natural boundaries continues until the size of the smaller chunks is at most equal to a target chunk size.

[0386] Clause 75. The non-transitory computer-readable storage medium of any of Clauses 67-74, wherein recursively splitting the chunk of each node along natural boundaries comprises selecting the natural boundaries to split along based, at least in part, on the relative sizes of the resulting chunks.

[0387] Clause 76. The non-transitory computer-readable storage medium of any of Clauses 67-75, wherein the threshold is one of 50,000 tokens, 100,000 characters, and 50 pages.

[0388] Clause 77. The non-transitory computer-readable storage medium of any of Clauses 67-76, wherein the threshold varies based on a document type.

[0389] Clause 78. The non-transitory computer-readable storage medium of any of Clauses 67-77, wherein the threshold varies based on a source of the document.

[0390] Clause 79. The non-transitory computer-readable storage medium of any of Clauses 67-78, wherein the method further comprises dynamically adjusting the threshold based on at least one of available computing resources and use case requirements.

[0391] Clause 80. The non-transitory computer-readable storage medium of any of Clauses 67-79, wherein generating the weight w comprises: preprocessing the chunk of the update facet to generate a term profile; updating corpus statistics within a term statistics database; calculating using a relevance function, for each facet within the vector database, aAgent Reference: 22735-002WO-PCTrelevance score for the chunk of the update facet relative to the facet, the relevance score being a function of the term profile and the corpus statistics; and calculating the weight w using the weighting function, the weighting function being a function of the relevance scores.

[0392] Clause 81. The non-transitory computer-readable storage medium of Clause 80, wherein the relevance function is BM25(C, F) = SUM(s in C) [ IDF(s) * (tf(s, C) * (kl + 1)) / (tf(s, C) + kl * (1 - b + b * (|C| / avgdl))) ], where: s represents each stem in chunk C, tf(s, C) is the term frequency of stem s in chunk C, |C| is the length of chunk C (in stems), avgdl is the average chunk length across the corpus, kl is a tuning parameter controlling term frequency saturation, b is a tuning parameter controlling length normalization, and IDF(s) is the inverse document frequency of stem s and defined as IDF(s) = ln( (N - n(s) + 0.5) / (n(s) + 0.5) + 1 ), where N is the total number of chunks in the corpus and n(s) is the number of chunks containing stem s.

[0393] Clause 82. The non-transitory computer-readable storage medium of Clause 80 or Clause 81, wherein the relevance function comprises an inverse document frequency calculation.

[0394] Clause 83. The non-transitory computer-readable storage medium of any of Clauses 80-82, wherein the weighting function is w_F = BM25(C, F) / (SUM(F' in all facets) BM25(C, F') + epsilon), where epsilon is a small constant to prevent division by zero.

[0395] Clause 84. The non-transitory computer-readable storage medium of any of Clauses 80-83, wherein the weighting function is w_F = exp(beta * BM25(C, F)) / (SUM(F') exp(beta * BM25(C, F'))), where beta is a temperature parameter.

[0396] Clause 85. The non-transitory computer-readable storage medium of any of Clauses 80-84, wherein the weighting function is also a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated.

[0397] Clause 86. The non-transitory computer-readable storage medium of any of Clauses 80-85, wherein preprocessing comprises removing stop words from the new data.

[0398] Clause 87. The non-transitory computer-readable storage medium of any of Clauses 80-86, wherein preprocessing comprises applying stemming to generate stems for the term profile.

[0399] Clause 88. A computer program comprising instructions that, when executed by one or more processors, cause the one or more processors to perform a method for updating a vector database, the method comprising: receiving a new data; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet;Agent Reference: 22735-002WO-PCTgenerating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and an update vector, the update vector being the new data in a vectorized form, with the update vector being multiplied by a weight w produced by a weighting function and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0400] Clause 89. The computer program of Clause 88, wherein the method further comprises generating the update vector by vectorizing the new data with an embedding model, wherein the new data is received as raw data.

[0401] Clause 90. The computer program of Clause 88 or Clause 89, wherein the target facet comprises at most one vector.

[0402] Clause 91. The computer program of any of Clauses 88-90, wherein storing the updated facet vector within the vector database comprises overwriting the target facet vector with the updated facet vector.

[0403] Clause 92. The computer program of any of Clauses 88-91, wherein the method further comprises storing the update vector within the vector database.

[0404] Clause 93. The computer program of any of Clauses 88-92, wherein the weighting function depends, at least in part, on a vector count, the vector count being the number of vectors that have been combined through linear interpolation to yield the target facet vector.

[0405] Clause 94. The computer program of Clause 93, wherein the weighting function is average-based, and wherein, if n is the vector count, the weight is l / (n+l).

[0406] Clause 95. The computer program of any of Clauses 88-94, wherein the weighting function is order-based, and wherein the weight is equal to a decay factor that is greater than 0 and less than 1.

[0407] Clause 96. The computer program of Clause 95, wherein the decay factor is a function and is dependent on an elapsed time since the target facet vector was last updated.

[0408] Clause 97. A computer program comprising instructions that, when executed by one or more processors, cause the one or more processors to perform a method for updating a vector database, the method comprising: receiving a new data, wherein the new data is received as raw data comprising text; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet; generating a weight w produced by a weighting function, wherein generating the weight w comprises: preprocessing the new data to generate a term profile; updating corpus statistics within a term statistics database;Agent Reference: 22735-002WO-PCTcalculating using a relevance function, for each facet within the vector database, a relevance score for the new data relative to the facet, the relevance score being a function of the term profile and the corpus statistics; and calculating the weight w using the weighting function, the weighting function being a function of the relevance scores; generating an update vector by vectorizing the new data with an embedding model; generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0409] Clause 98. The computer program of Clause 97, wherein the relevance function is BM25(C, F) = SUM(s in C) [ IDF(s) * (tf(s, C) * (kl + 1)) / (tf(s, C) + kl * (1 - b + b * (|C| / avgdl))) ], where: s represents each stem in chunk C, tf(s, C) is the term frequency of stem s in chunk C, |C| is the length of chunk C (in stems), avgdl is the average chunk length across the corpus, kl is a tuning parameter controlling term frequency saturation, b is a tuning parameter controlling length normalization, and IDF(s) is the inverse document frequency of stem s and defined as IDF(s) = ln( (N - n(s) + 0.5) / (n(s) + 0.5) + 1 ), where N is the total number of chunks in the corpus and n(s) is the number of chunks containing stem s.

[0410] Clause 99. The computer program of Clause 97 or Clause 98, wherein the relevance function comprises an inverse document frequency calculation.

[0411] Clause 100. The computer program of any of Clauses 97-99, wherein the weighting function is w_F = BM25(C, F) / (SUM(F' in all facets) BM25(C, F') + epsilon), where epsilon is a small constant to prevent division by zero.

[0412] Clause 101. The computer program of any of Clauses 97-100, wherein the weighting function is w_F = exp(beta * BM25(C, F)) / (SUM(F') exp(beta * BM25(C, F'))), where beta is a temperature parameter.

[0413] Clause 102. The computer program of any of Clauses 97-101, wherein the weighting function is also a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated.

[0414] Clause 103. The computer program of any of Clauses 97-102, wherein preprocessing comprises removing stop words from the new data.

[0415] Clause 104. The computer program of any of Clauses 97-103, wherein preprocessing comprises applying stemming to generate stems for the term profile.

[0416] Clause 105. A computer program comprising instructions that, when executed by one or more processors, cause the one or more processors to perform a method forAgent Reference: 22735-002WO-PCTupdating a vector database, the method comprising: receiving a new data, wherein the new data is received as raw data comprising a document having natural boundaries; determining that a size of the new data exceeds a threshold; defining a root node with the document as a chunk; organizing the root node into a hierarchical tree of nodes by recursively splitting the chunk of each node along natural boundaries into smaller chunks and defining a new node for each chunk, each node having metadata specifying at least one of a parent node relationship pointing to a node with a larger chunk that comprises the chunk and a child node relationship pointing to a node with a smaller chunk that is part the chunk; defining, for each node, a new facet within the vector database comprising the chunk of the node and at least one attribute comprising at least one of a parent facet relationship mirroring the parent node relationship of the node and a child facet relationship mirroring the child node relationship of the node; vectorizing the chunk of at least a subset of the new facets with an embedding model, with the subset comprising an update facet; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the update facet and an attribute of the update facet, the target facet vector belonging to the target facet; generating a weight w produced by a weighting function; generating an updated facet vector that reflects an update vector of the update facet by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w); and storing the updated facet vector within the vector database.

[0417] Clause 106. The computer program of Clause 105, wherein all new facets are vectorized.

[0418] Clause 107. The computer program of Clause 105 or Clause 106, wherein the metadata of each node further comprises a level within the hierarchical tree of nodes, and wherein the attributes of each new facet further comprise the level of the associated node.

[0419] Clause 108. The computer program of Clause 107, wherein the subset of the new facets that is vectorized is composed of new facets having at least one of the top N levels and the bottom M levels.

[0420] Clause 109. The computer program of Clause 108, wherein N and M equal 1.

[0421] Clause 110. The computer program of any of Clauses 105-109, wherein each parent facet relationship and child facet relationship is bidirectional.

[0422] Clause 111. The computer program of any of Clauses 105-110, wherein natural boundaries comprise at least one of section boundaries, subsection boundaries, paragraph boundaries, sentence boundaries, and token boundaries.Agent Reference: 22735-002WO-PCT

[0423] Clause 112. The computer program of any of Clauses 105-111, wherein recursively splitting the new data along natural boundaries into smaller chunks along natural boundaries continues until the size of the smaller chunks is at most equal to a target chunk size.

[0424] Clause 113. The computer program of any of Clauses 105-112, wherein recursively splitting the chunk of each node along natural boundaries comprises selecting the natural boundaries to split along based, at least in part, on the relative sizes of the resulting chunks.

[0425] Clause 114. The computer program of any of Clauses 105-113, wherein the threshold is one of 50,000 tokens, 100,000 characters, and 50 pages.

[0426] Clause 115. The computer program of any of Clauses 105-114, wherein the threshold varies based on a document type.

[0427] Clause 116. The computer program of any of Clauses 105-115, wherein the threshold varies based on a source of the document.

[0428] Clause 117. The computer program of any of Clauses 105-116, wherein the method further comprises dynamically adjusting the threshold based on at least one of available computing resources and use case requirements.

[0429] Clause 118. The computer program of any of Clauses 105-117, wherein generating the weight w comprises: preprocessing the chunk of the update facet to generate a term profile; updating corpus statistics within a term statistics database; calculating using a relevance function, for each facet within the vector database, a relevance score for the chunk of the update facet relative to the facet, the relevance score being a function of the term profile and the corpus statistics; and calculating the weight w using the weighting function, the weighting function being a function of the relevance scores.

[0430] Clause 119. The computer program of Clause 118, wherein the relevance function is BM25(C, F) = SUM(s in C) [ IDF(s) * (tf(s, C) * (kl + 1)) / (tf(s, C) + kl * (1 - b + b * (|C| / avgdl))) ], where: s represents each stem in chunk C, tf(s, C) is the term frequency of stem s in chunk C, |C| is the length of chunk C (in stems), avgdl is the average chunk length across the corpus, kl is a tuning parameter controlling term frequency saturation, b is a tuning parameter controlling length normalization, and IDF(s) is the inverse document frequency of stem s and defined as IDF(s) = ln( (N - n(s) + 0.5) / (n(s) + 0.5) + 1 ), where N is the total number of chunks in the corpus and n(s) is the number of chunks containing stem s.

[0431] Clause 120. The computer program of Clause 118 or Clause 119, wherein the relevance function comprises an inverse document frequency calculation.Agent Reference: 22735-002WO-PCT

[0432] Clause 121. The computer program of any of Clauses 118-120, wherein the weighting function is w_F = BM25(C, F) / (SUM(F' in all facets) BM25(C, F') + epsilon), where epsilon is a small constant to prevent division by zero.

[0433] Clause 122. The computer program of any of Clauses 118-121, wherein the weighting function is w_F = exp(beta * BM25(C, F)) / (SUM(F') exp(beta * BM25(C, F'))), where beta is a temperature parameter.

[0434] Clause 123. The computer program of any of Clauses 118-122, wherein the weighting function is also a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated.

[0435] Clause 124. The computer program of any of Clauses 118-123, wherein preprocessing comprises removing stop words from the new data.

[0436] Clause 125. The computer program of any of Clauses 118-124, wherein preprocessing comprises applying stemming to generate stems for the term profile.

[0437] It will be understood that implementations are not limited to the specific components disclosed herein, as virtually any components consistent with the intended operation of a system and method for updating a vector database through continuous vectorization may be utilized. Accordingly, for example, although particular systems, methods, and / or devices for vectorization, storage, and retrieval of vectors and facets may be disclosed, such components may comprise any shape, size, style, type, model, version, class, grade, measurement, concentration, material, weight, quantity, and / or the like consistent with the intended operation of a system and method for updating a vector database through continuous vectorization may be used. In places where the description above refers to particular implementations of a system and method for updating a vector database through continuous vectorization, it should be readily apparent that a number of modifications may be made without departing from the spirit thereof and that these implementations may be applied to other vector storage systems and systems for embedding data streams.

Claims

Agent Reference: 22735-002WO-PCTCLAIMSWhat is claimed is:

1. A method for updating a vector database, comprising:receiving a new data;identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet;generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and an update vector, the update vector being the new data in a vectorized form, with the update vector being multiplied by a weight w produced by a weighting function and the target facet vector being multiplied by (1-w); andstoring the updated facet vector within the vector database.

2. The method of claim 1, further comprising generating the update vector by vectorizing the new data with an embedding model, wherein the new data is received as raw data.

3. The method of claim 1, wherein the target facet comprises at most one vector.

4. The method of claim 1, wherein storing the updated facet vector within the vector database comprises overwriting the target facet vector with the updated facet vector.

5. The method of claim 1, further comprising storing the update vector within the vector database.

6. The method of claim 1, wherein the weighting function depends, at least in part, on a vector count, the vector count being the number of vectors that have been combined through linear interpolation to yield the target facet vector.

7. The method of claim 6:wherein the weighting function is average-based, andwherein, if n is the vector count, the weight is l / (n+l).

8. The method of claim 1 :Agent Reference: 22735-002WO-PCTwherein the weighting function is order-based, andwherein the weight is equal to a decay factor that is greater than 0 and less than 1.

9. The method of claim 8, wherein the decay factor is a function and is dependent on an elapsed time since the target facet vector was last updated.

10. A continuous vectorization system, comprising:a vector database comprising a plurality of vectors and a plurality of facets, each facet describing at least one vector associated with the facet on the basis of at least one of a value and an attribute reflected by the vector; anda continuous vectorization server communicatively coupled to the vector database, the continuous vectorization server comprising a processor and a memory, the memory comprising a weighting function and the processor configured to:receive a new data;identify a target facet within the vector database using at least one of a value of the new data and an attribute of the new data;identify a target facet vector belonging to the target facet using the new data; retrieve the target facet vector from the vector database;generate a weight w by applying the weighting function to at least a part of at least one of the target facet, the target facet vector, the new data in a raw data form, and the new data in a vectorized form;create an updated facet vector via a weighted linear interpolation between the target facet vector and an update vector by performing a linear interpolation between the update vector multiplied by the weight and the target facet vector multiplied by ( l-ir); andsend the updated facet vector to the vector database for storage;wherein the update vector is the new data in a vectorized form.

11. The continuous vectorization system of claim 10, wherein the processor of the continuous vectorization server is further configured to receive the new data from a client device communicatively coupled to the continuous vectorization server through a network.

12. The continuous vectorization system of claim 10, wherein the vector database is remote and is communicatively coupled to the continuous vectorization server through a network.Agent Reference: 22735-002WO-PCT13. The continuous vectorization system of claim 10:wherein the new data is raw data, andwherein the processor of the continuous vectorization server is further configured to generate the update vector by vectorizing the new data with an embedding model.

14. The continuous vectorization system of claim 10, wherein the target facet comprises, at most, one vector.

15. The continuous vectorization system of claim 10, wherein sending the updated facet vector to the vector database for storage comprises instructing the vector database to overwrite the target facet vector with the updated facet vector.

16. The continuous vectorization system of claim 10, wherein the processor is further configured to send the update vector to the vector database for storage.

17. The continuous vectorization system of claim 10, wherein the weighting function depends, at least in part, on a vector count, the vector count being the number of vectors that have been combined through linear interpolation to yield the target facet vector.

18. The continuous vectorization system of claim 17:wherein the weighting function is average-based, such that the update vector is weighted the same as any of the n vectors previously interpolated to yield the target facet vector, andwherein, if n is the vector count, the weight is l / (n+l).

19. The continuous vectorization system of claim 10:wherein the weighting function is order-based, andwherein the weight is equal to a decay factor that is greater than 0 and less than 1.

20. The continuous vectorization system of claim 19, wherein the decay factor is a function and is dependent on an elapsed time since the target facet vector was last updated.

21. A method for updating a vector database, comprising:Agent Reference: 22735-002WO-PCTreceiving a new data, wherein the new data is received as raw data comprising text; identifying within the vector database a target facet and a target facet vector, using at least one of a value of the new data and an attribute of the new data, the target facet vector belonging to the target facet;generating a weight w produced by a weighting function, wherein generating the weight w comprises:preprocessing the new data to generate a term profile;updating corpus statistics within a term statistics database;calculating using a relevance function, for each facet within the vector database, a relevance score for the new data relative to the facet, the relevance score being a function of the term profile and the corpus statistics; andcalculating the weight w using the weighting function, the weighting function being a function of the relevance scores;generating an update vector by vectorizing the new data with an embedding model; generating an updated facet vector that reflects the new data by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w); andstoring the updated facet vector within the vector database.

22. The method of claim 21, wherein the relevance function isWhere:s represents each stem in chunk C,tf(s, C) is the term frequency of stem s in chunk C,|C| is the length of chunk C (in stems),avgdl is the average chunk length across the corpus,kl is a tuning parameter controlling term frequency saturation,b is a tuning parameter controlling length normalization, andIDF(s) is the inverse document frequency of stem s and defined asAgent Reference: 22735-002WO-PCTWhere:N is the total number of chunks in the corpus, andn(s) is the number of chunks containing stem s.

23. The method of claim 21, wherein the relevance function comprises an inverse document frequency calculation.

24. The method of claim 21, wherein the weighting function iswhere a is a small constant to prevent division by zero.

25. The method of claim 21, wherein the weighting function is>where P is a temperature parameter.

26. The method of claim 21, wherein the weighting function is also a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated.

27. The method of claim 21, wherein preprocessing comprises removing stop words from the new data.

28. The method of claim 21, wherein preprocessing comprises applying stemming to generate stems for the term profile.

29. A method for updating a vector database, comprising:receiving a new data, wherein the new data is received as raw data comprising a document having natural boundaries;determining that a size of the new data exceeds a threshold;defining a root node with the document as a chunk;organizing the root node into a hierarchical tree of nodes by recursively splitting the chunk of each node along natural boundaries into smaller chunks and defining a new node for each chunk, each node having metadata specifying at least one of a parentAgent Reference: 22735-002WO-PCTnode relationship pointing to a node with a larger chunk that comprises the chunk and a child node relationship pointing to a node with a smaller chunk that is part the chunk;defining, for each node, a new facet within the vector database comprising the chunk of the node and at least one attribute comprising at least one of a parent facet relationship mirroring the parent node relationship of the node and a child facet relationship mirroring the child node relationship of the node;vectorizing the chunk of at least a subset of the new facets with an embedding model, with the subset comprising an update facet;identifying within the vector database a target facet and a target facet vector, using at least one of a value of the update facet and an attribute of the update facet, the target facet vector belonging to the target facet;generating a weight w produced by a weighting function;generating an updated facet vector that reflects an update vector of the update facet by performing a weighted linear interpolation between the target facet vector and the update vector, with the update vector being multiplied by the weight w and the target facet vector being multiplied by (1-w); andstoring the updated facet vector within the vector database.

30. The method of claim 29, wherein all new facets are vectorized.

31. The method of claim 29, wherein the metadata of each node further comprises a level within the hierarchical tree of nodes, and wherein the attributes of each new facet further comprise the level of the associated node.

32. The method of claim 31, wherein the subset of the new facets that is vectorized is composed of new facets having is at least one of the top N levels and the bottom M levels.

33. The method of claim 32, wherein N and M equal 1.

34. The method of claim 29, wherein each parent facet relationship and child facet relationship is bidirectional.Agent Reference: 22735-002WO-PCT35. The method of claim 29, wherein natural boundaries comprise at least one of section boundaries, subsection boundaries, paragraph boundaries, sentence boundaries, and token boundaries.

36. The method of claim 29, wherein recursively splitting the new data along natural boundaries into smaller chunks along natural boundaries continues until the size of the smaller chunks is at most equal to a target chunk size.

37. The method of claim 29, wherein recursively splitting the chunk of each node along natural boundaries comprises selecting the natural boundaries to split along based, at least in part, on the relative sizes of the resulting chunks.

38. The method of claim 29, wherein the threshold is one of 50,000 tokens, 100,000 characters, and 50 pages.

39. The method of claim 29, wherein the threshold varies based on a document type.

40. The method of claim 29, wherein the threshold varies based on a source of the document.

41. The method of claim 29, wherein further comprising dynamically adjusting the threshold based on at least one of available computing resources and use case requirements.

42. The method of claim 29, wherein generating the weight w comprises:preprocessing the chunk of the update facet to generate a term profile;updating corpus statistics within a term statistics database;calculating using a relevance function, for each facet within the vector database, a relevance score for the chunk of the update facet relative to the facet, the relevance score being a function of the term profile and the corpus statistics; and calculating the weight w using the weighting function, the weighting function being a function of the relevance scores.

43. The method of claim 42, wherein the relevance function isAgent Reference: 22735-002WO-PCTWhere:s represents each stem in chunk C,tf(s, C) is the term frequency of stem s in chunk C,|C| is the length of chunk C (in stems),avgdl is the average chunk length across the corpus,kl is a tuning parameter controlling term frequency saturation,b is a tuning parameter controlling length normalization, andIDF(s) is the inverse document frequency of stem s and defined asWhere:N is the total number of chunks in the corpus, andn(s) is the number of chunks containing stem s.

44. The method of claim 42, wherein the relevance function comprises an inverse document frequency calculation.

45. The method of claim 42, wherein the weighting function iswhere a is a small constant to prevent division by zero.

46. The method of claim 42, wherein the weighting function is>where P is a temperature parameter.

47. The method of claim 42, wherein the weighting function is also a function of a decay factor that is dependent on an elapsed time since the target facet vector was last updated.Agent Reference: 22735-002WO-PCT48. The method of claim 42, wherein preprocessing comprises removing stop words from the new data.

49. The method of claim 42, wherein preprocessing comprises applying stemming to generate stems for the term profile.