A knowledge hub-oriented intelligent retrieval generation method and system
By constructing an intelligent retrieval system oriented towards the knowledge hub, and utilizing semantic similarity vector indexing and hybrid retrieval strategies, the system solves the problems of insufficient semantic understanding and accuracy in traditional knowledge retrieval methods, and achieves efficient and controllable knowledge location and output.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHENZHEN BEIWEI TECH CO LTD
- Filing Date
- 2026-04-07
- Publication Date
- 2026-06-30
Smart Images

Figure CN122309818A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of data processing, and in particular to an intelligent retrieval and generation method and system for knowledge centers. Background Technology
[0002] Currently, enterprises have accumulated a large amount of scattered, heterogeneous, and primarily unstructured knowledge assets during their digital operations. Traditional knowledge retrieval methods based on keyword matching or manual classification struggle to understand semantic relationships, resulting in limited retrieval efficiency and accuracy. Meanwhile, while generative question answering relying directly on large language models possesses strong language understanding and expression capabilities, it generally suffers from insufficient coverage of internal knowledge, poor content timeliness, untraceable results, and the risk of "illusion" in enterprise scenarios, failing to meet enterprises' high requirements for knowledge accuracy, reliability, and security. Summary of the Invention
[0003] This invention aims to solve the problem that current single retrieval methods cannot simultaneously take into account semantic understanding and precise matching, resulting in insufficient knowledge positioning, and provides an intelligent retrieval generation method and system oriented towards the knowledge center.
[0004] The present invention employs the following technical means to solve the technical problem: This invention provides an intelligent retrieval and generation method oriented towards a knowledge hub, comprising: Based on the multi-source heterogeneous documents pre-received by the knowledge storage terminal, the original knowledge corresponding to the multi-source heterogeneous documents is identified, wherein the multi-source heterogeneous documents specifically include text, tables, images and scanned documents; Determine whether the original document can be converted into a preset knowledge fragment; If possible, then based on the keywords of the original knowledge, construct a semantic similarity-oriented vector index for the knowledge fragment, integrate the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, and receive the user's knowledge query request according to the hybrid retrieval strategy, wherein the vector index specifically includes semantic vectors, structural information and metadata; Determine whether the knowledge query request matches the preset query rights of the knowledge storage terminal, wherein the query rights specifically include user identity, role permissions, and business scenario; If a match is found, the search scope of the hybrid retrieval strategy is dynamically constrained through the query rights. Multi-way retrieval is performed in the hybrid index, and the highly relevant knowledge fragments obtained are processed in a secondary manner to generate corresponding external knowledge output information. The traceability description of the external knowledge output information is dynamically marked in the knowledge storage terminal. The dynamic constraints specifically include content importance, semantic integrity, and context length limits, and the traceability description specifically includes the source document, paragraph position, and knowledge identifier.
[0005] Furthermore, before the step of constructing a semantic similarity-oriented vector index for the knowledge fragment based on the keywords of the original knowledge, and integrating the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, the method further includes: Based on the semantic segmentation preset by the knowledge storage terminal, the knowledge fragment is split into segments to obtain the segmented data of the knowledge fragment; Determine whether the segment length of the data block exceeds a preset length threshold; If so, the semantic boundaries of the segmented data are identified, and keyword extraction processing is performed on the segmented data based on the semantic boundaries to obtain the enterprise's preset business semantic keywords. Based on the coverage of the business semantic keywords, the context fragments of the knowledge fragments are introduced and associated with the segmented data. Through the knowledge storage terminal, confidence samples of the segmented data are dynamically marked.
[0006] Furthermore, the step of receiving the user's knowledge query request according to the hybrid retrieval strategy further includes: Based on the business scenario corresponding to the knowledge query request, identify the user's identity verification content; Determine whether the identity verification content can pass the verification requirements of the knowledge storage terminal; If possible, the business semantics of the knowledge query request are obtained, the user's identity permissions are obtained based on the business semantics, the access knowledge domain of the knowledge storage terminal is dynamically restricted based on the identity permissions, a query vector corresponding to the knowledge query request is generated through the access knowledge domain, and the query result of the query vector is constructed.
[0007] Furthermore, the step of performing multi-way retrieval in the hybrid index, and further processing the retrieved highly relevant knowledge fragments to generate corresponding external knowledge output information, also includes: Based on the query attributes of the knowledge query request, the terminology information of the knowledge query request is identified, wherein the query attributes specifically include semantic complexity, number of keywords, and query type; Determine whether the terminology information reaches the preset retrieval weight; If so, then based on the weight bias of the terminology information, the weight configuration of the preset retrieval channel is obtained. According to the weight configuration, the preset parallel multi-path retrieval is dynamically triggered to generate corresponding candidate knowledge fragments and perform relevance scoring and ranking. Through the relevance scoring and ranking, the highly relevant knowledge fragments of the knowledge query request are selected. The weight bias specifically includes semantic description and natural language expression.
[0008] Furthermore, the step of determining whether the original document can be converted into a preset knowledge fragment also includes: Based on the document structure pre-detected by the knowledge storage terminal, the content length of the original document is identified, wherein the document structure specifically includes chapters, paragraphs, and semantic boundaries; Determine whether the length of the content has reached a preset length threshold; If so, the business knowledge domain of the original document is obtained, the business value of the original document is detected based on the business knowledge domain, and the document parameters of the original document are dynamically generated based on the business value. The document parameters specifically include source information, permission identifier, and compliance attributes.
[0009] Furthermore, the step of determining whether the knowledge query request matches the preset query rights of the knowledge storage terminal also includes: The user's access behavior is identified based on the frequency of the knowledge query requests initiated within a preset time period; Determine whether the access behavior is cross-domain; If so, the cross-domain degree of the access behavior is detected, and the sensitive fields of the knowledge query request are obtained according to the cross-domain degree. Based on the sensitive fields, the knowledge output content of the knowledge query request is dynamically desensitized. The cross-domain degree specifically includes the number of business domains, the sensitivity level, and the degree of deviation from responsibilities.
[0010] Furthermore, the step of identifying the original knowledge corresponding to the multi-source heterogeneous documents pre-received by the knowledge storage terminal further includes: Based on the historical knowledge pre-collected by the knowledge storage terminal, the associated content of the multi-source heterogeneous documents is collected, wherein the associated content specifically includes the same business object, the same process, and the same technical point; Determine whether the associated content has a reference relationship with the historical knowledge; If so, the text information of the multi-source heterogeneous document is obtained, the text quality of the multi-source heterogeneous document is identified based on the text information, and invalid content of the original knowledge is dynamically covered based on the text quality. The text information specifically includes structural markers, preceding and following paragraphs, and layout features, and the text quality specifically includes missing characters, garbled characters, and typos.
[0011] This invention also provides an intelligent retrieval and generation system oriented towards a knowledge hub, comprising: The identification module is used to identify the original knowledge corresponding to the multi-source heterogeneous documents pre-received by the knowledge storage terminal, wherein the multi-source heterogeneous documents specifically include text, tables, images and scanned documents; The judgment module is used to determine whether the original document can be converted into a preset knowledge fragment; The execution module is configured to, if possible, construct a semantic similarity-oriented vector index for the knowledge fragment based on the keywords of the original knowledge, integrate the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, and receive the user's knowledge query request based on the hybrid retrieval strategy, wherein the vector index specifically includes semantic vectors, structural information and metadata; The second judgment module is used to determine whether the knowledge query request matches the preset query rights of the knowledge storage terminal, wherein the query rights specifically include user identity, role permissions and business scenarios; The second execution module is used to dynamically constrain the retrieval scope of the hybrid retrieval strategy if a match is found, by means of the query rights, to perform multi-way retrieval in the hybrid index, to perform secondary processing on the highly relevant knowledge fragments obtained from the retrieval, to generate corresponding external knowledge output information, and to dynamically mark the traceability description of the external knowledge output information in the knowledge storage terminal. The dynamic constraints specifically include content importance, semantic integrity and context length limits, and the traceability description specifically includes source document, paragraph position and knowledge identifier.
[0012] Furthermore, it also includes: The acquisition module is used to split the knowledge fragment into segments based on the semantic segments preset by the knowledge storage terminal, and acquire the segmented data of the knowledge fragment. The third judgment module is used to determine whether the segment length of the block data exceeds a preset length threshold. The third execution module is used to identify the semantic boundary of the segmented data if the semantic boundary is true, perform keyword extraction processing on the segmented data according to the semantic boundary, obtain the enterprise's preset business semantic keywords, introduce the context fragment of the knowledge fragment and associate it with the segmented data according to the coverage of the business semantic keywords, and dynamically mark the confidence samples of the segmented data through the knowledge storage terminal.
[0013] Furthermore, the execution module also includes: The identification unit is used to identify the user's identity verification content based on the business scenario corresponding to the knowledge query request; The judgment unit is used to determine whether the identity verification content can pass the verification requirements of the knowledge storage terminal; An execution unit is configured to, if possible, obtain the business semantics of the knowledge query request, obtain the user's identity permissions based on the business semantics, dynamically restrict the knowledge domain access of the knowledge storage terminal based on the identity permissions, generate a query vector corresponding to the knowledge query request through the access knowledge domain, and construct the query result of the query vector.
[0014] This invention provides an intelligent retrieval and generation method and system for knowledge hubs, which has the following beneficial effects: This invention significantly improves the comprehensiveness and accuracy of knowledge retrieval by introducing a hybrid retrieval strategy that integrates keywords and semantic vectors. On the one hand, by semantically segmenting and vectorizing the original knowledge in multi-source heterogeneous documents, a vector index oriented towards semantic similarity is constructed. This accurately understands the semantic intent behind user queries and avoids missing semantically relevant content by relying solely on literal matching. On the other hand, by incorporating original knowledge keywords into the retrieval system, it compensates for the shortcomings of pure vector retrieval in terms of professional terminology, numbering rules, and precise clause matching, achieving precise positioning of key details. Based on this, the retrieval scope is dynamically constrained in conjunction with user query rights, and the retrieval results are further processed and refined to effectively reduce the interference of irrelevant or noisy content. This ensures that the final output of external knowledge is both semantically relevant and maintains accuracy and controllability, thereby improving the overall accuracy and usability of enterprise knowledge positioning. Attached Figure Description
[0015] Figure 1 This is a flowchart illustrating an embodiment of the intelligent retrieval and generation method for knowledge hubs according to the present invention. Figure 2 This is a structural block diagram of an embodiment of the intelligent retrieval and generation system for knowledge hubs according to the present invention. Detailed Implementation
[0016] It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. The realization of the purpose, functional features, and advantages of the invention will be further explained in conjunction with the embodiments and with reference to the accompanying drawings.
[0017] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0018] Reference Appendix Figure 1 The present invention provides an intelligent retrieval and generation method for a knowledge hub, comprising: S1: Based on the multi-source heterogeneous documents pre-received by the knowledge storage terminal, identify the original knowledge corresponding to the multi-source heterogeneous documents, wherein the multi-source heterogeneous documents specifically include text, tables, images and scanned documents; S2: Determine whether the original document can be converted into a preset knowledge fragment; S3: If possible, construct a semantic similarity-oriented vector index for the knowledge fragment based on the keywords of the original knowledge, integrate the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, and receive the user's knowledge query request based on the hybrid retrieval strategy, wherein the vector index specifically includes semantic vectors, structural information and metadata; S4: Determine whether the knowledge query request matches the preset query rights of the knowledge storage terminal, wherein the query rights specifically include user identity, role permissions and business scenarios; S5: If a match is found, the search scope of the hybrid retrieval strategy is dynamically constrained through the query rights. Multi-way retrieval is performed in the hybrid index, and the highly relevant knowledge fragments obtained are processed in a secondary manner to generate corresponding external knowledge output information. The traceability description of the external knowledge output information is dynamically marked in the knowledge storage terminal. The dynamic constraints specifically include content importance, semantic integrity, and context length limits. The traceability description specifically includes the source document, paragraph position, and knowledge identifier.
[0019] In this embodiment, the system receives multi-source heterogeneous documents uploaded by other users within the enterprise in advance based on the knowledge storage terminal. These multi-source heterogeneous documents specifically include text, tables, images, and scanned documents. The system identifies the original knowledge corresponding to these multi-source heterogeneous documents and then determines whether this original knowledge can be converted into pre-defined knowledge fragments to execute corresponding steps. For example, when the system determines that the original knowledge corresponding to these multi-source heterogeneous documents cannot be converted into knowledge fragments, the system considers the document content to have low relevance to the system's preset business knowledge domain and lacks significant knowledge value. The system will then mark the corresponding original knowledge as "cannot be automatically segmented" or "low-confidence knowledge". The original knowledge is temporarily stored in a separate buffer or isolated area to prevent it from entering the vectorization and retrieval process and affecting the overall retrieval quality. Simultaneously, appropriate follow-up actions are selected based on the reason for failure, such as triggering secondary parsing or content enhancement processing (e.g., OCR re-recognition, structural repair, semantic completion). The original knowledge is then pushed to the manual review or knowledge administrator processing queue, and only re-enters the knowledge fragment generation process after manual confirmation, supplementation, or adjustment. For example, when the system determines that the original knowledge corresponding to these multi-source heterogeneous documents can be converted into knowledge fragments, the system will consider the document content to be closely related to the preset business knowledge domain. The system will then construct knowledge fragments based on the keywords of this original knowledge. The system employs a semantic similarity-based vector index, which includes semantic vectors, structural information, and metadata. It integrates keywords and vector indexes as a hybrid retrieval strategy for knowledge storage. Based on this strategy, it receives knowledge query requests from other users within the enterprise for these knowledge fragments. At this stage, the system extracts keywords and performs semantic modeling on the original knowledge, transforming scattered and heterogeneous document content into standardized, computable knowledge units. This provides a high-quality input foundation for subsequent retrieval and applications, ensuring the reliability and consistency of the enterprise's knowledge base content from the source. Simultaneously, the semantic vectors enable the system to understand the true semantic intent of user queries, improving its understanding of the target audience. The system's ability to identify synonyms and complex issues, along with the introduction of keyword retrieval and structural information, ensures accurate matching of professional terms, fixed expressions, and specific clauses. This effectively avoids the biases caused by a single retrieval method, thereby achieving precise positioning of knowledge fragments. Furthermore, with the support of a hybrid retrieval strategy, the system can stably and efficiently respond to query requests for knowledge fragments from other users within the enterprise, enabling users in different roles and business scenarios to quickly obtain the knowledge they need. The system then determines whether the knowledge query request matches the pre-set query rights of the knowledge storage terminal. These query rights specifically include user identity, role permissions, and business scenarios, in order to execute the corresponding steps.For example, when the system determines that a user's knowledge query request cannot match the pre-set query privileges of the knowledge storage terminal, the system will consider that the query request does not meet the current knowledge access conditions in terms of permission boundaries, business adaptability, or security compliance. The system will adopt a controlled processing strategy to ensure the security and manageability of the knowledge storage terminal, immediately terminate the current query process, return a prompt message to the user indicating insufficient permissions or restricted access, avoid exposing any specific knowledge content, and record and mark the unmatched query request, including user identity, query time, request content, and reason for failure. Further branch processing will be executed according to the enterprise's management strategy, such as guiding the user through permission application or role upgrade processes, automatically recommending alternative knowledge content that matches the user's permission scope, and triggering security alarms and access restriction mechanisms when frequent abnormal requests are detected. Conversely, when the system determines that a user's knowledge query request can match the pre-set query privileges of the knowledge storage terminal, the system will consider that the query request meets the current knowledge access conditions. The system will dynamically constrain the retrieval scope of the hybrid retrieval strategy through different query privileges. The system incorporates content importance, semantic completeness, and context length constraints. Multi-path retrieval is performed within the hybrid index, and highly relevant knowledge fragments are further processed to generate corresponding external knowledge output information. The system dynamically marks traceability information for this external knowledge output information in the knowledge storage terminal, including the source document, paragraph position, and knowledge identifier. By dynamically constraining content importance, semantic completeness, and context length, the system can prioritize knowledge fragments highly relevant to current business decisions or operations, while ensuring compliance and security. This prevents irrelevant or unauthorized information from entering the search results, thereby improving the controllability and relevance of knowledge access. Furthermore, by refining, trimming, and semantically reorganizing the search results, the system generates more clearly structured and complete external knowledge output information, making the final output content more aligned with the user's query intent. This improves the accuracy and readability in intelligent question answering and knowledge recommendation scenarios. Simultaneously, while generating external knowledge output information, the system dynamically marks corresponding traceability information in the knowledge storage terminal, including the source document, paragraph position, and knowledge identifier, ensuring that each output result has a clear source basis.
[0020] It should be noted that, based on the keywords of the original knowledge, a semantically similar vector index is constructed for the knowledge fragment. The keywords and the vector index are then integrated as a hybrid retrieval strategy for the knowledge storage terminal, specifically as follows: After the system determines that the original knowledge can be converted into knowledge fragments, it constructs a semantic similarity-oriented vector index based on the keywords and semantic features of the original knowledge. Specifically, the system first extracts keywords from the original knowledge that reflect business themes, technical points, or management rules to characterize the explicit semantic features of the knowledge fragments. At the same time, it maps the knowledge fragments as a whole into high-dimensional semantic vectors through an embedding model to express their implicit semantic relationships and contextual meanings. During the construction of the vector index, the system further incorporates the structural information of the knowledge fragments in the original document (including chapter levels, paragraph order, table or image relationships) and the corresponding metadata (including the business domain, applicable scenarios, version information, and update time) into the index system, thereby forming a composite vector index that can simultaneously reflect semantic similarity, structural position, and business context. On this basis, the system integrates the keyword-based precise matching mechanism with the semantic vector-based similarity calculation mechanism to form a unified hybrid retrieval strategy, enabling the knowledge storage terminal to understand the semantic intent of the user's query and accurately hit key terms and rule clauses when responding to queries. Specific examples are as follows: For example, in an enterprise R&D and operations knowledge base, a multi-source heterogeneous document originates from a system design specification and historical operations reports. Its original knowledge description is "a method to ensure system stability through multi-level caching and failure control mechanisms in high-concurrency business scenarios." After recognizing this original knowledge, the system extracts keywords such as "high concurrency," "multi-level caching," "cache failure control," and "system stability," and generates corresponding semantic vectors based on this knowledge fragment. Simultaneously, it records metadata such as the chapter level in the original document, the system module it belongs to, and the applicable business scenarios. When another user initiates a knowledge base query with the content "How does the system handle cache failure issues when concurrent request volume suddenly increases?"... When a query request is received, the system first calculates semantic vector similarity to accurately match the aforementioned knowledge fragments. Even if the query expression differs from the original text, it can still identify the semantic relationship. Subsequently, the system combines keyword search results to accurately locate content containing key terms such as "cache invalidation" and "high concurrency," and prioritizes returning knowledge fragments that are structurally complete, the latest version, and consistent with the current business scenario. Finally, the knowledge content output by the system not only covers specific technical processing strategies but also includes clear source documents, paragraph positions, and knowledge identifiers, facilitating further review and verification by users. This provides reliable and traceable knowledge support in actual business decision-making and technical support processes.
[0021] It should be added that the secondary processing specifically includes splicing, compression, or recombination. Through the query rights, the retrieval scope of the hybrid retrieval strategy is dynamically constrained. Multi-path retrieval is performed in the hybrid index, and the highly relevant knowledge fragments obtained are subjected to secondary processing to generate corresponding external knowledge output information. The traceability description of the external knowledge output information is dynamically marked in the knowledge storage terminal, specifically as follows: After the system determines that a user's knowledge query request meets the preset query privileges, the system dynamically constrains the search scope of the hybrid search strategy based on these privileges to ensure that the search process meets both business needs and permission and compliance requirements. Specifically, the system maps user identity, role permissions, and business scenarios to search constraints, filtering and limiting the knowledge fragments that can participate in the search, such as limiting the accessible business domain, knowledge level, or document version range. Simultaneously, the system further combines content importance, semantic integrity, and context length limitations to dynamically filter and prune candidate knowledge fragments. Under these constraints, the system includes keyword indexes and semantic... The system performs parallel multi-path retrieval within the hybrid index of the vector index, and aggregates, deduplicates, and sorts the results returned by different retrieval paths to filter out knowledge fragments with high semantic relevance and strong business adaptability. Subsequently, the system performs secondary processing on the highly relevant knowledge fragments, including content integration, context completion, conflict resolution, and expression optimization, thereby generating external knowledge output information that is structurally clear, semantically complete, and directly usable. At the same time, the system dynamically marks traceability descriptions for the external knowledge output information in the knowledge storage terminal, associating it with the source document, paragraph position, and corresponding knowledge identifier to ensure that the generated results have a clear source and traceability. Specific examples are as follows: For example, in an enterprise's internal compliance and process management scenario, a business user initiates a query request for "approval key points of a certain business process" based on their identity and role permissions. The system first allows the user to access only the corresponding business line and public-level process documents according to their role, automatically excluding knowledge content involving other departments or higher levels of sensitivity. On this basis, the system quickly locates knowledge fragments containing terms such as "approval" and "process key points" through keyword search in the hybrid index. At the same time, it identifies content that is semantically similar to the approval requirements but expressed differently through semantic vector search, and comprehensively sorts the two types of results. Subsequently, the system performs secondary processing on multiple highly relevant knowledge fragments, extracts and integrates the key approval conditions and precautions, forming a complete and concise external knowledge output information. The output results are marked with the source document name, specific paragraph position, and corresponding knowledge identifier, enabling users to quickly obtain the information they need and also to trace back to the original document for verification based on the traceability instructions.
[0022] In this embodiment, before step S3, which involves constructing a semantic similarity-oriented vector index for the knowledge fragment based on the keywords of the original knowledge, and integrating the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, the method further includes: S301: Based on the semantic segmentation preset by the knowledge storage terminal, the knowledge fragment is segmented to obtain the segmented data of the knowledge fragment; S302: Determine whether the segment length of the block data exceeds a preset length threshold; S303: If so, identify the semantic boundary of the segmented data, perform keyword extraction processing on the segmented data according to the semantic boundary, obtain the enterprise's preset business semantic keywords, introduce the context fragment of the knowledge fragment and associate it with the segmented data according to the coverage of the business semantic keywords, and dynamically mark the confidence samples of the segmented data through the knowledge storage terminal.
[0023] In this embodiment, the system splits knowledge fragments into segments based on pre-defined semantic blocks in the knowledge storage terminal, obtaining segmented data of these knowledge fragments. The system then determines whether the segment length of these segmented data exceeds a pre-set length threshold to execute corresponding steps. For example, if the system determines that the segment length of these segmented data does not exceed the pre-set length threshold, the system considers the knowledge fragment to have clear semantic boundaries and appropriate information density under the current semantic block rules, and that it can be fully understood and utilized within a single context. The system then marks it as a standard knowledge fragment that can enter the subsequent processing flow and performs semantic mapping based on this segmented data. Operations such as data generation, keyword extraction, and structural information binding are incorporated into the semantic similarity-based vector index construction process. Simultaneously, the segmented data is associated and stored with its corresponding original document, paragraph position, and knowledge identifier for direct retrieval during subsequent hybrid retrieval, multi-path retrieval, and knowledge generation processes. Furthermore, the system records the judgment results of whether the segmented data meets the length threshold, serving as an important reference for subsequent retrieval ranking, contextual concatenation, or dynamic constraints. This improves overall processing efficiency and knowledge utilization while ensuring semantic integrity. For example, when the system determines that the segment length of these data segments exceeds the preset length threshold... At this point, the system determines that the knowledge fragment cannot be fully understood and utilized under the current semantic segmentation rules. The system identifies the semantic boundaries of these segmented data and, based on these boundaries, performs keyword extraction processing to obtain pre-defined business semantic keywords. Based on the coverage of these business semantic keywords, contextual fragments of the knowledge fragment are introduced and associated with these segmented data. Through the knowledge storage terminal, confidence samples of these segmented data are dynamically labeled. Based on the coverage of business semantic keywords, the system selectively introduces contextual fragments highly relevant to the current segment, ensuring that the recombined knowledge fragment retains its semantic meaning while maintaining a controlled length. Continuity and logical integrity enhance the understandability and accuracy of knowledge fragments in subsequent retrieval, matching, and generation processes. Furthermore, keyword coverage allows for the selection and association of contextual fragments, ensuring that the retained content better aligns with core business scenarios and reduces interference from irrelevant information. This helps improve the relevance and hit rate of knowledge fragments in mixed retrieval and semantic matching processes, thereby enhancing the overall business adaptability of knowledge services. Moreover, this confidence-marking mechanism not only provides important references for subsequent retrieval ranking and generation calls but also provides a basis for manual review, model optimization, and knowledge governance, including identifying potentially ambiguous fragments and continuously optimizing semantic segmentation rules.
[0024] It should be noted that the semantic boundaries of the segmented data are identified, and keyword extraction processing is performed on the segmented data based on the semantic boundaries to obtain the enterprise's preset business semantic keywords. Based on the coverage of the business semantic keywords, contextual fragments of the knowledge segment are introduced and associated with the segmented data. Confidence samples of the segmented data are dynamically marked through the knowledge storage terminal. Specifically: After the system determines that the length of the segmented data exceeds a preset threshold, the system first performs semantic boundary recognition on the segmented data to determine the natural segmentation position of the knowledge content at the semantic level. The semantic boundaries can be comprehensively judged based on topic changes, semantic similarity mutations, paragraph structure, or logical relationships, thereby avoiding mechanical truncation of knowledge content. On this basis, the system performs keyword extraction processing on the segmented data around each semantic boundary, extracting and matching the pre-defined business semantic keywords of the enterprise to determine the relevance of each semantic unit to the enterprise's core business knowledge domain. The system further evaluates the contextual information required for the current segment based on the coverage of business semantic keywords in the segmented data. When it is determined that there is semantic dependency or incomplete information, the system introduces necessary contextual fragments from adjacent knowledge fragments and associates or supplements them with the current segmented data, thereby restoring the semantic integrity of the knowledge fragments within a controlled length range. At the same time, the system dynamically marks confidence samples on the segmented data processed above through the knowledge storage terminal to characterize that the segmented data has undergone semantic reorganization and context completion processing. Its content credibility and stability need to be paid attention to and distinguished in subsequent retrieval, generation, or manual governance processes. Specific examples are as follows: For example, in a company's process and policy documents, a certain original data block exceeds a threshold in length due to containing multiple consecutive process descriptions. The system identifies different semantic boundaries in this block, such as "process initiation conditions," "approval node descriptions," and "abnormal handling rules." The system then extracts and matches the company's preset business semantic keywords around these semantic boundaries, such as "approval authority," "responsible department," and "abnormal handling." It finds that the "abnormal handling rules" part has semantic dependencies on the preceding and following processes. Based on the coverage judgment of business semantic keywords, the system introduces context fragments related to abnormal handling from adjacent blocks and associates them with the current block data, so that the processed knowledge fragments can fully reflect the business meaning of the process rules. Finally, the system marks this block data as a confidence sample in the knowledge storage terminal. When it is subsequently retrieved or used to generate external knowledge output information, the system can use this mark to adjust the ranking weight or provide manual review prompts, thereby enhancing the system's controllability and credibility while ensuring the availability of knowledge.
[0025] In this embodiment, step S3, which involves receiving a user's knowledge query request based on the hybrid retrieval strategy, further includes: S31: Based on the business scenario corresponding to the knowledge query request, identify the user's identity verification content; S32: Determine whether the identity verification content can pass the verification requirements of the knowledge storage terminal; S33: If possible, obtain the business semantics of the knowledge query request, obtain the user's identity permissions based on the business semantics, dynamically restrict the knowledge storage terminal's access to the knowledge domain based on the identity permissions, generate the query vector corresponding to the knowledge query request through the access to the knowledge domain, and construct the query result of the query vector.
[0026] In this embodiment, the system identifies the identity verification content of users based on the business scenario corresponding to the knowledge query request. Then, the system determines whether this identity verification content can pass the verification requirements of the knowledge storage terminal, and executes the corresponding steps accordingly. For example, when the system determines that a user's identity verification content cannot pass the verification requirements of the knowledge storage terminal, the system considers that the user does not have legitimate or complete access qualifications in the current business scenario. The system will take tiered handling and security protection measures, immediately terminating the current query process, returning an identity verification failure prompt message to the user, clearly informing them that they need to complete identity authentication, log in again, or switch to a legitimate business scenario before continuing access. Simultaneously, the system records and marks the identity verification failure event, including user identifier, business scenario, verification failure reason, and time information, for subsequent security auditing and abnormal behavior analysis. Furthermore, the system can trigger further security policies based on the risk level, such as limiting the user's access frequency within a certain period or requiring secondary verification. Conversely, when the system determines that a user's identity verification content can pass the verification requirements of the knowledge storage terminal, the system considers that the user has legitimate or complete access qualifications in the current business scenario. Upon verifying the access rights of the user, the system acquires the business semantics of the knowledge query request. Based on the different business semantics, it obtains the user's identity permissions and dynamically restricts the knowledge domains accessible to the knowledge storage terminal according to different identity permissions. By accessing the knowledge domains, it generates query vectors corresponding to the knowledge query request and constructs the query results of these query vectors. By dynamically restricting the accessible knowledge domains based on different identity permissions, the system can ensure that users can only retrieve and use knowledge content within their legitimate business scope, effectively avoiding cross-business domain or unauthorized access issues. This enhances system security while increasing the flexibility and adaptability of access control. Furthermore, since the query vector construction process excludes interference from irrelevant or inaccessible knowledge domains, the system can focus more on knowledge content highly relevant to the user's actual needs during subsequent retrieval and matching processes. This improves the semantic relevance and hit accuracy of query results and reduces the output of irrelevant results or noise information. By forming a closed-loop processing mechanism for identity verification, business semantic recognition, identity permission matching, and query vector construction, the system can provide efficient and intelligent knowledge services to legitimate users while ensuring the security and compliant use of enterprise knowledge assets.
[0027] It should be noted that the process involves obtaining the business semantics of the knowledge query request, obtaining the user's identity and permissions based on the business semantics, dynamically restricting the knowledge domain access of the knowledge storage terminal based on the identity and permissions, generating a query vector corresponding to the knowledge query request through the accessed knowledge domain, and constructing the query result of the query vector. Specifically: After the system completes user identity verification, it first performs semantic parsing on the knowledge query request to identify its corresponding business semantics, which is used to clarify the business scenario, business object, and focus of the query request. The business semantics can be parsed through a semantic understanding model and matched with the enterprise's preset business semantic system to map the user's natural language query to specific business semantic tags. On this basis, the system further obtains the user's identity permissions matching the business scenario based on the identified business semantics, including user role, scope of responsibility, and knowledge level accessible under the business semantics. Subsequently, the system dynamically restricts the knowledge domains that can participate in retrieval in the knowledge storage terminal based on the identity permissions, and filters out the knowledge set that matches the current business semantics and user permissions. Within the limited access knowledge domain, the system generates a query vector corresponding to the knowledge query request, so that the query vector is consistent with the target knowledge domain at the semantic representation level, thereby constructing the basis for query results used for subsequent similarity calculation and retrieval ranking. Specific examples are as follows: For example, in an enterprise's internal management and operations scenario, a user initiates a knowledge query request for "system fault handling process" using their identity. The system identifies the business semantics corresponding to this request as "system operations and maintenance - fault handling" through semantic parsing. Based on this business semantics, the system obtains the user's identity permissions in the operations and maintenance scenario, confirming that they only have the permission to access operations and maintenance process documents and publicly available technical specifications. Accordingly, the system dynamically restricts the knowledge domain accessed by the knowledge storage terminal, allowing only knowledge fragments related to operations and maintenance fault handling to participate in the retrieval, and excluding content involving other business lines or sensitive management decisions. Within this restricted knowledge domain, the system generates a query vector consistent with the query request semantics, and constructs the query results based on the similarity calculation between the query vector and the knowledge fragment vector, so that the final returned content not only meets the user's current business needs but is also strictly controlled within the scope of their identity permissions.
[0028] In this embodiment, step S5, which involves performing multi-way retrieval in a hybrid index, processing the retrieved highly relevant knowledge fragments, and generating corresponding external knowledge output information, further includes: S51: Based on the query attributes of the knowledge query request, identify the terminology information of the knowledge query request, wherein the query attributes specifically include semantic complexity, number of keywords, and query type; S52: Determine whether the terminology information reaches the preset retrieval weight; S53: If so, then according to the weight bias of the terminology information, obtain the weight configuration of the preset retrieval channel, dynamically trigger the preset parallel multi-path retrieval based on the weight configuration, generate the corresponding candidate knowledge fragments and perform relevance scoring and sorting, and filter out the highly relevant knowledge fragments of the knowledge query request through the relevance scoring and sorting, wherein the weight bias specifically includes semantic description and natural language expression.
[0029] In this embodiment, the system identifies the terminology information of a knowledge query request based on its query attributes, specifically semantic complexity, number of keywords, and query type. The system then determines whether this terminology information reaches a pre-set retrieval weight to execute corresponding steps. For example, if the system determines that the terminology information of the knowledge query request does not reach the pre-set retrieval weight, the system considers the query request to have insufficient terminology targeting or unclear semantic expression in the current state, making it difficult to support high-quality, controllable knowledge retrieval and matching. The system will then adopt a combined guidance and compensation processing strategy to address this issue. Semantic completion or terminology expansion processing, such as combining enterprise knowledge graphs, historical high-frequency queries, or thesaurus, expands and reconstructs query terms to improve their retrieval weight. Simultaneously, it returns guiding prompts or recommended example queries to the user, encouraging them to supplement with more specific keywords, limit the business scope, or clarify the query type. The system also marks this query request as a low-weight query, recording relevant characteristics for subsequent model optimization and retrieval strategy adjustments. This gradually improves query quality and overall retrieval performance without compromising user experience. For example, when the system determines that the terminology information in the knowledge query request has reached a pre-set retrieval weight, the system... The system assumes the query request has a clear semantic expression in the current state. Based on the weight bias of different terminology information (including semantic description and natural language expression), the system obtains the pre-set weight configuration for the retrieval channel. According to different weight configurations, it dynamically triggers pre-defined parallel multi-path retrieval, generating corresponding candidate knowledge fragments and ranking them by relevance scoring. Through this relevance scoring and ranking, the system selects highly relevant knowledge fragments for the query request. By distinguishing the weight bias of terminology information in semantic description and natural language expression, and matching the pre-defined retrieval channel weight configuration, the system can dynamically select the most suitable retrieval path or combine multiple retrieval methods for parallel execution. This weight-driven retrieval channel triggering mechanism makes the retrieval process no longer dependent on a single strategy, but rather adaptively adjusts according to query characteristics, thus significantly improving the flexibility and adaptability of the retrieval strategy. Simultaneously, by dynamically triggering parallel multi-path retrieval, the system can simultaneously obtain candidate knowledge fragments in different retrieval channels. For example, semantic vector retrieval focuses on understanding the query intent, while keyword or rule retrieval focuses on accurately matching key terms. Furthermore, by ranking and ranking the candidate knowledge fragments by relevance scoring and selecting highly relevant knowledge fragments, the system can prioritize returning knowledge content with strong semantic consistency, high business matching degree, and complete structure.
[0030] It should be noted that, based on the weight bias of the terminology information, the weight configuration of the preset retrieval channel is obtained. According to the weight configuration, a preset parallel multi-path retrieval is dynamically triggered to generate corresponding candidate knowledge fragments and perform relevance scoring and ranking. Through the relevance scoring and ranking, highly relevant knowledge fragments for the knowledge query request are selected. Specifically: After the system determines that the terminology information in the knowledge query request has reached the preset retrieval weight, the system further analyzes the weight bias of the terminology information to distinguish whether the query request focuses more on semantic description and understanding or on natural language or precise terminology expression. Based on this weight bias, the system obtains a matching weight configuration scheme from the pre-configured retrieval channel strategy. Different weight configurations correspond to the participation ratio and priority of different retrieval channels in parallel retrieval. For example, when the semantic description weight is high, the system increases the weight of the semantic vector retrieval channel; when the natural language or precise terminology weight is high, the system correspondingly increases the weight of the keyword retrieval, rule matching, or structured index channel. On this basis, the system dynamically triggers multiple retrieval channels to execute in parallel, generating a set of candidate knowledge fragments from different index spaces. Subsequently, the system performs unified relevance scoring and ranking on the candidate knowledge fragments. The relevance scoring can comprehensively consider factors such as semantic similarity, keyword hit rate, structural matching degree, and business adaptability, and selects knowledge fragments that are highly relevant to the knowledge query request based on the ranking results, as the core input for subsequent knowledge processing or output. Specific examples are as follows: For example, in an enterprise technical support scenario, a user initiates a knowledge query request for "how to solve the timeout problem of service under high concurrency". The system identifies that the semantic description has a higher weight and the precise terminology has a relatively lower weight in the query. Based on this weight bias, the system obtains a weight configuration that prioritizes semantic vector retrieval and supplements it with keyword retrieval, and triggers the semantic similarity retrieval channel and the keyword matching channel in parallel to retrieve a batch of candidate knowledge fragments from the knowledge storage terminal. The semantic retrieval channel returns solutions that are semantically related to "high concurrency" and "service timeout" but with different expressions, while the keyword channel supplements the document fragments containing the explicit terms "timeout" and "concurrency". The system then scores and sorts all candidate knowledge fragments based on their relevance, prioritizing the retention of knowledge fragments with high semantic similarity, strong technical scenario matching, and complete content, and finally selects the knowledge content that is highly relevant to the query request, providing the user with accurate and comprehensive knowledge support.
[0031] In this embodiment, step S2, which determines whether the original document can be converted into a preset knowledge fragment, further includes: S21: Based on the document structure pre-detected by the knowledge storage terminal of the original document, identify the content length of the original document, wherein the document structure specifically includes chapters, paragraphs and semantic boundaries; S22: Determine whether the length of the content has reached a preset length threshold; S23: If so, obtain the business knowledge domain of the original document, detect the business value of the original document based on the business knowledge domain, and dynamically generate document parameters of the original document based on the business value. The document parameters specifically include source information, permission identifier, and compliance attributes.
[0032] In this embodiment, the system identifies the content length of original documents based on the document structure pre-detected by the knowledge storage terminal. The document structure specifically includes chapters, paragraphs, and semantic boundaries. The system then determines whether the content length reaches a pre-set length threshold to execute corresponding steps. For example, if the system determines that the content length of the original documents does not reach the pre-set length threshold, it considers the original document to have a moderate content size and concentrated semantics under the current document structure division rules, capable of being fully understood and utilized within a single context. No further splitting or complex reorganization is required. The system directly marks the original document as a valid document object that can be processed as a whole, inputting it as a complete piece of original knowledge or a small fragment of knowledge into subsequent processes. Based on this original document, it directly performs keyword extraction, semantic vector generation, and structural information binding operations, incorporating it as a whole into the knowledge fragment construction and vector indexing process. This avoids semantic fragmentation caused by excessive splitting. Simultaneously, the chapter, paragraph, and semantic boundary information of the original document are stored as structural metadata, providing a basis for subsequent retrieval, sorting, and tracing. The system also records the judgment result that the document did not reach the length threshold for subsequent index weight adjustment or context splicing decisions, thereby ensuring semantic integrity. To improve overall processing efficiency and knowledge utilization, the system takes the following steps: For example, when the system determines that the content length of these original documents has reached a pre-set length threshold, it considers the original document unintelligible or unusable. The system then acquires the business knowledge domain of these original documents, detects their business value based on different business knowledge domains, and dynamically generates document parameters based on these business values. These parameters specifically include source information, permission identifiers, and compliance attributes. By introducing business knowledge domain-based judgment and business value detection at this stage, the system can avoid directly inputting excessively long and complex documents into subsequent semantic modeling or retrieval processes. This effectively prevents misunderstandings and retrieval noise caused by context overload, semantic dilution, or structural complexity, improving the stability of the overall knowledge processing flow. For documents with high business value, the system can generate more refined document parameters and enter the key processing flow. For documents with low business value or that are not applicable at the moment, strategies such as delayed processing, de-weighted indexing, or isolated storage can be adopted to achieve reasonable allocation and differentiated governance of knowledge resources. Furthermore, by dynamically generating document parameters that include source information, permission identifiers, and compliance attributes, the system can complete the necessary security and compliance marking before the document enters the subsequent knowledge splitting, indexing, or retrieval process.
[0033] It should be noted that the process involves obtaining the business knowledge domain of the original document, detecting the business value of the original document based on the business knowledge domain, and dynamically generating document parameters for the original document based on the business value. Specifically: After the system determines that the length of the original document content has reached a preset threshold, the system first identifies the business knowledge domain to which the original document belongs based on the document source, semantic features of the content, and the enterprise's existing business classification system, in order to clarify the document's position in the enterprise's overall business structure. Subsequently, the system combines the business knowledge domain to detect and evaluate the business value of the original document. The business value can be judged by comprehensively considering factors such as the degree of business criticality involved in the document, expected frequency of use, timeliness, and degree of support for core business processes or decisions. On this basis, the system dynamically generates corresponding document parameters for the original document according to different levels of business value, which are used to describe and constrain the processing method and scope of use of the document in subsequent knowledge processing processes. The document parameters may include the document's source information, permission identifier, compliance attributes, and priority markers, thereby achieving refined management and controllable utilization of ultra-long original documents. Specific examples are as follows: For example, in the enterprise management and R&D knowledge base, a very long original document is a compilation of annual technical specifications. The system identifies its business knowledge domain as "core technical specifications" through semantic analysis. Based on this business knowledge domain, the system further detects that the document has high supporting value for multiple product development and system operation and maintenance scenarios, and therefore assesses its business value as high-level. Accordingly, the system dynamically generates document parameters for the original document, including marking its source as an official technical department, setting access permissions that are restricted to specific R&D roles, and assigning it high compliance and priority processing attributes. In subsequent processing, the document can be broken down into multiple key knowledge fragments that are prioritized for indexing and retrieval, while being subject to corresponding permissions and compliance parameters when accessed and used, thereby fully releasing its business value while ensuring security.
[0034] In this embodiment, step S4, which determines whether the knowledge query request matches the preset query rights of the knowledge storage terminal, further includes: S41: Identify the user's access behavior based on the frequency of the knowledge query requests initiated within a preset time period; S42: Determine whether the access behavior is cross-domain; S43: If so, detect the cross-domain degree of the access behavior, obtain the sensitive fields of the knowledge query request based on the cross-domain degree, and dynamically desensitize the knowledge output content of the knowledge query request based on the sensitive fields. The cross-domain degree specifically includes the number of business domains, sensitivity level, and degree of responsibility deviation.
[0035] In this embodiment, the system identifies user access behavior to the knowledge storage terminal based on the frequency of knowledge query requests initiated within a pre-set time period. The system then determines whether the access behavior involves cross-domain issues and executes corresponding steps accordingly. For example, if the system determines that the user's access behavior to the knowledge storage terminal does not involve cross-domain issues, the system considers the user's knowledge query behavior to comply with their identity permissions and preset usage guidelines in terms of access frequency, access scope, and business context. The system identifies this access behavior as normal and compliant, continues to process the user's query requests according to the established knowledge service process, maintains the current access policy and permission configuration, and allows the user to continuously perform knowledge retrieval, query vector generation, and mixed retrieval operations within their authorized scope. Simultaneously, the system records this access behavior normally for subsequent behavior statistics, usage analysis, or model optimization, without triggering additional security restrictions or alarm mechanisms. Furthermore, the system can appropriately increase the response priority or cache hit rate of the user's query requests based on the user's stable and compliant access behavior characteristics, thereby further improving the user experience and overall system efficiency while ensuring the security and stable operation of the knowledge storage terminal. For example, when the system... If the system detects that a user's access to the knowledge storage terminal involves cross-domain issues, it will consider the user's knowledge query behavior to be inconsistent with identity permissions and usage guidelines. The system will then detect the extent of the cross-domain behavior, specifically including the number of business domains, sensitivity level, and degree of deviation from responsibilities. Based on different levels of cross-domain issues, the system will obtain the sensitive fields of the knowledge query request and dynamically desensitize the knowledge output content of the request. By comprehensively evaluating the cross-domain issue from multiple dimensions such as the number of business domains, sensitivity level, and degree of deviation from responsibilities, the system can distinguish between minor business boundary deviations and high-risk unauthorized access behaviors, providing a reliable basis for subsequent differentiated processing strategies. This improves the accuracy and rationality of access control. At the same time, by masking, obscuring, or replacing key fields, values, identifiers, or sensitive descriptions, the system can effectively prevent the improper acquisition of sensitive information and avoid the business interruption caused by complete denial of service, achieving a balance between security and availability. Furthermore, by incorporating cross-domain identification, sensitive field extraction, and dynamic desensitization mechanisms into a unified process, the system can proactively protect sensitive information during the knowledge output stage, effectively reducing data leakage and compliance risks.
[0036] It should be noted that the detection of the cross-domain extent of the access behavior, the acquisition of sensitive fields in the knowledge query request based on the extent of the cross-domain extent, and the dynamic desensitization of the knowledge output content of the knowledge query request based on the sensitive fields are specifically as follows: After determining that a user's access behavior involves cross-domain issues, the system first detects and quantifies the degree of cross-domain involvement. This degree is not a single indicator but rather a comprehensive assessment considering factors such as the number of business domains involved, the sensitivity level of the accessed knowledge, and the degree of deviation between the user's current responsibilities and the target knowledge domain. This results in an objective assessment of the access risk. Based on this, the system further analyzes the semantic content of the knowledge query request according to different levels of cross-domain involvement, identifying fields that may contain sensitive information, such as key business indicators, core technical parameters, internal personnel information, or unpublished strategic conclusions. Subsequently, based on the identified sensitive fields, the system dynamically adjusts the presentation of the knowledge output content. Through desensitization strategies such as field masking, content blurring, numerical range representation, or summary expression, the system processes the generated knowledge results in real time, ensuring that the output information meets basic query requirements while not exposing sensitive details beyond the user's authorized scope. Specific examples are as follows: For example, a user belonging to the marketing department initiates multiple knowledge query requests targeting the R&D business domain within a short period. The system determines that this access behavior constitutes cross-domain access and further identifies its cross-domain level as "medium risk" because it involves cross-business domains but does not touch the highest level of sensitivity. The system then identifies several sensitive fields from the knowledge query requests, such as specific algorithm implementation details, core performance parameters, and internal version codes. Based on these sensitive fields, when generating knowledge output content, the system provides a general description of the algorithm implementation process, replaces precise performance parameters with range descriptions, and anonymizes the internal version codes. Ultimately, the user can still gain an understanding of the overall R&D solution's approach and business value, but cannot access critical sensitive information beyond their responsibilities, thus effectively reducing the security risks associated with cross-domain access.
[0037] In this embodiment, step S1, which identifies the original knowledge corresponding to the multi-source heterogeneous documents pre-received by the knowledge storage terminal, further includes: S11: Based on the historical knowledge pre-collected by the knowledge storage terminal, collect the associated content of the multi-source heterogeneous documents, wherein the associated content specifically includes the same business object, the same process, and the same technical point; S12: Determine whether the associated content has a reference relationship with the historical knowledge; S13: If so, obtain the text information of the multi-source heterogeneous document, identify the text quality of the multi-source heterogeneous document based on the text information, and dynamically cover the invalid content of the original knowledge based on the text quality. The text information specifically includes structural markers, preceding and following paragraphs, and layout features, and the text quality specifically includes missing characters, garbled characters, and typos.
[0038] In this embodiment, the system collects the associated content of these multi-source heterogeneous documents based on the historical knowledge pre-collected by the knowledge storage terminal. The associated content specifically includes the same business object, the same process, and the same technical point. The system then determines whether this associated content has a reference relationship with historical knowledge to execute corresponding steps. For example, when the system determines that the associated content of the multi-source heterogeneous documents does not have a reference relationship with historical knowledge, the system considers these uploaded multi-source heterogeneous documents to be new knowledge, independent experience summaries, or potential knowledge information that has not yet been systematically managed. The system will trigger a supplementary processing flow, independently marking the associated content, temporarily storing it as candidate new knowledge, and generating a unique knowledge identifier for later use. Continuing the tracking, on the other hand, the system can perform semantic clustering analysis based on business objects, processes, and technical points to assess whether it should be included in the existing knowledge domain or managed as a new knowledge branch. Simultaneously, the system can submit this content to manual review or expert confirmation to determine its business validity, accuracy, and reusability. After confirmation, formal entry into the database, vectorization modeling, and index construction are completed, thereby continuously improving and expanding the overall knowledge system of the enterprise knowledge storage terminal. For example, when the system determines that the related content of multi-source heterogeneous documents has a reference relationship with historical knowledge, the system will consider that these multi-source heterogeneous documents have the potential to inherit, reuse, or explicitly reference historical knowledge assets already accumulated in the knowledge storage terminal. The system will then obtain... The text information of these multi-source heterogeneous documents includes structural markers, preceding and following paragraphs, and layout features. Based on this text information, the system identifies the text quality of the multi-source heterogeneous documents, specifically including missing text, garbled characters, and typos. Depending on the text quality, invalid content is dynamically overwritten when the original knowledge is integrated from the knowledge storage terminal. The system further acquires text information such as document structural markers, preceding and following paragraphs, and layout features to perform refined analysis of cited content, helping to accurately determine the authenticity and validity of citation relationships. Combined with the text quality detection results, it identifies and dynamically overwrites invalid content in the original knowledge, such as missing text, garbled characters, or typos, preventing low-quality text from being incorrectly inherited or repeatedly stored. For missing or garbled references, the system can selectively ignore them or revert to high-quality historical versions. For content with typos or minor formatting errors, it can automatically correct them through rules or models. This continuously optimizes the quality of knowledge assets without disrupting the original knowledge structure. Furthermore, by dynamically covering invalid content based on text quality, the system can prevent low-value or noisy information from being amplified during repeated references and integrations. This ensures the credibility and traceability of core knowledge assets in the knowledge storage terminal. This mechanism enables the natural evolution of historical knowledge through "survival of the fittest" when it is inherited or reused by new documents, promoting the continuous expansion of the enterprise's knowledge system while maintaining a high-quality, maintainable, and sustainable state.
[0039] It should be noted that the process of obtaining the text information of the multi-source heterogeneous documents, identifying the text quality of the multi-source heterogeneous documents based on the text information, and dynamically overwriting invalid content of the original knowledge based on the text quality, specifically involves: After confirming that there is a reference or inheritance relationship between the multi-source heterogeneous document and historical knowledge, the system first extracts and parses the text information of the multi-source heterogeneous document in a structured manner. The text information includes not only the main text content itself, but also chapter titles, paragraph levels, list and table tags, contextual relationships between paragraphs, and layout features, which are used to restore the semantic structure and logical continuity of the document. On this basis, the system evaluates the text quality by combining rule detection and model recognition, focusing on identifying low-quality features such as missing content, garbled characters, broken format, or obvious typos. Subsequently, based on different text quality judgment results, the system dynamically overwrites the corresponding content in the original knowledge, such as blocking invalid segments that cannot be parsed, replacing erroneous references with high-quality text, or reverting to a semantically complete and reliable version in historical knowledge, thereby ensuring that the knowledge storage terminal always retains understandable and usable effective knowledge. Specific examples are as follows: For example, the system receives a scanned technical summary document and determines that it references a core technology description already existing in the knowledge storage terminal. While extracting text information, the system finds that some paragraphs in the document contain a large amount of garbled text due to OCR recognition issues, and that key steps are missing. After text quality assessment, the system marks these paragraphs as low-quality text and automatically ignores the corresponding content during knowledge integration. Simultaneously, it retrieves semantically complete and formatted original technical descriptions from historical knowledge to overwrite them. For paragraphs with only a few typos or minor formatting errors, the system retains the latest wording, automatically corrects the errors, and updates the original knowledge. Through this processing method, the knowledge storage terminal avoids low-quality documents contaminating core knowledge and achieves continuous optimization and updating of knowledge content.
[0040] Reference Appendix Figure 2 An intelligent retrieval and generation system for a knowledge hub, as described in one embodiment of the present invention, includes: The identification module 10 is used to identify the original knowledge corresponding to the multi-source heterogeneous documents pre-received by the knowledge storage terminal, wherein the multi-source heterogeneous documents specifically include text, tables, images and scanned documents; The judgment module 20 is used to determine whether the original document can be converted into a preset knowledge fragment; The execution module 30 is configured to, if possible, construct a semantic similarity-oriented vector index for the knowledge fragment based on the keywords of the original knowledge, integrate the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, and receive the user's knowledge query request based on the hybrid retrieval strategy, wherein the vector index specifically includes semantic vectors, structural information and metadata; The second judgment module 40 is used to determine whether the knowledge query request matches the preset query rights of the knowledge storage terminal, wherein the query rights specifically include user identity, role permissions and business scenarios; The second execution module 50 is used to dynamically constrain the retrieval scope of the hybrid retrieval strategy if a match is found, by means of the query rights, to perform multi-way retrieval in the hybrid index, to perform secondary processing on the highly relevant knowledge fragments obtained from the retrieval, to generate corresponding external knowledge output information, and to dynamically mark the traceability description of the external knowledge output information in the knowledge storage terminal. The dynamic constraints specifically include content importance, semantic integrity and context length limits, and the traceability description specifically includes the source document, paragraph position and knowledge identifier.
[0041] In this embodiment, the identification module 10 receives multi-source heterogeneous documents uploaded by other users within the enterprise based on the knowledge storage terminal. These multi-source heterogeneous documents specifically include text, tables, images, and scanned documents. The module 10 identifies the original knowledge corresponding to these multi-source heterogeneous documents. Then, the judgment module 20 determines whether this original knowledge can be converted into a pre-defined knowledge fragment to execute the corresponding steps. For example, when the system determines that the original knowledge corresponding to these multi-source heterogeneous documents cannot be converted into a knowledge fragment, the system will consider that the document content has low relevance to the system's preset business knowledge domain and does not have obvious knowledge value. The system will mark the corresponding original knowledge as "cannot be automatically segmented" or "low confidence". "Knowledge" is temporarily stored in a separate buffer or isolated area to prevent it from entering the vectorization and retrieval process and affecting the overall retrieval quality. Simultaneously, based on the reason for failure, appropriate subsequent actions are selected, such as triggering secondary parsing or content enhancement processing (e.g., OCR re-recognition, structural repair, semantic completion). The original knowledge is then pushed to the manual review or knowledge administrator processing queue, awaiting manual confirmation, supplementation, or adjustment before re-entering the knowledge fragment generation process. For example, when the system determines that the original knowledge corresponding to these multi-source heterogeneous documents can be converted into knowledge fragments, the execution module 30 will consider that the document content is closely related to the preset business knowledge domain. The system will then construct a knowledge fragment generation process based on the keywords of this original knowledge. The knowledge fragments are indexed by semantic similarity-based vector indexes, which specifically include semantic vectors, structural information, and metadata. A hybrid retrieval strategy, integrating keywords and vector indexes as the knowledge storage terminal, is used to receive knowledge query requests for these knowledge fragments from other users within the enterprise. At this stage, the system extracts keywords and performs semantic modeling on the original knowledge, transforming scattered and heterogeneous document content into standardized, computable knowledge units. This provides a high-quality input foundation for subsequent retrieval and applications, ensuring the reliability and consistency of the enterprise's knowledge base content from the source. Simultaneously, semantic vectors enable the system to understand the true semantic intent of user queries, improving the ability to understand synonyms. The system's ability to identify complex issues and its ability to accurately match technical terms, fixed expressions, and specific clauses through keyword retrieval and the introduction of structural information effectively avoid the bias caused by a single retrieval method. This enables precise positioning of knowledge fragments. Furthermore, with the support of a hybrid retrieval strategy, the system can stably and efficiently respond to query requests for knowledge fragments from other users within the enterprise, allowing users in different roles and business scenarios to quickly obtain the knowledge they need. Then, the second judgment module 40 determines whether the knowledge query request matches the query rights pre-set by the knowledge storage terminal. The query rights specifically include user identity, role permissions, and business scenarios, in order to execute the corresponding steps.For example, when the system determines that a user's knowledge query request cannot match the pre-set query rights of the knowledge storage terminal, the system will consider that the query request does not meet the current knowledge access conditions in terms of permission boundaries, business adaptability, or security compliance. The system will adopt a controlled processing strategy to ensure the security and manageability of the knowledge storage terminal, immediately terminate the current query process, return a prompt message to the user indicating insufficient permissions or restricted access, avoid exposing any specific knowledge content, and record and mark the unmatched query request, including user identity, query time, request content, and reason for failure. Further branch processing will be executed according to the enterprise's management strategy, such as guiding the user through permission application or role upgrade processes, automatically recommending alternative knowledge content that matches their permission scope, and triggering security alarms and access restriction mechanisms when frequent abnormal requests are detected. Conversely, when the system determines that a user's knowledge query request can match the pre-set query rights of the knowledge storage terminal, the second execution module 50 will consider that the query request meets the current knowledge access conditions. The system will dynamically constrain the retrieval scope of the hybrid retrieval strategy through different query rights. The constraints specifically include content importance, semantic integrity, and context length limitations. Multi-path retrieval is performed in the hybrid index, and highly relevant knowledge fragments are processed again to generate corresponding external knowledge output information. The traceability descriptions of these external knowledge outputs are dynamically marked in the knowledge storage terminal, including the source document, paragraph position, and knowledge identifier. By dynamically constraining content importance, semantic integrity, and context length, the system can prioritize knowledge fragments highly relevant to current business decisions or operations, while ensuring compliance and security. This prevents irrelevant or unauthorized information from entering the search results, thereby improving the controllability and relevance of knowledge access. Simultaneously, through fine-tuning, trimming, and semantic reorganization of the search results, the system can generate more clearly structured and complete external knowledge output information, making the final output content more aligned with the user's query intent. This improves the accuracy and readability in intelligent question answering and knowledge recommendation scenarios. Furthermore, while generating external knowledge output information, the system dynamically marks the corresponding traceability descriptions in the knowledge storage terminal, including the source document, paragraph position, and knowledge identifier, ensuring that each output result has a clear source basis.
[0042] In this embodiment, it also includes: The acquisition module is used to split the knowledge fragment into segments based on the semantic segments preset by the knowledge storage terminal, and acquire the segmented data of the knowledge fragment. The third judgment module is used to determine whether the segment length of the block data exceeds a preset length threshold. The third execution module is used to identify the semantic boundary of the segmented data if the semantic boundary is true, perform keyword extraction processing on the segmented data according to the semantic boundary, obtain the enterprise's preset business semantic keywords, introduce the context fragment of the knowledge fragment and associate it with the segmented data according to the coverage of the business semantic keywords, and dynamically mark the confidence samples of the segmented data through the knowledge storage terminal.
[0043] In this embodiment, the system splits knowledge fragments into segments based on pre-defined semantic blocks in the knowledge storage terminal, obtaining segmented data of these knowledge fragments. The system then determines whether the segment length of these segmented data exceeds a pre-set length threshold to execute corresponding steps. For example, if the system determines that the segment length of these segmented data does not exceed the pre-set length threshold, the system considers the knowledge fragment to have clear semantic boundaries and appropriate information density under the current semantic block rules, and that it can be fully understood and utilized within a single context. The system then marks it as a standard knowledge fragment that can enter the subsequent processing flow and performs semantic mapping based on this segmented data. Operations such as data generation, keyword extraction, and structural information binding are incorporated into the semantic similarity-based vector index construction process. Simultaneously, the segmented data is associated and stored with its corresponding original document, paragraph position, and knowledge identifier for direct retrieval during subsequent hybrid retrieval, multi-path retrieval, and knowledge generation processes. Furthermore, the system records the judgment results of whether the segmented data meets the length threshold, serving as an important reference for subsequent retrieval ranking, contextual concatenation, or dynamic constraints. This improves overall processing efficiency and knowledge utilization while ensuring semantic integrity. For example, when the system determines that the segment length of these data segments exceeds the preset length threshold... At this point, the system determines that the knowledge fragment cannot be fully understood and utilized under the current semantic segmentation rules. The system identifies the semantic boundaries of these segmented data and, based on these boundaries, performs keyword extraction processing to obtain pre-defined business semantic keywords. Based on the coverage of these business semantic keywords, contextual fragments of the knowledge fragment are introduced and associated with these segmented data. Through the knowledge storage terminal, confidence samples of these segmented data are dynamically labeled. Based on the coverage of business semantic keywords, the system selectively introduces contextual fragments highly relevant to the current segment, ensuring that the recombined knowledge fragment retains its semantic meaning while maintaining a controlled length. Continuity and logical integrity enhance the understandability and accuracy of knowledge fragments in subsequent retrieval, matching, and generation processes. Furthermore, keyword coverage allows for the selection and association of contextual fragments, ensuring that the retained content better aligns with core business scenarios and reduces interference from irrelevant information. This helps improve the relevance and hit rate of knowledge fragments in mixed retrieval and semantic matching processes, thereby enhancing the overall business adaptability of knowledge services. Moreover, this confidence-marking mechanism not only provides important references for subsequent retrieval ranking and generation calls but also provides a basis for manual review, model optimization, and knowledge governance, including identifying potentially ambiguous fragments and continuously optimizing semantic segmentation rules.
[0044] In this embodiment, the execution module further includes: The identification unit is used to identify the user's identity verification content based on the business scenario corresponding to the knowledge query request; The judgment unit is used to determine whether the identity verification content can pass the verification requirements of the knowledge storage terminal; An execution unit is configured to, if possible, obtain the business semantics of the knowledge query request, obtain the user's identity permissions based on the business semantics, dynamically restrict the knowledge domain access of the knowledge storage terminal based on the identity permissions, generate a query vector corresponding to the knowledge query request through the access knowledge domain, and construct the query result of the query vector.
[0045] In this embodiment, the system identifies the identity verification content of users based on the business scenario corresponding to the knowledge query request. Then, the system determines whether this identity verification content can pass the verification requirements of the knowledge storage terminal, and executes the corresponding steps accordingly. For example, when the system determines that a user's identity verification content cannot pass the verification requirements of the knowledge storage terminal, the system considers that the user does not have legitimate or complete access qualifications in the current business scenario. The system will take tiered handling and security protection measures, immediately terminating the current query process, returning an identity verification failure prompt message to the user, clearly informing them that they need to complete identity authentication, log in again, or switch to a legitimate business scenario before continuing access. Simultaneously, the system records and marks the identity verification failure event, including user identifier, business scenario, verification failure reason, and time information, for subsequent security auditing and abnormal behavior analysis. Furthermore, the system can trigger further security policies based on the risk level, such as limiting the user's access frequency within a certain period or requiring secondary verification. Conversely, when the system determines that a user's identity verification content can pass the verification requirements of the knowledge storage terminal, the system considers that the user has legitimate or complete access qualifications in the current business scenario. Upon verifying the access rights of the user, the system acquires the business semantics of the knowledge query request. Based on the different business semantics, it obtains the user's identity permissions and dynamically restricts the knowledge domains accessible to the knowledge storage terminal according to different identity permissions. By accessing the knowledge domains, it generates query vectors corresponding to the knowledge query request and constructs the query results of these query vectors. By dynamically restricting the accessible knowledge domains based on different identity permissions, the system can ensure that users can only retrieve and use knowledge content within their legitimate business scope, effectively avoiding cross-business domain or unauthorized access issues. This enhances system security while increasing the flexibility and adaptability of access control. Furthermore, since the query vector construction process excludes interference from irrelevant or inaccessible knowledge domains, the system can focus more on knowledge content highly relevant to the user's actual needs during subsequent retrieval and matching processes. This improves the semantic relevance and hit accuracy of query results and reduces the output of irrelevant results or noise information. By forming a closed-loop processing mechanism for identity verification, business semantic recognition, identity permission matching, and query vector construction, the system can provide efficient and intelligent knowledge services to legitimate users while ensuring the security and compliant use of enterprise knowledge assets.
[0046] In this embodiment, the second execution module further includes: The second identification unit is used to identify the terminology information of the knowledge query request based on the query attributes of the knowledge query request, wherein the query attributes specifically include semantic complexity, number of keywords and query type; The second judgment unit is used to determine whether the term information reaches the preset retrieval weight; The second execution unit is configured to, if so, obtain the weight configuration of the preset retrieval channel based on the weight bias of the term information, dynamically trigger the preset parallel multi-path retrieval based on the weight configuration, generate corresponding candidate knowledge fragments and perform relevance scoring and ranking, and filter out the highly relevant knowledge fragments of the knowledge query request through the relevance scoring and ranking, wherein the weight bias specifically includes semantic description and natural language expression.
[0047] In this embodiment, the system identifies the terminology information of a knowledge query request based on its query attributes, specifically semantic complexity, number of keywords, and query type. The system then determines whether this terminology information reaches a pre-set retrieval weight to execute corresponding steps. For example, if the system determines that the terminology information of the knowledge query request does not reach the pre-set retrieval weight, the system considers the query request to have insufficient terminology targeting or unclear semantic expression in the current state, making it difficult to support high-quality, controllable knowledge retrieval and matching. The system will then adopt a combined guidance and compensation processing strategy to address this issue. Semantic completion or terminology expansion processing, such as combining enterprise knowledge graphs, historical high-frequency queries, or thesaurus, expands and reconstructs query terms to improve their retrieval weight. Simultaneously, it returns guiding prompts or recommended example queries to the user, encouraging them to supplement with more specific keywords, limit the business scope, or clarify the query type. The system also marks this query request as a low-weight query, recording relevant characteristics for subsequent model optimization and retrieval strategy adjustments. This gradually improves query quality and overall retrieval performance without compromising user experience. For example, when the system determines that the terminology information in the knowledge query request has reached a pre-set retrieval weight, the system... The system assumes the query request has a clear semantic expression in the current state. Based on the weight bias of different terminology information (including semantic description and natural language expression), the system obtains the pre-set weight configuration for the retrieval channel. According to different weight configurations, it dynamically triggers pre-defined parallel multi-path retrieval, generating corresponding candidate knowledge fragments and ranking them by relevance scoring. Through this relevance scoring and ranking, the system selects highly relevant knowledge fragments for the query request. By distinguishing the weight bias of terminology information in semantic description and natural language expression, and matching the pre-defined retrieval channel weight configuration, the system can dynamically select the most suitable retrieval path or combine multiple retrieval methods for parallel execution. This weight-driven retrieval channel triggering mechanism makes the retrieval process no longer dependent on a single strategy, but rather adaptively adjusts according to query characteristics, thus significantly improving the flexibility and adaptability of the retrieval strategy. Simultaneously, by dynamically triggering parallel multi-path retrieval, the system can simultaneously obtain candidate knowledge fragments in different retrieval channels. For example, semantic vector retrieval focuses on understanding the query intent, while keyword or rule retrieval focuses on accurately matching key terms. Furthermore, by ranking and ranking the candidate knowledge fragments by relevance scoring and selecting highly relevant knowledge fragments, the system can prioritize returning knowledge content with strong semantic consistency, high business matching degree, and complete structure.
[0048] In this embodiment, the determination module further includes: The third identification unit is used to identify the content length of the original document based on the document structure pre-detected by the knowledge storage terminal, wherein the document structure specifically includes chapters, paragraphs and semantic boundaries; The third judgment unit is used to determine whether the length of the content reaches a preset length threshold. The third execution unit is configured to, if so, obtain the business knowledge domain of the original document, detect the business value of the original document based on the business knowledge domain, and dynamically generate document parameters of the original document based on the business value, wherein the document parameters specifically include source information, permission identifier, and compliance attributes.
[0049] In this embodiment, the system identifies the content length of original documents based on the document structure pre-detected by the knowledge storage terminal. The document structure specifically includes chapters, paragraphs, and semantic boundaries. The system then determines whether the content length reaches a pre-set length threshold to execute corresponding steps. For example, if the system determines that the content length of the original documents does not reach the pre-set length threshold, it considers the original document to have a moderate content size and concentrated semantics under the current document structure division rules, capable of being fully understood and utilized within a single context. No further splitting or complex reorganization is required. The system directly marks the original document as a valid document object that can be processed as a whole, inputting it as a complete piece of original knowledge or a small fragment of knowledge into subsequent processes. Based on this original document, it directly performs keyword extraction, semantic vector generation, and structural information binding operations, incorporating it as a whole into the knowledge fragment construction and vector indexing process. This avoids semantic fragmentation caused by excessive splitting. Simultaneously, the chapter, paragraph, and semantic boundary information of the original document are stored as structural metadata, providing a basis for subsequent retrieval, sorting, and tracing. The system also records the judgment result that the document did not reach the length threshold for subsequent index weight adjustment or context splicing decisions, thereby ensuring semantic integrity. To improve overall processing efficiency and knowledge utilization, the system takes the following steps: For example, when the system determines that the content length of these original documents has reached a pre-set length threshold, it considers the original document unintelligible or unusable. The system then acquires the business knowledge domain of these original documents, detects their business value based on different business knowledge domains, and dynamically generates document parameters based on these business values. These parameters specifically include source information, permission identifiers, and compliance attributes. By introducing business knowledge domain-based judgment and business value detection at this stage, the system can avoid directly inputting excessively long and complex documents into subsequent semantic modeling or retrieval processes. This effectively prevents misunderstandings and retrieval noise caused by context overload, semantic dilution, or structural complexity, improving the stability of the overall knowledge processing flow. For documents with high business value, the system can generate more refined document parameters and enter the key processing flow. For documents with low business value or that are not applicable at the moment, strategies such as delayed processing, de-weighted indexing, or isolated storage can be adopted to achieve reasonable allocation and differentiated governance of knowledge resources. Furthermore, by dynamically generating document parameters that include source information, permission identifiers, and compliance attributes, the system can complete the necessary security and compliance marking before the document enters the subsequent knowledge splitting, indexing, or retrieval process.
[0050] In this embodiment, the second determination module further includes: The fourth identification unit is used to identify the user's access behavior based on the frequency of queries initiated by the knowledge query request within a preset time period; The fourth judgment unit is used to determine whether the access behavior is cross-domain; The fourth execution unit is used to detect the cross-domain degree of the access behavior if the cross-domain degree is true, obtain the sensitive fields of the knowledge query request based on the cross-domain degree, and dynamically desensitize the knowledge output content of the knowledge query request based on the sensitive fields. The cross-domain degree specifically includes the number of business domains, the sensitivity level, and the degree of deviation from the responsibility.
[0051] In this embodiment, the system identifies user access behavior to the knowledge storage terminal based on the frequency of knowledge query requests initiated within a pre-set time period. The system then determines whether the access behavior involves cross-domain issues and executes corresponding steps accordingly. For example, if the system determines that the user's access behavior to the knowledge storage terminal does not involve cross-domain issues, the system considers the user's knowledge query behavior to comply with their identity permissions and preset usage guidelines in terms of access frequency, access scope, and business context. The system identifies this access behavior as normal and compliant, continues to process the user's query requests according to the established knowledge service process, maintains the current access policy and permission configuration, and allows the user to continuously perform knowledge retrieval, query vector generation, and mixed retrieval operations within their authorized scope. Simultaneously, the system records this access behavior normally for subsequent behavior statistics, usage analysis, or model optimization, without triggering additional security restrictions or alarm mechanisms. Furthermore, the system can appropriately increase the response priority or cache hit rate of the user's query requests based on the user's stable and compliant access behavior characteristics, thereby further improving the user experience and overall system efficiency while ensuring the security and stable operation of the knowledge storage terminal. For example, when the system... If the system detects that a user's access to the knowledge storage terminal involves cross-domain issues, it will consider the user's knowledge query behavior to be inconsistent with identity permissions and usage guidelines. The system will then detect the extent of the cross-domain behavior, specifically including the number of business domains, sensitivity level, and degree of deviation from responsibilities. Based on different levels of cross-domain issues, the system will obtain the sensitive fields of the knowledge query request and dynamically desensitize the knowledge output content of the request. By comprehensively evaluating the cross-domain issue from multiple dimensions such as the number of business domains, sensitivity level, and degree of deviation from responsibilities, the system can distinguish between minor business boundary deviations and high-risk unauthorized access behaviors, providing a reliable basis for subsequent differentiated processing strategies. This improves the accuracy and rationality of access control. At the same time, by masking, obscuring, or replacing key fields, values, identifiers, or sensitive descriptions, the system can effectively prevent the improper acquisition of sensitive information and avoid the business interruption caused by complete denial of service, achieving a balance between security and availability. Furthermore, by incorporating cross-domain identification, sensitive field extraction, and dynamic desensitization mechanisms into a unified process, the system can proactively protect sensitive information during the knowledge output stage, effectively reducing data leakage and compliance risks.
[0052] In this embodiment, the identification module further includes: The acquisition unit is used to acquire the associated content of the multi-source heterogeneous documents based on the historical knowledge pre-acquired by the knowledge storage terminal. The associated content specifically includes the same business object, the same process, and the same technical point. The fifth judgment unit is used to determine whether the associated content has a reference relationship with the historical knowledge; The fifth execution unit is configured to, if so, acquire the text information of the multi-source heterogeneous document, identify the text quality of the multi-source heterogeneous document based on the text information, and dynamically overwrite invalid content of the original knowledge based on the text quality. The text information specifically includes structural markers, preceding and following paragraphs, and layout features, and the text quality specifically includes missing characters, garbled characters, and typos.
[0053] In this embodiment, the system collects the associated content of these multi-source heterogeneous documents based on the historical knowledge pre-collected by the knowledge storage terminal. The associated content specifically includes the same business object, the same process, and the same technical point. The system then determines whether this associated content has a reference relationship with historical knowledge to execute corresponding steps. For example, when the system determines that the associated content of the multi-source heterogeneous documents does not have a reference relationship with historical knowledge, the system considers these uploaded multi-source heterogeneous documents to be new knowledge, independent experience summaries, or potential knowledge information that has not yet been systematically managed. The system will trigger a supplementary processing flow, independently marking the associated content, temporarily storing it as candidate new knowledge, and generating a unique knowledge identifier for later use. Continuing the tracking, on the other hand, the system can perform semantic clustering analysis based on business objects, processes, and technical points to assess whether it should be included in the existing knowledge domain or managed as a new knowledge branch. Simultaneously, the system can submit this content to manual review or expert confirmation to determine its business validity, accuracy, and reusability. After confirmation, formal entry into the database, vectorization modeling, and index construction are completed, thereby continuously improving and expanding the overall knowledge system of the enterprise knowledge storage terminal. For example, when the system determines that the related content of multi-source heterogeneous documents has a reference relationship with historical knowledge, the system will consider that these multi-source heterogeneous documents have the potential to inherit, reuse, or explicitly reference historical knowledge assets already accumulated in the knowledge storage terminal. The system will then obtain... The text information of these multi-source heterogeneous documents includes structural markers, preceding and following paragraphs, and layout features. Based on this text information, the system identifies the text quality of the multi-source heterogeneous documents, specifically including missing text, garbled characters, and typos. Depending on the text quality, invalid content is dynamically overwritten when the original knowledge is integrated from the knowledge storage terminal. The system further acquires text information such as document structural markers, preceding and following paragraphs, and layout features to perform refined analysis of cited content, helping to accurately determine the authenticity and validity of citation relationships. Combined with the text quality detection results, it identifies and dynamically overwrites invalid content in the original knowledge, such as missing text, garbled characters, or typos, preventing low-quality text from being incorrectly inherited or repeatedly stored. For missing or garbled references, the system can selectively ignore them or revert to high-quality historical versions. For content with typos or minor formatting errors, it can automatically correct them through rules or models. This continuously optimizes the quality of knowledge assets without disrupting the original knowledge structure. Furthermore, by dynamically covering invalid content based on text quality, the system can prevent low-value or noisy information from being amplified during repeated references and integrations. This ensures the credibility and traceability of core knowledge assets in the knowledge storage terminal. This mechanism enables the natural evolution of historical knowledge through "survival of the fittest" when it is inherited or reused by new documents, promoting the continuous expansion of the enterprise's knowledge system while maintaining a high-quality, maintainable, and sustainable state.
[0054] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A method for intelligent retrieval and generation oriented towards a knowledge hub, characterized in that, Includes the following steps: Based on the multi-source heterogeneous documents pre-received by the knowledge storage terminal, the original knowledge corresponding to the multi-source heterogeneous documents is identified, wherein the multi-source heterogeneous documents specifically include text, tables, images and scanned documents; Determine whether the original document can be converted into a preset knowledge fragment; If possible, then based on the keywords of the original knowledge, construct a semantic similarity-oriented vector index for the knowledge fragment, integrate the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, and receive the user's knowledge query request according to the hybrid retrieval strategy, wherein the vector index specifically includes semantic vectors, structural information and metadata; Determine whether the knowledge query request matches the preset query rights of the knowledge storage terminal, wherein the query rights specifically include user identity, role permissions, and business scenario; If a match is found, the search scope of the hybrid retrieval strategy is dynamically constrained through the query rights. Multi-way retrieval is performed in the hybrid index, and the highly relevant knowledge fragments obtained are processed in a secondary manner to generate corresponding external knowledge output information. The traceability description of the external knowledge output information is dynamically marked in the knowledge storage terminal. The dynamic constraints specifically include content importance, semantic integrity, and context length limits, and the traceability description specifically includes the source document, paragraph position, and knowledge identifier.
2. The intelligent retrieval and generation method for knowledge hubs according to claim 1, characterized in that, Before the step of constructing a semantic similarity-oriented vector index for the knowledge fragment based on the keywords of the original knowledge, and integrating the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, the method further includes: Based on the semantic segmentation preset by the knowledge storage terminal, the knowledge fragment is split into segments to obtain the segmented data of the knowledge fragment; Determine whether the segment length of the data block exceeds a preset length threshold; If so, the semantic boundaries of the segmented data are identified, and keyword extraction processing is performed on the segmented data based on the semantic boundaries to obtain the enterprise's preset business semantic keywords. Based on the coverage of the business semantic keywords, the context fragments of the knowledge fragments are introduced and associated with the segmented data. Through the knowledge storage terminal, confidence samples of the segmented data are dynamically marked.
3. The intelligent retrieval and generation method for knowledge hubs according to claim 1, characterized in that, The step of receiving the user's knowledge query request according to the hybrid retrieval strategy further includes: Based on the business scenario corresponding to the knowledge query request, identify the user's identity verification content; Determine whether the identity verification content can pass the verification requirements of the knowledge storage terminal; If possible, the business semantics of the knowledge query request are obtained, the user's identity permissions are obtained based on the business semantics, the access knowledge domain of the knowledge storage terminal is dynamically restricted based on the identity permissions, a query vector corresponding to the knowledge query request is generated through the access knowledge domain, and the query result of the query vector is constructed.
4. The intelligent retrieval and generation method for knowledge hubs according to claim 1, characterized in that, The step of performing multi-way retrieval in the hybrid index, and further processing the retrieved highly relevant knowledge fragments to generate corresponding external knowledge output information, also includes: Based on the query attributes of the knowledge query request, the terminology information of the knowledge query request is identified, wherein the query attributes specifically include semantic complexity, number of keywords, and query type; Determine whether the terminology information reaches the preset retrieval weight; If so, then based on the weight bias of the terminology information, the weight configuration of the preset retrieval channel is obtained. According to the weight configuration, the preset parallel multi-path retrieval is dynamically triggered to generate corresponding candidate knowledge fragments and perform relevance scoring and ranking. Through the relevance scoring and ranking, the highly relevant knowledge fragments of the knowledge query request are selected. The weight bias specifically includes semantic description and natural language expression.
5. The intelligent retrieval and generation method for knowledge hubs according to claim 1, characterized in that, The step of determining whether the original document can be converted into a preset knowledge fragment further includes: Based on the document structure pre-detected by the knowledge storage terminal, the content length of the original document is identified, wherein the document structure specifically includes chapters, paragraphs, and semantic boundaries; Determine whether the length of the content has reached a preset length threshold; If so, the business knowledge domain of the original document is obtained, the business value of the original document is detected based on the business knowledge domain, and the document parameters of the original document are dynamically generated based on the business value. The document parameters specifically include source information, permission identifier, and compliance attributes.
6. The intelligent retrieval and generation method for knowledge hubs according to claim 1, characterized in that, The step of determining whether the knowledge query request matches the preset query rights of the knowledge storage terminal further includes: The user's access behavior is identified based on the frequency of the knowledge query requests initiated within a preset time period; Determine whether the access behavior is cross-domain; If so, the cross-domain degree of the access behavior is detected, and the sensitive fields of the knowledge query request are obtained according to the cross-domain degree. Based on the sensitive fields, the knowledge output content of the knowledge query request is dynamically desensitized. The cross-domain degree specifically includes the number of business domains, the sensitivity level, and the degree of deviation from responsibilities.
7. The intelligent retrieval and generation method for knowledge hubs according to claim 1, characterized in that, The step of identifying the original knowledge corresponding to the multi-source heterogeneous documents pre-received by the knowledge storage terminal further includes: Based on the historical knowledge pre-collected by the knowledge storage terminal, the associated content of the multi-source heterogeneous documents is collected, wherein the associated content specifically includes the same business object, the same process, and the same technical point; Determine whether the associated content has a reference relationship with the historical knowledge; If so, the text information of the multi-source heterogeneous document is obtained, the text quality of the multi-source heterogeneous document is identified based on the text information, and invalid content of the original knowledge is dynamically covered based on the text quality. The text information specifically includes structural markers, preceding and following paragraphs, and layout features, and the text quality specifically includes missing characters, garbled characters, and typos.
8. An intelligent retrieval and generation system oriented towards a knowledge hub, characterized in that, include: The identification module is used to identify the original knowledge corresponding to the multi-source heterogeneous documents pre-received by the knowledge storage terminal, wherein the multi-source heterogeneous documents specifically include text, tables, images and scanned documents; The judgment module is used to determine whether the original document can be converted into a preset knowledge fragment; The execution module is configured to, if possible, construct a semantic similarity-oriented vector index for the knowledge fragment based on the keywords of the original knowledge, integrate the keywords and the vector index as a hybrid retrieval strategy for the knowledge storage terminal, and receive the user's knowledge query request based on the hybrid retrieval strategy, wherein the vector index specifically includes semantic vectors, structural information and metadata; The second judgment module is used to determine whether the knowledge query request matches the preset query rights of the knowledge storage terminal, wherein the query rights specifically include user identity, role permissions and business scenarios; The second execution module is used to dynamically constrain the retrieval scope of the hybrid retrieval strategy if a match is found, by means of the query rights, to perform multi-way retrieval in the hybrid index, to perform secondary processing on the highly relevant knowledge fragments obtained from the retrieval, to generate corresponding external knowledge output information, and to dynamically mark the traceability description of the external knowledge output information in the knowledge storage terminal. The dynamic constraints specifically include content importance, semantic integrity and context length limits, and the traceability description specifically includes source document, paragraph position and knowledge identifier.
9. The intelligent retrieval and generation system oriented towards a knowledge hub according to claim 8, characterized in that, Also includes: The acquisition module is used to split the knowledge fragment into segments based on the semantic segments preset by the knowledge storage terminal, and acquire the segmented data of the knowledge fragment. The third judgment module is used to determine whether the segment length of the block data exceeds a preset length threshold. The third execution module is used to identify the semantic boundary of the segmented data if the semantic boundary is true, perform keyword extraction processing on the segmented data according to the semantic boundary, obtain the enterprise's preset business semantic keywords, introduce the context fragment of the knowledge fragment and associate it with the segmented data according to the coverage of the business semantic keywords, and dynamically mark the confidence samples of the segmented data through the knowledge storage terminal.
10. The intelligent retrieval and generation system oriented towards a knowledge hub according to claim 8, characterized in that, The execution module further includes: The identification unit is used to identify the user's identity verification content based on the business scenario corresponding to the knowledge query request; The judgment unit is used to determine whether the identity verification content can pass the verification requirements of the knowledge storage terminal; An execution unit is configured to, if possible, obtain the business semantics of the knowledge query request, obtain the user's identity permissions based on the business semantics, dynamically restrict the knowledge domain access of the knowledge storage terminal based on the identity permissions, generate a query vector corresponding to the knowledge query request through the access knowledge domain, and construct the query result of the query vector.