A vectorized construction method and system for manufacturing equipment repair knowledge
By performing type identification and structural complexity assessment on manufacturing equipment maintenance documents, and utilizing multi-modal semantic boundary matching rules and multi-source weighted voting mechanisms, the problem of inaccurate knowledge retrieval for manufacturing equipment maintenance in existing technologies is solved, achieving efficient and scalable vectorized construction and retrieval.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SICHUAN DONGJIU TECHNOLOGY CO LTD
- Filing Date
- 2026-05-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing RAG technology suffers from problems in manufacturing equipment maintenance scenarios, such as coarse document segmentation, lack of differentiated processing capabilities, non-isolated knowledge base storage, and non-specialized similarity calculation, resulting in insufficient retrieval accuracy and reliability.
By identifying the type and assessing the structural complexity of maintenance documents, knowledge units are extracted and hierarchical paths are preserved using multi-modal semantic boundary matching rules. A multi-source weighted voting mechanism is adopted to determine brand affiliation. The documents are stored in isolation by brand and retrieved using cosine similarity and TOP-K filtering mechanisms. A brand-level routing and progressive degradation strategy is implemented.
It improves the semantic integrity of knowledge segmentation and the accuracy of cross-brand retrieval, reduces the risk of large model illusion, and realizes efficient and scalable vectorized construction of maintenance knowledge.
Smart Images

Figure CN122240677A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of manufacturing knowledge processing technology, and in particular to a vectorized construction method and system for manufacturing equipment maintenance knowledge. Background Technology
[0002] Against the backdrop of the rapid development of intelligent manufacturing and the Industrial Internet, manufacturing enterprises have accumulated a large amount of knowledge documents related to equipment maintenance, including maintenance manuals, fault code tables, historical work orders, drawings, etc. This knowledge is mostly in unstructured or semi-structured forms such as PDFs, images, and text, relying on manual reading and experience-based judgment, which makes it difficult to efficiently and accurately support fault diagnosis and maintenance decisions.
[0003] In recent years, Retrieval Enhanced Generation (RAG) technology has begun to be introduced into the field of industrial knowledge management. Its core idea is to segment documents, vectorize them, and store them in a vector database, then combine this with a large language model to achieve intelligent question answering. However, existing RAG technologies generally have the following shortcomings when applied to manufacturing equipment maintenance scenarios:
[0004] The document segmentation method is crude, often segmented by a fixed number of pages or characters, which destroys the natural structured semantic boundaries of chapters, fault codes, part numbers, etc. in the maintenance manual;
[0005] A uniform processing method is used for different types of maintenance documents (such as manuals with simple structures and manuals with complex structures), and there is a lack of ability to differentiate the processing of elements such as images, tables, and special numbers.
[0006] Vector libraries typically use a single knowledge base to store all documents, ignoring the knowledge isolation between different devices and brands, which leads to severe semantic interference during retrieval.
[0007] Similarity calculations mostly use general text similarity metrics and are not specifically optimized for the "strong semantics and weak length" characteristics of maintenance knowledge.
[0008] Therefore, there is an urgent need for a high-quality, engineerable vectorized construction method for manufacturing equipment maintenance knowledge to improve the accuracy of knowledge retrieval and the reliability of responses in intelligent maintenance systems. Summary of the Invention
[0009] The purpose of this invention is to provide a vectorized construction method and system for manufacturing equipment maintenance knowledge, thereby improving the accuracy of knowledge retrieval and the reliability of responses in intelligent maintenance systems.
[0010] To achieve the above objectives, in a first aspect, the present invention provides a vectorized construction method for manufacturing equipment maintenance knowledge, comprising the following steps:
[0011] The acquired maintenance documents are type-identified and their structural complexity is evaluated. Based on the complexity level, the text stream in the documents and the text information in the images are extracted in a differentiated manner, and structured chapter blocks are output.
[0012] The starting boundaries of knowledge units in a document are detected by using multi-modal semantic boundary matching rules. Complete knowledge units are extracted from the starting position of the current semantic boundary to the starting position of the next semantic boundary, and the hierarchical path information of the knowledge units is preserved.
[0013] The device brand to which each knowledge unit belongs is determined by a multi-source weighted voting mechanism. Knowledge units under the same brand are sent to the embedding vectorization engine in batches according to the original document order to generate corresponding semantic vectors. The vectors and metadata are then stored in the corresponding brand-level vector knowledge base.
[0014] During retrieval, cosine similarity is used to measure the semantic distance between the user's question vector and the knowledge unit vector. The non-negativity of vectors is used to limit the similarity to the range of 0 to 1, and a set of candidate knowledge units is obtained through a preset similarity threshold and TOP-K screening mechanism.
[0015] Based on the device brand identified in the user query, the corresponding brand-level vector knowledge base is selected for retrieval. When the retrieval results do not meet the quality requirements, a progressive degradation strategy is executed in sequence, which includes lowering the similarity threshold, switching to a general cross-brand knowledge base, and enabling the online retrieval channel. The final number of knowledge units output is dynamically determined based on the confidence level of the results.
[0016] This includes extracting text streams from documents and text information from images based on complexity levels, including:
[0017] By analyzing the chapter hierarchy depth, image and table density, and numbering pattern diversity of maintenance documents, the documents are classified into simple or complex structures.
[0018] For documents with simple structures, images with high information gain potential are retained and the optical character recognition unit is used to extract the text information; for documents with complex structures, all image extraction is skipped and only the text content is retained.
[0019] The multi-modal semantic boundary matching rules include Chinese chapter-level rules, number-level rules, device-specific code rules, and composite numbering rules. Each rule is configured with a hierarchical priority. The system scans the document in priority order, and the position of a successfully matched line is marked as the starting boundary of the knowledge unit, and the hierarchical depth of the boundary is recorded.
[0020] The hierarchical path information of the knowledge unit is retained, including:
[0021] Obtain a temporary stack structure. When the level depth is detected to be greater than the top depth of the stack, push the boundary title onto the stack. When the depth is less than or equal to the top depth of the stack, pop the stack until the top depth is less than the current depth, and then push the current boundary back onto the stack. When each knowledge unit is generated, concatenate all the titles in the stack into a level path string and store it with the unit.
[0022] The equipment brand for each knowledge unit is determined through a multi-source weighted voting mechanism, including:
[0023] Prioritize the acquisition of document-level brand tags, brand keywords in file names and metadata, and brand mention frequency and position weight in the knowledge unit text, and calculate weighted scores for candidate brands. When the scores of multiple brands are within the set range, the knowledge unit is marked as a multi-brand coexistence and stored in the general cross-brand knowledge base. When more than half of the units in the same document have identified brands, the brand tags of adjacent units are inherited using the context propagation rule for units that are not identified.
[0024] Before feeding the knowledge units into the embedding vectorization engine in batches, the method further includes:
[0025] First, group the knowledge units by brand tags, then sort them in ascending order by the starting position of the original document within each brand group, and then divide them into batches of fixed length. For knowledge units whose length exceeds the maximum input limit of the embedding model, a two-end truncation strategy is adopted, which retains the beginning token of the text and the end token of the text, and discards the middle part.
[0026] The method further includes:
[0027] The similarity threshold is increased or decreased depending on whether the query type is a fault code query or a maintenance step query, and is adaptively fine-tuned based on the quantity density of knowledge units in the knowledge base; the system maintains a threshold cache table to record the optimal threshold for historical queries to accelerate repeated retrieval.
[0028] The gradual degradation strategy includes:
[0029] First, keep the target brand unchanged and temporarily lower the similarity threshold by one step to search again; if sufficient results are still not obtained, switch to the general cross-brand knowledge base and use the original threshold for retrieval; if still ineffective, enable the online retrieval channel to obtain public information and convert it into temporary knowledge units; if all downgrade steps yield no results, return an empty result set and record the miss event.
[0030] The number of knowledge units in the final output is dynamically determined based on the result confidence level, including:
[0031] Calculate the mean and standard deviation of the cosine similarity of candidate knowledge units filtered by the similarity threshold; when the mean is higher than the first preset value and the standard deviation is lower than the second preset value, take the first K value; when the mean is in the middle range and the standard deviation is greater than the set range, take the second K value; when the mean is lower than the third preset value, determine that the retrieval quality is low, do not output any results and trigger the downgraded route.
[0032] Secondly, the present invention provides a vectorized construction system for manufacturing equipment maintenance knowledge, applied to a vectorized construction method for manufacturing equipment maintenance knowledge as provided in the first aspect, comprising:
[0033] The maintenance document parsing module is used to identify the type and assess the structural complexity of the acquired maintenance documents. Based on the complexity level, it extracts the text stream and text information from the images in the documents and outputs structured chapter blocks.
[0034] The adaptive knowledge segmentation module is used to detect the starting boundary of knowledge units in a document using multi-modal semantic boundary matching rules. It extracts complete knowledge units from the starting position of the current semantic boundary to the starting position of the next semantic boundary, and preserves the hierarchical path information of the knowledge units.
[0035] The vectorized storage module is used to determine the device brand to which each knowledge unit belongs through a multi-source weighted voting mechanism. Knowledge units under the same brand are sent to the embedded vectorization engine in batches according to the original document order to generate corresponding semantic vectors. The vectors and metadata are then stored in the corresponding brand-level vector knowledge base.
[0036] The similarity calculation and filtering module is used to measure the semantic distance between the user's question vector and the knowledge unit vector during retrieval using cosine similarity. It utilizes the non-negativity of vectors to limit the similarity to the range of 0 to 1, and obtains a set of candidate knowledge units through a preset similarity threshold and TOP-K filtering mechanism.
[0037] The multi-level routing and threshold control module is used to select the corresponding brand-level vector knowledge base for retrieval based on the device brand identified in the user query. When the retrieval results do not meet the quality requirements, a progressive degradation strategy is executed in sequence, which includes lowering the similarity threshold, switching to a general cross-brand knowledge base, and enabling the online retrieval channel. The final number of knowledge units is dynamically determined based on the confidence level of the results.
[0038] This invention discloses a vectorized construction method and system for manufacturing equipment maintenance knowledge. It extracts document text and image text differentially through structural complexity assessment; extracts complete knowledge units from the current boundary to the next boundary using multi-modal semantic boundary matching rules while preserving hierarchical paths; determines brand affiliation through multi-source weighted voting, stores and vectorizes knowledge in batches by brand; uses cosine similarity combined with thresholds and TOPK filtering during retrieval; and implements brand-level routing and progressive degradation strategies, dynamically adjusting thresholds and output quantities. This invention improves the semantic completeness of knowledge segmentation, cross-brand retrieval accuracy, and retrieval reliability, significantly reduces the risk of large model illusions, and achieves efficient and scalable vectorized construction of maintenance knowledge. Attached Figure Description
[0039] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below.
[0040] Figure 1 This is a schematic diagram illustrating the steps of a vectorized construction method for manufacturing equipment maintenance knowledge according to the first embodiment of the present invention.
[0041] Figure 2 This is a flowchart illustrating a vectorized construction method for manufacturing equipment maintenance knowledge provided by the present invention.
[0042] Figure 3 This is a structural principle diagram of a vectorized construction system for manufacturing equipment maintenance knowledge according to the second embodiment of the present invention.
[0043] Figure 4 This is a schematic diagram of the electronic device of the present invention.
[0044] In the diagram: 101-Maintenance document parsing module, 102-Adaptive knowledge segmentation module, 103-Vectorized storage module, 104-Similarity calculation and filtering module, 105-Multi-level routing and threshold control module. Detailed Implementation
[0045] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.
[0046] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any or all possible combinations of one or more of the associated listed items.
[0047] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."
[0048] The first embodiment of this application is as follows:
[0049] Please see Figures 1-3 This invention provides a vectorized construction method for manufacturing equipment maintenance knowledge, comprising the following steps:
[0050] S1. Perform type identification and structural complexity assessment on the acquired maintenance documents, extract text streams from the documents and text information from the images according to the complexity level, and output structured chapter blocks.
[0051] Specifically, the system first receives maintenance knowledge documents uploaded by manufacturers in various formats. Common formats include portable document format files, scanned image packages, and technical manuals containing mixed text and images. Regardless of the original format, the system uniformly converts them into an internal, page-by-page accessible intermediate representation, ensuring that subsequent processing units can read the content in page order.
[0052] For portable document format files, the system further distinguishes between digitally generated documents composed of text layers and image documents composed of scanned images. For the former, the underlying text stream can be extracted directly; for the latter, optical character recognition technology must be used to reconstruct the text information from the image.
[0053] Before formally extracting information, the system performs a comprehensive analysis of each document. This analysis primarily examines the following three dimensions:
[0054] Chapter level depth: By scanning the directory structure or heading numbering patterns in the document, the maximum number of nested chapters is calculated. For example, documents containing only "Chapter 1, Chapter 2" are considered low complexity; documents containing multi-level numbering such as "1.1, 1.1.1," are considered high complexity.
[0055] Image and Table Density: Count the number of image objects and table objects on each page and calculate the average density of the entire document. If the average number of images per page exceeds a set threshold (e.g., more than two images per page) and the image content is mainly photos or complex scenes, it is classified as an image-dense document.
[0056] Numbering pattern diversity: This detects special numbering types in the document, such as Chinese chapter numbers, numeric level numbers, product codes connected by hyphens, and alphanumeric numbering. A greater variety of patterns indicates a more complex document structure.
[0057] Based on the comprehensive score of the above three dimensions, the system will label each maintenance document as either simple or complex in structure.
[0058] For documents deemed to have a simple structure, the system prioritizes the text stream extraction unit, directly extracting the complete text content and its position information on the page (such as paragraph boundaries and heading lines) from the document's digital underlying layer. Simultaneously, the system performs lightweight layout analysis, identifying basic areas such as plain text paragraphs, list items, and table cells, and establishing spatial relationships between these areas and potentially extracted image areas.
[0059] For documents with complex structures, the text stream extraction method remains the same, but the system reduces the granularity of the layout analysis—only retaining the boundaries of paragraphs and headings, ignoring the internal structure of tables, to avoid introducing noise due to over-parsing. This differentiated strategy effectively controls the processing overhead of complex documents.
[0060] The system invokes the image analysis unit to traverse each page of the document, extracting all embedded image objects (including bitmaps and vector graphics). For each image, the system calculates its information gain potential, based on the following heuristic:
[0061] If the image is a circuit schematic, a component disassembly and assembly diagram, or a screenshot of a fault code display, then the information gain potential should be set to high.
[0062] If the image is a photograph of the product's appearance, a photograph of personnel operating the product, or a decorative image that is not directly related to the maintenance action, then the information gain potential is set to low.
[0063] For documents with simple structures, the system retains all images with high information gain potential and skips images with low potential. For documents with complex structures, the system uniformly skips all image extraction and retains only the text content. This design is based on the following industrial practice: images in complex manuals are often numerous and repetitive, providing limited incremental information for fault diagnosis. Discarding images can significantly reduce noise and computational burden in subsequent vectorization.
[0064] For high-potential images that are retained, the system activates the optical character recognition unit. This unit is essentially a text recognition engine finely tuned for industrial maintenance scenarios, and its core capability is to recognize and transcribe editable text from images.
[0065] In this invention, the working process of the optical character recognition unit is as follows:
[0066] Image preprocessing: The original image is converted into a grayscale image, and an adaptive binarization algorithm is applied to enhance the contrast between text and background, while suppressing shadows and noise caused by scanning or taking pictures.
[0067] Text region detection: Instead of performing dense recognition on the entire image, a method based on connected component analysis is used to locate regions in the image that may contain text, thereby improving efficiency.
[0068] Character Recognition: For each detected text region, the recognition engine outputs the corresponding Unicode character sequence. This engine has been additionally trained using screenshots from numerous manufacturing equipment manuals (such as fault code display panels and parameter setting interfaces), thus achieving high recognition accuracy for numbers, letters, and special symbols (such as hyphens and periods).
[0069] Location restoration and semantic binding: The recognized text carries the coordinate information of its original image. The system combines these texts into short text fragments according to the reading order (from left to right, from top to bottom) and establishes semantic associations with the document page to which the image belongs and nearby paragraphs. For example, the words "Power module test point" recognized above a circuit diagram will be bound to the corresponding maintenance procedure instructions for that circuit diagram.
[0070] Through the above process, key textual information in the image is successfully extracted and incorporated into the text stream of the entire document, while the visual features of the image itself are not vectorized. This preserves the core semantics carried by the image and avoids the incorporation of high-dimensional, sparse image tensors into the vector library.
[0071] After completing text extraction, image text recognition, and layout analysis, the maintenance document parsing module outputs each page as a structured document block. This block contains the following fields:
[0072] Original document identifiers: document name, version number, and device brand;
[0073] Page numbering: used to maintain the original order;
[0074] Text paragraph sequence: Plain text paragraphs organized in page order, incorporating text identified from nearby images;
[0075] Chapter level markers: The chapter number and title name corresponding to this page (if they can be identified from the text stream);
[0076] Complexity label: simple or complex, for reference in subsequent segmentation steps.
[0077] S2. Use the multi-mode semantic boundary matching rules to detect the starting boundary of knowledge units in the document, extract the complete knowledge units from the starting position of the current semantic boundary to the position before the starting position of the next semantic boundary, and retain the hierarchical path information of the knowledge units.
[0078] Specifically, the semantic boundary refers to a special pattern that naturally exists in the maintenance document and can identify the start of a new knowledge point. This step maintains a set of configurable semantic boundary matching rule sets, which cover common structured markings in the manufacturing equipment maintenance document. The rules are divided into the following categories:
[0079] Chinese chapter-level rules: used to match Chinese serial numbers starting with the character "第", such as "第一章", "第二节", "第三部分", etc. These rules can identify the top-level structural boundaries of the document.
[0080] Digital hierarchy rules: used to match hierarchical numbers consisting of pure numbers and dots, supporting different depths, such as "1.1", "1.1.1", etc. These rules distinguish the hierarchical depth according to the number of digit segments.
[0081] Equipment-specific code rules: used to match the part numbers or fault codes unique to manufacturing enterprises, such as "016-930", "IQ-1", "E001", etc. These codes often serve as the starting identifier of an independent knowledge segment.
[0082] Composite number rules: used to match patterns with mixed numbers and dots but with uncertain hierarchy, such as "6.16.1.1", as a supplement to the previous types of rules.
[0083] Each rule is associated with a hierarchical priority. For example, the priority of the Chinese chapter rule is higher than that of the digital hierarchy rule, and the equipment code rule has the highest priority in some documents. When scanning the document, the system will try to match according to the priority order to ensure that the most structurally significant boundary is identified first.
[0084] The system scans the text content line by line in the original order of the structured text blocks. For each line, the semantic boundary recognizer applies the above matching rules in turn. Once a rule matches successfully, the system records the starting character position of the line (i.e., the absolute offset of the boundary in the document), and takes this position as a candidate starting point for a knowledge unit, while recording the type of the matched boundary and its hierarchical depth.
[0085] For example, when a line of content "3.2 Power board fault troubleshooting" is detected and the digital hierarchy rule matches successfully, the system records the boundary type as "secondary digital title" and the hierarchical depth as 2.
[0086] Starting from the current detected semantic boundary start position and ending at the character preceding the next semantic boundary start position, a complete knowledge unit is extracted. Specifically:
[0087] View the document as a continuous stream of characters, with multiple semantic boundary start points distributed throughout. Assume boundary start points P1, P2, P3…Pn are detected sequentially. The first knowledge unit begins at P1 (the first boundary, typically the document title or first chapter) and ends one character before P2. The second knowledge unit begins at P2 and ends one character before P3. The last knowledge unit begins at Pn and ends at the end of the document.
[0088] The title line (i.e., the boundary line) of each knowledge unit is itself contained within that unit, serving as a header summary of that knowledge. The advantage of this is that the title of the knowledge unit can be directly used for vectorization during subsequent retrieval, improving recall.
[0089] To leverage the tree structure of documents in subsequent vectorization and retrieval, the system not only segments the documents into knowledge units but also preserves the hierarchical relationships between each unit. The specific steps are as follows:
[0090] When detecting each semantic boundary, its hierarchical depth is recorded. For example, the Chinese chapter "Chapter 1" has a depth of 1, "1.1" has a depth of 2, and "1.1.1" has a depth of 3. Each generated knowledge unit carries a hierarchical path string, which is composed of the names of all parent titles from the document root node to the current unit, separated by a specific delimiter. For example, a knowledge unit about "Power Board Fault Code E001" might have the hierarchical path "Chapter 1 > 1.2 Fault Code Table > 1.2.3 E001".
[0091] The system maintains a temporary stack structure to construct paths: whenever a boundary with a depth greater than the top depth of the stack is encountered, the boundary title is pushed onto the stack; when a boundary with a depth less than or equal to the top depth of the stack is encountered, the stack is continuously popped until the top depth of the stack is less than the current depth, and then the current boundary is pushed onto the stack.
[0092] When each knowledge unit is generated, all the titles in the current stack constitute its complete hierarchical path, which is stored along with the unit. This method of preserving hierarchical paths allows subsequent searches to not only match by content but also to be limited by document structure (e.g., "search only for content in Chapter 3"), significantly improving advanced search capabilities.
[0093] The knowledge unit generator packages each unit into the following structure:
[0094] Cell text: All text (including header lines) from the current boundary to the beginning of the next boundary.
[0095] Hierarchical path: such as "Chapter 1 > 1.2 Fault Code Table".
[0096] Boundary type: such as "Number level header" or "Device specific code".
[0097] Starting position and length: used for tracing the source.
[0098] S3. The device brand to which each knowledge unit belongs is determined through a multi-source weighted voting mechanism. Knowledge units under the same brand are sent to the embedding vectorization engine in batches according to the original document order to generate corresponding semantic vectors. The vectors and metadata are then stored in the corresponding brand-level vector knowledge base.
[0099] Specifically, a manufacturing company's maintenance document library may contain information on multiple equipment brands. For example, the same company may have both Canon and Fuji printers, and even the same comprehensive manual may mention different brands interchangeably. Therefore, the system must be able to accurately determine the equipment brand to which each knowledge unit belongs. The brand identifier of this invention adopts a multi-source weighted voting mechanism, rather than simply relying on a single signal. The specific judgment process is as follows:
[0100] First priority: Document-level brand tag (strong signal). Users can actively tag each repair document with its brand upon entry into the system. If the document's brand was explicitly specified during the upload phase (e.g., the folder name or file name contains "Canon"), the system will directly inherit this brand tag from all knowledge units under that document, without further brand resolution. This is the most reliable signal.
[0101] The second priority is given to brand keywords in the filename and metadata. If the document does not carry manual tags, the system extracts brand keywords from the filename and document attributes (such as the title field). For example, "Fuji" in "Fuji_ApeosPort_Repair Manual.pdf" will be identified as a brand candidate. The system maintains a brand keyword mapping table to normalize common brand aliases (such as "Canon" and "Fuji Xerox") into standard brand names.
[0102] The third priority: the frequency and positional weight of brand mentions in the main text of the knowledge unit. When the above signals are missing or conflicting (e.g., a document mentions multiple brands simultaneously), the system analyzes the main text of each knowledge unit:
[0103] Count the frequency of each brand keyword appearing in this unit.
[0104] Keywords appearing in headings, the first sentence of paragraphs, and the beginning of chapters are given higher positional weights (for example, a single appearance in a heading has three times the weight of an appearance in the body text).
[0105] Calculate the weighted score for each brand.
[0106] Fourth priority: Context consistency check. If the vast majority of knowledge units in the same document have been identified as a specific brand by a strong brand signal or high-frequency words, then for the few units with unclear scores, the system adopts the context propagation rule: inherit the brand label of the adjacent (previous or next) knowledge unit to the current unit, assuming that consecutive segments in the same document usually belong to the same brand.
[0107] In the rare case where a document or knowledge unit contains multiple brands with similar scores (e.g., Canon 45 points, Fujifilm 43 points), the system adopts a conservative strategy for resolving multi-brand conflicts: the knowledge unit is marked as "multi-brand coexistence" and stored in a separate "general cross-brand knowledge base," rather than being forcibly categorized into a specific brand. Subsequent searches by users, without specifying a brand, can simultaneously search the general knowledge base; if a brand is specified, conflicting entries in the general knowledge base are excluded. This approach avoids the propagation of errors from forced categorization.
[0108] Through the aforementioned multi-level weighted voting and conflict resolution mechanism, the system is able to provide a high degree of confidence in brand attribution for the vast majority of knowledge units.
[0109] After confirming the brand tags for each knowledge unit, the system needs to send all knowledge units under the same brand into the embedding vectorization engine for conversion. The relative order between knowledge units does not need to be changed, but they need to be grouped by brand and organized in batches, while maintaining the original document order within each batch. The reasons are as follows:
[0110] Current mainstream embedding models (especially the Transformer architecture) treat each sample independently when processing a batch of text, regardless of the input order. Therefore, sorting does not affect the mathematical results of the generated vectors. However, maintaining the original order is beneficial for subsequent traceability and incremental updates. When a company adds new maintenance manuals, the system only needs to append the new knowledge units to the knowledge base in the original order, without rebuilding the entire database. If the order is disrupted, it will be difficult to locate the update position.
[0111] From an engineering efficiency perspective, arranging all knowledge units under the same brand into a sequence according to their order of appearance in the original document (i.e., document order, first chapter before second chapter, first page before second page) and then feeding them into the embedding model in a fixed batch size (e.g., 32 or 64 knowledge units per batch) can achieve the highest GPU or CPU utilization.
[0112] Therefore, the system's knowledge unit sorting and batch processor perform the following operations:
[0113] Group global knowledge units by brand tags.
[0114] Within each brand group, the knowledge units are sorted in ascending order according to their original starting position (i.e., the document offset recorded in step 2) to ensure that the document order before splitting is accurately restored.
[0115] The sorted sequence is divided into fixed-length batches (the last batch may be insufficient). The knowledge units within each batch maintain their internal order.
[0116] The batches are submitted sequentially to the embedding vectorization engine.
[0117] This approach does not compromise vectorization quality and provides significant convenience for subsequent knowledge base maintenance and tracing.
[0118] The Embedded Vectorization Engine is a pre-trained deep learning model specifically fine-tuned for Chinese industrial maintenance text. This model maps input natural language text (typically no more than 512 tokens in length) into a fixed-dimensional dense numerical vector, which can be understood as a semantically compressed representation of the original text.
[0119] For each input knowledge unit, the vectorization process is as follows:
[0120] Text preprocessing: The engine receives the "body" portion of the knowledge unit (i.e., the complete text content including the header line). First, a lightweight cleaning process is performed: invisible characters are removed, punctuation is standardized (e.g., full-width to half-width conversion), and consecutive whitespace characters are compressed into single spaces. The cleaned text retains the original language information and does not undergo stemming or stop word filtering, because words like "not" and "non-" in maintenance terminology are semantically crucial.
[0121] Length Adaptation and Truncation: If the text length of a knowledge unit exceeds the maximum input length of the embedding model (e.g., 512 tokens), the engine employs a two-stage truncation strategy: retaining the first 400 tokens and the last 100 tokens, discarding the middle portion. This strategy is based on observations of industrial maintenance documents: key information often appears at the beginning of a section (title, summary) and the end of a section (summary, warning), while the descriptions of lengthy steps in the middle are highly redundant. Simply truncating the beginning or end may result in the loss of fault codes or maintenance conclusions.
[0122] Batch inference: Preprocessed text is organized into batches and fed into the embedding model for forward computation. The model outputs a vector (e.g., a 768-dimensional or 1024-dimensional sequence of floating-point numbers) corresponding to each knowledge unit. This process does not involve gradient updates and is purely inference computation.
[0123] Vector quality verification: Each generated vector is quickly scanned to check for NaN values or vectors that are completely zero (indicating an inference anomaly). If an anomaly is found, the system automatically retryes the knowledge unit (up to 3 times). If it fails, it is marked as "vectorization failed" and logged, but the entire batch processing is not interrupted.
[0124] Vector and metadata binding: Successfully generated vectors are packaged together with other metadata of the knowledge unit, including: original text (used for final display to the user), hierarchical path, boundary type, brand tag, starting position, source document name, timestamp, etc. This packaged structure is called a vectorized knowledge entry.
[0125] Each brand has its own independent vector knowledge base (e.g., "Canon Knowledge Base," "Fujifilm Knowledge Base"). This invention selects a vector database system (such as Chromadb) that supports fast cosine similarity retrieval as the underlying storage engine. Storage strategy:
[0126] For each brand, the system creates a collection in its proprietary knowledge base. The collection name is the brand's standard name. Each vectorized knowledge entry is inserted into the collection as a record, where the vector field is used for subsequent similarity calculations, and the metadata field stores all auxiliary information. During insertion, the system does not require any order between vectors—retrieval relies on vector indexing (such as the HNSW algorithm) for fast nearest neighbor search, regardless of the insertion order.
[0127] When a company receives a new version of its maintenance documentation, the system re-parses, segments, and vectorizes the new document, inserting the generated new knowledge units into the corresponding knowledge base with the same brand affiliation. For duplicate or conflicting knowledge units already existing in the old version, the system determines whether to skip, overwrite, or coexist by checking the uniqueness of the "starting position + document name" combination during insertion (user-configurable). The recommended strategy is to coexist and tag the version number, allowing the upper-layer application to select the latest version based on user needs during retrieval.
[0128] The final output is a set of ready-to-use brand-level vector knowledge bases. Each knowledge base stores high-quality vector representations and complete metadata of all maintenance knowledge units under the corresponding brand. These knowledge bases can be directly used by upper-layer retrieval modules (such as intelligent question-answering systems other than those described in this invention), which perform cosine similarity matching by inputting question vectors and return the top-K most relevant knowledge units.
[0129] S4. During retrieval, cosine similarity is used to measure the semantic distance between the user's question vector and the knowledge unit vector. The non-negativity of vectors is used to limit the similarity to the range of 0 to 1. A set of candidate knowledge units is obtained through a preset similarity threshold and TOP-K screening mechanism.
[0130] Specifically, the system uniformly adopts cosine similarity as the metric for measuring the semantic distance between two vectorized knowledge items. From a geometric perspective, the vector of each knowledge unit can be understood as a point in a high-dimensional space or a ray emanating from the origin; the maintenance questions raised by users, after being processed by the same embedding vectorization engine, will also be converted into a question vector of the same dimension.
[0131] Cosine similarity focuses not on the lengths of two vectors, but on whether their directions are consistent in a high-dimensional space. The smaller the angle between two vectors, the closer their semantics are. This invention chooses cosine similarity instead of Euclidean or Manhattan distance primarily based on the following two inherent characteristics of manufacturing equipment maintenance knowledge:
[0132] The need to match short fault codes with long repair procedures: User queries are often very short, such as "E001 fault" or "Canon printer error 016-930". However, the corresponding answers in the repair knowledge base may be hundreds of words long, detailing the fault analysis and repair process. The vector lengths of the two differ significantly. If a metric relying on absolute distance is used, short vectors may be misjudged as being too far removed from much irrelevant content due to insufficient length. Cosine similarity, on the other hand, only considers direction, allowing short queries to effectively match long documents with the same semantic direction.
[0133] Semantic direction determines relevance: In a maintenance scenario, the relevance of two knowledge units depends primarily on whether the fault phenomena, maintenance actions, or parts they describe are consistent, rather than on the amount of text used in the description. Cosine similarity naturally possesses this "direction-first" characteristic, making it more suitable for this scenario than other length-sensitive metrics.
[0134] The embedding vectorization engine used in this invention is specially designed so that the values of each dimension of its output vector are non-negative. This characteristic ensures that the vectors of all knowledge units reside in the non-negative quadrant of the high-dimensional space. Therefore, the cosine similarity calculation result of any two vectors is automatically limited to a closed interval between 0 and 1:
[0135] When the cosine similarity is 1, it means that the two vectors have the same direction (e.g., self-matching of the same knowledge unit).
[0136] When the cosine similarity is 0, it means that the two vectors are orthogonal to each other and are completely unrelated semantically.
[0137] In actual searches, the similarity value is usually between 0.5 and 0.9.
[0138] This 0-to-1 normalization characteristic provides an intuitive and stable basis for setting the similarity threshold in this step. The system can set a uniform, interpretable threshold (e.g., 0.7) without needing to dynamically adjust the threshold based on document length, as is the case with other metrics.
[0139] When an upper-layer application (such as a repair intelligent question-and-answer system) receives a user's repair question, the system first vectorizes the question to generate a question vector. Subsequently, the similarity calculation engine selects the corresponding brand-level vector knowledge base for retrieval based on the device brand tags parsed from the question.
[0140] Within the selected knowledge base, the engine reads the vector field of each vectorized knowledge entry one by one and calculates its cosine similarity with the question vector. Since modern vector databases have built-in efficient approximate nearest neighbor indexes, the actual calculation process is not exhaustive scanning, but rather quickly locates potentially relevant subsets of candidate vectors through the index. However, for logical clarity, this step still describes its decision-making principle as "comparing one by one with the candidate set."
[0141] For each comparison pair, the engine calculates a specific similarity value (between 0 and 1) and attaches this value as a temporary attribute to the corresponding vectorized knowledge entry.
[0142] The threshold controller manages a system-level similarity threshold. This threshold is a configurable floating-point parameter, with an initial value determined by the system based on a large amount of historical maintenance Q&A data (e.g., set to 0.70). It can be dynamically adjusted based on actual usage or manually modified by the administrator.
[0143] The purpose of the similarity threshold is to filter out knowledge items that are not semantically relevant. The specific rules are as follows:
[0144] If the calculated cosine similarity is less than the threshold, the knowledge entry is considered "semantically irrelevant" and is discarded directly without proceeding to subsequent steps.
[0145] If the cosine similarity is greater than or equal to the threshold, the knowledge entry is considered a "candidate relevant" entry, retained, and proceeds to the next screening step.
[0146] By setting a threshold, the system can effectively curb the "illusion" problem common in large language models—that is, when there is no truly relevant information in the knowledge base, the model will not force an answer from irrelevant fragments, but can return a "no relevant knowledge found" message. This directly addresses the decision reliability risk mentioned in the background technology.
[0147] After threshold filtering, there may still be multiple (e.g., dozens) candidate knowledge items remaining. However, large language models have limited context length to accept when generating answers, and excessive input reduces response speed and accuracy. Therefore, the result ranking and filtering process performs a TOP-K filtering operation:
[0148] First, all candidate entries that pass the threshold screening are sorted in descending order of cosine similarity. Entries with higher similarity are ranked higher, indicating greater semantic relevance.
[0149] The system then selects the top K entries as the final search results. K is a preset positive integer, typically between 3 and 5. This range is based on practical experience in industrial scenarios: too few entries may miss key information, while too many are redundant and reduce the processing efficiency of large models. If the total number of candidate entries that pass the threshold screening is less than K, all of them are retained (but no additional irrelevant entries are added).
[0150] If, in a given search, no knowledge item's cosine similarity reaches the preset threshold (i.e., all are filtered out), the system determines that no repair knowledge related to the current problem exists in the knowledge base. In this case, the upper-layer application can choose one of the following two processing methods:
[0151] Returning an empty result: Directly inform the user that "no relevant repair information was found in the knowledge base" to avoid the large language model generating illusionary responses.
[0152] Degradation to online retrieval: If the system is configured with real-time online retrieval capabilities (e.g., via a search engine API), the original question is forwarded to a general search engine to obtain parts information or the latest maintenance notices from the Internet, and the online results are then provided as context to the large language model.
[0153] The second approach embodies the advantage of the "dual-terminal integration of dedicated knowledge base and online retrieval" in the present invention, which can both ensure the reliability of accurate knowledge within the enterprise and make up for knowledge blind spots by utilizing open resources.
[0154] Step 4 ultimately outputs a search result set containing up to K vectorized knowledge entries, each with a cosine similarity score, sorted in descending order of score. This result set will be fed into a large language model, which will then combine it with prompt word engineering to generate natural language responses for maintenance personnel. Simultaneously, the similarity threshold and TOP-K parameters set in this step will be persistently saved as system configuration items and can be adjusted online as business needs change without requiring the reconstruction of the vector knowledge base.
[0155] S5. Select the corresponding brand-level vector knowledge base for retrieval based on the device brand identified in the user query. When the retrieval results do not meet the quality requirements, execute the progressive degradation strategy in sequence, which includes lowering the similarity threshold, switching to a general cross-brand knowledge base, and enabling the online retrieval channel. Dynamically determine the final number of knowledge units output based on the confidence level of the results.
[0156] Specifically, when the upper-layer application (such as a maintenance intelligent question-and-answer system) passes the user's natural language query into this step, the query parser first analyzes the query text and extracts the following key information:
[0157] Target device brand: Identify device brand keywords from the query, such as "Canon," "Fuji," etc. The identification results can be one of three possibilities: explicitly identifying a single brand, mentioning multiple brands simultaneously, or not mentioning any brand.
[0158] Core maintenance items include printer models, fault codes, and component names. This information is used for subsequent similarity calculations but does not affect routing decisions.
[0159] Query Intent Type: Determine if it is a fault diagnosis, repair procedure query, component parameter query, or other type. This classification can help optimize the TOP-K value in subsequent steps (for example, if fault diagnosis requires higher accuracy, the K value can be increased).
[0160] Brand identification results will directly determine the initial target knowledge base for routing. If a single brand is explicitly specified in the query (e.g., "Canon IR-ADV C5051 error E001"), the routing target will be locked to that brand's dedicated vector knowledge base. If multiple brands are mentioned in the query (e.g., "Compare the disassembly steps for Canon and Fujifilm fuser components"), the routing target will be marked as "cross-brand mode". If no brand is mentioned, the routing target will be marked as "brand unknown mode".
[0161] The brand-level route selector determines which vector knowledge base(s) to access for this retrieval based on the query parsing results. This invention employs a multi-level routing strategy with decreasing priority, as detailed below:
[0162] Level 1: Single-brand dedicated database. When a specific brand is identified in the query, the router only uses the brand-level vector knowledge base corresponding to that brand as the search target. Knowledge bases of all other brands are completely excluded from this search. This strict isolation design fundamentally avoids cross-brand knowledge interference—for example, searching for "Canon" will not incorrectly return repair steps for "Fujifilm," because the fault code systems and part names of the two may coincidentally appear similar but are actually incompatible.
[0163] Level 2: General cross-brand knowledge base. When multiple brands are identified in the query (e.g., the user requests a comparison) or the brand cannot be identified, the router switches to the general cross-brand knowledge base. This knowledge base, established in step 3, stores knowledge units marked as "multi-brand coexistence." Furthermore, if a single-brand database search does not yield a sufficient number of relevant results (i.e., the hit count is below the preset minimum requirement), the system also allows downgrading to the general cross-brand knowledge base for supplementary searching.
[0164] Level 3: Online retrieval channel. If the first two levels of the knowledge base fail to return any knowledge units exceeding the similarity threshold (i.e., all are filtered out), or if the system configuration mandates priority for online retrieval (e.g., involving the latest component information), the router forwards the query to the online retrieval channel. This channel calls the application programming interface of a general search engine to obtain publicly available online information related to the query and packages the search results into temporary knowledge units for later use.
[0165] After the target knowledge base is determined, the similarity threshold controller manages the similarity threshold used in this retrieval. Although the system has a global default threshold (e.g., 0.70), the threshold controller supports the following two dynamic adjustment capabilities:
[0166] Threshold fine-tuning based on query type: For fault code queries (which are usually very short and require extremely high accuracy), the controller automatically increases the threshold by 0.05 (e.g., from 0.70 to 0.75) to reduce false recalls; for descriptive repair step queries (which are longer and allow for some ambiguity), the controller decreases the threshold by 0.05 to increase recall coverage.
[0167] Adaptive based on brand knowledge base density: If a brand's knowledge base has very few knowledge units (e.g., less than 100), the threshold controller will appropriately lower the retrieval threshold for that base to avoid no results due to sparse samples; conversely, for a large base, the threshold will be maintained or increased to obtain more accurate top K results.
[0168] In addition, the system maintains a threshold cache table that records the historical best thresholds for "query keyword combination + target brand". When the exact same query occurs again, the threshold controller directly uses the cached value, skipping recalculation, thereby reducing latency.
[0169] The degradation route manager monitors the quality of the current search results and automatically switches to the next level of knowledge source when the results do not meet the requirements. Degradation conditions include:
[0170] Condition 1: After searching in the selected knowledge base, the number of candidate knowledge units filtered by the similarity threshold is 0.
[0171] Condition 2: The number of candidates is greater than 0 but less than the minimum value set by the TOP-K parameter (e.g., K=3 but only 1 was found), and the similarity scores of these candidates are all below the warning line (e.g., high threshold mode below 0.75).
[0172] When any of the degradation conditions are met, the degradation route manager tries them in the following order:
[0173] Expand the search within the same brand: Keep the target brand unchanged, but temporarily lower the similarity threshold by one step (e.g., 0.05) and search again. If a sufficient number of results are obtained, stop the downgrade; otherwise, continue.
[0174] Switch to a generic cross-brand knowledge base: Retrieve from the generic base using the original threshold. The knowledge units in the generic base come from multi-brand coexistence scenarios and may contain non-brand-specific knowledge relevant to the current query (such as generic screw torque standards).
[0175] Enable online retrieval channel: Send the original query to the search engine to obtain the first N (e.g., 5) search results, concatenate the titles and summaries of these results into a temporary text block, vectorize it, and then use it for similarity calculation (at this point, threshold filtering is no longer needed, and the highest score is taken directly).
[0176] Return an empty result and log: If none of the above degradation steps produce any valid knowledge units, the system will eventually return an empty result set and log this "miss" event for future reference when expanding the knowledge base.
[0177] This progressive degradation strategy ensures that the system will not force low-quality answers when knowledge coverage is insufficient, while fully utilizing the complementarity of multi-level resources. This invention supports dynamic K-value selection based on result confidence. Specific rules are as follows:
[0178] Calculate the mean and standard deviation of the cosine similarity of all candidate knowledge units that pass the threshold screening. If the mean is higher than 0.85 and the standard deviation is lower than 0.1 (indicating high confidence and clustered results), a smaller first K value (e.g., 2 or 3) is chosen, as a small number of highly similar units are sufficient. If the mean is between 0.65 and 0.85 and the standard deviation is large, a larger second K value (e.g., 5 or 6) is chosen to provide more context for cross-validation of the large language model. If the mean is lower than 0.65, the system determines that the overall retrieval quality is low, and no results are output (even if there are candidates), instead triggering a degraded route for re-retrieval.
[0179] In addition, the system allows administrators to preset different base K values for different device types or query intents (e.g., K=3 for fault codes, K=5 for maintenance procedures). Dynamic K value rules will be fine-tuned based on the base values.
[0180] After completing all the routing, threshold control, and K-value selection above, the result assembler retrieves the final selected K vectorized knowledge entries from the knowledge base and appends the following routing metadata:
[0181] The actual source of the knowledge base that was hit (brand library, general library, or online channel);
[0182] Cosine similarity score for each item;
[0183] The similarity threshold used in this search (a dynamically adjusted value);
[0184] Degradation path identifier (records whether a degradation has occurred, and to which level it was downgraded).
[0185] This metadata is passed to the upper-level large language model along with the search results set, not only for generating the final answer, but also for subsequent search quality analysis and knowledge base optimization.
[0186] The final output is a search result set with route metadata, containing at most K vectorized knowledge entries, each with a similarity score and source tag. If no valid results are produced at any route level, an empty result set is output with a "missed" label. The upper-layer application uses this result to decide whether to call a large language model to generate an answer or directly return a "no relevant knowledge found" message.
[0187] Finally, during the vectorization construction process, this invention binds complete metadata information to each knowledge unit, including: source document name, brand tag, chapter path, segmentation timestamp, version number, etc. This metadata does not participate in similarity calculations, but can be used for result tracing, version comparison, and knowledge updates after the search results are returned.
[0188] When a company receives a new version of the maintenance manual, the system can re-parse, segment, vectorize, and replace the old version of the knowledge units in the same brand's knowledge base according to the above steps, so as to achieve continuous evolution of the knowledge base without rebuilding the entire vector base.
[0189] The second embodiment of this application is as follows:
[0190] Please see Figure 3 This invention provides a vectorized construction system for manufacturing equipment maintenance knowledge, applied to a vectorized construction method for manufacturing equipment maintenance knowledge as provided in the first embodiment, comprising:
[0191] The maintenance document parsing module 101 is used to perform type recognition and structural complexity assessment on the acquired maintenance documents, extract text streams and text information from images in the documents according to the complexity level, and output structured chapter blocks.
[0192] The adaptive knowledge segmentation module 102 is used to detect the starting boundary of knowledge units in a document using multi-modal semantic boundary matching rules, extract complete knowledge units from the starting position of the current semantic boundary to the starting position of the next semantic boundary, and retain the hierarchical path information of the knowledge units.
[0193] The vectorization storage module 103 is used to determine the device brand to which each knowledge unit belongs through a multi-source weighted voting mechanism, send the knowledge units under the same brand into the embedded vectorization engine in batches according to the original document order, generate corresponding semantic vectors, and store the vectors and metadata into the corresponding brand-level vector knowledge base.
[0194] The similarity calculation and filtering module 104 is used to measure the semantic distance between the user's question vector and the knowledge unit vector during retrieval using cosine similarity. It utilizes the non-negativity of vectors to limit the similarity to the range of 0 to 1, and obtains a set of candidate knowledge units through a preset similarity threshold and TOP-K filtering mechanism.
[0195] The multi-level routing and threshold control module 105 is used to select the corresponding brand-level vector knowledge base for retrieval based on the device brand identified in the user query. When the retrieval results do not meet the quality requirements, it sequentially executes a progressive degradation strategy of lowering the similarity threshold, switching to a general cross-brand knowledge base, and enabling the network retrieval channel. It also dynamically determines the number of knowledge units to be output based on the confidence level of the results.
[0196] Regarding the system in the above embodiments, the specific ways in which each module performs operations have been described in detail in the embodiments related to the method, and will not be elaborated here.
[0197] For the system embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this application according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0198] Accordingly, this application also provides an electronic device, including: one or more processors; a memory for storing one or more programs; and when the one or more programs are executed by the one or more processors, causing the one or more processors to implement the vectorized construction method for manufacturing equipment maintenance knowledge as described above. Figure 4The diagram shown is a hardware structure diagram of any device with data processing capabilities, used in a vectorized construction system for manufacturing equipment maintenance knowledge provided by an embodiment of the present invention. (Except for...) Figure 4 In addition to the processor, memory, and network interface shown, any data processing device in the embodiment may also include other hardware depending on the actual function of the data processing device, which will not be described in detail here.
[0199] Accordingly, this application also provides a computer-readable storage medium storing computer instructions thereon, which, when executed by a processor, implement the vectorized construction method for manufacturing equipment maintenance knowledge as described above. The computer-readable storage medium can be an internal storage unit of any data-processing device as described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be an external storage device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units of any data-processing device and external storage devices. The computer-readable storage medium is used to store the computer program and other programs and data required by the data-processing device, and can also be used to temporarily store data that has been output or will be output.
[0200] Other embodiments of this application will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of this application that follow the general principles of this application and include common knowledge or customary techniques in the art not disclosed herein.
[0201] It should be understood that this application is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope.
Claims
1. A vectorized construction method for manufacturing equipment maintenance knowledge, characterized in that, Includes the following steps: The acquired maintenance documents are type-identified and their structural complexity is evaluated. Based on the complexity level, the text stream in the documents and the text information in the images are extracted in a differentiated manner, and structured chapter blocks are output. The starting boundaries of knowledge units in a document are detected by using multi-modal semantic boundary matching rules. Complete knowledge units are extracted from the starting position of the current semantic boundary to the starting position of the next semantic boundary, and the hierarchical path information of the knowledge units is preserved. The device brand to which each knowledge unit belongs is determined by a multi-source weighted voting mechanism. Knowledge units under the same brand are sent to the embedding vectorization engine in batches according to the original document order to generate corresponding semantic vectors. The vectors and metadata are then stored in the corresponding brand-level vector knowledge base. During retrieval, cosine similarity is used to measure the semantic distance between the user's question vector and the knowledge unit vector. The non-negativity of vectors is used to limit the similarity to the range of 0 to 1, and a set of candidate knowledge units is obtained through a preset similarity threshold and TOP-K screening mechanism. Based on the device brand identified in the user query, the corresponding brand-level vector knowledge base is selected for retrieval. When the retrieval results do not meet the quality requirements, a progressive degradation strategy is executed in sequence, which includes lowering the similarity threshold, switching to a general cross-brand knowledge base, and enabling the online retrieval channel. The final number of knowledge units output is dynamically determined based on the confidence level of the results.
2. The vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, Based on complexity levels, extract text streams from documents and text information from images, including: By analyzing the chapter hierarchy depth, image and table density, and numbering pattern diversity of maintenance documents, the documents are classified into simple or complex structures. For documents with simple structures, images with high information gain potential are retained and the optical character recognition unit is used to extract the text information; for documents with complex structures, all image extraction is skipped and only the text content is retained.
3. The vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, The multi-modal semantic boundary matching rules include Chinese chapter-level rules, number-level rules, device-specific code rules, and composite numbering rules. Each rule is configured with a hierarchical priority. The system scans the document in priority order, and the position of a successfully matched line is marked as the starting boundary of the knowledge unit, and the hierarchical depth of the boundary is recorded.
4. The vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, Preserve the hierarchical path information of knowledge units, including: Obtain a temporary stack structure. When the level depth is detected to be greater than the top depth of the stack, push the boundary title onto the stack. When the depth is less than or equal to the top depth of the stack, pop the stack until the top depth is less than the current depth, and then push the current boundary back onto the stack. When each knowledge unit is generated, concatenate all the titles in the stack into a level path string and store it with the unit.
5. The vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, The equipment brand associated with each knowledge unit is determined through a multi-source weighted voting mechanism, including: Prioritize the acquisition of document-level brand tags, brand keywords in file names and metadata, and brand mention frequency and position weight in the knowledge unit text, and calculate weighted scores for candidate brands. When the scores of multiple brands are within the set range, the knowledge unit is marked as a multi-brand coexistence and stored in the general cross-brand knowledge base. When more than half of the units in the same document have identified brands, the brand tags of adjacent units are inherited using the context propagation rule for units that are not identified.
6. The vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, Before feeding knowledge units into the embedding vectorization engine in batches, the method further includes: First, group the knowledge units by brand tags, then sort them in ascending order by the starting position of the original document within each brand group, and then divide them into batches of fixed length. For knowledge units whose length exceeds the maximum input limit of the embedding model, a two-end truncation strategy is adopted, which retains the beginning token of the text and the end token of the text, and discards the middle part.
7. The vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, The method further includes: The similarity threshold is increased or decreased depending on whether the query type is a fault code query or a maintenance step query, and is adaptively fine-tuned based on the quantity density of knowledge units in the knowledge base; the system maintains a threshold cache table to record the optimal threshold for historical queries to accelerate repeated retrieval.
8. The vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, Gradual degradation strategies include: First, keep the target brand unchanged and temporarily lower the similarity threshold by one step to search again; if sufficient results are still not obtained, switch to the general cross-brand knowledge base and use the original threshold for retrieval; if still ineffective, enable the online retrieval channel to obtain public information and convert it into temporary knowledge units; if all downgrade steps yield no results, return an empty result set and record the miss event.
9. The vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, The number of knowledge units in the final output is dynamically determined based on the confidence level of the results, including: Calculate the mean and standard deviation of the cosine similarity of candidate knowledge units filtered by the similarity threshold; when the mean is higher than the first preset value and the standard deviation is lower than the second preset value, take the first K value; when the mean is in the middle range and the standard deviation is greater than the set range, take the second K value; when the mean is lower than the third preset value, determine that the retrieval quality is low, do not output any results and trigger the downgraded route.
10. A vectorized construction system for manufacturing equipment maintenance knowledge, applied to the vectorized construction method for manufacturing equipment maintenance knowledge as described in claim 1, characterized in that, include: The maintenance document parsing module is used to identify the type and assess the structural complexity of the acquired maintenance documents. Based on the complexity level, it extracts the text stream and text information from the images in the documents and outputs structured chapter blocks. The adaptive knowledge segmentation module is used to detect the starting boundary of knowledge units in a document using multi-modal semantic boundary matching rules. It extracts complete knowledge units from the starting position of the current semantic boundary to the starting position of the next semantic boundary, and preserves the hierarchical path information of the knowledge units. The vectorized storage module is used to determine the device brand to which each knowledge unit belongs through a multi-source weighted voting mechanism. Knowledge units under the same brand are sent to the embedded vectorization engine in batches according to the original document order to generate corresponding semantic vectors. The vectors and metadata are then stored in the corresponding brand-level vector knowledge base. The similarity calculation and filtering module is used to measure the semantic distance between the user's question vector and the knowledge unit vector during retrieval using cosine similarity. It utilizes the non-negativity of vectors to limit the similarity to the range of 0 to 1, and obtains a set of candidate knowledge units through a preset similarity threshold and TOP-K filtering mechanism. The multi-level routing and threshold control module is used to select the corresponding brand-level vector knowledge base for retrieval based on the device brand identified in the user query. When the retrieval results do not meet the quality requirements, a progressive degradation strategy is executed in sequence, which includes lowering the similarity threshold, switching to a general cross-brand knowledge base, and enabling the online retrieval channel. The final number of knowledge units is dynamically determined based on the confidence level of the results.