A domain knowledge graph extraction method and system based on large model and small model cooperation
By employing a collaborative approach between large and small models, the problem of insufficient scalability and generalization in the automated construction of knowledge graphs is solved, achieving high-quality knowledge graph extraction that is applicable to vertical domains and reducing model parameter requirements and noise interference.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SUZHOU AEROSPACE INFORMATION RES INST
- Filing Date
- 2026-03-27
- Publication Date
- 2026-06-30
AI Technical Summary
Existing automated knowledge graph construction methods have poor scalability and weak generalization ability in practical applications, making it difficult to handle real-world domain data. This results in inaccurate and redundant entity and relation extraction, failing to meet the actual needs of vertical domains.
Employing a collaborative approach of large and small models, this method involves main entity extraction, document preprocessing, document entity extraction, relation extraction and verification, and entity fusion. By combining BGE-m3 and BGE-Reranker models, it filters out noise information, performs fine-grained knowledge fusion, and provides plug-in-style fixes for common problems.
Achieve high-quality knowledge graph construction with a small number of parameters available on a single machine, reduce noise, reduce knowledge redundancy, improve extraction effect, and adapt to vertical domain applications.
Smart Images

Figure CN122309760A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to knowledge graphs, specifically to a method and system for extracting domain knowledge graphs based on the collaboration of large and small models. Background Technology
[0002] Knowledge graphs, as a form of knowledge representation, can structurally describe concepts and objects in the real world, as well as the relationships between them. Based on a constructed knowledge graph, upper-level applications such as semantic retrieval, semantic question answering, and deep relational queries can be supported. Simultaneously, knowledge graphs also represent a series of technical fields and engineering practices, such as non-relational entity extraction (NER), relation extraction (RE), attribute extraction (PE), event extraction (EE), and entity linking (EL), which work together to complete the construction of a knowledge graph.
[0003] Since the emergence of the Transformer model, a number of natural language processing (NLP) technologies have developed specifically to address the aforementioned knowledge graph-related problems. These include models fine-tuned from pre-trained models such as BERT, RoBERTa, XLM, and AlBERT, which have achieved promising results on experimental datasets. However, in real-world applications, these models often struggle to solve practical problems due to their poor scalability, weak generalization ability, and the scarcity of real-world training data. Consequently, automated knowledge graph construction remains a challenge for many.
[0004] In recent years, the emergence of large models has brought groundbreaking progress to the field of knowledge graph technology. Through cue word engineering and RAG-related techniques, the scalability problems of traditional models and the sparsity of real-world data have been greatly addressed, providing a fundamental capability for automated knowledge graph construction. While large models demonstrate an ability to understand complex natural language problems, the reliability and accuracy of their generated results are not guaranteed; illusions and knowledge truncation are inherent problems of large models. Furthermore, their random black-box nature is difficult to control, requiring careful definition of the controllable range of large models to obtain stable results in applications.
[0005] In common methods for automated knowledge graph construction, such as GraphRAG, KAG, and NodeRAG, entity and relation extraction is typically performed once per text. A single large model call aims to obtain all entities or relations from the input text, often resulting in poor accuracy of entity and relation extraction and an inability to fully cover the information described in the input text. Furthermore, these solutions do not perform deep fusion and disambiguation processing on entities, often leading to redundant and disordered knowledge. Ultimately, in real-world applications in vertical domains, the data obtained by such automated knowledge graph construction solutions is insufficient to meet actual needs. Summary of the Invention
[0006] The purpose of this invention is to provide a method and system for extracting domain knowledge graphs based on the collaboration of large and small models. The main problem it solves is to automatically realize the process of constructing a high-quality knowledge graph from text under the condition of a small number of parameters available on a single machine. The secondary problems it solves include: filtering noise information in the extraction process, integrating fine-grained knowledge fusion process into the knowledge extraction process, and plug-in-style repair of common problems.
[0007] The technical solution to achieve the purpose of this invention is: a domain knowledge graph extraction method based on the collaboration of large and small models, comprising the following steps:
[0008] S1. Main Entity Extraction: After truncating the input text, the central entity is extracted from it using a large model. The large model outputs a structured result containing the entity name, confidence level and reason for extraction, and filters it according to the confidence level.
[0009] S2. Document Preprocessing and Vectorization: The original text is semantically segmented into multiple topic blocks. For each topic block, a large model is used to generate a summary of its association with the central entity, forming an enhanced semantic block. The BGE-m3 model is used to calculate the text vector of each enhanced semantic block and write it into the vector database.
[0010] S3. Document Entity Extraction: For each enhanced semantic block, candidate entity names are extracted in parallel using a large model and a domain name segmenter built based on the existing knowledge base entity list. After merging and deduplication, the large model is used to determine the entity type and confidence level of each candidate entity name. Entities with confidence levels higher than the set threshold are retained as valid entities. The large model is then used to generate a descriptive information containing context for each valid entity.
[0011] S4. Relation Extraction and Validation: Calculate all candidate entity pairs based on the list of valid entities; for each candidate entity pair, construct a search query to retrieve supporting text from the vector database; use a large model to perform usability judgment on the supporting text to filter noise; for the filtered entity pairs and supporting text, use the large model to determine the relation type; subsequently, use the large model to perform illusion detection, relation strength detection, and common relation direction error detection on the extracted relations; for relations that pass all detections, use the large model to generate a relation summary and supporting text excerpt to obtain the final relation triplet information.
[0012] S5. Entity Fusion: Taking the list of valid entities and the list of relationship information as input, a star diagram of related relationships is constructed for each entity, and the summary information of the relationships is extracted to form the similarity support material for the entity; a standardized text is constructed for each entity, the semantic vector is calculated using the BGE-m3 model and coarse-grained similarity retrieval is performed, and the BGE-Reranker model is used to perform fine-grained similarity calculation on candidate entity pairs; for candidate entity pairs with similarity higher than the threshold, the large model is used to determine entities with the same name; entities determined to have the same name are fused, clustered into one class, a standard entity name is selected, and all relationships are updated.
[0013] Furthermore, step S1 specifically includes:
[0014] S1-1 Input text truncation processing: Extract the first 4000 characters of the article as the actual input text for the main entity extraction;
[0015] S1-2, Main Entity Extraction from Large Model: Constrain the characteristics and extraction examples of the main entities in the prompt words, and constrain the structured entity names, confidence levels and extraction reasons output by the large model. The confidence level is qualitatively set to three levels: high, medium and low.
[0016] S1-3, Confidence Filtering: Based on the confidence of the main entity output by the large model, if the confidence is lower than the set threshold, the main entity is set to an empty string in subsequent processing; otherwise, the main entity name that meets the confidence threshold is used.
[0017] Furthermore, step S2 specifically includes:
[0018] S2-1, Coarse-grained text whole sentence fixed-length chunking: The original input text is divided into coarse-grained text chunks with complete sentences and a maximum length of 800 characters using the technique of recursive chunking;
[0019] S2-2, Fine-grained semantic topic segmentation: The large model extracts each topic segment from each coarse-grained text segment, constraining the output of the large model to strictly match the input text, and outputting structured topic information containing topic, excerpt from the original text, and explanatory information;
[0020] S2-3, Semantic Block Summarization and Association Filling: After concatenating the central entity with each topic text according to the template, the large model is used to generate a summary. The large model is constrained to explicitly indicate the relationship between the central entity and each topic text. The pronouns in the text are replaced with the names that best represent the entity. A fused summary containing topic summary, summary thought chain, and confidence level is output. The confidence level is qualitatively set to three levels: high, medium, and low.
[0021] S2-4. Text Template Concatenation: Combine the topic, original text excerpt, explanatory information, topic summary, and summary thought chain obtained for each topic text into a single text according to the template.
[0022] S2-5. Text Vector Library Construction: Use the BGE-m3 model to calculate the text vectors of the merged topic texts and write them into the vector database.
[0023] Furthermore, in step S3, the domain name segmenter is an HMM-based domain name segmenter; when using a large model to determine the entity type, the selectable range of the target entity type is specified by parameters. It can explicitly limit the list of acceptable entity types for a certain domain, or it can leave the entity type unspecified so that the large model can perform open domain extraction; the confidence level is qualitatively set to three levels: high, medium, and low.
[0024] Furthermore, step S4 specifically includes:
[0025] S4-1, Calculation of candidate entity combination: Based on the list of valid entities obtained from entity extraction, calculate all permutations and combinations of two entities;
[0026] S4-2, Word Vector Supported Document Retrieval: For each entity pair, construct the retrieval statement "What is the relationship between entity A and entity B?", use the BGE-m3 model to calculate its vector representation, and use the vector of the retrieval statement as parameters to retrieve the most similar texts from the vector database;
[0027] S4-3, Large Model Document Validity Filtering: For each supporting text, use the large model to determine whether the text can effectively answer the question of whether there is a certain relationship between candidate entity pairs. Constrain the large model to only answer YES or NO, and provide the basis for the judgment and the confidence level. The confidence level is qualitatively set to three levels: high, medium and low. Only documents in which the large model judges the validity as YES and the confidence level is high are retained.
[0028] S4-4, Large Model Relationship Type Judgment: Combine each candidate entity pair and its supporting document into prompt words according to the template, use the large model to determine the relationship type, constrain the large model to select the relationship type only from the target relationship type list, and output the relationship type and relationship description;
[0029] S4-5, Large Model Relationship Illusion Detection: For each entity pair and its relationship type, use the large model to determine whether the relationship strictly originates from the supporting document, constrain the large model to only answer YES or NO, and only retain the relationship information where the large model answers YES.
[0030] S4-6, Large Model Relationship Strength Detection: For each relation, use the large model to determine the relation strength, constrain the large model to only answer A, B, and C, which represent the relation strength as strong association, medium association, and weak association, respectively, and only retain the relation information of the large model answering A;
[0031] S4-7 Common Error Detection of Relationships in Large Model: For each relationship, use the large model to determine whether the relationship direction is correct, constrain the large model to only answer YES or NO, and for relationship information that is judged to have the opposite relationship direction, reverse the positions of its head entity and tail entity;
[0032] S4-8. Generation of Large Model Relationship Summary: For each relation that has passed the above multi-step detection, the large model is used to summarize a concise summary and supporting statements excerpted from the original text. The final relation information includes the head entity name, relation type, tail entity name, relation summary, and relation supporting text excerpt.
[0033] Furthermore, in step S4-2, the vectorized retrieval of entity pairs uses batch processing and concurrency techniques; in steps S4-4 and S4-8, when using the large model to determine the relation type and generate the summary, the selectable range of the target relation type is specified by parameters. It is possible to explicitly limit the list of acceptable relation types for a certain domain, or not to specify the relation type so that the large model can perform open domain extraction.
[0034] Furthermore, step S5 specifically includes:
[0035] S5-1, Entity Star Diagram and Feature Information Construction: Using the entity name as a parameter, find the relationships that match the head or tail entities from all relationship information to obtain a star diagram centered on a single entity; extract the summary information of each relationship from the list of related relationships for each entity to form a list of related relationship descriptions for that entity, which serves as the similarity support material for that entity;
[0036] S5-2, Coarse-grained word vector similarity matching: For each entity information and its related relationship description list, construct a standardized text containing entity name, entity type, entity description, and entity relationship description list. Feed the standardized text of the entity into the BGE-m3 model to calculate the semantic vector of the entity. For each entity, find the top 10 entities that are most similar to its semantic vector and construct candidate entity pairs. This calculation process is batch and concurrent.
[0037] S5-3, Fine-grained word vector similarity matching: For the standardized text pairs of candidate entity pairs, the BGE-Reranker model is used to calculate fine-grained similarity, and candidate entity pairs with similarity meeting the 0.85 threshold are retained; this calculation process is batch and concurrent.
[0038] S5-4, Large Model Same Name Detection: For each candidate entity pair with a fine-grained similarity higher than the threshold, use the large model to determine whether the two entities are entities with the same name. Constrain the large model to only output YES or NO, and provide an explicit explanation for its answer. Only candidate entity pairs that the large model judges to be YES are retained.
[0039] S5-5. Post-entity fusion processing: Perform fusion processing on the same-named entity pairs that have passed the above detection steps, group all entities with the same name into one class, select one entity in the class as the standardized entity, and use the name of this entity as the standard name of the class. Use the names of the remaining entities as aliases of the class. Then, query all relationships based on the original entity names and change the entity names corresponding to these relationships to the standard entity names of the class.
[0040] A domain knowledge graph extraction system based on the collaboration of large and small models is provided to implement the domain knowledge graph extraction method based on the collaboration of large and small models. The system includes: a main entity extraction module, a document preprocessing module, a document entity extraction module, a relation extraction module, and an entity fusion module, which respectively execute steps S1 to S5.
[0041] An electronic device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the program, it implements the domain knowledge graph extraction method based on the collaboration of large and small models.
[0042] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the domain knowledge graph extraction method based on the collaboration of large and small models.
[0043] Compared with the prior art, the significant advantages of this invention are:
[0044] (1) It reduces the hard requirement for the number of model parameters, and can achieve the effect of automated knowledge graph construction for application based on the 4B, 7B and 32B parameter models that can be carried by a single machine.
[0045] (2) Decompose the knowledge extraction and knowledge fusion tasks into basic task combinations that are more suitable for the large model according to their respective processing characteristics: judgment task, selection task, induction task, etc., and improve the extraction effect by reducing the complexity of the tasks in a targeted manner.
[0046] (3) Noise information is filtered during the extraction process. Relevant ideas and techniques in the RAG field are combined during the extraction process. Noise data with semantic inconsistencies is filtered out during the processing through vector indexing and retrieval capabilities.
[0047] (4) Integrate the fine-grained knowledge fusion process into the knowledge extraction process, and combine word vector technology with the task decomposition and judgment capabilities based on large models to automatically fuse entities with the same name and the same reference, thereby reducing the redundancy of knowledge.
[0048] (5) Considering the plug-in repair of common problems in the knowledge extraction process, an interface for repairing common problems of large models is reserved in the post-processing of extraction results, which can be combined with actual problems to expand the processing workflow. Attached Figure Description
[0049] Figure 1 This is a flowchart of the knowledge graph extraction system in the field of this invention.
[0050] Figure 2 This is the processing flow of the main entity extraction module.
[0051] Figure 3 This is the document preprocessing module's processing flow.
[0052] Figure 4 This is the document entity extraction module processing flow.
[0053] Figure 5 This is the processing flow of the relation extraction module.
[0054] Figure 6 This is the entity fusion module processing flow. Detailed Implementation
[0055] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0056] This embodiment, using an encyclopedia review text and "Harvard University" as an example, details the technical process and effects. Example text is as follows:
[0057] Harvard University, commonly known as Harvard, is a private research university located in Cambridge, Massachusetts, USA. It is an Ivy League school and a member of the Global University Presidents Forum, the Global Alliance of Institutes for Advanced Study, the Association of American Universities, the Association of Independent Colleges and Universities, and the American Council on Education. Founded in 1636 by the Massachusetts colonial legislature as "New Citizens College," it was renamed "Harvard College" in March 1639 in honor of Reverend John Harvard, who generously supported the college in its early years. In 1780, Harvard College officially became "Harvard University." In 2013, President Drew Faust launched the Harvard Campaign to raise funds for teaching and research programs and financial aid. On July 1, 2023, Claudine Gay... (Gay) has been appointed as the 30th president of Harvard University. Harvard University has three campuses: Cambridge, Alston, and Boston. It comprises Harvard College for undergraduates, 12 graduate schools, and the Radcliffe Institute for Advanced Study, and operates under a dual-committee system.
[0058] like Figure 1 As shown, a domain knowledge graph extraction system based on the collaboration of large and small models includes five modules: main entity extraction, document preprocessing, document entity extraction, relation extraction and verification, and entity fusion. The implementation process of each module is as follows:
[0059] S1, Main Entity Extraction Module
[0060] Used to extract the central entity from the input text for use in subsequent modules.
[0061] like Figure 2 As shown, the processing flow of the main entity extraction module in this embodiment includes:
[0062] S1-1: Input text truncation processing
[0063] To balance the maximum context length limit of the large model with the extraction response speed, the first 4000 characters of the article are extracted as the real input text for the main entity extraction.
[0064] The sample text length does not exceed the maximum character limit; the full text will be used as subsequent input.
[0065] S1-2: Extraction of Main Entities from Large Models
[0066] The prompt word section specifies the characteristics that the main entity should have and related extraction examples, and also constrains the output of the large model to include the structured entity name, confidence level, and extraction reason. The confidence level is qualitatively set to three levels: high, medium, and low.
[0067] After this step, the sample text has the central entity "Harvard University" and a confidence level of "high".
[0068] S1-3: Confidence Filtering Process
[0069] Based on the confidence score of the main entity output by the large model, it is determined whether the main entity exists. If the confidence score is lower than the set threshold, the main entity is set to an empty string in subsequent processing; otherwise, the name of the main entity that meets the confidence score threshold is used.
[0070] The system default confidence threshold is set to medium. Main entities with a confidence level greater than or equal to medium meet the constraints. The central entity obtained in the previous step has a high confidence level, which meets the retention requirements.
[0071] S2, Document Preprocessing Module
[0072] This is used to segment the original text into topic blocks of appropriate length according to semantics. At the same time, a large model is used to fill in the explicit relationship between the semantic block and the central entity. The BGE-m3 model is used to calculate the text vector of each filled semantic block and write it into the vector database to support the subsequent relation extraction module.
[0073] like Figure 3 As shown, the document preprocessing module in this embodiment includes the following processing flow:
[0074] S2-1: Coarse-grained text sentence-by-sentence fixed-length chunking
[0075] The original input text is divided into coarse-grained text blocks, each consisting of complete sentences and a maximum length of 800 characters, using a recursive chunking technique.
[0076] S2-2: Fine-grained semantic topic segmentation
[0077] The large model is used to extract topic segments from each coarse-grained text block. The output text of the large model is constrained to strictly match the input text. The large model is also constrained to perform topic summarization for each topic segment. The final output structured topic information includes: topic, original text excerpt, and explanatory information.
[0078] S2-3: Semantic Block Summarization Generation and Association Filling
[0079] After concatenating the central entity with each topic text according to the template, a large model is used to generate a summary. The large model is constrained to explicitly indicate the relationship between the central entity and each topic text. The large model is constrained to replace the pronouns in the text with the names that best represent the entity. The large model is constrained to provide explicit explanations and confidence scores for its output. The final output fused summary includes: topic summary, summary thought chain, and confidence score.
[0080] S2-4: Text Template Concatenation
[0081] The above-mentioned topics, excerpts from the original text, explanatory information, topic summaries, and summary thought chains obtained for each topic text are combined into a single text according to the template.
[0082] S2-5: Construction of Text Vector Library
[0083] The text vectors of the merged topic texts are calculated using the BGE-m3 model and written into the vector database.
[0084] After the example text is filled with topic blocks and semantic associations, template filling is performed to obtain the following four semantic blocks:
[0085] Topic: Harvard University's Basic Positioning
[0086] The thought process: The original text begins by defining the school's name, location, and nature (private research institution), and lists several important international and domestic academic alliances it has joined. This information collectively constitutes Harvard University's basic identity profile, and semantically belongs to the same level.
[0087] Content: Harvard University (Harvard for short) is a private research university located in Cambridge, Massachusetts, USA. As a top global institution of higher learning, Harvard is not only a member of the Ivy League but also actively participates in international academic collaborations. It is a key member of the Global University Presidents Forum, the Global Alliance of Universities, the Association of American Universities, the Association of Independent Colleges and Universities, and the American Council on Education.
[0088] Subject: The Founding and Renaming of Harvard University
[0089] The thought process chain: This section narrates the school's development from its founding and renaming in honor of its donors to its official name in chronological order (1636, 1639, 1780). It is a complete historical narrative chain, focusing on the timeline development of the entity "Harvard University".
[0090] Content: Harvard University's history dates back to 1636, when it was founded by the Massachusetts colonial legislature as "New Civic College." In March 1639, to commemorate Reverend John Harvard, who generously supported the university in its early years, it was officially renamed "Harvard College." With its expansion and rising status, Harvard University officially changed its name to its current name, "Harvard University," in 1780, marking a significant milestone in its development.
[0091] Subject: Recent Initiatives and Leadership at Harvard University
[0092] Thought chain: The text mentions the fundraising campaign in 2013 and the inauguration of the president in 2023. These two events represent Harvard University's strategic actions in the modern era and the change of its top leadership, respectively, and are key milestones in the school's recent development.
[0093] Content: Entering the 21st century, Harvard University has continued to promote major development initiatives. In 2013, then-President Drew Faust launched the "Harvard Campaign" to raise funds for Harvard University's teaching, research programs, and financial aid. In terms of leadership, Claudine Gay officially took office on July 1, 2023, becoming Harvard University's 30th president, leading the university into a new phase of development.
[0094] Topic: Harvard University's Organizational Structure
[0095] The thought process: The last paragraph details the school's geographical distribution (three campuses) and internal academic / administrative structure (undergraduate schools, graduate schools, research institutes, and management system). This is a static description of Harvard University's physical structure and management system, forming a complete semantic unit on its own.
[0096] Content: Harvard University boasts a vast educational and research system, with physical campuses spanning Cambridge, Alston, and Boston. Organizationally, Harvard comprises the undergraduate school (Harvard College), 12 graduate schools, and the Radcliffe Institute for Advanced Study. Furthermore, Harvard employs a unique dual-committee corporate structure to ensure the efficient operation of its academic and administrative processes.
[0097] The reconstructed topic blocks are written into the vector library for use in subsequent steps.
[0098] S3, Document Entity Extraction Module
[0099] This is used to extract entity names and entity types from semantic blocks. A smaller model is constructed using a large model and an entity list from an existing knowledge base. This smaller model extracts all matched candidate entity names from the semantic block. Then, for each candidate entity name, the large model determines its entity type. Finally, the large model generates a descriptive information containing context for each valid entity.
[0100] like Figure 4 As shown, the processing flow of the document entity extraction module in this embodiment includes:
[0101] S3-1-1: Large Model Entity Noun Extraction
[0102] For each document preprocessing module outputting the merged topic text, a large model is used to extract candidate entity names. The large model is constrained to ignore weak nouns such as dates, numbers, and pronouns, and the nouns output by the large model are constrained to strictly match the input text.
[0103] The noun extraction results for the large model are as follows:
[0104] Theme Block 1: Harvard University, Harvard, United States, Massachusetts, Ivy League, Global University Presidents Forum, Global Alliance of Universities for Advanced Studies, Association of American Universities, Association of Independent Colleges and Universities, American Council on Education.
[0105] Topic Block 2: Harvard University, Massachusetts, John Harvard, Harvard College.
[0106] Theme Block 3: Harvard University, Drew Foster, Claudio Guy, Harvard Movement.
[0107] Theme Block 4: Harvard University, Cambridge Campus, Alston Campus, Boston Campus, Harvard College, Radcliffe Institute for Advanced Study.
[0108] S3-1-2: Small Model Entity Noun Extraction
[0109] Extract all entity names from the existing knowledge base, construct a domain dictionary, and train an HMM-based domain name segmenter based on the dictionary. Use the domain name segmenter to extract candidate entity names from each merged topic text as a supplement to the extraction results of the large model.
[0110] The noun extraction results of the small model are as follows:
[0111] Topic Block 1: Harvard University, Harvard, United States, Massachusetts, Ivy League.
[0112] Topic Block 2: Harvard University, Massachusetts, John Harvard, Harvard.
[0113] Topic Block 3: Harvard University, Drew Faust, Claudio Guy, Harvard.
[0114] Theme Block 4: Harvard University, Cambridge, Alston, Boston, Harvard.
[0115] S3-2: Merging Entity Nouns
[0116] Merge the candidate entity names extracted from the large model with those extracted from the domain word segmenter, and perform simple deduplication for entities with the same name;
[0117] After merging and deduplication, the entity noun results are as follows:
[0118] Theme Block 1: Harvard University, United States, Massachusetts, Ivy League Schools, Global University Presidents Forum, Global Alliance of Universities for Advanced Studies, Association of American Universities, Association of Independent Colleges and Universities, American Council on Education.
[0119] Topic Block 2: Harvard University, Massachusetts, John Harvard, Harvard College.
[0120] Theme Block 3: Harvard University, Drew Foster, Claudio Guy, Harvard Movement.
[0121] Theme Block 4: Harvard University, Cambridge Campus, Alston Campus, Boston Campus, Harvard College, Radcliffe Institute for Advanced Study.
[0122] S3-3: Large Model Entity Type Determination
[0123] For each candidate entity name, a large model is used to determine its entity type and confidence level. The large model is constrained to output entity types within a specified list of types and to output the confidence level of the determination. When using the large model to determine entity types, the range of possible target entity types is specified through parameters. This can be done by explicitly limiting the list of acceptable entity types for a specific domain, or by not specifying entity types, allowing the large model to perform open-domain extraction.
[0124] The open-domain entity type extraction was configured, and the large model's classification results for the above entity nouns are as follows. The three columns, from left to right, represent noun, type, and confidence score:
[0125] Topic Block 1:
[0126] Harvard University, institution, high
[0127] United States, country, high
[0128] Massachusetts, Location, High
[0129] Cambridge, Location, High
[0130] Global University Presidents Forum, organization, high
[0131] The Global Alliance of Universities for Advanced Studies (GALAS)
[0132] Association of American Universities, organization, high
[0133] The Association of Independent Colleges and Universities (AIC), an organization that...
[0134] American Council on Education, organization, higher education
[0135] Topic Block 2:
[0136] Harvard University, institution, high
[0137] Massachusetts, Location, High
[0138] John Harvard, a prominent figure, was highly regarded.
[0139] Harvard College, institution, high
[0140] Topic Block 3:
[0141] Harvard University, institution, high
[0142] Drew Foster, a character, high
[0143] Claudio Guy, a character, high
[0144] Harvard movement, events, high
[0145] Topic Block 4:
[0146] Harvard University, institution, high
[0147] Cambridge campus, institution, middle
[0148] Alston Campus, Institution, Middle
[0149] Boston campus, institution, middle
[0150] Harvard College, institution, high
[0151] Radcliffe Institute for Advanced Study, institution, high
[0152] S3-4: Valid Entity Confidence Filtering
[0153] The confidence level of the output type judgment of the large model is qualitatively set to three levels: high, medium and low. In application, only entities and their types with a confidence level higher than the threshold are retained as valid entities.
[0154] The default confidence threshold is "medium", and all the above entities meet the retention criteria.
[0155] S3-5: Generation of Large Model Entity Descriptions
[0156] For each valid entity, a representative description is generated using a large model. The description output by the large model is constrained to be no more than 3 sentences long and to cover most of the information in the input text. The final entity information includes: entity name, entity type, confidence level, and entity description information.
[0157] Due to the large number of entities, listing the detailed descriptions of each entity would be too lengthy. Below are three entities briefly listed as examples:
[0158] Entity 1
[0159] Entity Name: Harvard University
[0160] Entity type: Institution
[0161] Confidence level: High
[0162] Entity Description: This is a private research university located in Cambridge, Massachusetts, USA, commonly known as "Harvard." As a member of the Ivy League, it enjoys an extremely high reputation and influence in the global academic community.
[0163] Entity 2
[0164] Entity Name: Massachusetts
[0165] Entity type: Location
[0166] Confidence level: High
[0167] Entity description information: In the 17th century, the state's colonial legislature exercised its power to formally approve the establishment of this college, which later became Harvard University. This historical act laid the foundation for the region as the birthplace of American education.
[0168] Entity 3
[0169] Entity Name: Radcliffe Institute for Advanced Study
[0170] Entity type: Institution
[0171] Confidence level: High
[0172] Entity Description: This is a top-tier research institution affiliated with Harvard University, dedicated to promoting interdisciplinary academic exchange and cutting-edge research. It inherits the legacy of the former Radcliffe College and now provides research support to scholars across the university and globally.
[0173] S4, Relationship Extraction Module
[0174] This method is used to extract potentially valid relationships from an entity list. First, candidate entity pairs are calculated. Then, supporting text for each candidate entity pair is retrieved from a text vector database. Next, a large model is used to assess the usability of each supporting text, filtering out candidate entity pairs without supporting text. Based on the supporting documents, the large model is used to determine the relationships between each candidate entity pair. The large model is then used to perform consistency checks on the extracted relationships. Finally, an extensible common problem correction component is used to obtain the required relationship information: head entity name, relationship type, tail entity name, relationship description, and a selection of supporting text.
[0175] like Figure 5 As shown, the processing flow of the relation extraction module in this embodiment includes:
[0176] S4-1: Calculation of Candidate Entity Combination
[0177] Based on the list of valid entities obtained from entity extraction, calculate all permutations and combinations of two entities;
[0178] Due to the large number of entities, listing every possible combination of entities would be too lengthy. The following explanation uses "Harvard University" and "Massachusetts" as examples.
[0179] S4-2: Word vectors support document retrieval
[0180] For each entity pair, a search query in the form of "What is the relationship between entity A and entity B?" is constructed. The BGE-m3 model is used to calculate its vector representation. Using the vector of the search query as a parameter, the 10 most similar texts that can support a certain relationship between the entity pairs are retrieved from the text vector library generated by the document preprocessing module. The vectorized retrieval of entity pairs uses batch processing and concurrency techniques to improve computational efficiency.
[0181] There are two texts that can support the existence of a certain connection between the entities “Harvard University” and “Massachusetts”: topic block 1 and topic block 2.
[0182] S4-3: Large Model Document Validity Filtering
[0183] For each supporting text, a large model is used to determine whether the text can effectively answer the question of whether there is a certain relationship between candidate entity pairs. The large model is constrained to only answer YES or NO, and to provide clear judgment criteria and explanations for the judgment. The large model is also constrained to provide a judgment confidence level. The confidence level is qualitatively set to three levels: high, medium, and low. In application, only documents in which the large model judges the validity as YES and has a high confidence level are retained.
[0184] The association validity of topic block 1 and topic block 2 is YES, and the confidence level is "high", which meets the retention condition of the default threshold "medium".
[0185] S4-4: Determining the Relationship Type of a Large Model
[0186] Each candidate entity pair and its supporting document are combined into prompt words according to a template. The large model is then used to determine the relationship between the candidate entity pairs. The large model is constrained to select only the relationship type from the target relationship type list. The large model is also constrained to provide an explicit explanation of its judgment. The output fields of the large model include: relationship type and relationship description. When using the large model to determine the relationship type, the selectable range of target relationship types can be specified through parameters. This can be done by explicitly limiting the list of acceptable relationship types for a specific domain, or by not specifying the relationship type, allowing the large model to perform open domain extraction.
[0187] Using open-domain relation extraction, without limiting the type, the judgment result for the large model is:
[0188] Harvard University - located in Massachusetts
[0189] S4-5: Detection of Relational Illusions in Large Models
[0190] For each entity pair and its relation type, such as (head entity, relation type, tail entity), use the large model to determine whether the relation strictly originates from the supporting document. That is, use the large model to review whether there is an illusion problem. Constrain the large model to only answer YES or NO, and only retain relation information that the large model answers YES.
[0191] The large model confirms that the relationship "Harvard University - located in -> Massachusetts" has no anomalies and meets the retention criteria.
[0192] S4-6: Large Model Relationship Strength Detection
[0193] For each relation, the large model is used to determine the relation strength, that is, the large model is used to determine the value and validity of the relation. The large model is constrained to only answer A, B, and C, which represent the relation strength as strong association, medium association, and weak association, respectively. Only the relation information of the large model answering A is retained.
[0194] The large model confirms that the relationship "Harvard University - located in -> Massachusetts" has a strength of A, which is a strong association and meets the retention criteria.
[0195] S4-7: Common Error Detection in Large Model Relationships
[0196] For each relation, use the large model to determine whether the relation direction is correct and whether the relation direction conforms to common sense. Constrain the large model to only answer YES and NO, which represent the relation direction being correct and the relation direction being opposite, respectively. For relation information that is judged to have the relation direction being opposite, reverse the positions of its head entity and tail entity.
[0197] The large model confirms that the relationship "Harvard University - located in -> Massachusetts" is in the correct direction and there is no need to reverse the direction.
[0198] S4-8: Generation of Relational Summary for Large Model
[0199] For each relation that passes the above multi-step detection, a concise summary is generated using the large model to semantically describe the triple. The large model's induction is constrained to strictly follow the supporting text, and the large model is also constrained to output supporting statements excerpted from the original text while outputting the summary. From the results of all the above relation extraction steps, the following fields are selected as the constituent fields of the final relation: head entity name, relation type, tail entity name, relation summary, and relation supporting text excerpt.
[0200] After generating a summary of the relations, the final relation result is as follows:
[0201] Header Entity Name: Harvard University
[0202] Relationship type: located at
[0203] Tail Entity Name: Massachusetts
[0204] Abstract Summary: Harvard University, located in Cambridge, Massachusetts, is the state's most prestigious private research university.
[0205] Supporting text excerpt: Harvard University, commonly known as "Harvard," is located in Cambridge, Massachusetts, USA.
[0206] Other relationships are similar to this example.
[0207] S5, Entity Fusion Module
[0208] This is used to fuse extracted entities and achieve entity disambiguation. A relational star diagram is constructed for each extracted entity. The first round of similar entity screening is performed using the BGE-m3 model and the BGE-Reranker model. The second round of same-name entity judgment is performed using the larger model. Based on the output results, the same-name entities are fused.
[0209] like Figure 6 As shown, the processing flow of the entity fusion module in this embodiment includes:
[0210] S5-1: Entity star diagram and feature information construction. Taking the entity information list and relationship information list obtained after the above entity relationship extraction as input, a list of related relationships is constructed for each entity. Taking the entity name as a parameter, the relationship that matches the head entity or tail entity is found from all the relationship information to obtain a star diagram centered on a single entity. From the list of related relationships of each entity, the summary information of each relationship is extracted and the relationship triples are represented by semantic descriptions to form the related relationship description list of the entity, which serves as the similarity support material for the entity.
[0211] Taking the physical "Harvard University" as an example, its star diagram is described in words as follows:
[0212] Harvard University - located in Massachusetts
[0213] Harvard University - Participation -> Global University Presidents Forum
[0214] Harvard University - Participation -> Global Alliance of Universities for Advanced Studies
[0215] Harvard University - Participation -> Association of American Universities
[0216] Harvard University - its predecessor -> Harvard College
[0217] Harvard University President Drew Faust
[0218] Harvard University President Claudio Guy
[0219] Harvard University - including Cambridge Campus
[0220] Harvard University - including Radcliffe Institute for Advanced Study
[0221] In fact, the above associations should also include entity descriptions and relationship descriptions. For the sake of clarity, these lengthy textual details are omitted here.
[0222] S5-2: Coarse-grained word vector similarity matching
[0223] For each entity information and its related relationship description list, a standardized text containing entity name, entity type, entity description, and entity relationship description list is constructed. This standardized text is then fed into the BGE-m3 model to calculate the entity's semantic vector. The process of calculating the entity's semantic vector through the BGE-m3 model is batch and concurrent. For each entity, the top 10 entities most similar to its semantic vector are identified, and then candidate entity pairs are constructed between this entity and each of the similar entities.
[0224] The current text does not have any entity pairs that need to be merged. Suppose that another text, after extraction, yields another "Harvard University" entity as follows:
[0225] Harvard University is located in Massachusetts.
[0226] Harvard University (including Radcliffe Institute for Advanced Study)
[0227] Therefore, “Harvard University” and “Harvard University” extracted from these two different texts will constitute a candidate entity pair to be merged because they have two identical relations and have a certain degree of similarity.
[0228] S5-3: Fine-grained word vector similarity matching
[0229] For the standardized text pairs of candidate entity pairs, the BGE-Reranker model is used to further calculate fine-grained similarity, and candidate entity pairs with similarity meeting the threshold (0.85) are retained; the process of calculating entity semantic vectors through the BGE-Reranker model is batch processing and concurrent.
[0230] The vectorized similarity detection results with semantic information will indicate that the two Harvard Universities meet the similarity threshold and need to be merged.
[0231] S5-4: Name Detection for Large Models
[0232] For each candidate entity pair with a similarity higher than the threshold, the large model is used to determine whether the two entities have the same name. The large model is constrained to output only YES and NO, and to provide an explicit explanation of its answer. Only candidate entity pairs that the large model judges as YES are retained.
[0233] The large model determines that two entities are similar enough based on semantic description information and need to be merged.
[0234] S5-5: Post-processing for entity fusion
[0235] For entities with the same name that have passed the above detection steps, a fusion process is performed. First, all entities with the same name are grouped into one class. In the class, one entity is selected as the standardized entity, and the name of this entity is used as the standard name of the class. The names of the remaining entities are used as aliases of the class. Then, all relationships are queried based on the original entity names, and the entity names corresponding to these relationships are changed to the standard entity names of the class.
[0236] Merge the entities “Harvard University” and “Harvard University”, randomly select “Harvard University” as the standard name for the entity, and change the name of “Harvard University” to “Harvard University”.
[0237] Furthermore, the word vector model used in this invention is a BGE series model; in fact, any model capable of achieving the goal of word vector calculation can play the same role. This invention uses YES and NO indicators to perform right / wrong judgments; in fact, any binary classification standard capable of making such judgments can play the same role.
[0238] The present invention also proposes an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, when the processor executes the program, it implements the domain knowledge graph extraction method based on the collaboration of large and small models.
[0239] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the domain knowledge graph extraction method based on the collaboration of large and small models.
[0240] In summary, the knowledge graph automatically constructed by this invention has advantages such as high accuracy, high coverage, and high availability.
[0241] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0242] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these modifications and improvements all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A domain knowledge graph extraction method based on the collaboration of large and small models, characterized in that, Includes the following steps: S1. Main Entity Extraction: After truncating the input text, the central entity is extracted from it using a large model. The large model outputs a structured result containing the entity name, confidence level and reason for extraction, and filters it according to the confidence level. S2. Document Preprocessing and Vectorization: The original text is semantically segmented into multiple topic blocks. For each topic block, a large model is used to generate a summary of its association with the central entity, forming an enhanced semantic block. The BGE-m3 model is used to calculate the text vector of each enhanced semantic block and write it into the vector database. S3. Document Entity Extraction: For each enhanced semantic block, candidate entity names are extracted in parallel using a large model and a domain name segmenter built based on the existing knowledge base entity list. After merging and deduplication, the large model is used to determine the entity type and confidence level of each candidate entity name. Entities with confidence levels higher than the set threshold are retained as valid entities. The large model is then used to generate a descriptive information containing context for each valid entity. S4. Relation Extraction and Validation: Calculate all candidate entity pairs based on the list of valid entities; for each candidate entity pair, construct a retrieval statement and retrieve supporting text from the vector database; A large model is used to assess the usability of the supporting text to filter out noise; for the filtered entity pairs and supporting text, the large model is used to determine the relationship type; then, the large model is used to perform illusion detection, relationship strength detection, and common relationship direction error detection on the extracted relationships in sequence. For all relations that pass the tests, a large model is used to generate relation summaries and supporting text excerpts to obtain the final relation triplet information; S5. Entity Fusion: Taking the list of valid entities and the list of relationship information as input, construct a star diagram of related relationships for each entity, and extract the summary information of the relationships to form the similarity support material for the entity. For each entity, a standardized text is constructed. The semantic vector is calculated using the BGE-m3 model and coarse-grained similarity retrieval is performed. The BGE-Reranker model is used to calculate the fine-grained similarity of candidate entity pairs. For candidate entity pairs with similarity higher than the threshold, the large model is used to determine entities with the same name. Entities determined to have the same name are fused, clustered into one class, and then a standard entity name is selected and all association relationships are updated.
2. The domain knowledge graph extraction method based on the collaboration of large and small models according to claim 1, characterized in that, Step S1 specifically includes: S1-1 Input text truncation processing: Extract the first 4000 characters of the article as the actual input text for the main entity extraction; S1-2, Main Entity Extraction from Large Model: Constrain the characteristics and extraction examples of the main entities in the prompt words, and constrain the structured entity names, confidence levels and extraction reasons output by the large model. The confidence level is qualitatively set to three levels: high, medium and low. S1-3, Confidence Filtering: Based on the confidence of the main entity output by the large model, if the confidence is lower than the set threshold, the main entity is set to an empty string in subsequent processing; otherwise, the main entity name that meets the confidence threshold is used.
3. The domain knowledge graph extraction method based on the collaboration of large and small models according to claim 1, characterized in that, Step S2 specifically includes: S2-1, Coarse-grained text whole sentence fixed-length chunking: The original input text is divided into coarse-grained text chunks with complete sentences and a maximum length of 800 characters using the technique of recursive chunking; S2-2, Fine-grained semantic topic segmentation: The large model extracts each topic segment from each coarse-grained text segment, constraining the output of the large model to strictly match the input text, and outputting structured topic information containing topic, excerpt from the original text, and explanatory information; S2-3, Semantic Block Summarization and Association Filling: After concatenating the central entity with each topic text according to the template, the large model is used to generate a summary. The large model is constrained to explicitly indicate the relationship between the central entity and each topic text. The pronouns in the text are replaced with the names that best represent the entity. A fused summary containing topic summary, summary thought chain, and confidence level is output. The confidence level is qualitatively set to three levels: high, medium, and low. S2-4. Text Template Concatenation: Combine the topic, original text excerpt, explanatory information, topic summary, and summary thought chain obtained for each topic text into a single text according to the template. S2-5. Text Vector Library Construction: Use the BGE-m3 model to calculate the text vectors of the merged topic texts and write them into the vector database.
4. The domain knowledge graph extraction method based on the collaboration of large and small models according to claim 1, characterized in that, In step S3, the domain name segmenter is an HMM-based domain name segmenter; when using a large model to determine the entity type, the selectable range of the target entity type is specified by parameters. It can explicitly limit the list of acceptable entity types for a certain domain, or it can leave the entity type unspecified so that the large model can perform open domain extraction; the confidence level is qualitatively set to three levels: high, medium, and low.
5. The domain knowledge graph extraction method based on the collaboration of large and small models according to claim 1, characterized in that, Step S4 specifically includes: S4-1, Calculation of candidate entity combination: Based on the list of valid entities obtained from entity extraction, calculate all permutations and combinations of two entities; S4-2, Word Vector Supported Document Retrieval: For each entity pair, construct the retrieval statement "What is the relationship between entity A and entity B?", use the BGE-m3 model to calculate its vector representation, and use the vector of the retrieval statement as parameters to retrieve the most similar texts from the vector database; S4-3, Large Model Document Validity Filtering: For each supporting text, use the large model to determine whether the text can effectively answer the question of whether there is a certain relationship between candidate entity pairs. Constrain the large model to only answer YES or NO, and provide the basis for the judgment and the confidence level. The confidence level is qualitatively set to three levels: high, medium and low. Only documents in which the large model judges the validity as YES and the confidence level is high are retained. S4-4, Large Model Relationship Type Judgment: Combine each candidate entity pair and its supporting document into prompt words according to the template, use the large model to determine the relationship type, constrain the large model to select the relationship type only from the target relationship type list, and output the relationship type and relationship description; S4-5, Large Model Relationship Illusion Detection: For each entity pair and its relationship type, use the large model to determine whether the relationship strictly originates from the supporting document, constrain the large model to only answer YES or NO, and only retain the relationship information where the large model answers YES. S4-6, Large Model Relationship Strength Detection: For each relation, use the large model to determine the relation strength, constrain the large model to only answer A, B, and C, which represent the relation strength as strong association, medium association, and weak association, respectively, and only retain the relation information of the large model answering A; S4-7 Common Error Detection of Relationships in Large Model: For each relationship, use the large model to determine whether the relationship direction is correct, constrain the large model to only answer YES or NO, and for relationship information that is judged to have the opposite relationship direction, reverse the positions of its head entity and tail entity; S4-8. Generation of Large Model Relationship Summary: For each relation that has passed the above multi-step detection, the large model is used to summarize a concise summary and supporting statements excerpted from the original text. The final relation information includes the head entity name, relation type, tail entity name, relation summary, and relation supporting text excerpt.
6. The domain knowledge graph extraction method based on the collaboration of large and small models according to claim 1, characterized in that, In step S4-2, the vectorized retrieval of entity pairs uses batch processing and concurrency techniques; in steps S4-4 and S4-8, when using the large model to determine the relation type and generate the summary, the selectable range of the target relation type is specified by parameters. The acceptable relation type list can be explicitly limited for a certain domain, or the relation type can be left unspecified so that the large model can perform open domain extraction.
7. The domain knowledge graph extraction method based on the collaboration of large and small models according to claim 1, characterized in that, Step S5 specifically includes: S5-1, Entity Star Diagram and Feature Information Construction: Using the entity name as a parameter, find the relationships that match the head or tail entities from all relationship information to obtain a star diagram centered on a single entity; extract the summary information of each relationship from the list of related relationships for each entity to form a list of related relationship descriptions for that entity, which serves as the similarity support material for that entity; S5-2, Coarse-grained word vector similarity matching: For each entity information and its related relationship description list, construct a standardized text containing entity name, entity type, entity description, and entity relationship description list. Feed the standardized text of the entity into the BGE-m3 model to calculate the semantic vector of the entity. For each entity, find the top 10 entities that are most similar to its semantic vector and construct candidate entity pairs. This calculation process is batch and concurrent. S5-3, Fine-grained word vector similarity matching: For the standardized text pairs of candidate entity pairs, the BGE-Reranker model is used to calculate fine-grained similarity, and candidate entity pairs with similarity meeting the 0.85 threshold are retained; this calculation process is batch and concurrent. S5-4, Large Model Same Name Detection: For each candidate entity pair with a fine-grained similarity higher than the threshold, use the large model to determine whether the two entities are entities with the same name. Constrain the large model to only output YES or NO, and provide an explicit explanation for its answer. Only candidate entity pairs that the large model judges to be YES are retained. S5-5. Post-entity fusion processing: Perform fusion processing on the same-named entity pairs that have passed the above detection steps, group all entities with the same name into one class, select one entity in the class as the standardized entity, and use the name of this entity as the standard name of the class. Use the names of the remaining entities as aliases of the class. Then, query all relationships based on the original entity names and change the entity names corresponding to these relationships to the standard entity names of the class.
8. A domain knowledge graph extraction system based on the collaboration of large and small models, characterized in that, The system is used to implement the domain knowledge graph extraction method based on the collaboration of large and small models as described in any one of claims 1-7. The system includes: a main entity extraction module, a document preprocessing module, a document entity extraction module, a relation extraction module, and an entity fusion module, which respectively execute steps S1 to S5.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the domain knowledge graph extraction method based on the collaboration of large and small models as described in any one of claims 1 to 8.
10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the domain knowledge graph extraction method based on the collaboration of large and small models as described in any one of claims 1 to 8.