A Question Answering Method and System for Professional Industry Information Based on Knowledge Graphs Using Large Language Models

By correcting entity boundaries using a large language model and particle swarm optimization algorithm, and combining this with an external knowledge base for node type classification, the problems of unstable entity boundaries and biased node type labels in knowledge graphs are solved, thereby improving the accuracy of answers and retrieval capabilities of the question-answering system.

CN122309667APending Publication Date: 2026-06-30BEIJING SHUJIE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING SHUJIE TECHNOLOGY CO LTD
Filing Date
2026-03-30
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing intelligent question answering systems suffer from inaccuracies in entity boundary recognition and node type classification, leading to unstable node boundaries and biased node type labels in the knowledge graph, which affects the accuracy of answers and retrieval coverage of the question answering system.

Method used

A large language model combined with particle swarm optimization algorithm is used to detect and correct oscillation phenomena at entity boundaries. The entity boundaries are corrected by particle swarm optimization algorithm, and node types are classified by combining external industry knowledge base to construct an enhanced knowledge graph.

Benefits of technology

It improves the stability of entity boundaries and the accuracy of node type classification, enhances the semantic association of knowledge graphs, and improves the answer accuracy and retrieval coverage of question-answering systems.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122309667A_ABST
    Figure CN122309667A_ABST
Patent Text Reader

Abstract

This invention discloses a question-answering method and system for professional industry information based on a knowledge graph using a large language model. It collects unstructured text data from professional industries, cleans and unifies it to obtain standardized text corpus; employs a large language model to identify entity boundaries, identifies entities with large boundary fluctuations through oscillation detection, and uses a particle swarm optimization algorithm to correct entity boundaries, ensuring consistency across different context windows and convergence with semantic matching in an industry terminology dictionary, outputting stable extraction results and solving the problem of unstable boundary identification; constructs an initial knowledge graph based on the stable extraction results; automatically classifies nodes in the initial knowledge graph by type, corrects node types using an external industry knowledge base, and establishes enhanced association edges between nodes of the same type, forming an enhanced knowledge graph whose structure closely resembles the industry knowledge system; receives natural language questions, performs semantic parsing and answer retrieval based on the enhanced knowledge graph, generates and returns industry information answers.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of large language models, and in particular to a question-answering method and system for professional industry information based on a knowledge graph of a large language model. Background Technology

[0002] Existing intelligent question-answering systems typically employ a "knowledge graph construction + question-answering retrieval" technical architecture. This architecture first extracts entities and relationships from unstructured industry text to construct a structured industry knowledge graph. Based on this, it receives natural language questions input by the user, performs semantic parsing to retrieve and reason within the knowledge graph, and ultimately returns the answer. The core premise of this technical approach is that the accuracy of entity boundaries, the rationality of node type classification, and the completeness of the association network in the knowledge graph directly determine the answer accuracy and retrieval coverage of the question-answering system.

[0003] However, existing technologies still have many shortcomings in the knowledge graph construction process, making it difficult for the constructed graphs to support high-quality industry information question answering.

[0004] First, in the entity extraction stage, traditional methods typically rely on large language models for sequence labeling. However, large language models are prone to oscillations in boundary positions when dealing with the boundary recognition of the same entity in different context windows. For example, the name of the same organization may be identified as "Company 1234" or "Company 12" in different sentences, causing frequent fluctuations in the start and end position indices. Traditional solutions often extract entities using only a single context window or employ a simple voting mechanism to merge multiple extraction results, failing to fundamentally solve the boundary oscillation problem. Furthermore, they do not quantify the fluctuation patterns of entity boundaries in different windows, resulting in a large number of entities with unstable boundaries mixed in the extraction results, directly affecting the accuracy of node boundaries in the knowledge graph.

[0005] Secondly, regarding the issue of entity boundary oscillation, existing technologies lack effective methods for quantifying the degree of entity boundary fluctuation. Traditional methods typically use the entity boundary results output by large language models directly, or simply filter entities by setting a fixed threshold, without considering the variation of boundary offsets in different context windows. When an entity appears in multiple windows, the offsets of its starting and ending positions may exhibit frequent changes in direction or drastic fluctuations in amplitude. Traditional methods cannot distinguish the essential difference between small oscillations and large jumps, nor can they identify abnormal offsets caused by special window boundary segmentation or text noise. This can lead to entities with stable boundaries being misjudged as unstable due to individual abnormal windows, while entities with severely fluctuating boundaries are misjudged as stable entities because the extreme values ​​of abnormal windows mask the true fluctuation patterns, further exacerbating the inconsistency in entity extraction results.

[0006] After entity extraction, the construction of a knowledge graph requires node classification. Traditional node classification methods typically rely solely on the node's own textual descriptions, using semantic vectors for clustering or classification. However, nodes in industry-specific knowledge graphs often exhibit context-dependent expressive characteristics, meaning the same node may belong to different types depending on its relationships. For example, the node "apple" might be considered a product type when its neighboring node contains "mobile phone," but a product type when its neighboring node contains "fruit." Traditional methods, using only the node's own semantic vectors, cannot capture this contextual information formed by its neighboring nodes. This leads to a discrepancy between the node classification results and actual industry understanding, resulting in node type labels in the knowledge graph failing to accurately reflect their true semantic meaning within the industry system.

[0007] Furthermore, in the traditional knowledge graph construction process, the relationships between nodes are usually derived solely from semantic relationships directly extracted from the original text, lacking the introduction and utilization of external industry knowledge systems. When the node type classification in the initial knowledge graph is inaccurate, the relationship network between nodes also fails to reflect the standard relationship patterns in the industry knowledge system. This makes it difficult for the knowledge graph to support complex question-and-answer requirements in subsequent semantic parsing and answer retrieval due to inaccurate node type labels and sparse relationship edges. Summary of the Invention

[0008] The purpose of this invention is to provide a question-answering method and system for professional industry information based on a knowledge graph of a large language model, which solves the above-mentioned technical problems pointed out in the prior art.

[0009] This invention provides a question-answering method for professional industry information based on a knowledge graph using a large language model, comprising the following steps:

[0010] Collect unstructured text data from professional industries, preprocess the unstructured text data to obtain standardized text corpus;

[0011] An oscillation detection process combining a large language model with entity boundary recognition is adopted. Entities and relations are extracted from standardized text corpora using a particle swarm optimization algorithm to obtain stable extraction results.

[0012] Based on the entities and relationships in the stable extraction results, construct an initial knowledge graph containing nodes and edges;

[0013] The nodes in the initial knowledge graph are automatically classified by type, and then linked and enhanced by external industry knowledge bases to obtain an enhanced knowledge graph;

[0014] It receives natural language questions input by users, performs semantic parsing and answer retrieval based on an enhanced knowledge graph, and generates and returns corresponding industry information answers.

[0015] Preferably, an oscillation detection process is adopted by combining a large language model with entity boundary recognition. Entity and relation extraction is performed on the standardized text corpus using a particle swarm optimization algorithm to obtain stable extraction results. The process includes the following steps:

[0016] Sequence labeling is performed on standardized text corpora using a large language model to obtain the label probability distribution of each token. Candidate entities and their start and end position indices in the text are extracted based on the label probability distribution to obtain an initial entity boundary set.

[0017] For each candidate entity in the initial entity boundary set, the oscillation coefficient is obtained by collecting the context window containing the candidate entity in the standardized text corpus and analyzing the offset of the start position index and the end position index with respect to the context window.

[0018] Candidate entities whose oscillation coefficients exceed a preset oscillation threshold are marked as oscillating entities, and a particle swarm is constructed for each oscillating entity;

[0019] Based on the initial boundary adjustment amount and the final boundary adjustment amount of each particle, the initial position index and the final position index of the oscillating entity are temporarily corrected, and the adjusted entity term corresponding to the particle is extracted from the standardized text corpus.

[0020] After performing particle optimization on the adjusted entity terms extracted through word vector extraction, the results are encapsulated to generate stable extraction results.

[0021] Preferably, after performing particle optimization on the adjusted entity terms extracted through word vectors, a stable extraction result is generated by encapsulation, including the following steps:

[0022] Calculate the semantic matching degree between the adjusted entity term corresponding to each particle and the preset industry term dictionary. Update the particle speed and position by comparing the individual historical best and the group historical best. Iterate until the change in semantic matching degree is less than the convergence threshold or the preset maximum number of iterations is reached. Output the global optimal position vector that maximizes the semantic matching degree.

[0023] Based on the start and end boundary adjustment amounts in the global optimal position vector, the start and end position indices of the oscillating entity are corrected to obtain the corrected entity boundary. Based on the corrected entity boundary, the corrected entity entries are extracted from the standardized text corpus.

[0024] Using a large language model, semantic relationships between corrected entity terms and entities other than corrected entity terms are extracted from standardized text corpora. The corrected entity terms and their corresponding semantic relationships are then encapsulated in a structured manner to generate stable extraction results.

[0025] Preferably, the semantic matching degree is calculated as follows: obtain the word vector corresponding to each character in the adjusted entity terminology, where the number of dimensions of the word vector is a fixed value and each dimension corresponds to a component value; and obtain the word vector corresponding to each character in the preset industry terminology dictionary terminology;

[0026] For the adjusted entity term and the preset industry term dictionary term, at the same character position and the same vector dimension, multiply the components of the adjusted entity term and the components of the preset industry term dictionary term, and sum the products over all character positions and all vector dimensions to obtain the numerator;

[0027] For the adjusted entity terminology and the preset industry terminology dictionary terminology, the squares of the component values ​​of all character positions and all vector dimensions are summed, and then the square root is taken to obtain the modulus of the adjusted entity terminology and the modulus of the preset industry terminology dictionary terminology, respectively.

[0028] Divide the numerator by the product of the lengths of the two terms to obtain the semantic matching degree.

[0029] Preferably, for each candidate entity in the initial entity boundary set, the oscillation coefficient is obtained by collecting the context window containing the candidate entity in the standardized text corpus, and by analyzing the offset of the start and end position indices with respect to the context window. This includes the following steps:

[0030] Obtain the occurrence positions of candidate entities in the standardized text corpus, construct a context window centered on each occurrence position, and record the starting position of each context window;

[0031] For each context window, extract the relative starting offset between the starting position index of the candidate entity within that context window and the starting position of the window. Arrange the relative starting offsets of all context windows according to the order in which the context windows appear in the text to form the first offset sequence.

[0032] For each context window, extract the relative end offset between the end position index of the candidate entity within that context window and the start position of the window. Arrange the relative end offsets of all context windows in the order in which the context windows appear in the text to form the second offset sequence.

[0033] Compare two adjacent relative starting offsets in the first offset sequence to obtain the starting offset direction sequence;

[0034] Compare two adjacent relative end offsets in the second offset sequence to obtain the end offset direction sequence;

[0035] The frequency of direction changes in the initial offset direction sequence is obtained by counting the number of times the direction changes in the initial offset direction sequence. The frequency of direction changes in the final offset direction sequence is obtained by counting the number of times the direction changes in the final offset direction sequence. The sum of the frequency of direction changes in the initial offset direction and the frequency of direction changes in the final offset direction is output as the oscillation coefficient.

[0036] Preferably, the nodes in the initial knowledge graph are automatically classified by type, and then enhanced by combining them with external industry knowledge bases to obtain an enhanced knowledge graph, including the following steps:

[0037] Each node in the initial knowledge graph is obtained, the node description text corresponding to each node is extracted from the standardized text corpus, and industry knowledge entries related to the node type are obtained from the external industry knowledge base to obtain the industry knowledge entry set.

[0038] For each node description text, a large language model is used to extract the semantic vector of the node description text. At the same time, all neighboring nodes of the node in the initial knowledge graph and the semantic vector of each neighboring node are obtained. The arithmetic mean of the semantic vectors of all neighboring nodes is calculated as the context dependency vector of the node.

[0039] Based on the contextual dependency vectors of each node, clustering is performed, and semantic vectors of industry knowledge entries are extracted and processed. Then, enhanced association edges are established through similarity-based correction analysis to obtain an enhanced knowledge graph.

[0040] Preferably, based on the contextual dependency vectors of each node, clustering is performed, and semantic vectors from the industry knowledge item set are extracted and processed. Then, enhanced association edges are established through similarity-based correction analysis to obtain an enhanced knowledge graph, including the following steps:

[0041] The semantic vector of each node is concatenated with its corresponding contextual dependency vector to obtain a node fusion vector. Based on the node fusion vectors of all nodes, a clustering algorithm is used to group the nodes to obtain an initial type cluster set, and the cluster identifier of the initial type cluster to which each node belongs is recorded. The arithmetic mean of the contextual dependency vectors of each node corresponding to the cluster identifier of the initial type cluster to which each node belongs is calculated to obtain the context prototype vector of the initial type cluster. For each industry knowledge item in the industry knowledge item set, a large language model is used to extract the semantic vector, and the semantic vector is used as the standard type template vector of the corresponding type of the item.

[0042] For each node, calculate the first similarity between the context dependency vector of the node and the context prototype vector of the initial type cluster to which it belongs, and the second similarity between the context dependency vector of the node and each standard type template vector in the standard type template vector set. If the second similarity is greater than the first similarity and exceeds the preset similarity threshold, then the type of the node is corrected to the standard type corresponding to the second similarity, and the corrected node type label is obtained.

[0043] Based on the corrected node type labels, the nodes in the initial knowledge graph are re-labeled, and combined with the association rules in the external industry knowledge base, enhanced association edges are established between nodes with the same corrected node type labels to obtain an enhanced knowledge graph.

[0044] Specifically, after obtaining the first offset sequence and the second offset sequence, the following steps are included:

[0045] The total number of abnormal windows is obtained by performing offset statistics on the first offset sequence and the second offset sequence in conjunction with the context window.

[0046] Calculate the absolute value of the difference between two adjacent relative starting offsets in the first purification offset sequence to obtain the starting fluctuation amplitude sequence; calculate the absolute value of the difference between two adjacent relative ending offsets in the second purification offset sequence to obtain the ending fluctuation amplitude sequence.

[0047] The arithmetic mean of all elements in the initial fluctuation amplitude sequence is summed with the arithmetic mean of all elements in the final fluctuation amplitude sequence to obtain the average fluctuation amplitude. The product of the average fluctuation amplitude and the total number of abnormal windows is output as the oscillation coefficient.

[0048] Preferably, the total number of abnormal windows is calculated by combining the first offset sequence and the second offset sequence with the context window, including the following steps:

[0049] Obtain the maximum and minimum values ​​of all relative starting offsets in the first offset sequence, calculate the difference between the maximum and minimum values, and obtain the starting offset range value;

[0050] Calculate the arithmetic mean of all relative starting offsets in the first offset sequence to obtain the starting offset mean, and calculate the absolute difference between each relative starting offset and the starting offset mean. Mark the relative starting offsets whose absolute difference exceeds the first offset threshold as starting abnormal offsets.

[0051] Remove the initial abnormal offset from the first offset sequence to obtain the first cleaned offset sequence, and record the window identifier of the context window corresponding to the initial abnormal offset to form the initial abnormal window identifier set;

[0052] Obtain the maximum and minimum values ​​of all relative end offsets in the second offset sequence, calculate the difference between the maximum and minimum values, and obtain the end offset range value;

[0053] Calculate the arithmetic mean of all relative end offsets in the second offset sequence to obtain the mean end offset, and calculate the absolute difference between each relative end offset and the mean end offset. Mark the relative end offsets whose absolute difference exceeds the second offset threshold as abnormal end offsets.

[0054] Remove the end-abnormal offset from the second offset sequence to obtain the second cleanup offset sequence, and record the window identifier of the context window corresponding to the end-abnormal offset to form the end-abnormal window identifier set;

[0055] Perform a union operation on the starting abnormal window identifier set and the ending abnormal window identifier set to obtain the total abnormal window identifier set. Count the number of window identifiers in the total abnormal window identifier set to obtain the total number of abnormal windows.

[0056] On the other hand, the present invention also provides a question-answering system for professional industry information based on a knowledge graph of a large language model, including a data acquisition and processing module, an entity extraction module, an initial construction module, an enhancement module, and an output module;

[0057] The data acquisition and processing module is used to collect unstructured text data from professional industries, preprocess the unstructured text data, and obtain standardized text corpus.

[0058] The entity extraction module is used to detect and process oscillations by combining a large language model with entity boundary recognition. It extracts entities and relations from standardized text corpora using a particle swarm optimization algorithm to obtain stable extraction results.

[0059] The initial construction module is used to build an initial knowledge graph containing nodes and edges based on the entities and relationships in the stable extraction results;

[0060] The enhancement module is used to automatically classify the nodes in the initial knowledge graph and combine them with external industry knowledge bases for association enhancement, resulting in an enhanced knowledge graph;

[0061] The output module receives natural language questions input by the user, performs semantic parsing and answer retrieval based on an enhanced knowledge graph, and generates and returns the corresponding industry information answers.

[0062] Compared with the prior art, the embodiments of the present invention have at least the following technical advantages:

[0063] Analysis of the question-answering method and system for professional industry information based on a large language model provided by this invention reveals that, in practical applications, unstructured text data of the professional industry is first collected. Through cleaning and format unification, the original data is converted into standardized text corpus, providing a unified input format for subsequent entity and relation extraction, avoiding inaccurate extraction results due to inconsistent data formats or noise interference. Then, a large language model is used to identify entity boundaries in the standardized text corpus, and oscillation detection processing is introduced to identify entities with significantly fluctuating boundary positions in different context windows. For these oscillating entities, a particle swarm optimization algorithm is used to correct their entity boundaries, ensuring consistency across different context windows. Simultaneously, the semantic matching degree with the preset industry terminology dictionary reaches a convergent state, outputting stable extraction results. This solves the problem of unstable entity boundary recognition under a single context window, providing a reliable foundation for building a knowledge graph. Next, based on the entities and relations in the stable extraction results... The system first constructs an initial knowledge graph containing nodes and edges, using extracted stable entities as nodes and semantic relationships between entities as edges to form a structured knowledge representation. This provides the graph structure foundation for subsequent type classification and knowledge enhancement. Next, the nodes in the initial knowledge graph are automatically classified into different type clusters. Based on this, an external industry knowledge base is used to correct the node types, and enhanced association edges are established between nodes of the same type. By introducing the standardized type system and association rules of the external knowledge base, the accuracy of node type classification is improved, and the semantic relationships between nodes in the graph are enriched, forming an enhanced knowledge graph that more closely aligns with industry knowledge systems. Finally, the system receives natural language questions input by users, performs semantic parsing and answer retrieval based on the enhanced knowledge graph, and utilizes the more accurate node types and richer association edges in the enhanced knowledge graph to improve the semantic understanding of user questions and the coverage of answer retrieval, ultimately generating and returning the corresponding industry information answer. Attached Figure Description

[0064] Figure 1 This is a schematic diagram of the main process of a question-answering method for professional industry information based on a knowledge graph of a large language model.

[0065] Figure 2 This is a schematic diagram of the initial knowledge graph simulation in a question-answering method for professional industry information based on a knowledge graph of a large language model.

[0066] Figure 3 This is a schematic diagram simulating the probability distribution of tags in a question-answering method for professional industry information based on a knowledge graph using a large language model.

[0067] Figure 4This is a schematic diagram of an enhanced knowledge graph simulation in a question-answering method for professional industry information based on a knowledge graph of a large language model.

[0068] Figure 5 This is a schematic diagram of the overall architecture of a question-and-answer system for professional industry information based on a knowledge graph of a large language model.

[0069] Reference numerals: Acquisition and processing module 10, entity extraction module 20, initial construction module 30, enhancement module 40, output module 50. Detailed Implementation

[0070] The technical solution of the present invention will now be clearly and completely described with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0071] The present invention will now be described in further detail with reference to specific embodiments and accompanying drawings.

[0072] Example 1

[0073] like Figure 1 As shown, Embodiment 1 of the present invention provides a question-answering method for professional industry information based on a knowledge graph of a large language model, including the following steps:

[0074] Step S10: Collect unstructured text data from professional industries, clean and unify the format of the unstructured text data to obtain standardized text corpus;

[0075] Step S20: Using a large language model combined with entity boundary recognition for oscillation detection processing, entity and relation extraction is performed on the standardized text corpus using the particle swarm optimization algorithm to obtain stable extraction results (the stable extraction results mentioned in this application refer to the extraction results in which entity boundaries maintain consistency in different context windows after oscillation detection and particle swarm correction, specifically manifested as the fluctuation amplitude of the start position index and end position index of the same entity in different context windows being lower than a preset threshold, and the semantic matching degree with the preset industry terminology dictionary reaching a convergent state after particle swarm optimization correction).

[0076] Step S30: Based on the entities and relations in the stable extraction results, construct an initial knowledge graph containing nodes and edges (e.g., ...). Figure 2 (as shown)

[0077] Step S40: Automatically classify the nodes in the initial knowledge graph by type, and enhance their association with external industry knowledge bases to obtain an enhanced knowledge graph;

[0078] Step S50: Receive the natural language question input by the user, perform semantic parsing and answer retrieval based on the enhanced knowledge graph, generate and return the corresponding industry information answer.

[0079] It should be noted that, based on the analysis of the above question-answering method for professional industry information based on knowledge graphs using large language models, the following can be observed:

[0080] First, the embodiments described above collect unstructured text data from professional industries. Through cleaning and format standardization, the raw data is converted into standardized text corpus, providing a unified input format for subsequent entity and relation extraction, avoiding inaccurate extraction results due to inconsistent data formats or noise interference. Then, a large language model is used to identify entity boundaries in the standardized text corpus, and oscillation detection is introduced to identify entities with significantly fluctuating boundary positions in different context windows. For these oscillating entities, a particle swarm optimization algorithm is used to correct their entity boundaries, ensuring consistency across different context windows. Simultaneously, the semantic matching degree with the preset industry terminology dictionary reaches convergence, outputting stable extraction results. This solves the problem of unstable entity boundary identification under a single context window, providing a reliable foundation for building a knowledge graph. Next, based on the entities and relations in the stable extraction results, an initial knowledge graph containing nodes and edges is constructed. The stable entities extracted in step S20 are used as nodes, and the semantic relationships between entities are used as edges, forming a structured knowledge representation, providing a graph structure foundation for subsequent type classification and knowledge enhancement.

[0081] Furthermore, the nodes in the initial knowledge graph are automatically classified into different type clusters. Based on this, the types of nodes are corrected by combining external industry knowledge bases, and enhanced association edges are established between nodes of the same type. By introducing the standardized type system and association rules of external knowledge bases, the accuracy of node type classification is improved, and the semantic associations between nodes in the graph are enriched, forming an enhanced knowledge graph that makes the graph structure closer to the industry knowledge system.

[0082] Finally, the system receives the natural language question input by the user, takes the text semantics of the natural language question as input, and then performs semantic parsing and answer retrieval based on the enhanced knowledge graph. By utilizing the more accurate node types and richer association edges in the enhanced knowledge graph, the system improves the semantic understanding of the user's question and the coverage of the answer retrieval, and finally generates and returns the corresponding industry information answer.

[0083] In practice, step S50 includes the following steps:

[0084] Step S51: Receive the natural language question input by the user, preprocess and semantically parse the natural language question, and extract the core entities and query intent in the question;

[0085] By receiving natural language question text input by users, the question text undergoes preprocessing operations such as word segmentation, stop word removal, and part-of-speech tagging to form standardized question semantic units. A large language model is then used to perform semantic parsing on the preprocessed question text, identifying the core entities mentioned in the question and extracting the query intent type; the query intent types include, but are not limited to, entity attribute queries, entity relationship queries, entity type statistical queries, and conditional filtering queries.

[0086] Step S52: Semantically match the extracted core entities with the nodes in the enhanced knowledge graph to locate the target node set associated with the question;

[0087] Using the core entity extracted in step S51 as input, and combining the semantic vectors and node type labels of the nodes in the enhanced knowledge graph, the semantic matching degree between the core entity and each node in the graph is calculated. Nodes with semantic matching degrees exceeding a preset matching threshold are selected as candidate target nodes. If there are multiple candidate target nodes, they are sorted according to the query intent type and the contextual dependency vector of the node in the enhanced knowledge graph. The node that best matches the question context is selected as the target node, and the node identifier and node type label of the node in the enhanced knowledge graph are recorded.

[0088] Step S53: Based on the query intent type and target node, perform answer retrieval in the enhanced knowledge graph to obtain the answer candidate information corresponding to the target node;

[0089] Based on the query intent type, retrieve the answer information associated with the target node from the enhanced knowledge graph:

[0090] If the query intent type is an entity attribute query, then the attribute field values ​​of the target node stored in the enhanced knowledge graph are obtained as candidate answer information.

[0091] If the query intent type is an inter-entity relationship query, then based on the edge of the target node in the enhanced knowledge graph, extract the associated nodes and associated edge type labels that have semantic relationship with the target node, and combine the description text of the associated nodes and the associated edge type as the answer candidate information.

[0092] If the query intent type is an entity type statistical query, then based on the type label of the target node, all nodes with the same type label in the enhanced knowledge graph are counted, and the statistical results are used as candidate information for the answer.

[0093] If the query intent type is a conditional filtering query, then based on the filtering conditions in the query intent, a subset of target nodes that meet the conditions is retrieved in the enhanced knowledge graph, and the descriptive information of each node in the subset is used as candidate answer information.

[0094] Step S54: Semantically sort the candidate answer information and generate the final industry information answer;

[0095] The candidate answer information obtained in step S53 is sorted according to its semantic relevance to the natural language question; the semantic relevance is obtained by calculating the cosine similarity between the semantic vector of the candidate answer information and the semantic vector of the question. Several candidate answer information pieces with the highest semantic relevance are selected, and a large language model is used in conjunction with node description text and associated edge type labels in the enhanced knowledge graph to generate semantically coherent and well-defined natural language answer text.

[0096] Step S55: Return the generated industry information answer to the user.

[0097] Specifically, in step S20, an oscillation detection process combining a large language model and entity boundary recognition is used. Entities and relations are extracted from the standardized text corpus using a particle swarm optimization algorithm to obtain stable extraction results. This includes the following steps:

[0098] Step S21: Perform sequence labeling on the standardized text corpus using a large language model to obtain the label probability distribution of each token (a token is the smallest basic unit in text processing. In English, a token refers to each word or punctuation mark, such as "fun" or "."; in Chinese, it represents a character or a word, such as "knowledge" or "了"). Obtaining the token label probability distribution is for entity recognition. For example, "Company 1234 established a new headquarters in a certain place." First step, word segmentation (into Tokens). The large language model first cuts this sentence into Tokens. Suppose the segmentation result is as follows: Token1: 12 Company, Token2: Baba, Token3: in, Token4: a certain place, Token5: established, Token6: 了, Token7: new, Token8: headquarters. Second step, tagging (assigning probability distribution). The large language model now needs to determine what type each Token belongs to (e.g., B-ORG indicates the start of an organization name, I-ORG indicates the inside of an organization name, B-LOC indicates the start of a place name, O indicates non-entity). For Token1 "12 Company", the large language model calculates that the probability of being "B-ORG" is 90%; the probability of being "O" is 8%; the probability of being other labels is 2%, that is, the label probability distribution of each of the above tokens. For Token4 "a certain place", the large language model calculates that the probability of being "B-LOC" is 95%; the probability of being "O" is 4%... Third step, determine the entity boundary. According to the high-probability label of Token1, identify the start of the entity "12 Company". Seeing that Token2 "Baba" has a high probability of being I-ORG, know that the entity continues; thus determine the entity "Company 1234" and its boundary, as Figure 3 shown), and extract candidate entities and the start position index and end position index of the candidate entities in the text according to the label probability distribution to obtain an initial entity boundary set;

[0099] Step S22: For each candidate entity in the initial entity boundary set, collect the context window containing the candidate entity in the standardized text corpus, and analyze the offsets of the start position index and end position index with respect to the context window to obtain an oscillation coefficient (characterizing the fluctuation degree of the candidate entity boundary);

[0100] It should be noted that, in the above embodiments of this application, after obtaining the oscillation coefficient, the calculation type identifier of the oscillation coefficient is recorded. The calculation type identifier includes a first type and a second type. The first type corresponds to the oscillation coefficient calculated based on the frequency of direction changes, and the second type corresponds to the oscillation coefficient calculated based on the fluctuation amplitude and the total number of abnormal windows. Then, according to the calculation type identifier of the oscillation coefficient, the corresponding preset oscillation threshold is selected, where the first type corresponds to the first oscillation threshold, and the second type corresponds to the second oscillation threshold. The values ​​of the first oscillation threshold and the second oscillation threshold are different. The value range of the first oscillation threshold is 3 to 8, and the value in this embodiment is 5. The value range of the second oscillation threshold is 2 to 6, and the value in this embodiment is 4. The above values ​​are determined by sampling and testing standard text corpora from multiple industry fields, statistically analyzing the recall and precision of oscillating entities under different thresholds, and selecting the threshold point with the optimal F1 value. In practical applications, the values ​​can be adjusted within the range according to the text standardization and entity boundary stability of different industries.

[0101] Step S23: Mark the candidate entities whose oscillation coefficient exceeds the preset oscillation threshold as oscillating entities, and construct a particle swarm for each oscillating entity, wherein the position vector of each particle consists of the initial boundary adjustment amount and the end boundary adjustment amount, and the velocity vector of the particle represents the step size of the adjustment amount.

[0102] Step S24: Based on the initial boundary adjustment amount and the final boundary adjustment amount of each particle, temporarily correct the initial position index and the final position index of the oscillating entity to obtain the temporary boundary position, and extract the adjusted entity entry corresponding to the particle from the standardized text corpus based on the temporary boundary position.

[0103] Step S25: Adjust the entity term entries and entries in the pre-defined industry terminology dictionary Character-level word vector extraction is performed separately, resulting in word vector components of each dimension at each character position of each adjusted entity term and word vector components of each dimension at each character position of the term in the preset industry terminology dictionary. (Specifically, for any adjusted entity term, word segmentation is performed at the character level, and each character is input into a pre-trained Chinese character-level embedding model (in this embodiment, a BERT-based Chinese character embedding model is used, with an output dimension m=768) to obtain the m-dimensional word vector corresponding to each character, thus obtaining the j-th dimension word vector component at the i-th character position of the p-th adjusted entity term.) For entries in the pre-defined industry terminology dictionary (Right now (This is the q-th term in the preset industry terminology dictionary). Similarly, the j-th dimension word vector component of the q-th term at the i-th character position in the preset industry terminology dictionary is obtained. The semantic matching degree between the adjusted entity term and the preset industry terminology dictionary is calculated by comparing the individual historical best and the group historical best of the particle with the speed and position of the updated particle. The iteration is performed until the change in semantic matching degree is less than the convergence threshold or the preset maximum number of iterations is reached (the convergence threshold is used to determine whether the particle swarm optimization algorithm is successful). The convergence parameter is used to stop iteration when the absolute value of the change in fitness value between two consecutive iterations is less than a threshold. In this embodiment, the convergence threshold is 0.001, and the maximum number of iterations is 100. The convergence threshold mentioned in this application refers to the parameter used to determine whether the particle swarm optimization algorithm has converged. When the absolute value of the change in fitness value between two consecutive iterations is less than the threshold, the algorithm is considered to have converged and the iteration stops. In this embodiment, the value is 0.001. This value is set based on the range of semantic matching degree (0 to 1). The change of 0.001 is much smaller than the perceptible semantic difference in actual applications. The output is the global optimal position vector that maximizes the semantic matching degree.

[0104] The semantic matching degree is calculated as follows:

[0105] ;

[0106] In the formula, It is the semantic matching degree between the p-th adjusted entity term (i.e., the adjusted entity term corresponding to the p-th particle) and the q-th term in the preset industry term dictionary; This represents the adjusted entity term corresponding to the p-th particle (i.e., the p-th adjusted entity term). It is the qth entry in the pre-defined industry terminology dictionary. Let be the j-th dimension word vector component of the p-th adjusted entity term at the i-th character position, where n represents the total number of characters in the term and m represents the dimension of the word vector. It is the j-th dimension word vector component of the q-th term in the preset industry term dictionary at the i-th character position. The numerator is the sum of the dot products of the two terms at all character positions and all vector dimensions, and the denominator is the product of the vector magnitudes of the two terms, which is used to normalize and obtain the cosine similarity.

[0107] In the above embodiments of the present application, by decomposing the entry into a character-level vector representation, the semantic information of character combinations in Chinese entries can be captured; by using the dot product and modulus normalization to calculate the cosine similarity, the influence of entry length differences on similarity can be eliminated; by performing a composite operation at two levels of character position and vector dimension, the proximity of two entries in the semantic space can be measured more precisely. In this embodiment, the value of the word vector dimension m is 768. By using the method of character-level word vector decomposition, the semantic characteristics of character combinations in Chinese entries can be adapted. Traditional word vector methods calculate vectors taking the entry as a whole and cannot handle the semantic information of the character order and combination within the entry. Through the double indexing of the i-th character position and the j-th vector component, this formula represents the entry as a character × dimension matrix, and calculates the similarity through matrix dot product, which can measure the semantic similarity between entries more precisely. The numerator is the sum of the dot products of two matrices, and the denominator is the product of the moduli of two matrices, normalizing the result to between 0 and 1, which is convenient for setting a unified similarity threshold;

[0108] For example, after adjustment, the entity entry is "smartphone", and the preset industry term dictionary entry is "intelligent phone". Both use character-level word vector representations, and the word vector dimension m = 3 (simplified example). "Smartphone" is decomposed into four characters: "intelligent", "capable", "hand", and "phone", and the word vectors of each character are V1 = [0.1, 0.2, 0.3], V2 = [0.2, 0.3, 0.4], V3 = [0.5, 0.6, 0.7], V4 = [0.6, 0.7, 0.8]; "Intelligent phone" is decomposed into four characters: "intelligent", "capable", "electric", and "phone", and the word vectors of each character are W1 = [0.1, 0.2, 0.3], W2 = [0.2, 0.3, 0.4], W3 = [0.7, 0.8, 0.9], W4 = [0.8, 0.9, 1.0]. Calculate the sum of the dot products of the corresponding dimensions of each character for the numerator: The first character (0.1×0.1 + 0.2×0.2 + 0.3×0.3) = 0.14, the second character (0.2×0.2 + 0.3×0.3 + 0.4×0.4) = 0.29, the third character (0.5×0.7 + 0.6×0.8 + 0.7×0.9) = 1.46, the fourth character (0.6×0.8 + 0.7×0.9 + 0.8×1.0) = 1.91, and the total sum of the numerator is 3.8. Calculate The square of the modulus length is the sum of the squares of the modulus lengths of each character vector: (0.14 + 0.29 + 1.1 + 1.49) = 3.02, and the modulus length is 1.74; The square of the modulus length is (0.14+0.29+1.94+2.45)=4.82, the modulus length is 2.20; the denominator is 1.74×2.20=3.828; the semantic matching degree is 3.8÷3.828≈0.993;

[0109] It should also be noted that when the entity term is adjusted... Compared with the qth entry in the preset industry terminology dictionary When the character lengths are different, this application adopts a length alignment strategy, assuming... If the actual number of characters in a certain term is less than Then, zero vectors are added after its character vector sequence to extend its length to... Based on this, the upper limit n for the summation in the formula is uniformly taken as a value. This alignment method ensures that semantic matching can be calculated for terms of any length, avoiding similarity calculation failures or deviations due to length differences.

[0110] In the above embodiments of this application, the velocity and position of particles are updated using the following formula by comparing the individual historical best and the group historical best:

[0111] ;

[0112] ;

[0113] In the formula, This represents the d-th dimension velocity component of the i-th particle during the k-th iteration; For positional components; For an individual, the best historical position; is the group's historical best position; w is the inertia weight, used to balance global search and local search capabilities, and in this embodiment, it is set to 0.8; and The learning factors represent the weights of individual cognition and social cognition, respectively, and in this embodiment, both are set to 2.0; and Use random numbers between [0,1] to iterate until the change in semantic matching degree is less than the convergence threshold or the preset maximum number of iterations is reached;

[0114] The above formulas for updating velocity and position preserve historical motion trends through inertial weights w and through individual cognitive terms. Guiding particles to move towards their historical best position through social cognition. The weighted combination of these three factors—guiding particles toward their historical best position—enables particles to simultaneously perform fine-grained local searches and global explorations within the search space.

[0115] Step S26: Based on the start boundary adjustment amount and end boundary adjustment amount in the global optimal position vector, correct the start position index and end position index of the oscillating entity to obtain the corrected entity boundary, and extract the corrected entity term from the standardized text corpus based on the corrected entity boundary.

[0116] Step S27: Use a large language model to extract the semantic relationships between the corrected entity terms and entities other than the corrected entity terms from the standardized text corpus. Encapsulate the corrected entity terms and their corresponding semantic relationships in a structured manner to generate stable extraction results.

[0117] It should be noted that the above embodiments of this application use a large language model to perform sequence annotation on standardized text corpora, obtain the label probability distribution of each token, extract candidate entities and their start and end position indices in the text based on the distribution, obtain an initial entity boundary set, apply the sequence annotation capability of the large language model to entity recognition, output the boundary position information of candidate entities, and provide basic data for subsequent boundary stability analysis; then, by quantifying the position changes of entity boundaries in different context windows into oscillation coefficients, a quantitative basis is provided for distinguishing stable entities from oscillating entities. After obtaining the oscillation coefficients, their calculation type identifiers are recorded, and a corresponding preset oscillation threshold is selected according to the type identifier. Entities with unstable boundaries are identified through threshold filtering, and a particle swarm optimization model is established for each oscillating entity to provide a search space and optimization framework for subsequent boundary correction;

[0118] Furthermore, the position vectors of the particles are transformed into actual entity terms, giving the adjustment amount an evaluable textual form, providing an evaluation object for calculating semantic matching degree; then, by transforming the entity boundary correction problem into a continuous space optimization problem, the particle swarm optimization algorithm is used to search for the boundary adjustment amount that gives the entity term the highest semantic matching degree with the industry terminology dictionary, thereby achieving automatic correction of the oscillating entity boundary.

[0119] Next, the globally optimal adjustment obtained from particle swarm optimization is applied to the original entity boundaries, and the corrected entity terms are output, which improves the consistency of entity boundaries in different context windows. The corrected entities and their semantic relationships are combined and encapsulated to form structured extraction results, providing stable and consistent entity and relationship data for the subsequent construction of knowledge graphs.

[0120] Specifically, in step S22, for each candidate entity in the initial entity boundary set, the oscillation coefficient is obtained by collecting the context window containing the candidate entity in the standardized text corpus and analyzing the offset of the start and end position indices with respect to the context window, including the following steps:

[0121] Step S221: Obtain the occurrence position of the candidate entity in the standardized text corpus, construct a context window with each occurrence position as the center, (construct based on all context windows) to obtain the context window set, and record the window start position of each context window;

[0122] Step S222: For each context window in the context window set, extract the relative starting offset between the starting position index of the candidate entity in the context window and the starting position of the window, and arrange the relative starting offsets of all context windows in the order in which the context windows appear in the text to form the first offset sequence.

[0123] Step S223: For each context window, extract the relative end offset between the end position index of the candidate entity within the context window and the start position of the window, and arrange the relative end offsets of all context windows in the order in which the context windows appear in the text to form the second offset sequence.

[0124] Step S224: Compare two adjacent relative starting offsets in the first offset sequence (compare their magnitudes; if the latter relative starting offset is greater than the former, record the starting offset direction as positive; if the latter is less than the former, record the starting offset direction as negative; traverse the entire first offset sequence) to obtain the starting offset direction sequence.

[0125] Step S225: Compare two adjacent relative end offsets in the second offset sequence (compare their magnitudes; if the latter relative end offset is greater than the former, record the end offset direction as positive; if the latter is less than the former, record the end offset direction as negative; traverse the entire second offset sequence) to obtain the end offset direction sequence.

[0126] Step S226: Count the number of times the direction changes in the initial offset direction sequence to obtain the initial direction change frequency; count the number of times the direction changes in the final offset direction sequence to obtain the final direction change frequency; and output the sum of the initial direction change frequency and the final direction change frequency as the oscillation coefficient.

[0127] It should be noted that the above-described embodiments of this application obtain the occurrence positions of candidate entities in standardized text corpora, construct context windows centered on each occurrence position to obtain a set of context windows, and record the starting position of each context window. Each occurrence of a candidate entity is used as an analysis sample. By constructing context windows, the text range of each analysis is fixed, providing a unified reference benchmark for subsequent offset calculations. Then, the starting boundary positions of candidate entities are converted into offsets relative to the starting positions of the windows, eliminating the absolute value differences caused by different starting positions of different windows, making the starting boundary positions of multiple windows comparable, and arranging them in the order of occurrence to form a sequence, providing a data foundation for subsequent directional change analysis. Next, the same transformation and arrangement operations as the starting boundaries are performed on the ending boundaries to obtain the offset of the ending boundaries. The system quantifies the frequency of boundary oscillations by generating independent sequence data for the initial and final boundaries. It converts the numerical magnitude relationship of the initial boundary offsets between adjacent windows into directional indicators, transforming the fluctuation trend of the initial boundaries from numerical differences into a directional sequence, providing a countable form for statistically analyzing the frequency of directional changes. Furthermore, it performs the same directional transformation operation on the final boundary as on the initial boundary, obtaining a fluctuation direction sequence for the final boundary, ensuring that both the initial and final boundaries have independent fluctuation direction data for subsequent statistics. Finally, by counting the number of direction reversals in the direction sequence, the frequency of boundary oscillations is quantified into a numerical value. A higher frequency of directional changes indicates more frequent oscillations of the boundary between different windows. This value is output as an oscillation coefficient, used in subsequent steps to determine whether a candidate entity is an oscillating entity.

[0128] In the specific implementation process of the above-described embodiments of this application, the technicians also found that, in the standardized text corpus, some context windows of candidate entities may contain special cases, that is, the window boundary is exactly cut at special characters or punctuation marks, resulting in extreme values ​​of relative start offset or relative end offset, or there may be typesetting errors, garbled characters or unconventional expressions in the window, causing the entity boundary recognition result to deviate from the normal range. If the offset sequence containing these extreme values ​​is directly used to calculate the oscillation coefficient, the oscillation coefficient will be dominated by abnormal windows and will not be able to truly reflect the actual fluctuation degree of the entity boundary. The processing execution scheme of steps S221 to S226 of the above-described embodiments of this application characterizes oscillation by statistically analyzing the frequency of changes in the offset direction. This method can reflect the frequency of the boundary "swinging back and forth", but it cannot distinguish the essential difference between the following two cases, namely, case A: the entity boundary swings frequently within a range of 1-2 characters (small oscillation); case B: the entity boundary jumps significantly within a range of 10-15 characters (large oscillation). In the two cases, the frequency of direction change may be similar, but the boundary instability of the latter is much higher than that of the former, which is directly related to the convergence difficulty and correction effect of the subsequent particle swarm optimization algorithm.

[0129] In step S23 above, candidate entities whose oscillation coefficients exceed the preset oscillation threshold will be marked as oscillating entities and enter the particle swarm optimization process. If the oscillation coefficients include contributions from abnormal windows, entities that are originally stable at the boundary may be misjudged as oscillating entities due to individual abnormal windows, wasting computational resources. Alternatively, entities that are originally severely volatile at the boundary may be misjudged as stable entities because the extreme values ​​of abnormal windows mask the true oscillation patterns, thus missing objects that need correction.

[0130] Steps S221 to S226 above provide a method for calculating the oscillation coefficient based on the frequency of changes in the offset direction. In practical applications, researchers of this invention have further discovered that while this method can reflect the frequency of oscillations at the entity boundary, it cannot distinguish between small oscillations and large jumps, and is easily affected by extreme values ​​of abnormal windows. Therefore, in another preferred embodiment, an improved method for calculating the oscillation coefficient is provided. This method, by eliminating abnormal windows, calculating the average fluctuation amplitude, and combining it with the total number of abnormal windows, can more accurately characterize the actual fluctuation degree of the entity boundary, as detailed below:

[0131] Specifically, in another embodiment, after obtaining the first offset sequence in step S222 and the second offset sequence in step S223, the following operation steps are included (steps S2221 to S2229 in this embodiment are another execution processing logic process for steps S224 to S226 after step S223, which is intended to optimize the aforementioned defects):

[0132] Step S2221: Obtain the maximum and minimum values ​​of all relative starting offsets in the first offset sequence, calculate the difference between the maximum and minimum values, and obtain the starting offset range value;

[0133] Step S2222: Calculate the arithmetic mean of all relative starting offsets in the first offset sequence to obtain the starting offset mean, and calculate the absolute difference between each relative starting offset and the starting offset mean. Mark the relative starting offsets whose absolute difference exceeds the first offset threshold (the first offset threshold is the starting offset range multiplied by a preset proportional coefficient, representing the threshold for judging the anomaly of the starting boundary, used to remove those offset values ​​that are significantly deviated from the normal range due to special context windows, because these anomalies will distort the subsequent fluctuation amplitude calculation, causing the oscillation coefficient to fail to truly reflect the actual fluctuation degree of the entity boundary) as starting abnormal offsets.

[0134] It should be noted that the first offset threshold is the starting offset range multiplied by a preset proportional coefficient. The preset proportional coefficient is a real number greater than 0 and less than or equal to 1, which is used to control the strictness of the judgment of abnormal offset. The smaller the preset proportional coefficient, the stricter the judgment standard, and the easier it is to mark the offset as abnormal. In this embodiment, the preset proportional coefficient is 0.5.

[0135] The preset proportional coefficient involved in this application refers to a parameter used to control the strictness of the judgment of abnormal offsets. The value ranges from 0 to 1. The smaller the coefficient, the stricter the judgment standard, and the easier it is to mark the offset as abnormal. In this embodiment, the value is 0.5. This value is determined based on the standard deviation analysis of the distribution of normal window offsets in standard text corpus (specifically, the standard deviation σ of the first offset sequence is first calculated, and then the preset proportional coefficient is set to 2σ / range, so that the 95% confidence interval of normal window offsets falls within the threshold range. In practical applications, this coefficient can be fine-tuned according to the quality of the corpus and the tolerance for abnormalities), which can ensure that 95% of normal window offsets are not marked as abnormal.

[0136] Step S2223: Remove the initial abnormal offset from the first offset sequence to obtain the first cleaned offset sequence, and record the window identifier of the context window corresponding to the initial abnormal offset to form the initial abnormal window identifier set;

[0137] Step S2224: Obtain the maximum and minimum values ​​of all relative end offsets in the second offset sequence, calculate the difference between the maximum and minimum values, and obtain the end offset range value;

[0138] Step S2225: Calculate the arithmetic mean of all relative end offsets in the second offset sequence to obtain the end offset mean, and calculate the absolute difference between each relative end offset and the end offset mean. Mark the relative end offsets with absolute differences exceeding the second offset threshold (the second offset threshold is the end offset range multiplied by a preset proportional coefficient, representing the threshold for judging the end boundary anomalies, used to remove offset values ​​that are significantly deviated from the normal range due to special context windows, because these anomalies will distort the subsequent fluctuation amplitude calculation, causing the oscillation coefficient to fail to truly reflect the actual fluctuation degree of the entity boundary. The second offset threshold is the end offset range multiplied by a preset proportional coefficient, wherein the preset proportional coefficient is the same as the preset proportional coefficient in step S2222, both being 0.5, to ensure that the start boundary and the end boundary adopt the same anomaly judgment standard) as end abnormal offsets.

[0139] Step S2226: Remove the end abnormal offset from the second offset sequence to obtain the second cleanup offset sequence, and record the window identifier of the context window corresponding to the end abnormal offset to form the end abnormal window identifier set;

[0140] Step S2227: Perform a union operation on the starting abnormal window identifier set and the ending abnormal window identifier set to obtain the total abnormal window identifier set. Count the number of window identifiers in the total abnormal window identifier set to obtain the total number of abnormal windows.

[0141] Step S2228: Calculate the absolute value of the difference between two adjacent relative starting offsets in the first purification offset sequence to obtain the starting fluctuation amplitude sequence; calculate the absolute value of the difference between two adjacent relative ending offsets in the second purification offset sequence to obtain the ending fluctuation amplitude sequence.

[0142] It should be noted that the processing in the above-described embodiments of this application involves determining the number of elements in the first purification offset sequence. If the number is greater than 1, the absolute value of the difference between two adjacent relative starting offsets in the sequence is calculated to obtain the initial fluctuation amplitude sequence. If the number is equal to 1, the initial fluctuation amplitude sequence is set to contain only one element with a value of 0. The same applies to determining the number of elements in the second purification offset sequence. If the number is greater than 1, the absolute value of the difference between two adjacent relative ending offsets in the sequence is calculated to obtain the ending fluctuation amplitude sequence. If the number is equal to 1, the ending fluctuation amplitude sequence is set to contain only one element with a value of 0.

[0143] Step S2229: Summate the arithmetic mean of all elements in the initial fluctuation amplitude sequence with the arithmetic mean of all elements in the final fluctuation amplitude sequence to obtain the average fluctuation amplitude. Output the product of the average fluctuation amplitude and the total number of abnormal windows as the oscillation coefficient. (Multiplying the fluctuation intensity (average fluctuation amplitude) and the anomaly degree (total number of abnormal windows) allows the oscillation coefficient to reflect both the fluctuation intensity within normal windows and the frequency of abnormal window occurrences. If only the average fluctuation amplitude is used, the extreme values ​​of abnormal windows will be diluted by the average value of normal windows, leading to entities with large fluctuations but only appearing in a few windows being misjudged as stable. If only the total number of abnormal windows is used, the difference between small, frequent oscillations and large, occasional jumps cannot be distinguished. Multiplying both can comprehensively reflect the boundary instability of entities, providing an accurate screening basis for subsequent particle swarm optimization.)

[0144] Based on steps S221 to S226 in the above embodiments of this application, the following processing is introduced for extracting the relative starting offset and forming the first offset sequence in step S222 and extracting the relative ending offset and forming the second offset sequence in step S223: First, by calculating the offset range and the offset mean, abnormal offsets that exceed the reasonable range are identified and eliminated to obtain a purified offset sequence, so that subsequent analysis is based on data that truly reflects the fluctuation law of the entity boundary; second, the absolute value of the difference between adjacent offsets is used to characterize the fluctuation amplitude, which makes up for the defect that the frequency of direction change cannot distinguish the strength of fluctuation; third, the total number of abnormal windows is used as a penalty factor and multiplied by the average fluctuation amplitude to obtain the final oscillation coefficient, so that the oscillation coefficient can reflect both the fluctuation intensity in the normal window and the frequency of abnormal window occurrence.

[0145] For example, a candidate entity has a relative starting offset of [5,6,7,8,25] and a relative ending offset of [12,13,14,15,32] across 5 context windows, with a preset scaling factor of 0.5. First, the starting offset range is calculated as 20, the threshold is 10, the average offset is 10.2, and the offset of 25 (with an absolute difference of 14.8) exceeding the threshold is marked as an anomaly and removed. The resulting cleaned sequence is [5,6,7,8]; similarly, the ending cleaned sequence is [12,13,14,15]. The starting cleaned sequence has 4 elements, which is greater than 1. The absolute values ​​of adjacent differences are calculated to obtain the starting fluctuation amplitude sequence [1,1,1] with a mean of 1; the ending fluctuation amplitude sequence is also [1,1,1] with a mean of 1; the average fluctuation amplitude is 2. The starting and ending anomaly window identifier sets both contain the 5th window, and the total number of anomaly windows after the union is 1; the final oscillation coefficient is 2 × 1 = 2.

[0146] Specifically, in step S40, the nodes in the initial knowledge graph are automatically classified by type, and the association is enhanced by combining them with an external industry knowledge base to obtain an enhanced knowledge graph, including the following operation steps:

[0147] Step S41: Obtain each node in the initial knowledge graph, extract the node description text corresponding to each node from the standardized text corpus to obtain the node description text set, and obtain industry knowledge entries related to the node type from the external industry knowledge base to obtain the industry knowledge entry set;

[0148] Step S42: For each node description text in the node description text set, use a large language model to extract the semantic vector of the node description text. At the same time, obtain all the neighboring nodes of the node in the initial knowledge graph, obtain the semantic vector of each neighboring node, and calculate the arithmetic mean of the semantic vectors of all neighboring nodes as the context dependency vector of the node.

[0149] Step S43: Concatenate the semantic vector of each node with the corresponding context dependency vector to obtain the node fusion vector that integrates context information. Based on the node fusion vectors of all nodes, use a clustering algorithm to group the nodes to obtain an initial type cluster set, and record the cluster identifier of the initial type cluster to which each node belongs. According to the cluster identifier of the initial type cluster to which each node belongs, obtain the context dependency vector of all nodes in each initial type cluster, and use the context dependency vector as the basis for type discrimination of nodes in that cluster.

[0150] Step S44: For each initial type cluster in the initial type cluster set, calculate the arithmetic mean of the context dependency vectors of all nodes in the initial type cluster, and use it as the context prototype vector of the initial type cluster (in a knowledge graph, adjacent nodes reflect the contextual semantic environment of the current node. By calculating the arithmetic mean of the semantic vectors of all adjacent nodes, the type information of the node can be fused with its association in the knowledge graph, so that the node representation contains contextual information, thereby solving the problem that a single semantic vector cannot handle "context-dependent expressions". For example, the "apple" node tends to be a product type when its adjacent nodes contain "mobile phone", and tends to be an agricultural product type when its adjacent nodes contain "fruit"); For each industry knowledge item in the industry knowledge item set, use a large language model to extract the semantic vector of the item, and use the semantic vector as the standard type template vector of the corresponding type of the item, where each standard type template vector is associated with a standard type label;

[0151] Step S45: For each node, calculate the first similarity between the node's context dependency vector and the context prototype vector of its initial type cluster, and the second similarity between the node's context dependency vector and each standard type template vector in the standard type template vector set. If the second similarity is greater than the first similarity and exceeds a preset similarity threshold (the similarity threshold is used to control the strictness of type correction; in this embodiment, it is set to 0.95. This value is determined based on statistical analysis of the similarity distribution between standard type template vectors and context dependency vectors of the same type of nodes: on the validation set, when the threshold is set to 0.95, the accuracy of type correction reaches over 98%, while maintaining a reasonable correction coverage. In practical applications, the threshold can be adjusted between 0.90 and 0.99 according to the accuracy requirements of the knowledge graph construction), then the type of the node is corrected to the standard type corresponding to the second similarity, resulting in the corrected node type label;

[0152] For example, node X's context dependency vector is [0.5, 0.6, 0.7], and its initial type cluster's context prototype vector is [0.4, 0.5, 0.6]. The cosine similarity yields a first similarity of 0.98. In the external industry knowledge base, the standard type template vectors for "product type" and "organization type" are [0.7, 0.8, 0.9] and [0.2, 0.3, 0.4], respectively, yielding second similarities of 0.96 and 0.85. Given a preset similarity threshold of 0.95, is the second similarity of 0.96 greater than the first similarity of 0.98? No, 0.96 is less than 0.98, therefore the correction condition is not met, and node X retains its original type. If the context dependency vector of another node Y is [0.7, 0.8, 0.9], the context prototype vector of its initial type cluster is [0.4, 0.5, 0.6], the first similarity is 0.96, and the second similarity with the standard type template vector of "product type" is 0.99. Since the second similarity of 0.99 is greater than the first similarity of 0.96 and exceeds the threshold of 0.95, then the type of node Y is corrected to "product type".

[0153] Step S46: Based on the corrected node type labels, re-label the nodes in the initial knowledge graph, and combine the association rules in the external industry knowledge base to establish enhanced association edges between nodes with the same corrected node type labels. (An enhanced association edge refers to an edge with a type label added between two nodes that were not directly connected in the initial knowledge graph but are corrected to have the same type label, and the external industry knowledge base has preset association rules between nodes of this type (such as collaboration, upstream / downstream relationships, or hierarchical relationships between nodes of the same type). The weight of this edge can be initialized based on the association strength coefficient in the external industry knowledge base; if no explicit strength coefficient is specified, the default value is 1. By establishing enhanced association edges, nodes of the same type in the knowledge graph form a denser semantic association network, improving the coverage and accuracy of subsequent question-answering retrieval.) This results in an enhanced knowledge graph (e.g., Figure 4 As shown, Figure 4 The thick black solid line in the middle is the edge that enhances the connection.

[0154] It should be noted that the above-described embodiments of this application first collect information from two data sources: extracting descriptive text of nodes from standardized text corpora for subsequent extraction of the semantic features of the nodes themselves; and obtaining industry knowledge entries from an external industry knowledge base to provide standardized type templates for subsequent classification and correction of node types. Then, the semantic features of the nodes themselves are represented by semantic vectors, and the arithmetic mean of the semantic vectors of adjacent nodes is used to quantify the association information of the nodes in the graph into context-dependent vectors, thus integrating the node representation with its association information in the knowledge graph. Further, the semantic features and contextual features of the nodes are merged into a fusion vector through a concatenation operation, enabling the clustering algorithm to group nodes based on both their own content and their association relationships in the graph. After clustering, the contextual dependency vectors of all nodes within each cluster are retained as intra-cluster references for subsequent type correction. Next, the contextual features of each initial type cluster are condensed into a contextual prototype vector, serving as the type representative of that cluster. Simultaneously, industry knowledge entries from the external industry knowledge base are converted into standard type template vectors, providing an external standardized type reference for node type correction.

[0155] By comparing the similarity between nodes and clustering results, and the similarity between nodes and external standard types, it is determined whether the external standard type better represents the contextual features of the node than the clustering result. If the external standard type simultaneously meets the two conditions of higher similarity and exceeding the threshold, the type obtained by clustering is replaced by the external standard type, so that the type label of the node is aligned with the standard type system of the external industry knowledge base. Finally, the corrected type label is updated on the nodes of the knowledge graph. At the same time, according to the association rules defined in the external industry knowledge base, edges that did not exist before are added between nodes of the same type, so that the nodes in the knowledge graph form a denser association network on the basis of type consistency, resulting in an enhanced knowledge graph.

[0156] Example 2

[0157] On the other hand, such as Figure 5 As shown, Embodiment 2 of this application, based on the question-answering method for professional industry information based on a knowledge graph of a large language model provided in Embodiment 1 of the invention, also provides a question-answering system for professional industry information based on a knowledge graph of a large language model, including a data acquisition and processing module 10, an entity extraction module 20, an initial construction module 30, an enhancement module 40, and an output module 50.

[0158] The data acquisition and processing module 10 is used to acquire unstructured text data from professional industries, preprocess the unstructured text data, and obtain standardized text corpus.

[0159] The entity extraction module 20 is used to detect and process oscillation phenomena by combining a large language model with entity boundary recognition. It extracts entities and relations from standardized text corpora using a particle swarm optimization algorithm to obtain stable extraction results.

[0160] The initial construction module 30 is used to construct an initial knowledge graph containing nodes and edges based on the entities and relationships in the stable extraction results.

[0161] The enhancement module 40 is used to automatically classify the nodes in the initial knowledge graph by type and combine them with external industry knowledge bases for association enhancement, so as to obtain an enhanced knowledge graph.

[0162] The output module 50 is used to receive natural language questions input by the user, perform semantic parsing and answer retrieval based on the enhanced knowledge graph, and generate and return the corresponding industry information answers.

[0163] In summary, the present invention presents a question-answering method and system for professional industry information based on a knowledge graph using a large language model. This method collects and cleans unstructured text data to obtain standardized corpora, combines oscillation detection and particle swarm optimization for stable entity relation extraction, constructs an initial knowledge graph, automatically classifies nodes by type, and introduces an external industry knowledge base for type correction and association enhancement. Finally, question answering is performed based on the enhanced knowledge graph. This completes the closed loop from data preprocessing to knowledge graph construction and question-answering application, solving the problems of unstable entity extraction, inaccurate node type classification, and disconnect between the knowledge graph and industry knowledge system in traditional methods. It provides stable, accurate, and semantically rich knowledge support for industry information question answering.

[0164] Furthermore, an initial entity boundary set is obtained by sequence labeling of standardized text corpora using a large language model. For each candidate entity, a context window is collected and the offset is analyzed to obtain the oscillation coefficient. Entities with oscillation coefficients exceeding a threshold are marked as oscillating entities. A particle swarm optimization is constructed with the boundary adjustment amount as the position vector. The semantic matching degree between the adjusted entity entries and the industry terminology dictionary is calculated using character-level word vectors. The particle swarm optimization algorithm iteratively searches for the boundary adjustment amount that maximizes the matching degree. Finally, the oscillating entity boundaries are corrected and semantic relations are extracted. The problem of entity boundary fluctuation in different context windows is transformed into a continuous space optimization problem. The boundary position is automatically corrected by the particle swarm optimization algorithm, so that the corrected entity entries and the industry terminology dictionary achieve semantic convergence. This solves the technical defects in the background technology where entity boundaries oscillate in different context windows and the boundary position is inconsistent, leading to unreliable extraction results.

[0165] Furthermore, a context window is constructed centered on each occurrence position of the candidate entity. The relative starting offset between the entity's starting position and the window's starting position within each window is extracted to form a first offset sequence. Similarly, the relative ending offset of the ending position is extracted to form a second offset sequence. The offset magnitudes of adjacent windows are compared to obtain a direction sequence. The sum of the frequency of change of the starting direction and the frequency of change of the ending direction is used as the oscillation coefficient. The positional fluctuation of the entity boundary in different windows is transformed into a quantitative indicator of the frequency of direction changes, so that the frequency of boundary oscillation can be measured numerically. This solves the technical defects in the background technology, which lacks effective quantification of the degree of fluctuation of the entity boundary and cannot distinguish between stable entities and oscillating entities.

[0166] On the other hand, based on the first and second offset sequences, the offset range and offset mean are calculated. Abnormal offsets are identified and removed by a preset proportional coefficient to obtain a purified offset sequence. The absolute value of the difference between adjacent offsets in the purified sequence is calculated to obtain the fluctuation amplitude sequence. The average fluctuation amplitude is obtained by summing the average initial fluctuation amplitude and the average final fluctuation amplitude. The product of the average fluctuation amplitude and the total number of abnormal windows is used as the oscillation coefficient. By removing abnormal windows, the interference of special character segmentation or text noise on the oscillation coefficient is avoided. At the same time, by multiplying the average fluctuation amplitude by the total number of abnormal windows, the oscillation coefficient can reflect both the fluctuation intensity in the normal window and the frequency of abnormal window occurrence. This solves the technical defects in the background technology, which cannot distinguish between small oscillations and large jumps in the frequency of direction changes and is easily affected by the extreme values ​​of abnormal windows.

[0167] On the other hand, semantic vectors are obtained by extracting descriptive texts of nodes from standardized text corpora. Simultaneously, the arithmetic mean of the semantic vectors of all neighboring nodes is calculated as a context-dependent vector. These two vectors are concatenated and clustered to obtain initial type clusters. The average of the context-dependent vectors within each cluster is calculated as a context prototype vector. Semantic vectors of industry knowledge entries are extracted from an external industry knowledge base as standard type template vectors. By comparing the first similarity between the node's context-dependent vector and the context prototype vector, and the second similarity with the standard type template vector, the node type is corrected when the second similarity is higher and exceeds a threshold. Finally, enhanced association edges are established between nodes with the same corrected type label. The semantic features of the node itself are fused with the contextual information formed by neighboring nodes for type classification. The clustering results are corrected using the standard type template from the external industry knowledge base, adding previously non-existent association edges between nodes of the same type. This addresses the technical shortcomings of the background technology, where node type classification relying solely on its own semantic vectors cannot handle context-dependent expressions, and the lack of alignment between knowledge graphs and standardized type systems and association rules of external industry knowledge bases.

[0168] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; those skilled in the art can modify the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A question-answering method for professional industry information based on a knowledge graph using a large language model, characterized in that, The following steps are included: Collect unstructured text data from professional industries, preprocess the unstructured text data to obtain standardized text corpus; An oscillation detection process combining a large language model with entity boundary recognition is adopted. Entities and relations are extracted from standardized text corpora using a particle swarm optimization algorithm to obtain stable extraction results. Based on the entities and relationships in the stable extraction results, construct an initial knowledge graph containing nodes and edges; The nodes in the initial knowledge graph are automatically classified by type, and then linked and enhanced by external industry knowledge bases to obtain an enhanced knowledge graph; It receives natural language questions input by users, performs semantic parsing and answer retrieval based on an enhanced knowledge graph, and generates and returns corresponding industry information answers.

2. The question-answering method for professional industry information based on a knowledge graph using a large language model, as described in claim 1, is characterized in that... An oscillation detection process is employed, combining a large language model with entity boundary recognition. Entity and relation extraction is then performed on standardized text corpora using a particle swarm optimization algorithm to obtain stable extraction results. The process includes the following steps: Sequence labeling is performed on standardized text corpora using a large language model to obtain the label probability distribution of each token. Candidate entities and their start and end position indices in the text are extracted based on the label probability distribution to obtain an initial entity boundary set. For each candidate entity in the initial entity boundary set, the oscillation coefficient is obtained by collecting the context window containing the candidate entity in the standardized text corpus and analyzing the offset of the start position index and the end position index with respect to the context window. Candidate entities whose oscillation coefficients exceed a preset oscillation threshold are marked as oscillating entities, and a particle swarm is constructed for each oscillating entity; Based on the initial boundary adjustment amount and the final boundary adjustment amount of each particle, the initial position index and the final position index of the oscillating entity are temporarily corrected, and the adjusted entity term corresponding to the particle is extracted from the standardized text corpus. After performing particle optimization on the adjusted entity terms extracted through word vector extraction, the results are encapsulated to generate stable extraction results.

3. The question-answering method for professional industry information based on a knowledge graph using a large language model according to claim 1, characterized in that, After performing particle optimization on the adjusted entity terms extracted through word vector extraction, the results are encapsulated to generate stable extraction results, including the following steps: Calculate the semantic matching degree between the adjusted entity term corresponding to each particle and the preset industry term dictionary. Update the particle speed and position by comparing the individual historical best and the group historical best. Iterate until the change in semantic matching degree is less than the convergence threshold or the preset maximum number of iterations is reached. Output the global optimal position vector that maximizes the semantic matching degree. Based on the start and end boundary adjustment amounts in the global optimal position vector, the start and end position indices of the oscillating entity are corrected to obtain the corrected entity boundary. Based on the corrected entity boundary, the corrected entity entries are extracted from the standardized text corpus. Using a large language model, semantic relationships between corrected entity terms and entities other than corrected entity terms are extracted from standardized text corpora. The corrected entity terms and their corresponding semantic relationships are then encapsulated in a structured manner to generate stable extraction results.

4. The question-answering method for professional industry information based on a knowledge graph using a large language model according to claim 3, characterized in that, The semantic matching degree is calculated as follows: obtain the word vector corresponding to each character in the adjusted entity terminology, where the number of dimensions of the word vector is a fixed value and each dimension corresponds to a component value; and obtain the word vector corresponding to each character in the preset industry terminology dictionary terminology. For the adjusted entity term and the preset industry term dictionary term, at the same character position and the same vector dimension, multiply the components of the adjusted entity term and the components of the preset industry term dictionary term, and sum the products over all character positions and all vector dimensions to obtain the numerator; For the adjusted entity terminology and the preset industry terminology dictionary terminology, the squares of the component values ​​of all character positions and all vector dimensions are summed, and then the square root is taken to obtain the modulus of the adjusted entity terminology and the modulus of the preset industry terminology dictionary terminology, respectively. Divide the numerator by the product of the lengths of the two terms to obtain the semantic matching degree.

5. The question-answering method for professional industry information based on a knowledge graph using a large language model according to claim 4, characterized in that, For each candidate entity in the initial entity boundary set, the oscillation coefficient is obtained by collecting the context window containing the candidate entity in the standardized text corpus, and by analyzing the offset of the start and end position indices with respect to the context window. The steps include the following: Obtain the occurrence positions of candidate entities in the standardized text corpus, construct a context window centered on each occurrence position, and record the starting position of each context window; For each context window, extract the relative starting offset between the starting position index of the candidate entity within that context window and the starting position of the window. Arrange the relative starting offsets of all context windows according to the order in which the context windows appear in the text to form the first offset sequence. For each context window, extract the relative end offset between the end position index of the candidate entity within that context window and the start position of the window. Arrange the relative end offsets of all context windows in the order in which the context windows appear in the text to form the second offset sequence. Compare two adjacent relative starting offsets in the first offset sequence to obtain the starting offset direction sequence; Compare two adjacent relative end offsets in the second offset sequence to obtain the end offset direction sequence; The frequency of direction changes in the initial offset direction sequence is obtained by counting the number of times the direction changes in the initial offset direction sequence. The frequency of direction changes in the final offset direction sequence is obtained by counting the number of times the direction changes in the final offset direction sequence. The sum of the frequency of direction changes in the initial offset direction and the frequency of direction changes in the final offset direction is output as the oscillation coefficient.

6. The question-answering method for professional industry information based on a knowledge graph using a large language model according to claim 5, characterized in that, The nodes in the initial knowledge graph are automatically classified by type, and then enhanced by combining them with external industry knowledge bases to obtain an enhanced knowledge graph. The process includes the following steps: Each node in the initial knowledge graph is obtained, the node description text corresponding to each node is extracted from the standardized text corpus, and industry knowledge entries related to the node type are obtained from the external industry knowledge base to obtain the industry knowledge entry set. For each node description text, a large language model is used to extract the semantic vector of the node description text. At the same time, all neighboring nodes of the node in the initial knowledge graph and the semantic vector of each neighboring node are obtained. The arithmetic mean of the semantic vectors of all neighboring nodes is calculated as the context dependency vector of the node. Based on the contextual dependency vectors of each node, clustering is performed, and semantic vectors of industry knowledge entries are extracted and processed. Then, enhanced association edges are established through similarity-based correction analysis to obtain an enhanced knowledge graph.

7. The question-answering method for professional industry information based on a knowledge graph using a large language model according to claim 6, characterized in that, Based on the contextual dependency vectors of each node, clustering is performed. After combining this with semantic vector extraction from the industry knowledge item set, similarity-based correction processing is used to establish enhanced association edges, resulting in an enhanced knowledge graph. The process includes the following steps: The semantic vector of each node is concatenated with the corresponding context dependency vector to obtain the node fusion vector. Based on the node fusion vectors of all nodes, a clustering algorithm is used to group the nodes to obtain an initial type cluster set, and the cluster identifier of the initial type cluster to which each node belongs is recorded. For each node, the arithmetic mean of the context dependency vectors of each node corresponding to the cluster identifier of the initial type cluster is calculated to obtain the context prototype vector of the initial type cluster; for each industry knowledge item in the industry knowledge item set, a large language model is used to extract the semantic vector, and the semantic vector is used as the standard type template vector of the type corresponding to the item. For each node, calculate the first similarity between the context dependency vector of the node and the context prototype vector of the initial type cluster to which it belongs, and the second similarity between the context dependency vector of the node and each standard type template vector in the standard type template vector set. If the second similarity is greater than the first similarity and exceeds the preset similarity threshold, then the type of the node is corrected to the standard type corresponding to the second similarity, and the corrected node type label is obtained. Based on the corrected node type labels, the nodes in the initial knowledge graph are re-labeled, and combined with the association rules in the external industry knowledge base, enhanced association edges are established between nodes with the same corrected node type labels to obtain an enhanced knowledge graph.

8. The question-answering method for professional industry information based on a knowledge graph using a large language model, as described in claim 5, is characterized in that... After obtaining the first offset sequence and the second offset sequence, the following steps are included: The total number of abnormal windows is obtained by performing offset statistics on the first offset sequence and the second offset sequence in conjunction with the context window. Calculate the absolute value of the difference between two adjacent relative starting offsets in the first purification offset sequence to obtain the starting fluctuation amplitude sequence; calculate the absolute value of the difference between two adjacent relative ending offsets in the second purification offset sequence to obtain the ending fluctuation amplitude sequence. The arithmetic mean of all elements in the initial fluctuation amplitude sequence is summed with the arithmetic mean of all elements in the final fluctuation amplitude sequence to obtain the average fluctuation amplitude. The product of the average fluctuation amplitude and the total number of abnormal windows is output as the oscillation coefficient.

9. The question-answering method for professional industry information based on a knowledge graph using a large language model, as described in claim 8, is characterized in that... The total number of abnormal windows is calculated by combining the first offset sequence and the second offset sequence with the context window, including the following steps: Obtain the maximum and minimum values ​​of all relative starting offsets in the first offset sequence, calculate the difference between the maximum and minimum values, and obtain the starting offset range value; Calculate the arithmetic mean of all relative starting offsets in the first offset sequence to obtain the starting offset mean, and calculate the absolute difference between each relative starting offset and the starting offset mean. Mark the relative starting offsets whose absolute difference exceeds the first offset threshold as starting abnormal offsets. Remove the initial abnormal offset from the first offset sequence to obtain the first cleaned offset sequence, and record the window identifier of the context window corresponding to the initial abnormal offset to form the initial abnormal window identifier set; Obtain the maximum and minimum values ​​of all relative end offsets in the second offset sequence, calculate the difference between the maximum and minimum values, and obtain the end offset range value; Calculate the arithmetic mean of all relative end offsets in the second offset sequence to obtain the mean end offset, and calculate the absolute difference between each relative end offset and the mean end offset. Mark the relative end offsets whose absolute difference exceeds the second offset threshold as abnormal end offsets. Remove the end-abnormal offset from the second offset sequence to obtain the second cleanup offset sequence, and record the window identifier of the context window corresponding to the end-abnormal offset to form the end-abnormal window identifier set; Perform a union operation on the starting abnormal window identifier set and the ending abnormal window identifier set to obtain the total abnormal window identifier set. Count the number of window identifiers in the total abnormal window identifier set to obtain the total number of abnormal windows.

10. A question-answering system for professional industry information based on a knowledge graph using a large language model, characterized in that, It includes a data acquisition and processing module, an entity extraction module, an initial construction module, an enhancement module, and an output module; The data acquisition and processing module is used to collect unstructured text data from professional industries, preprocess the unstructured text data, and obtain standardized text corpus. The entity extraction module is used to detect and process oscillations by combining a large language model with entity boundary recognition. It extracts entities and relations from standardized text corpora using a particle swarm optimization algorithm to obtain stable extraction results. The initial construction module is used to build an initial knowledge graph containing nodes and edges based on the entities and relationships in the stable extraction results; The enhancement module is used to automatically classify the nodes in the initial knowledge graph and combine them with external industry knowledge bases for association enhancement, resulting in an enhanced knowledge graph; The output module receives natural language questions input by the user, performs semantic parsing and answer retrieval based on an enhanced knowledge graph, and generates and returns the corresponding industry information answers.