A knowledge graph construction method for enterprise scientific research projects

By constructing a knowledge graph of enterprise research projects and utilizing a large language model and nearest neighbor propagation algorithm, the problems of low knowledge correlation and insufficient information utilization in traditional scientific research management systems have been solved, achieving efficient scientific research resource management and analysis.

CN121766409BActive Publication Date: 2026-06-16中铁科学研究院集团有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
中铁科学研究院集团有限公司
Filing Date
2026-03-02
Publication Date
2026-06-16

Smart Images

  • Figure CN121766409B_ABST
    Figure CN121766409B_ABST
Patent Text Reader

Abstract

The application discloses a kind of enterprise scientific research project knowledge graph construction method, it is related to knowledge graph construction technical field, comprising: S1, based on ontology editing tool, using Resource Description Framework (RDF) and OWL language, scientific research project knowledge graph is carried out knowledge modeling, determines entity type, entity attribute and entity relationship;S2, collect the achievement data of enterprise, and carry out data cleaning and pretreatment, obtain scientific research project text;S3, knowledge extraction based on large language model knowledge self-distillation;S4, entity disambiguation based on text similarity and near neighbor propagation algorithm;S5, based on the knowledge graph constructed, carries out rule setting, reasons potential relationship, and constructs the knowledge graph for enterprise scientific research management.The present application provides intelligent scientific research project management and search capability for scientific research personnel, to improve the efficiency of scientific research management.Meanwhile, through the deep development and analysis utilization of scientific research project information, provide strong support for scientific research unit management decision.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of knowledge graph construction technology, specifically to a method for constructing a knowledge graph for an enterprise research project. Background Technology

[0002] As scientific research enters the fourth paradigm of data-intensive scientific discovery, the connections between various elements in scientific research activities are becoming increasingly close, and research methods have undergone tremendous changes. Large-scale, cross-regional, and cross-institutional research activities are becoming more and more frequent. This presents significant challenges to project management in terms of comprehensive data collection, in-depth processing, and scientific management. Traditional research project management has many shortcomings, such as poor semantic association of projects, a single method of resource organization and association, and low utilization rate of resource integration and analysis, making it difficult to meet the new demands. Using knowledge graphs for research project management is particularly necessary. This approach can fully mine and share data, leverage the value of research project resources, help improve enterprises' ability to efficiently analyze scientific and technological research results, thereby enhancing their technological innovation capabilities and supporting the innovative development of scientific research.

[0003] Faced with a large amount of heterogeneous scientific research project data, the difficulty of physical and digital management of the project has increased dramatically. Existing scientific research project systems and technologies have technical deficiencies in the following aspects, and cannot fully meet the needs of managers in various aspects such as data mining and correlation analysis of project data.

[0004] 1) Poor correlation of scientific research knowledge: Traditional scientific research record management systems lack effective connections between scientific research knowledge, making it difficult to establish dynamic correlation networks across topics and disciplines. Knowledge graphs, however, construct a knowledge-level correlation and fusion model to link entities such as enterprises, professionals, and research results with entities extracted from scientific research project process documents. This allows previously scattered scientific research knowledge to connect with each other, improving the correlation of scientific research knowledge and solving the problem of isolated storage of scientific research knowledge.

[0005] 2) Insufficient information utilization: Previous scientific research project management systems did not fully utilize information, storing much information but failing to mine and utilize it effectively. This led to problems such as duplicate project approvals and unreasonable resource allocation, hindering effective management decision-making. The construction of knowledge graphs enriches the semantic relationships of scientific research projects, enabling a more comprehensive presentation of the information contained within them. This provides stronger support for management decision-making, in-depth development and analysis of project information in research institutions.

[0006] 3) There is a lack of sufficient training data to support the training of knowledge extraction models. Using traditional neural networks for knowledge extraction involves too much costly manual annotation of text data, and the limited data makes it difficult for the model to fully learn specific knowledge in the professional field, and there is also a certain risk of overfitting. Therefore, under the constraints of objective conditions, it is difficult to achieve good results in knowledge extraction using a small amount of enterprise research project data. Summary of the Invention

[0007] The purpose of this invention is to overcome the shortcomings of the prior art and provide a method for constructing knowledge graphs for enterprise research projects.

[0008] The objective of this invention is achieved through the following technical solution:

[0009] This application discloses a method for constructing a knowledge graph for enterprise research projects, including the following steps:

[0010] S1. Based on ontology editing tools, using the resource description framework RDF and OWL language, knowledge modeling is performed on the knowledge graph of scientific research projects to determine entity types, entity attributes, and entity relationships; the entity types include projects, enterprises, experts, teams, achievements, and professions.

[0011] S2. Collect the enterprise's achievement data, and perform data cleaning and preprocessing to obtain the research project text. The achievement data includes process data of internal research projects and external patent documents.

[0012] S3. Knowledge extraction based on large language model knowledge self-distillation;

[0013] S4. Entity disambiguation based on text similarity and nearest neighbor propagation algorithm;

[0014] S5. Based on the constructed knowledge graph, rules are set, potential relationships are reasoned, and a knowledge graph for enterprise scientific research management is constructed.

[0015] Preferably, the preprocessing in step S2 includes: recognizing scanned paper documents, extracting text information from unstructured Word documents, removing duplicate or erroneous data, and converting English in the resulting data into Chinese.

[0016] Preferably, in step S3, based on instruction learning technology, a large language model is used to complete the knowledge extraction task through text generation under zero-shot conditions, specifically including:

[0017] The description is concatenated with the corresponding task input to obtain the task description, which is then input into a large language model. This large language model, while understanding the task description, generates entity recognition results through text generation. The formula for task modeling is as follows: ,in This indicates a task description. Indicates task input, This indicates the result of the first entity recognition. This indicates that the large language model is based on the input. Under the premise of the first entity recognition result Probability modeling, Indicates the result of the first entity recognition. Length, Indicates the result of the first entity recognition. The i One output, Indicates the preceding i All generated content for each step.

[0018] Preferably, knowledge extraction is broken down into two sub-tasks. The first sub-task performs joint extraction of entities and relations, and the second sub-task is responsible for data verification and structuring according to a preset format, specifically including:

[0019] The first subtask identifies entities such as topics, enterprises, experts, teams, achievements, and professions, as well as their entity attribute values. At the same time, based on the content of the topic text, it extracts the relationships between different entities in the entity information and returns triples.

[0020] The second subtask integrates and verifies the entity relationships output by the first subtask, so that the final entity relationship extraction result meets the condition constraints. The condition constraints include that the type of each entity and relationship is within the specified type range and there are no duplicate entities and relationships.

[0021] Finally, the final entity relationships are populated into the preset knowledge extraction template, and the structured information that can be parsed by JSON is output.

[0022] Preferably, a self-distillation method is used to train the large language model, reducing the two reasoning processes required for knowledge extraction to a single step. The formula for achieving one-step reasoning for the large language model is as follows: ,in This indicates the description of the first instruction. This represents the parameters of the model after self-distillation training. This indicates that the large language model calls functions, specifically including:

[0023] The result of the first entity recognition The annotation results that cannot be successfully parsed by JSON are filtered out to obtain the second entity recognition result. Input the task Second entity recognition results Merge into training dataset Through the training dataset Supervised fine-tuning of the large language model minimizes the cross-entropy loss function as follows: ;in Indicates a vocabulary list, express The length of the text. This indicates the description of the second instruction. express forward i The text of 1 word, Vocabulary Any specific word in it, This represents a probability modeling function. The label smoothing function is calculated using the following formula: ,in express The Middle i One word, Indicates the size of the vocabulary. Indicates hyperparameters;

[0024] The LoRA method is used to fine-tune the parameters of a large language model. ,in This represents the output of a linear layer network in a large language model. x This represents the input to a linear layer network of a large language model. The parameters representing the large language model. This indicates the amount of parameter change after fine-tuning. and This indicates additional fine-tuning parameters.

[0025] Preferably, in step S4, a similarity analysis is performed based on the experts' information, including their employer, research topic name, and research field. The nearest neighbor propagation algorithm is then used to cluster the features of experts with the same name, ultimately determining whether experts with the same name belong to the same named entity. Specifically, this includes the following steps:

[0026] S41. Word embedding is performed on the expert research text to obtain word vectors. The expert research text includes the title and abstract of the expert's research project and the results. Then, the expert's professional text is obtained based on the expert's collaborative relationships and employing companies.

[0027] S42. Extracting text sentences from expert research texts This is then converted into token fragments and input into the BERT model. These token fragments are processed by the Transformer layer inside the BERT model to generate latent vectors. This allows us to construct a contextual embedding representation of the token fragment, i.e. H represents a text sentence Word vectors, Representing implicit vectors The dimension M represents the total number of tokens after a single input text sequence has been segmented.

[0028] S43. After calculating the word vectors of the expert research text using the BERT model, generate expert feature data according to the rule of sequential correspondence. , Where T represents the first feature vector, A represents the second feature vector, E represents the first feature text string, and C represents the second feature text string;

[0029] S44, Regarding the first expert with the same name Second expert with the same name Similarity calculation is performed on the relevant feature data, i.e. ;in Indicates the first expert with the same name Second expert with the same name The similarity of related feature data, where W represents a four-dimensional vector. and These represent the first expert with the same name. The first feature vector, the second feature vector, the first feature text string, and the second feature text string; and They represent the second expert with the same name. The first feature vector, the second feature vector, the first feature text string, and the second feature text string; Indicates the first expert with the same name Second expert with the same name The cosine similarity between the first eigenvectors, Indicates the first expert with the same name Second expert with the same name The cosine similarity between the second feature vectors, Indicates the first expert with the same name Second expert with the same name The first feature is the normalized Levenstein distance between text strings. Indicates the first expert with the same name Second expert with the same name The normalized Levenstein distance between the second feature text strings;

[0030] S45. Based on the four main characteristics, cluster experts with the same name using the nearest neighbor propagation algorithm to determine whether they belong to the same named entity. If they do, perform entity disambiguation. The nearest neighbor propagation algorithm specifically includes: initializing the similarity calculation results obtained in step S44 to obtain the similarity matrix S; and initializing the attraction matrix R and the attribute matrix. ;

[0031] The formula for calculating the attraction matrix R is as follows: ;in Indicates the first data point With the second data point similarity, Indicates the first data point The degree to which the third data point k is chosen as its cluster center; Indicates the second data point Suitable as the first data point The attractiveness of cluster centers Indicates the first data point The similarity with the third data point k, where N represents the total number of data points participating in the nearest neighbor propagation clustering algorithm;

[0032] The attribute matrix The calculation formula is , Indicates the first data point Select the second data point The degree to which it serves as a cluster center Indicates the second data point The attractiveness of a cluster center suitable for the third data point k.

[0033] Preferably, in step S5, inference rules are set based on the constructed knowledge graph, and the inference rules are as follows: Where B represents the rule body, The header triple represents the relationship between rules.

[0034] The beneficial effects of this invention are:

[0035] 1) This application realizes the effective processing of unique text features and relationships in enterprise scientific research projects, and adopts a knowledge extraction algorithm based on machine learning to improve the accuracy of named entity recognition and relationship extraction in enterprise scientific research projects.

[0036] 2) This application uses a large model for knowledge extraction. By leveraging the powerful general language understanding capabilities of the large model, relevant knowledge is extracted from the enterprise research project text, and knowledge distillation is used to improve the efficiency of knowledge extraction.

[0037] 3) The entity disambiguation algorithm based on text similarity and nearest neighbor propagation in this application can effectively disambiguate person entities from multi-source data, thereby improving data quality. At the same time, it constructs features of knowledge graphs for scientific research projects and realizes knowledge graph reasoning based on rule-based methods, which can discover more potential relationships between entities, enrich the knowledge of knowledge graphs, and support subsequent application analysis of knowledge graphs.

[0038] 4) The knowledge graph constructed in this application provides researchers with intelligent research project management and search capabilities, enabling them to analyze research projects more quickly and improving the efficiency of research management. Simultaneously, through in-depth development and analysis of research project information, it provides strong support for the management decisions of research institutions. Attached Figure Description

[0039] Figure 1 This is a schematic diagram illustrating the steps of a method for constructing a knowledge graph for an enterprise research project according to an embodiment of the present invention;

[0040] Figure 2 This is a schematic diagram of entity relationships for a knowledge graph construction method for enterprise scientific research projects according to an embodiment of the present invention;

[0041] Figure 3 This is a schematic diagram of the knowledge extraction process of a knowledge graph construction method for enterprise scientific research projects according to an embodiment of the present invention. Detailed Implementation

[0042] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0043] This application discloses a method for constructing a knowledge graph for enterprise research projects, the steps of which are illustrated in the diagram below. Figure 1 As shown, the specific steps include:

[0044] S1. Based on ontology editing tools (such as Protégé), using the resource description framework RDF and OWL language, knowledge modeling is performed on the knowledge graph of the research topic to determine entity types, entity attributes, and entity relationships. The entity types include topics, enterprises, experts, teams, achievements, and specialties. A schematic diagram of the entity relationships is shown below. Figure 2 As shown;

[0045] S2. Collect the enterprise's achievement data, and perform data cleaning and preprocessing to obtain the research project text. The achievement data includes process data of internal research projects and external patent documents.

[0046] S3. Knowledge extraction based on large language model knowledge self-distillation;

[0047] S4. Entity disambiguation based on text similarity and nearest neighbor propagation algorithm;

[0048] S5. Based on the constructed knowledge graph, rules are set, potential relationships are reasoned, and a knowledge graph for enterprise scientific research management is constructed.

[0049] For example, the preprocessing described in step S2 includes: recognizing scanned paper documents, extracting text information from unstructured Word documents, removing duplicate or erroneous data, and converting English in the resulting data into Chinese.

[0050] For example, in step S3, based on instruction learning technology, a large language model is used to complete the knowledge extraction task through text generation under zero-shot conditions, specifically including:

[0051] The description is concatenated with the corresponding task input to obtain the task description, which is then input into a large language model. This large language model, while understanding the task description, generates entity recognition results through text generation. The formula for task modeling is as follows: ,in This indicates a task description. Indicates task input, This represents the first entity recognition result (the model's output). This indicates that the large language model is based on the input. Under the premise of the first entity recognition result Probability modeling, Indicates the result of the first entity recognition. Length, Indicates the result of the first entity recognition. The i One output, Indicates the preceding i All generated content for each step.

[0052] For example, knowledge extraction is broken down into two sub-tasks. The first sub-task performs joint extraction of entities and relations, and the second sub-task is responsible for data verification and structuring according to a preset format. The flowchart of knowledge extraction is shown below. Figure 3 As shown, it specifically includes:

[0053] The first subtask identifies entities such as topics, enterprises, experts, teams, achievements, and professions, as well as their entity attribute values. At the same time, based on the content of the topic text, it extracts the relationships between different entities in the entity information and returns a triple (head entity, relationship, tail entity).

[0054] The second subtask integrates and verifies the entity relationships output by the first subtask, so that the final entity relationship extraction result meets the condition constraints. The condition constraints include that the type of each entity and relationship is within the specified type range and there are no duplicate entities and relationships.

[0055] Finally, the final entity relationships are populated into the preset knowledge extraction template, and the structured information that can be parsed by JSON is output.

[0056] For example, for a research paper, a large language model needs to perform two inferences to complete knowledge extraction. In practical applications, this results in high time costs. The calculation formula is as follows: ,in This represents the pre-trained parameters of a large language model. Indicates a large language model call; This indicates a third instruction description. Indicates the description of the fourth instruction; This represents the extracted entity and relation information.

[0057] For example, a self-distillation method is used to train a large language model, reducing the two reasoning processes required for knowledge extraction to a single step. The formula for achieving one-step reasoning for the large language model is as follows: ,in This indicates the description of the first instruction. This represents the parameters of the model after self-distillation training. This represents a large language model calling function. In this embodiment, the ChatGLM4-9B model is used as the base model, specifically including:

[0058] The result of the first entity recognition The annotation results that cannot be successfully parsed by JSON are filtered out to obtain the second entity recognition result. Input the task Second entity recognition results Merge into training dataset Through the training dataset Supervised fine-tuning of the large language model minimizes the cross-entropy loss function. ,in Indicates a vocabulary list, express The length of the text. This indicates the description of the second instruction. express forward i The text of 1 word, Vocabulary Any specific word in it, This represents a probabilistic modeling function (representing the vocabulary generated by the model under specific conditions). (probability distribution) The label smoothing function is calculated using the following formula: ,in express The Middle iOne word, Indicates the size of the vocabulary. This represents hyperparameters, in this embodiment ;

[0059] Considering the large parameter size of a large language model, fine-tuning all parameters is too costly. Therefore, the LoRA method is used to fine-tune the parameters of the large language model. ,in This represents the output of a linear layer network in a large language model. x This represents the input to a linear layer network of a large language model. The parameters representing the large language model. This indicates the amount of parameter change after fine-tuning. and This indicates additional fine-tuning parameters.

[0060] For example, in step S4, a similarity analysis is performed based on the experts' information, including their employer, research topic, and research field; and the nearest neighbor propagation algorithm is used to cluster the features of experts with the same name, ultimately determining whether experts with the same name belong to the same named entity; specifically, the following steps are included:

[0061] S41. Word embedding is performed on the expert research text to obtain word vectors. The expert research text includes the title and abstract of the expert's research project and the results. Then, the expert's professional text is obtained based on the expert's collaborative relationships and employing companies.

[0062] S42. Extracting text sentences from expert research texts This is then converted into token fragments and input into the BERT model. These token fragments are processed by the Transformer layer inside the BERT model to generate latent vectors. This allows us to construct a contextual embedding representation of the token fragment, i.e. H represents a text sentence Word vectors, Representing implicit vectors The dimension M represents the total number of tokens after a single input text sequence has been segmented.

[0063] S43. After calculating the word vectors of the expert research text using the BERT model, generate expert vectors according to the rule of sequential correspondence. , Where T represents the first feature vector (the word vectors corresponding to the title and abstract of the research project), A represents the second feature vector (the word vectors corresponding to the title and abstract of the research results), E represents the first feature text string (employment information), and C represents the second feature text string (the expert's collaborative relationship).

[0064] S44, Regarding the first expert with the same name Second expert with the same name Similarity calculation is performed on the relevant feature data, i.e. ;in Indicates the first expert with the same name Second expert with the same name The similarity of the relevant feature data, where W represents a four-dimensional vector used to balance the corresponding weight proportions of the four feature similarities related to the expert. and These represent the first expert with the same name. The first feature vector, the second feature vector, the first feature text string, and the second feature text string; and They represent the second expert with the same name. The first feature vector, the second feature vector, the first feature text string, and the second feature text string; Indicates the first expert with the same name Second expert with the same name The cosine similarity between the first eigenvectors, Indicates the first expert with the same name Second expert with the same name The cosine similarity between the second feature vectors, Indicates the first expert with the same name Second expert with the same name The first feature is the normalized Levenstein distance between text strings. Indicates the first expert with the same name Second expert with the same name The normalized Levenstein distance between the second feature text strings;

[0065] S45. Based on the four main characteristics, cluster experts with the same name using the nearest neighbor propagation algorithm to determine whether they belong to the same named entity. If they do, perform entity disambiguation. The nearest neighbor propagation algorithm specifically includes: initializing the similarity calculation results obtained in step S44 to obtain the similarity matrix S; and initializing the attraction matrix R and the attribute matrix. ;

[0066] The formula for calculating the attraction matrix R is as follows: ;in Indicates the first data point (Data points to be clustered) and the second data point Similarity of (candidate cluster center data points) Indicates the first data point The degree to which the third data point k is chosen as its cluster center; Indicates the second data point Suitable as the first data point The attractiveness of cluster centers Indicates the first data point The similarity with the third data point k, where N represents the total number of data points participating in the nearest neighbor propagation clustering algorithm (the total number of expert feature vectors to be disambiguated for entity disambiguation).

[0067] The attribute matrix The calculation formula is , Indicates the first data point Select the second data point The degree to which it serves as a cluster center Indicates the second data point The attractiveness of a cluster center suitable for the third data point k.

[0068] For example, through the formula Calculate the first expert with the same name Second expert with the same name Cosine similarity between the first feature vectors ,in, Indicates the first expert with the same name First eigenvector The amount, Indicates the second expert with the same name First eigenvector The amount, express and The number of dimensions; calculating the first expert with the same name Second expert with the same name Cosine similarity between the second feature vectors As above, I will not go into too much detail here.

[0069] For example, through the formula Calculate the first expert with the same name Second expert with the same name The first characteristic is the Levinstein distance between text strings. ,in Indicates the first expert with the same name The first characteristic text string The One character, Indicates the second expert with the same name The first characteristic text string The One character, Indicates to Execute deletion The last string, Indicates to Execute deletion The last string; The cost function for the replacement operation is expressed as follows: Then, through the formula right After normalization, the first expert with the same name was obtained. Second expert with the same name Normalized Levenstein distance between the first characteristic text strings; calculate the first expert with the same name Second expert with the same name The normalized Levenstein distance between the second feature text strings is the same as above, and will not be elaborated further here.

[0070] For example, in step S5, inference rules are set based on the constructed knowledge graph, and the inference rules are as follows: Where B represents the rule body, This represents the rule's head relation triple. Assuming the knowledge graph contains three fact triples: ParticipatesIn(expert, project), MemberOf(expert, team), and ParticipatesIn(team, project) (replacing instances with variables), this embodiment represents the rule used for reasoning as a first-order Horn logic clause, for example: ParticipatesIn(expert, project) ˄ MemberOf(expert, team) → ParticipatesIn(team, project). Through expert summarization, the above reasoning rule can be obtained, which automatically retrieves the implicit fact triples during reasoning.

[0071] In summary, this application constructs a six-dimensional ontology model for scientific research management, achieving modeling of projects, experts, teams, enterprises, achievements, and specialties throughout the entire process of enterprise scientific research management, from project initiation to mid-term evaluation, acceptance, and transformation. It also supports incremental access to new entity types and relationships. Knowledge extraction is decomposed into a dual-task architecture of joint entity-relationship extraction and structured verification. Through the instruction learning capability of Large Language Model (LLM), high-quality knowledge generation is achieved even without labeled data. The joint extraction task synchronously extracts six types of entities (project / enterprise / expert / team / achievement / specialty) and their attributes in an end-to-end manner, generating (head entity-relationship-tail entity) triples. Type validation filtering ensures that the output conforms to the ontology model definition specifications. Addressing the problem of duplicate expert names in multi-source data, an entity disambiguation algorithm based on text similarity and nearest neighbor propagation is proposed. Similarity analysis is performed based on information such as the expert's employing company, project name, and research field. The nearest neighbor propagation algorithm is then used to cluster the features of experts with the same name, ultimately determining whether experts with the same name belong to the same named entity.

[0072] The above description is merely a preferred embodiment of the present invention. It should be understood that the present invention is not limited to the forms disclosed herein and should not be construed as excluding other embodiments. It can be used in various other combinations, modifications, and environments, and can be altered within the scope of the concept described herein through the above teachings or related technologies or knowledge. Modifications and variations made by those skilled in the art that do not depart from the spirit and scope of the present invention should be within the protection scope of the appended claims.

Claims

1. A method for constructing a knowledge graph for enterprise research projects, characterized in that, Includes the following steps: S1. Based on ontology editing tools, using the resource description framework RDF and OWL language, knowledge modeling is performed on the knowledge graph of scientific research projects to determine entity types, entity attributes, and entity relationships; the entity types include projects, enterprises, experts, teams, achievements, and professions. S2. Collect the enterprise's achievement data, and perform data cleaning and preprocessing to obtain the research project text. The achievement data includes process data of internal research projects and external patent documents. S3. Knowledge extraction based on large language model knowledge self-distillation; S4. Entity disambiguation based on text similarity and nearest neighbor propagation algorithm; S5. Based on the constructed knowledge graph, rules are set, potential relationships are reasoned, and a knowledge graph for enterprise scientific research management is constructed. Step S4 involves performing a similarity analysis based on the experts' information, including their employer, research topic name, and research field. The nearest neighbor algorithm is then used to cluster the features of experts with the same name, ultimately determining whether they belong to the same named entity. This process includes the following steps: S41. Word embedding is performed on the expert research text to obtain word vectors. The expert research text includes the title and abstract of the expert's research project and the results. Then, the expert's professional text is obtained based on the expert's collaborative relationships and employing companies. S42. Extracting text sentences from expert research texts This is then converted into token fragments and input into the BERT model. These token fragments are processed by the Transformer layer inside the BERT model to generate latent vectors. This allows us to construct a contextual embedding representation of the token fragment, i.e. H represents a text sentence Word vectors, Representing implicit vectors The dimension M represents the total number of tokens after a single input text sequence has been segmented. S43. After calculating the word vectors of the expert research text using the BERT model, generate expert feature data according to the rule of sequential correspondence. , Where T represents the first feature vector, A represents the second feature vector, E represents the first feature text string, and C represents the second feature text string; S44, Regarding the first expert with the same name Second expert with the same name Similarity calculation is performed on the relevant feature data, i.e. ;in Indicates the first expert with the same name Second expert with the same name The similarity of related feature data, where W represents a four-dimensional vector. and These represent the first expert with the same name. The first feature vector, the second feature vector, the first feature text string, and the second feature text string; and They represent the second expert with the same name. The first feature vector, the second feature vector, the first feature text string, and the second feature text string; Indicates the first expert with the same name Second expert with the same name The cosine similarity between the first eigenvectors, Indicates the first expert with the same name Second expert with the same name The cosine similarity between the second feature vectors, Indicates the first expert with the same name Second expert with the same name The first feature is the normalized Levenstein distance between text strings. Indicates the first expert with the same name Second expert with the same name The normalized Levenstein distance between the second feature text strings; S45. Based on the four main characteristics, cluster experts with the same name using the nearest neighbor propagation algorithm to determine whether they belong to the same named entity. If they do, perform entity disambiguation. The nearest neighbor propagation algorithm specifically includes: initializing the similarity calculation results obtained in step S44 to obtain the similarity matrix S; and initializing the attraction matrix R and the attribute matrix. ; The formula for calculating the attraction matrix R is as follows: ;in Indicates the first data point With the second data point similarity, Indicates the first data point The degree to which the third data point k is chosen as its cluster center; Indicates the second data point Suitable as the first data point The attractiveness of cluster centers Indicates the first data point The similarity with the third data point k, where N represents the total number of data points participating in the nearest neighbor propagation clustering algorithm; The attribute matrix The calculation formula is , Indicates the first data point Select the second data point The degree to which it serves as a cluster center Indicates the second data point The attractiveness of a cluster center suitable for the third data point k.

2. The method for constructing a knowledge graph for enterprise research projects according to claim 1, characterized in that, The preprocessing described in step S2 includes: recognizing scanned paper documents, extracting text information from unstructured Word documents, removing duplicate or erroneous data, and converting English in the resulting data into Chinese.

3. The method for constructing a knowledge graph for enterprise research projects according to claim 2, characterized in that, Step S3, based on instruction learning technology, uses a large language model to complete the knowledge extraction task through text generation under zero-shot conditions. Specifically, it includes: The description is concatenated with the corresponding task input to obtain the task description, which is then input into a large language model. This large language model, while understanding the task description, generates entity recognition results through text generation. The formula for task modeling is as follows: ,in This indicates a task description. Indicates task input, This indicates the result of the first entity recognition. This indicates that the large language model is based on the input. Under the premise of the first entity recognition result Probability modeling, Indicates the result of the first entity recognition. Length, Indicates the result of the first entity recognition. The i One output, Indicates the preceding i All generated content for each step.

4. The method for constructing a knowledge graph for an enterprise research project according to claim 3, characterized in that, Knowledge extraction is broken down into two sub-tasks. The first sub-task performs joint extraction of entities and relations, while the second sub-task is responsible for data verification and structuring according to a preset format, specifically including: The first subtask identifies entities such as topics, enterprises, experts, teams, achievements, and professions, as well as their entity attribute values. At the same time, based on the content of the topic text, it extracts the relationships between different entities in the entity information and returns triples. The second subtask integrates and verifies the entity relationships output by the first subtask, so that the final entity relationship extraction result meets the condition constraints. The condition constraints include that the type of each entity and relationship is within the specified type range and there are no duplicate entities and relationships. Finally, the final entity relationships are populated into the preset knowledge extraction template, and the structured information that can be parsed by JSON is output.

5. The method for constructing a knowledge graph for an enterprise research project according to claim 4, characterized in that: A self-distillation method is used to train the large language model, reducing the two reasoning processes required for knowledge extraction to a single step. The formula for achieving one-step reasoning for the large language model is as follows: ,in This indicates the description of the first instruction. This represents the parameters of the model after self-distillation training. This indicates that the large language model calls functions, specifically including: The result of the first entity recognition The annotation results that cannot be successfully parsed by JSON are filtered out to obtain the second entity recognition result. Input the task Second entity recognition results Merge into training dataset Through the training dataset Supervised fine-tuning of the large language model minimizes the cross-entropy loss function as follows: ;in Indicates a vocabulary list, express The length of the text. This indicates the description of the second instruction. express forward i The text of 1 word, Vocabulary Any specific word in it, This represents a probability modeling function. The label smoothing function is calculated using the following formula: ,in express The Middle i One word, Indicates the size of the vocabulary. Indicates hyperparameters; The LoRA method is used to fine-tune the parameters of a large language model. ,in This represents the output of a linear layer network in a large language model. x This represents the input to a linear layer network of a large language model. The parameters representing the large language model. This indicates the amount of parameter change after fine-tuning. and This indicates additional fine-tuning parameters.

6. The method for constructing a knowledge graph for an enterprise research project according to claim 5, characterized in that, In step S5, inference rules are set based on the constructed knowledge graph. The inference rules are as follows: Where B represents the rule body, The header triple represents the relationship between rules.