Literature intelligent question and answer method and system based on children's oncology research large model
By using an intelligent question-answering method based on a large-scale pediatric oncology research model, the problem of traditional literature retrieval tools struggling to understand complex semantic issues in pediatric oncology research has been solved. This has enabled efficient and accurate literature retrieval and information integration, thereby improving research efficiency and accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHILDRENS HOSPITAL OF FUDAN UNIV
- Filing Date
- 2026-03-06
- Publication Date
- 2026-06-19
AI Technical Summary
Traditional literature retrieval tools struggle to accurately understand complex semantic issues in pediatric oncology research, fail to precisely locate relevant literature, and lack in-depth analysis and correlation analysis, leading to inefficiency and information omissions for researchers.
The intelligent question-answering method for literature based on a large-scale research model of pediatric tumors extracts core terms and relationships through semantic deconstruction, generates semantic knowledge paths, and combines them with a dynamic knowledge association module for targeted retrieval and knowledge association analysis to construct a dynamic knowledge association network and generate intelligent question-answering results.
It improves the targeting and accuracy of literature retrieval, greatly enhances the efficiency and accuracy of researchers in obtaining answers, and provides intuitive, comprehensive, and accurate intelligent question-and-answer services.
Smart Images

Figure CN122240768A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of large-scale model technology, and more specifically, to a literature intelligent question-answering method and system based on a large-scale model of pediatric tumor research. Background Technology
[0002] In the field of pediatric oncology research, researchers need to extract valuable information from a vast amount of scientific literature to answer various questions they encounter during their research. As pediatric oncology research continues to deepen, the number of related research papers is exploding, and the content covers multiple professional areas and complex knowledge systems.
[0003] Currently, researchers primarily rely on traditional literature search tools and manual review to obtain answers from literature. Traditional literature search tools typically rely on keyword matching, a method with several limitations. Firstly, keyword matching struggles to accurately understand the semantics of the researcher's question. For questions with complex semantics and specialized backgrounds, the search results are often inaccurate, failing to pinpoint relevant literature. For example, if a researcher raises a complex question about the correlation between specific treatment mechanisms and prognosis in pediatric cancer, traditional search tools might return a large number of irrelevant documents based on only a few keywords. Secondly, traditional search tools lack the ability to deeply mine and analyze the knowledge and connections within the literature, failing to effectively integrate the retrieved content with the question. Researchers must spend considerable time and effort manually sifting and organizing relevant information from numerous documents, resulting in low efficiency. Furthermore, manual review is not only time-consuming and laborious but also prone to omissions or misinterpretations due to individual knowledge levels and misunderstandings, impacting the progress and accuracy of research. Summary of the Invention
[0004] In view of the aforementioned problems, and in conjunction with the first aspect of the present invention, embodiments of the present invention provide a literature intelligent question-answering method based on a large-scale pediatric tumor research model, the method comprising: The system receives research questions on pediatric oncology in natural language input from researchers. These research questions include professional research directions in the field of pediatric oncology, scientific questions to be explored, and descriptions of relevant research background. The semantic knowledge path construction module of the pediatric oncology research model is invoked to perform semantic deconstruction processing on the pediatric oncology research problem, extract the core terms, the relationships between terms and the research objectives in the pediatric oncology research problem, and generate the semantic knowledge path of the research problem. Based on the semantic knowledge path of the research question, the literature knowledge matching module of the pediatric oncology research big model is triggered to perform targeted literature search in the preset pediatric oncology research literature database, and filter out literature containing content that matches the core terms and related relationships in the semantic knowledge path to obtain a set of matched literature. The set of matched literature includes the full text of the literature, professional term annotations in the literature, and term related expression fragments. The dynamic knowledge association module of the pediatric oncology research model is invoked to perform knowledge association analysis on the full text of the literature, the professional term annotations, and the term association expression fragments in the matched literature set. A dynamic knowledge association network with the semantic knowledge path of the research question as the core is constructed. The dynamic knowledge association network includes the association path between literature terms and question terms, the description of the association strength, and the logical deduction relationship between knowledge nodes. Information is integrated from the knowledge nodes and association paths in the dynamic knowledge association network to extract knowledge content that can answer the research questions on pediatric oncology. Combined with the literature source information corresponding to the association paths, the results of pediatric oncology research literature question and answer are generated. The results of the question-and-answer session on pediatric oncology research literature are linked and integrated with the core structural diagram of the dynamic knowledge association network to form the final intelligent question-and-answer output content for pediatric oncology research literature. The final intelligent question-and-answer output content for pediatric oncology research literature is then sent to the researchers' interactive terminals.
[0005] Furthermore, embodiments of the present invention also provide an intelligent question-answering system for literature based on a large-scale research model of pediatric tumors, characterized in that it includes: A processor; a machine-readable storage medium for storing machine-executable instructions of the processor; wherein the processor is configured to execute the aforementioned intelligent question-answering method for literature based on a large-scale pediatric tumor research model by executing the machine-executable instructions.
[0006] In another aspect, embodiments of the present invention also provide a computer program product, the computer program product including machine-executable instructions, the machine-executable instructions being stored in a computer-readable storage medium, the processor of the intelligent question-answering system for literature based on a large model of pediatric oncology research reading the machine-executable instructions from the computer-readable storage medium, the processor executing the machine-executable instructions, causing the intelligent question-answering system for literature based on a large model of pediatric oncology research to execute the aforementioned intelligent question-answering method for literature based on a large model of pediatric oncology research.
[0007] Based on the above, this system receives natural language-based research questions on pediatric oncology from researchers, covering professional research directions, scientific questions to be explored, and related research background descriptions. Then, a semantic knowledge path construction module is invoked to semantically deconstruct the questions, accurately extracting core terms, inter-terminal relationships, and research objectives, generating a semantic knowledge path for the research question. Based on this semantic knowledge path, a literature knowledge matching module is triggered to conduct a targeted search in a preset literature database, selecting a set of matching documents, thus improving the targeting and accuracy of literature retrieval. A dynamic knowledge association module is then invoked to perform knowledge association analysis on the matching document set, constructing a dynamic knowledge association network centered on the question's semantic knowledge path. This network presents the association paths, association strengths, and logical derivation relationships between knowledge nodes in the literature and question terms, deeply exploring the intrinsic connections between literature knowledge. Information is integrated into the dynamic knowledge association network to extract the knowledge content for answering the questions and generate question-and-answer results by combining them with literature source information. Finally, the question-and-answer results are integrated with the network's core structure diagram to form the final output content, which is sent to the researchers' interactive terminal. This provides researchers with an intuitive, comprehensive, and accurate intelligent question-and-answer service, greatly improving the efficiency and accuracy of researchers obtaining literature answers. Attached Figure Description
[0008] Figure 1 This is a schematic diagram of the execution flow of the intelligent question-answering method for literature based on a large-scale scientific research model for pediatric tumors provided in this embodiment of the invention.
[0009] Figure 2 This is a schematic diagram of exemplary hardware and software components of the intelligent question-answering system for literature based on a large-scale pediatric tumor research model provided in this embodiment of the invention. Detailed Implementation
[0010] The present invention will now be described in detail with reference to the accompanying drawings. Figure 1 This is a flowchart illustrating an intelligent question-answering method for literature based on a large-scale pediatric tumor research model, provided in one embodiment of the present invention. The following is a detailed description of this intelligent question-answering method for literature based on a large-scale pediatric tumor research model.
[0011] Step S110: Receive a pediatric oncology research question in natural language input from a researcher. The pediatric oncology research question includes the professional research direction in the field of pediatric oncology, the scientific question to be explored, and a description of the relevant research background.
[0012] In this embodiment, researchers input a research question about pediatric neuroblastoma via an interactive terminal. The specific question is: "Investigate the mechanism of MYCN gene amplification in the development and progression of pediatric neuroblastoma, and its correlation with clinical prognostic indicators. Current research shows that this gene amplification may affect tumor cell proliferation by regulating cell cycle-related proteins, but the specific molecular pathway is still unclear, and different studies have yielded different conclusions regarding its association with survival." This question includes the research focus (molecular mechanisms and clinical prognosis of pediatric neuroblastoma), the scientific question to be investigated (the mechanism of MYCN gene amplification and its correlation with clinical prognostic indicators), and a description of the relevant research background (the gene's role as shown in existing studies and existing controversies).
[0013] Step S120: Call the semantic knowledge path construction module of the pediatric oncology research big model to perform semantic deconstruction processing on the pediatric oncology research problem, extract the core terms, the relationships between terms and the research objectives in the pediatric oncology research problem, and generate the semantic knowledge path of the research problem.
[0014] Step S121: Input the pediatric oncology research question into the terminology extraction submodule of the semantic knowledge path construction module of the pediatric oncology research big model, perform sentence-by-sentence semantic analysis on the question text, identify the pediatric oncology disease name, pathological mechanism terminology, experimental technique terminology, research indicator terminology and conclusion expression terminology involved in the question text, and form an initial terminology list.
[0015] A sentence-by-sentence analysis of the input research question text reveals the following: In the first sentence, "Investigating the role of MYCN gene amplification in the development and progression of neuroblastoma in children," the disease name is identified as "neuroblastoma in children," and the pathological mechanism terms are "MYCN gene amplification," "development and progression," and "mechanism of action." In the second sentence, "and its correlation with clinical prognostic indicators," the research indicator terms are "clinical prognostic indicators" and "correlation." In the third sentence, "Current studies have shown that this gene amplification may affect tumor cell proliferation by regulating cell cycle-related proteins," the pathological mechanism terms are "cell cycle-related proteins" and "tumor cell proliferation," and the conclusion statement terms are "may affect...". In the fourth sentence, "However, the specific molecular pathway is still unclear, and there are differences in the conclusions regarding its association with survival rate in different studies," the pathological mechanism term is "molecular pathway," the research indicator term is "survival rate," and the conclusion statement terms are "still unclear" and "there are differences." The identified terms were compiled to form an initial term list, which included terms such as "childhood neuroblastoma", "MYCN gene amplification", "development process", "mechanism of action", "clinical prognostic indicators", "correlation", "cell cycle-related proteins", "tumor cell proliferation", "molecular pathways", "survival rate", "may affect by regulating...", "not yet clear", and "differences exist".
[0016] Step S122: Determine the domain affiliation of each term in the initial term list, distinguishing between basic medical terms, clinical research terms, experimental technology terms, and statistical analysis terms, and eliminating terms that are not directly related to the research topic of pediatric oncology to obtain the core term set.
[0017] Each term in the initial terminology list was categorized according to its domain: "Pediatric neuroblastoma" falls under clinical research; "MYCN gene amplification," "development and progression," "mechanism of action," "cell cycle-related proteins," "tumor cell proliferation," and "molecular pathways" are basic medical terms; "clinical prognostic indicators" and "survival rate" are clinical research terms; "correlation" is a statistical analysis term; and "may affect…" "not yet clear," and "differences exist" are conclusion-statement terms. However, "not yet clear" and "differences exist" primarily describe the current state of research and have low relevance to the core terms of pediatric oncology research, so these two terms were removed. After screening, the core term set was determined to be: "Pediatric neuroblastoma," "MYCN gene amplification," "development and progression," "mechanism of action," "clinical prognostic indicators," "correlation," "cell cycle-related proteins," "tumor cell proliferation," "molecular pathways," and "survival rate."
[0018] Step S123: Call the relation definition submodule of the semantic knowledge path construction module of the pediatric tumor research big data model, analyze the grammatical position, semantic collocation and logical connectors of each term in the core term set in the question text, determine the association type between terms, including causal association, subordinate association, comparative association and synergistic association, and form a list of term association relationships.
[0019] Analyzing the relationships between terms in the core terminology set reveals the following: a causal relationship exists between "MYCN gene amplification" and "the development and progression of neuroblastoma in children," indicated by the logical connector "in..."; a subordinate relationship exists between "MYCN gene amplification" and "mechanism of action," with "mechanism of action" referring specifically to "MYCN gene amplification"; a correlation exists between "MYCN gene amplification" and "clinical prognostic indicators," representing a related relationship and a special case of comparative relationships used to compare their interrelationships; a causal relationship exists between "MYCN gene amplification" and "cell cycle-related proteins," indicated by "regulation"; a causal relationship exists between "cell cycle-related proteins" and "tumor cell proliferation," indicated by "influence"; a subordinate relationship exists between "MYCN gene amplification" and "molecular pathways," with "molecular pathways" specifically reflecting the mechanism of action of "MYCN gene amplification"; and a correlation exists between "MYCN gene amplification" and "survival rate," representing a comparative relationship. These relationships are then compiled into a terminology relationship list, where each record contains two related terms and their corresponding relationship type.
[0020] Step S124: Extract the research objective statements from the pediatric oncology research questions, analyze the action directions and expected results in the research objective statements, determine the research objective direction, and convert the research objective direction into target nodes in the atlas. The target nodes contain target type descriptions and target-related terms.
[0021] The research objective statement extracted from the question is "to investigate the mechanism of action of MYCN gene amplification in the development and progression of neuroblastoma in children, and its correlation with clinical prognostic indicators." The action direction is "investigation," and the expected outcome is to clarify the "mechanism of action" of "MYCN gene amplification" and its "correlation" with "clinical prognostic indicators." Therefore, the research objective is determined to be clarifying the mechanism of action of MYCN gene amplification in the development and progression of neuroblastoma in children and its correlation with clinical prognostic indicators. This research objective is converted into a target node in the atlas, with the target type described as "mechanism and correlation investigation," and the target-related terms being "MYCN gene amplification," "neuroblastoma in children," and "clinical prognostic indicators."
[0022] Step S125: Treat each term in the core term set as a graph node, construct association edges between nodes according to the association types in the term association relationship list, label the association type description on the association edges, and connect the target node with the relevant core term nodes through the target association edges to form an initial semantic knowledge path.
[0023] The core terminology set includes “childhood neuroblastoma”, “MYCN gene amplification”, “development process”, “mechanism of action”, “clinical prognostic indicators”, “correlation”, “cell cycle-related proteins”, “tumor cell proliferation”, “molecular pathway”, and “survival rate” as atlas nodes. According to the terminology association list, "MYCN gene amplification" and "the process of development and progression of neuroblastoma in children" are connected by a causal association edge, labeled "causal association: in..."; "MYCN gene amplification" and "mechanism of action" are connected by a subordinate association edge, labeled "subordinate association: targeting..."; "MYCN gene amplification" and "clinical prognostic indicators" are connected by a comparative association edge, labeled "comparative association: correlation"; "MYCN gene amplification" and "cell cycle-related proteins" are connected by a causal association edge, labeled "causal association: regulation"; "cell cycle-related proteins" and "tumor cell proliferation" are connected by a causal association edge, labeled "causal association: influence"; "MYCN gene amplification" and "molecular pathways" are connected by a subordinate association edge, labeled "subordinate association: manifestation of mechanism of action"; "MYCN gene amplification" and "survival rate" are connected by a comparative association edge, labeled "comparative association: correlation". Then, the target node “Mechanism and Correlation Exploration” is connected to “MYCN Gene Amplification”, “Childhood Neuroblastoma” and “Clinical Prognostic Indicators” through target association edges to form an initial semantic knowledge path.
[0024] Step S126: Optimize the structure of the initial semantic knowledge path, merge semantically repetitive nodes, adjust the connection logic of related edges to conform to the knowledge system of the pediatric oncology field, supplement the domain attribute description of the nodes, and generate a semantic knowledge path for scientific research questions.
[0025] Inspecting the nodes in the initial semantic knowledge path revealed a semantic connection between "clinical prognostic indicators" and "survival rate." "Survival rate" is a type of "clinical prognostic indicator," and therefore can be treated as a child node of "clinical prognostic indicator" rather than an independent parallel node, merging semantically redundant associations. The connection logic of the associated edges was adjusted so that "MYCN gene amplification" is connected to "clinical prognostic indicators," and then "clinical prognostic indicators" are connected to "survival rate," aligning with the relationship between overall and specific indicators in the domain knowledge system. Domain attribute descriptions were added to each node, such as "pediatric neuroblastoma" having the domain attribute "clinical research - disease name," and "MYCN gene amplification" having the domain attribute "basic medicine - gene mutation," etc. After these optimizations, the final semantic knowledge path for the research question was generated.
[0026] Step S130: Based on the semantic knowledge path of the scientific research question, the literature knowledge matching module of the pediatric oncology scientific research model is triggered to perform targeted literature search in the preset pediatric oncology scientific research literature database, and to filter out literature containing content that matches the core terms and related relationships in the semantic knowledge path, so as to obtain a set of matching literature. The set of matching literature includes the full text of the literature, the professional term annotations in the literature, and the term related expression fragments.
[0027] Step S131: Extract all core term nodes and the types of related edges between nodes from the semantic knowledge path of scientific research questions, forming search keyword combinations and related relationship search conditions. The search keyword combinations include the full names of core terms and commonly used expressions in the field, and the related relationship search conditions include the logical search rules corresponding to the types of related edges.
[0028] From the semantic knowledge path of the research questions generated above, the core term nodes are extracted as "pediatric neuroblastoma," "MYCN gene amplification," "development process," "mechanism of action," "clinical prognostic indicators," "cell cycle-related proteins," "tumor cell proliferation," and "molecular pathways." The full names of these core terms are the node names. Common expressions in the field, such as "pediatric neuroblastoma," can include "pediatric neuroblastoma," and "MYCN gene amplification" can include "MYCN amplification," etc. The types of association edges between nodes include causal association, subordinate association, and contrastive association. In the association relationship retrieval conditions, the logical retrieval rule corresponding to causal association is "term A causes / regulates / affects term B," the logical retrieval rule corresponding to subordinate association is "term A is a part / aspect / manifestation of term B," and the logical retrieval rule corresponding to contrastive association is "term A is related to / has a correlation with term B."
[0029] Step S132: Input the search keyword combination and related search conditions into the search strategy generation submodule of the literature knowledge matching module of the pediatric tumor research big data model to generate a targeted search strategy containing multi-round search logic. The multi-round search logic includes a first round of coarse search based on core terms, a second round of fine screening based on related relationships, and a final round of fine matching based on semantic similarity.
[0030] For example, step S1321: The full names of the core terms and commonly used expressions in the field in the search keyword combination are split into fields and hierarchical levels, distinguishing between basic medical terminology, clinical research terminology, experimental technology terminology and statistical analysis terminology, and outputting a hierarchical keyword set. Each level of the terminology in the hierarchical keyword set retains the description related to the research problem of pediatric oncology.
[0031] The core terms are broken down by domain level. The basic medical terminology level includes "MYCN gene amplification," "cell cycle-related proteins," "molecular pathways," and "tumor cell proliferation," with the association description being "basic medical terms related to the mechanism of action of MYCN gene amplification in the development and progression of neuroblastoma in children." The clinical research terminology level includes "childhood neuroblastoma" and "clinical prognostic indicators," with the association description being "terminology related to clinical research on neuroblastoma in children." "Development and progression process" and "mechanism of action" are cross-disciplinary terms between basic medicine and clinical research and can be categorized into the appropriate level based on the specific context; here, they are temporarily categorized into the basic medical terminology level. The statistical analysis terminology level is less involved in this question; "relevance" can be categorized into this level, with the association description being "statistical analysis terms used to analyze the relationships between terms." This forms a hierarchical keyword set, with each level containing corresponding terms and related descriptions.
[0032] Step S1322: Perform atomic decomposition on the logical retrieval rules corresponding to the association edge types in the association retrieval conditions. Decompose the logical retrieval rules corresponding to causal association into a one-way logical expression pointing from the existence of the cause term to the existence of the effect term. Decompose the logical retrieval rules corresponding to subordinate association into an inclusion logical expression pointing from the belonging of the subordinate term to the category of the main term. Decompose the logical retrieval rules corresponding to comparative association into a two-way logical expression pointing from the attribute of the first comparative term to the difference of the attribute of the second comparative term. Decompose the logical retrieval rules corresponding to collaborative association into a linked logical expression pointing from the action of the first collaborative term to the action of the second collaborative term and then to the joint action. Output the decomposed atomic logical condition set.
[0033] The causal relationship logical retrieval rule "Term A causes / regulates / affects Term B" is broken down into the unidirectional logical expression "If Term A exists, it may lead to the occurrence / change of Term B". The subordinate relationship "Term A is a part / aspect / manifestation of Term B" is broken down into the inclusion logical expression "Term A belongs to the scope of Term B / is an aspect of Term B". The comparative relationship "Term A is related to Term B / has a correlation" is broken down into the bidirectional logical expression "Changes in the attributes of Term A will cause changes in the attributes of Term B, and vice versa". Since collaborative relationships are not involved in this problem, the breakdown rules for collaborative relationships are only shown as examples here; in actual applications, they should be broken down according to the specific relationship. The above-described decomposed expressions form a set of atomic logical conditions.
[0034] Step S1323: Associate and bind the terms at each level in the hierarchical keyword set with the corresponding logical expressions in the atomic logical condition set, so that each atomic logical condition corresponds to at least one level of term, forming an associated retrieval unit. Each associated retrieval unit contains a term level identifier, atomic logical condition and term association description, and outputs a set of associated retrieval units.
[0035] In the basic medical terminology hierarchy, "MYCN gene amplification" is bound to the causal relationship atomic logic expression "If term A exists, it may lead to the occurrence / change of term B," forming an associated retrieval unit. The terminology hierarchy is identified as "Basic Medicine - Gene Regulation," the atomic logic condition is the aforementioned causal expression, and the term association description is "MYCN gene amplification regulates cell cycle-related proteins." Similarly, "pediatric neuroblastoma" is bound to a subordinate relationship atomic logic expression, the terminology hierarchy is identified as "Clinical Research - Disease Definition," the atomic logic condition is an inclusion expression, and the term association description is "The occurrence and development process of pediatric neuroblastoma," etc. This process continues, binding terms at each level with their corresponding atomic logic conditions to form multiple associated retrieval units, which together constitute an associated retrieval unit set.
[0036] Step S1324: Perform relevance evaluation on each related retrieval unit in the related retrieval unit set. Based on the relevance description of the terms in the related retrieval unit to the pediatric oncology research problem, and the degree of matching between the atomic logic conditions and the core requirements of the problem, prioritize all related retrieval units and output the sorted sequence of related retrieval units.
[0037] The relevance of each associated search unit to the research question was assessed. The associated search unit "MYCN gene amplification regulates cell cycle-related proteins" directly relates to the core term "MYCN gene amplification" in the question, and its atomic logic conditions highly match the core requirement of the question, "affecting tumor cell proliferation by regulating cell cycle-related proteins," thus having a higher priority. The associated search unit "the occurrence and development of neuroblastoma in children," while having important terminology, has a slightly lower degree of matching in its atomic logic conditions, thus having a lower priority. Following the above assessment method, all associated search units were prioritized, resulting in a ranked sequence of associated search units.
[0038] Step S1325: Generate multi-round retrieval logic based on the sorted sequence of related retrieval units. The first round of retrieval logic uses terms from the top-ranked related retrieval units for coarse retrieval, retaining all documents containing those terms. The second round of retrieval logic uses atomic logic conditions from the related retrieval units for fine screening of the documents obtained in the first round, retaining documents whose texts meet the atomic logic conditions. The final round of retrieval logic uses the hierarchical correspondence between the terminology level identifiers in the related retrieval units and the hierarchical correspondence with the semantic knowledge path of pediatric oncology research questions for fine matching, retaining documents whose hierarchical correspondence meets the preset rules. Integrate the first, second, and final rounds of retrieval logic to output a targeted retrieval strategy that includes multi-round retrieval logic.
[0039] The first round of searching used top-ranked terms such as "MYCN gene amplification" and "pediatric neuroblastoma" to retain all literature containing these terms. The second round, based on the literature from the first round, used atomic logic conditions such as "if MYCN gene amplification exists, it may regulate cell cycle-related proteins" to filter and retain literature containing such expressions. The final round checked whether the hierarchical level of terms in the literature corresponded to the hierarchical level of the semantic knowledge path; for example, basic medical terms should appear in the section discussing mechanisms, and clinical research terms should appear in the section discussing prognosis, retaining literature with high hierarchical correspondence. Integrating these three rounds of search logic forms a targeted search strategy.
[0040] Step S133: Call the literature database retrieval submodule of the literature knowledge matching module of the pediatric oncology research big model, and search for literature containing any core terms in the preset pediatric oncology research literature database according to the first-round retrieval logic in the targeted retrieval strategy to obtain a coarse retrieval literature set.
[0041] The literature database retrieval submodule searches the pediatric oncology research literature database based on core terms such as "MYCN gene amplification" and "pediatric neuroblastoma" from the initial search logic. The system traverses the database by examining fields such as title, abstract, and keywords. Any document containing any of the aforementioned core terms is included in the coarse-search literature set. This process yields a coarse-search literature set containing hundreds of articles.
[0042] Step S134: Based on the secondary retrieval logic in the targeted retrieval strategy, perform text analysis on each document in the coarse retrieval document set, check whether there are related terms in the document text that match the relevance retrieval conditions, retain the documents with matching expressions, and obtain the finely screened document set.
[0043] The full text of each article in the coarsely searched literature set was analyzed. For example, an article stating that "MYCN gene amplification promotes the proliferation of neuroblastoma cells in children by upregulating the expression of cyclin D1" was retained because the association between "MYCN gene amplification," "cyclin D1" (a cell cycle-related protein), and "proliferation of neuroblastoma cells in children" (a tumor cell proliferation) met the causal relationship search criteria. Articles that only mentioned "treatment progress of neuroblastoma in children" without mentioning MYCN gene amplification and its related relationships were removed. This process yielded a finely selected literature set.
[0044] Step S135: According to the final round retrieval logic in the targeted retrieval strategy, calculate the semantic similarity between the text content of each document in the finely screened document set and the semantic knowledge path of the research question. The semantic similarity is derived by combining the overlap between document terms and graph nodes, and the matching degree between document term associations and graph association edges.
[0045] For each document in the carefully selected literature set, the number of overlaps between the terms appearing in the document and the core term nodes in the semantic knowledge path of the research question is counted. The higher the overlap, the higher the base semantic similarity score. Simultaneously, the degree of matching between the relationships between terms in the document and the edges of the graph is analyzed. For example, if the causal relationships in the document match the causal relationship edges in the graph, the matching score is high. The overlap score and the matching score are then combined according to a certain weight ratio (e.g., overlap accounts for 0.4, matching score accounts for 0.6) to obtain the semantic similarity value for each document.
[0046] Step S136: Retain documents whose semantic similarity meets the preset requirements, extract the full text of the retained documents whose semantic similarity meets the preset requirements, and use the terminology annotation submodule of the document knowledge matching module of the pediatric tumor research big model to annotate the full text of the documents with professional terms, identify and mark the core terms and term-related expression fragments in the documents, and form a set of matched documents containing the full text of the documents, professional term annotations and term-related expression fragments.
[0047] A preset threshold for semantic similarity, such as 0.6, is set, and documents with a semantic similarity value greater than or equal to 0.6 are retained. For the retained documents, their full text is extracted. The terminology annotation submodule uses a BERT-based named entity recognition model, which has been fine-tuned on pediatric oncology corpora and can recognize specialized terms in the documents. For example, for the sentence "MYCNamplification is a key event in the pathogenesis of neuroblastoma, and it can regulate the cell cycle by affecting the expression of cyclin-dependent kinases.", the core terms such as "MYCNamplification" (MYCN gene amplification), "pathogenesis" (pathogenesis, a pathological mechanism term), "neuroblastoma" (neuroblastoma, i.e., pediatric neuroblastoma), "cell cycle" (cell cycle, a cell cycle-related term), and "cyclin-dependent kinases" (cyclin-dependent kinases, cell cycle-related proteins) are annotated, and the related excerpts of the above terms "regulate the cell cycle by affecting the expression of cyclin-dependent kinases" are marked. The annotated full-text documents, the annotated technical terms, and the related descriptions of the terms are compiled to form a set of matching documents.
[0048] Step S140: Call the dynamic knowledge association module of the pediatric oncology research big data model to perform knowledge association analysis on the full text of the documents, professional term annotations and term association descriptions in the matching document set, and construct a dynamic knowledge association network with the semantic knowledge path of the research question as the core. The dynamic knowledge association network includes the association path between document terms and question terms, the description of the association strength, and the logical deduction relationship between knowledge nodes.
[0049] Step S141: Compare the technical term annotations of each document in the matching document set with the core term nodes in the semantic knowledge path of the research question, and identify the matching terms and unmatched new terms in the documents. Matching terms are document terms that are semantically consistent with the core term nodes, and new terms are technical terms in the documents that are related to the research topic but do not appear in the map.
[0050] Terms such as “MYCN amplification,” “neuroblastoma,” and “cellcycle proteins” in a literature article are compared with the core term nodes “MYCN gene amplification,” “pediatric neuroblastoma,” and “cell cycle-related proteins” in the semantic knowledge path of the research question. If they are semantically consistent, they are considered matched terms. If terms such as “ALK gene mutation” and “telomerase activity” appear in the literature, these terms are related to the research topic “the occurrence, development, mechanism, and prognosis of pediatric neuroblastoma”, but do not appear in the core term nodes of the original semantic knowledge path, they are identified as newly added terms.
[0051] Step S142: Perform semantic parsing on the term association description fragments of each document, extract the term association relationships in the term association description fragments, distinguish between matching association relationships that are consistent with the association edge type in the semantic knowledge path of the research question and newly added association relationships. The newly added association relationships are the inter-term associations that appear in the document but are not recorded in the graph.
[0052] Semantic analysis was performed on the terminology fragment "MYCN amplification leads to increased expression of telomerase, which in turn promotes the immortalization of neuroblastoma cells" in the literature. The analysis extracted the association of "leads to increased expression of" between "MYCN amplification" and "telomerase" (a newly added term), and the association of "promotes" between "telomerase" and "immortalization of neuroblastoma cells" (a newly added term). These associations were not recorded in the original atlas and are therefore considered new associations. However, the association in the statement "MYCN amplification regulates cell cycle-related proteins" is consistent with the causal association between "MYCN gene amplification" and "cell cycle-related proteins" in the original atlas, indicating a matched association.
[0053] Step S143: Add the newly added terms as new knowledge nodes and the newly added relationships as new relationship edges to the semantic knowledge path of the research question to form an extended semantic knowledge path. Mark the corresponding literature source identifier on the newly added knowledge nodes and mark the literature reference fragments of the relationship on the newly added relationship edges.
[0054] The newly identified terms such as "ALK gene mutation," "telomerase activity," "telomerase," and "neuroblastoma cell immortalization" are added as new knowledge nodes to the semantic knowledge path. New associations such as "MYCNamplification leads to increased telomerase activity" and "telomerase activity promotes neuroblastoma cell immortalization" are added as new association edges. The new knowledge node "telomerase activity" is labeled with a literature source identifier, such as the DOI or PMID number. The new association edge "MYCNamplification leads to increased telomerase activity" is labeled with the literature reference fragment "MYCNamplification leads to increased expression of telomerase."
[0055] Step S144: Calculate the degree of association between each node and other nodes in the extended semantic knowledge path. The degree of association is described as the attribute information of the association edges between nodes. The degree of association is calculated based on the number of direct association edges between nodes, the length of indirect association paths, and the frequency of association descriptions in the literature, according to preset rules.
[0056] Step S1441: Count the number of direct edges between any two nodes in the extended semantic knowledge path. A direct edge is an edge that connects two nodes without passing through other nodes. Record the results of the direct edge count.
[0057] A traversal analysis is performed on all node pairs in the extended semantic knowledge path. For each node pair, it is checked whether there is a direct connection edge that does not pass through any intermediate nodes. For the "MYCN gene amplification" node and the "cell cycle-related protein" node, if multiple documents are identified during the literature knowledge analysis process that explicitly describe the direct regulatory relationship between the two, then an independent direct connection edge is created for each document supporting this direct relationship during knowledge path construction. Therefore, the statistical result of the number of direct connection edges is not a binary statement of existence or non-existence, but rather reflects the number of independent sources of literature evidence supporting this direct connection. This number is recorded as an integer value, denoted as the direct connection edge count value E_direct.
[0058] Step S1442: Traverse all indirect association paths between any two nodes in the extended semantic knowledge path. An indirect association path is a sequence of associated edges that connect two nodes through one or more intermediate nodes. Record the number of associated edges contained in each indirect association path as the path length, and calculate the average path length of all indirect association paths between two nodes.
[0059] For node pairs without direct connections or requiring examination of multi-step transmission relationships, such as "MYCN gene amplification" and "neuroblastoma cell immortalization," all paths connecting these two nodes are searched within the extended semantic knowledge path. Each path consists of a series of continuous connecting edges and intermediate nodes. For example, path one is "MYCN gene amplification → telomerase activity → neuroblastoma cell immortalization," containing 2 connecting edges and a path length of 2; path two is "MYCN gene amplification → cell cycle-related proteins → tumor cell proliferation → cell cycle checkpoint malfunction → neuroblastoma cell immortalization," containing 4 connecting edges and a path length of 4. For all found indirect connecting paths, the arithmetic mean of their path lengths is calculated and denoted as the average path length, L_path_avg.
[0060] Step S1443: Extract all fragments in the matching document set that involve the related terms of the two nodes, count the total frequency of the extracted fragments in the documents, and record the statistical results of the frequency of related terms.
[0061] A full-text scan of all documents in the matching document set was performed to locate all sentences or paragraphs that simultaneously mentioned the terms corresponding to the two nodes. For "MYCN gene amplification" and "cell cycle-related proteins," the total number of segments in all documents that explicitly stated a relationship between the two was counted, regardless of whether the relationship described by these segments had been constructed as a direct association edge. It is important to note that the frequency statistics here have a different statistical granularity than the number of direct association edges: the number of direct association edges counts the number of relationships identified as having clear and independent evidence after semantic parsing and knowledge extraction. Typically, a document may be extracted into a direct association edge because it mentions the same pair of terms multiple times. On the other hand, the frequency of term association statements counts the number of times the pair of terms is mentioned in the original document text. The same document may contain multiple statements of the pair of terms, and each occurrence is counted. Therefore, the frequency of term association statements can reflect the popularity and breadth of discussion of the pair of terms in the literature. The statistical results are recorded as the frequency value F_mention.
[0062] Step S1444: According to the preset weight allocation rules, assign corresponding calculation weights to the statistical results of the number of directly related edges, the average path length, and the statistical results of the frequency of related terms. The weight allocation rules are set based on the importance characteristics of knowledge association in the field of pediatric oncology.
[0063] Based on the characteristics of knowledge association in the field of pediatric oncology, weights are assigned to three parameters. The frequency of term association (F_mention) reflects the popularity of the association in academic circles and has high reference value; it is assigned a weight coefficient w_f with a value of 0.5. The number of direct association edges (E_direct) reflects the number of associations supported by independent evidence after semantic refinement, representing the reliability of the association; it is assigned a weight coefficient w_e with a value of 0.4. The average path length (L_path_avg) reflects the indirect association distance between two nodes. The shorter the path, the fewer steps involved in the transmission through other knowledge nodes, and the more direct the association. Therefore, the reciprocal of the average path length is used in the calculation, and a weight coefficient w_l with a value of 0.1 is assigned. The sum of the above weight coefficients is 1 to ensure that the calculation results are within a unified dimension.
[0064] Step S1445: Based on the assigned calculation weights, perform a comprehensive calculation on the statistical results of the number of directly related edges, the average path length, and the statistical results of the frequency of related terms to obtain the numerical value of the degree of association between the two nodes.
[0065] Before performing the comprehensive calculation, the three parameters need to be standardized to eliminate the inconsistency in units that may be caused by differences in the order of magnitude of different parameters. For the number of directly related edges E_direct, the maximum number of directly related edges between all node pairs in the extended semantic knowledge path is counted and denoted as E_max. The standardized value of the number of directly related edges E_norm is obtained by dividing the E_direct of each node pair by E_max, so that its value is between 0 and 1. For the term association frequency F_mention, the maximum frequency between all node pairs is counted and denoted as F_max. The standardized frequency value F_norm is obtained by dividing the F_mention of each node pair by F_max. For the average path length L_path_avg, since it is itself a length value, its reciprocal is taken. The maximum value of the reciprocal of the average path length between all node pairs is counted and denoted as L_inv_max. The standardized path length contribution value L_norm is obtained by dividing the reciprocal 1 / L_path_avg of each node pair by L_inv_max, so that its value is between 0 and 1.
[0066] After standardization, a comprehensive calculation is performed using the formula: S = E_norm × w_e + F_norm × w_f + L_norm × w_l. The S value calculated using this formula is a dimensionless value, ranging from 0 to 1. It objectively reflects the degree of correlation between two nodes and avoids calculation distortion caused by differences in the dimensions and orders of magnitude of the original parameters.
[0067] Step S1446: Convert the numerical value of the degree of association into a textual description of the degree of association, and add the description of the degree of association to the attribute information of the associated edge between the two nodes to complete the attribute labeling of the associated edge.
[0068] Based on the calculated correlation strength value S, a hierarchical conversion rule is set: if the S value is greater than or equal to 0.8, it is converted to the text description "highly closely correlated"; if the S value is greater than or equal to 0.5 and less than 0.8, it is converted to the text description "moderately closely correlated"; if the S value is less than 0.5, it is converted to the text description "lowly closely correlated". These text descriptions are then added to the attribute information of the corresponding edges between nodes to complete the attribute labeling of the edges.
[0069] Step S145: Analyze the logical derivation relationship between knowledge nodes in the extended semantic knowledge path, identify multiple logical paths from the core term node of the problem to the research target node, each logical path contains sequentially connected knowledge nodes and associated edges, and record the literature support basis corresponding to each logical path.
[0070] For example, step S1451: Extract all knowledge nodes and associated edges of the connected nodes in the extended semantic knowledge path, record the terminology content, domain attribute description, association type and association tightness description of each knowledge node, and output the complete set of node edges.
[0071] All nodes in the extended semantic knowledge path, such as “MYCN gene amplification”, “childhood neuroblastoma”, “cell cycle-related protein”, “telomerase activity”, etc., as well as the associated edges connecting these nodes, such as “causal association (highly close)”, “subordinate association (moderately close)”, etc., along with the terminology content of the nodes, domain attribute descriptions, and association type and degree of association descriptions of the associated edges, are extracted to form a complete set of node edges.
[0072] Step S1452: Classify and organize the associated edges in the complete set of node edges according to the association type. Mark the associated edges of causal association as derivation direction edges, the associated edges of subordinate association as belonging direction edges, the associated edges of comparative association as bidirectional reference edges, and the associated edges of cooperative association as linkage direction edges. Output the set of node edges with direction labels.
[0073] Causal relationships, such as "MYCN gene amplification → cell cycle-related proteins", are marked as derivation direction edges, with arrows pointing from cause to effect; subordinate relationships, such as "mechanism of action → mechanism of action of MYCN gene amplification", are marked as attribution direction edges, with arrows pointing from child node to parent node; comparative relationships, such as "MYCN gene amplification ↔ clinical prognostic indicators", are marked as bidirectional reference edges, with arrows pointing in both directions; synergistic relationships, such as "MYCN gene amplification + ALK gene mutation → increased tumor malignancy", are marked as linkage direction edges, with arrows pointing from two synergistic nodes to the combined result.
[0074] Step S1453: Construct a logical deduction rule base based on the set of nodes and edges with directional labels. The rule base includes unidirectional deduction rules where the starting node points to the ending node corresponding to the deduction direction edge, subordinate deduction rules where the subordinate node points to the master node corresponding to the belonging direction edge, bidirectional deduction rules where the first node points to the second node and the second node points to the first node and the attribute comparison corresponding to the bidirectional reference edge, and linkage deduction rules where the first node's action points to the second node's action and then to the joint action corresponding to the linkage direction edge. Output the logical deduction rule base.
[0075] The unidirectional derivation rule is: "If there is a starting node A, and there is a derivation direction edge between A and B, then B can be deduced from A"; the attribution derivation rule is: "If there is a subordinate node A, and there is an attribution direction edge between A and the master node B, then A belongs to B"; the bidirectional derivation rule is: "If there is a bidirectional reference edge between A and B, then changes in the attributes of A can be referenced to changes in the attributes of B, and vice versa"; the linkage derivation rule is: "If there is a linkage direction edge between A and B, and A acts on C, and B acts on C, then A and B act together on C." Combining these rules forms a logical derivation rule base.
[0076] Step S1454: Starting from the core term node of the problem, based on the one-way deduction rules, attribution deduction rules, two-way deduction rules, and linkage deduction rules in the logical deduction rule base, traverse all reachable knowledge nodes in the set of node edges with direction labels, record the node sequence and corresponding associated edges traversed in each traversal, and form an initial path. Each initial path contains the associated edge information of the starting node, intermediate nodes, ending node, and connecting nodes, and output the initial path set.
[0077] Starting with "MYCN gene amplification," and following the unidirectional derivation rule, the "Cell Cycle-Related Proteins" node can be reached via the derivation direction edge "MYCN gene amplification → Cell Cycle-Related Proteins," and then the "Tumor Cell Proliferation" node can be reached via the derivation direction edge "Cell Cycle-Related Proteins → Tumor Cell Proliferation," forming an initial path: MYCN gene amplification → Cell Cycle-Related Proteins → Tumor Cell Proliferation. The node sequences and associated edge information in this path are recorded. Simultaneously, the "Telomerase Activity" node can be reached via the derivation direction edge "MYCN gene amplification → Telomerase Activity," and then the "Neuroblastoma Cell Immortality" node, forming another initial path. This process is repeated, traversing all reachable nodes to form a set of initial paths.
[0078] Step S1455: Select paths from the initial path set whose endpoint nodes are research target nodes, retain paths that contain core terminology nodes, at least one intermediate node, and research target nodes, and eliminate direct paths that only contain core terminology nodes and research target nodes, and output a candidate path set.
[0079] The research target node is "mechanism and correlation investigation." Paths ending at this node are selected from the initial path set. For example, the path "MYCN gene amplification → cell cycle-related proteins → tumor cell proliferation → mechanism of neuroblastoma development in children → mechanism and correlation investigation" contains a core term node, intermediate nodes, and the research target node, and is retained. If a direct path like "MYCN gene amplification → mechanism and correlation investigation" exists, it is discarded due to the lack of intermediate derivation. After screening, a candidate path set is obtained.
[0080] Step S1456: Perform path integrity check processing on each candidate path in the candidate path set. Based on the description of the degree of association of the associated edges, retain the candidate paths whose descriptions of the degree of association of all associated edges conform to the preset rules, and remove the candidate paths whose descriptions of the degree of association of the associated edges do not conform to the preset rules. Output multiple logical paths from the core term node of the problem to the research target node. Each logical path contains the core term node of the problem, intermediate nodes, research target nodes and corresponding associated edges, association types and degree of association of the associated edges connected in sequence.
[0081] The preset rule requires that the correlation strength of all associated edges in a path must reach "moderately close correlation" or higher. A candidate path is checked; if any associated edge has a correlation strength of "lowly close correlation," the path is discarded; if all associated edges have "moderate" or "high" close correlation, the path is retained. This results in multiple logical paths that meet the requirements, such as Path 1: MYCN gene amplification → cell cycle-related proteins (causal association, high close correlation) → tumor cell proliferation (causal association, moderate close correlation) → mechanism of neuroblastoma development in children (dependent association, high close correlation) → investigation of mechanism and correlation (target association). The literature support for each path is recorded, such as the association relationships in a path originating from literature A, literature B, etc.
[0082] Step S146: Integrate the nodes, related edges, degree of association, and logical paths in the extended semantic knowledge path to construct a dynamic knowledge association network with the original research question semantic knowledge path as the core, which includes newly added knowledge from the literature and multi-dimensional related information. The dynamic knowledge association network can intuitively display the relationship between the question and the literature knowledge.
[0083] The expanded knowledge nodes, various related edges (including original and newly added related edges), the degree of association of related edges, and the identified multiple logical paths are integrated together. The original semantic knowledge path of the scientific research question is used as the central framework, and the newly added knowledge nodes and related edges are expanded around the core framework. The logical paths are highlighted with different colors or line styles, forming a dynamic knowledge association network that can intuitively show the knowledge association from the core term node to the research target node.
[0084] Step S150: Integrate the knowledge nodes and related paths in the dynamic knowledge association network, extract the knowledge content that can answer research questions in pediatric oncology, and generate pediatric oncology research literature question-and-answer results by combining the literature source information corresponding to the related paths.
[0085] Step S151: Extract all logical paths pointing to the research target node from the dynamic knowledge association network. Each logical path contains multiple knowledge nodes and associated edges connected in sequence.
[0086] Extracting all logical paths with the endpoint of "mechanism and relevance exploration" from the dynamic knowledge network, such as Path 1 mentioned above and other similar paths, these paths are key clues for answering scientific research questions.
[0087] Step S152: Extract the content of knowledge nodes and related edges in each logical path, integrate the terminology explanations corresponding to the knowledge nodes, the descriptions of the relationships and the degree of connection corresponding to the related edges, and form path knowledge fragments. Each path knowledge fragment corresponds to the core content of a logical path.
[0088] For the knowledge node "MYCN gene amplification" in Path 1, the terminology explanation is extracted as follows: "MYCN gene amplification is a common gene abnormality event in childhood neuroblastoma and is closely related to the malignancy of the tumor." The association edge "MYCN gene amplification → cell cycle-related proteins" is described as "MYCN gene amplification regulates the expression of cell cycle-related proteins," with a strong correlation described as "highly strong." The node "cell cycle-related proteins" is explained as "a class of proteins involved in cell cycle regulation, such as cyclins and cyclin-dependent kinases." The association edge "cell cycle-related proteins → tumor cell proliferation" is described as "abnormal expression of cell cycle-related proteins can promote tumor cell proliferation," with a moderately strong correlation. Integrating the above content forms the path knowledge fragment for this path.
[0089] Step S153: Analyze the relevance of all path knowledge fragments to the pediatric oncology research problem, retain the path knowledge fragments that are directly related to the scientific questions to be explored in the pediatric oncology research problem, and remove the path knowledge fragments that are irrelevant to the problem or whose relevance does not meet the preset relevance standard, so as to obtain the core knowledge fragment set.
[0090] Step S1531: Extract the scientific question statement to be explored from the research questions on pediatric tumors, and decompose the content of the scientific question statement to be explored into multiple key question elements. The key question elements include the object of the question, the perspective of the question, and the expected answer dimension.
[0091] The scientific question to be investigated is described as "investigating the mechanism of action of MYCN gene amplification in the development and progression of neuroblastoma in children, and its correlation with clinical prognostic indicators." The question is broken down into "MYCN gene amplification," and the question angles are "the mechanism of action in the development and progression of neuroblastoma in children" and "the correlation with clinical prognostic indicators." The expected dimensions of the answer are the specific process of the mechanism of action and the specific manifestations and degree of the correlation.
[0092] Step S1532: Analyze the content of each path knowledge segment, extract the core knowledge points in the path knowledge segment. The core knowledge points include term definitions, relationship descriptions and logical conclusions. Determine whether each core knowledge point corresponds to the key question element.
[0093] Analyze a knowledge segment from a specific path and extract core knowledge points such as "MYCN gene amplification definition: a common gene abnormality in childhood neuroblastoma," "relationship explanation: MYCN gene amplification regulates the expression of cell cycle-related proteins," and "logical conclusion: leads to accelerated tumor cell proliferation and promotes tumor development." Determine that these key points correspond to the question object "MYCN gene amplification" and the question angle "mechanism of action."
[0094] Step S1533: Count the number of core knowledge points in each path knowledge segment that correspond to the key question elements, calculate the proportion of the number of corresponding points to the total number of points in the segment, and obtain the point correspondence ratio.
[0095] If a certain path knowledge segment has a total of 5 core knowledge points, and 4 of them correspond to key question elements, then the correspondence ratio of the points is 4 / 5 = 0.8.
[0096] Step S1534: Check whether the core knowledge points in the path knowledge fragment can directly respond to the question angle and expected answer dimension in the key question elements, record the number of elements that can be directly responded to, and obtain the number of element responses.
[0097] There are two question angles in the key question elements, and multiple expected answer dimensions. If the path knowledge fragment can directly respond to the question angle of "mechanism of action" and the expected answer dimension of "specific process of mechanism of action", then the number of element responses is 2.
[0098] Step S1535: Based on the corresponding ratio of key points and the number of element responses, conduct a comprehensive assessment of the relevance between the path knowledge fragments and the scientific question to be explored according to the preset assessment rules, and generate a relevance assessment result.
[0099] The preset assessment rule is: Relevance assessment result = Key point correspondence ratio × 0.6 + Number of element responses × 0.4 (The maximum number of element responses is 3, which is adjusted according to the actual number of questioning elements). If the key point correspondence ratio is 0.8 and the number of element responses is 2, then the relevance assessment result = 0.8 × 0.6 + 2 × 0.4 = 0.48 + 0.8 = 1.28 (This is only an example calculation method; specific score ranges and thresholds can be set during actual assessment).
[0100] Step S1536: Retain the path knowledge fragments whose relevance assessment results reach the relevance assessment threshold, remove the path knowledge fragments that do not reach the relevance assessment threshold, and form the retained path knowledge fragments into a core knowledge fragment set.
[0101] A relevance assessment threshold of 1.0 is set. If the assessment result of a certain path knowledge fragment is 1.28, reaching the threshold, it is retained; if the assessment result of another fragment is 0.8, not reaching the threshold, it is removed. All retained fragments are combined into a core knowledge fragment set.
[0102] Step S154: Logically sort the fragments in the core knowledge fragment set, arranging them according to the derivation order from the core terminology of the problem to the research target node.
[0103] The fragments in the core knowledge fragment set correspond to different logical paths. They are ordered according to these paths, starting from the core term "MYCN gene amplification" and gradually deducing to the research target node "mechanism and correlation exploration", so that the knowledge content presents a progressive logical relationship.
[0104] Step S155: Extract the literature source identifier and literature basis fragment from the associated edge attribute information corresponding to each core knowledge fragment, and find the literature source details corresponding to the extracted literature source identifier. The literature source details include the journal in which the literature was published, the publication time, the author, and the link to obtain the literature.
[0105] The source identifier of the literature is obtained from the associated edge attribute of the core knowledge fragment, such as PMID:32456789. Based on this identifier, the corresponding source details of the literature are found in the literature database, including the publication in the Journal of Clinical Oncology, Volume 38, Issue 15, 2020, with the authors Smith A et al., and the DOI link of the literature.
[0106] Step S156: Associate the sorted core knowledge fragments with the corresponding literature reference fragments and literature source details, organize them according to the preset answer content format, and form a pediatric oncology research literature Q&A result that includes the question-and-answer logic, knowledge basis, and literature source.
[0107] The core knowledge segments are arranged sequentially, with each segment followed by a corresponding reference and source details, such as "MYCN gene amplification promotes the proliferation of neuroblastoma cells in children by regulating the expression of cell cycle-related proteins (reference: 'MYCN amplification up regulates cyclin D1 expression, leading to increased proliferation of neuroblastoma cells.' PMID: 32456789). This study was published in the *Journal of Clinical Oncology* in 2020, by Smith A et al. (link: https: / / doi.org / ...)". The content is organized according to this format to form a Q&A result on pediatric oncology research literature.
[0108] Step S160: Integrate the results of the question-and-answer session on pediatric oncology research literature with the core structure diagram of the dynamic knowledge association network to form the final intelligent question-and-answer output content for pediatric oncology research literature, and send the final intelligent question-and-answer output content for pediatric oncology research literature to the researchers' interactive terminal.
[0109] Step S161: Call the network visualization module of the pediatric tumor research model to visualize the core structure of the dynamic knowledge association network, retain the core term nodes, research target nodes and main association paths in the dynamic knowledge association network, simplify secondary nodes and redundant association edges, and generate a schematic diagram of the core structure of the dynamic knowledge association network.
[0110] Step S1611: Extract the core term nodes and research target nodes contained in the semantic knowledge path of the original scientific research question from the dynamic knowledge association network, and determine the extracted core term nodes and research target nodes as the core node set for visualization processing.
[0111] The core term nodes, such as “MYCN gene amplification”, “childhood neuroblastoma”, and “clinical prognostic indicators”, as well as the research target node “mechanism and correlation exploration”, are extracted from the original semantic knowledge path in the dynamic knowledge association network to form a core node set.
[0112] Step S1612: Analyze the association edges between nodes in the core node set of the dynamic knowledge association network, count the degree of association tightness of each association edge and the corresponding number of document support, and retain the association edges whose degree of association tightness meets the preset standard and whose number of document support reaches the preset threshold as the main association paths.
[0113] The correlation between core nodes is statistically analyzed. For example, the correlation between "MYCN gene amplification" and "cell cycle-related proteins" is "highly close" with 15 supporting articles. The preset standard is "moderately close" or higher, with a threshold of 5 articles. Therefore, this correlation edge is retained as the main correlation path. On the other hand, if the correlation edge is "lowly close" with only 3 supporting articles, it is removed.
[0114] Step S1613: Identify secondary nodes in the dynamic knowledge association network. Secondary nodes are nodes in the newly added terms whose association with the core nodes does not reach the preset degree of association and do not appear in the main association path. Merge or hide the identified secondary nodes to reduce the number of nodes in the diagram.
[0115] New terms such as "phosphorylation level of certain signaling pathways" are added. If the correlation between a node and a core node is "lowly close" and the node does not appear in the main associated path, it is identified as a minor node and merged into the relevant core node or hidden directly to simplify the diagram.
[0116] Step S1614: Clean up redundant association edges in the main association paths. Redundant association edges are two association edges with the same node and the same association type. Retain association edges with the number of documents supporting the data reaching the preset threshold, and delete duplicate redundant association edges.
[0117] If there are two related edges that are both "MYCN gene amplification → cell cycle-related protein" and both are causal relationships, with one edge having 10 supporting documents and the other having 3, and the preset threshold for the number of supporting documents being 5, then the related edge with 10 supporting documents is retained, and the other redundant related edge is deleted.
[0118] Step S1615: Assign a unique visual identifier to each node in the core node set. The visual identifier includes the node shape and base color. Different types of core nodes use different shapes, and nodes of the same type use the same base color.
[0119] Assign a circular shape to "Childhood Neuroblastoma" (disease name node) with a base color of blue; assign a square shape to "MYCN gene amplification" (gene mutation node) with a base color of red; assign a triangle shape to "Clinical prognostic indicators" (research indicator node) with a base color of green; and assign a diamond shape to the research target node "Mechanism and Correlation Investigation" with a base color of yellow.
[0120] Step S1616: Arrange the node positions in the visualization interface according to the relationship between the core nodes, so that the main related paths are evenly distributed and without intersection or occlusion. Use arrow lines to represent the related edges, and the direction of the arrows reflects the logical direction of the relationship.
[0121] In the visualization interface, core nodes are arranged according to logical relationships. For example, "MYCN gene amplification" is on the left, connected to "cell cycle-related proteins" on its right, and then connected to "tumor cell proliferation" on the right, and so on, so that the main association paths are distributed from left to right to avoid intersections. The association edges are represented by arrow lines. For example, the arrow pointing from "MYCN gene amplification" to "cell cycle-related proteins" indicates the direction of the causal relationship.
[0122] Step S1617: Add association type description labels and association tightness description labels to the association edges. The labels are located in the middle of the association edges. Adjust the font size and color of the labels to make them visible. This completes the generation of the core structure diagram of the dynamic knowledge association network.
[0123] Add the label "Causal Relationship (Highly Tight)" to the middle of the association edge of "MYCN Gene Amplification → Cell Cycle Related Proteins," setting the font size to 10 and the color to black, ensuring it is clearly visible in the diagram. Perform the same processing on the association edges of all major association paths to complete the generation of the core structure diagram.
[0124] Step S162: In the schematic diagram of the core structure of the dynamic knowledge association network, the nodes and associated paths corresponding to the core knowledge fragments in the question-and-answer results of pediatric oncology research literature are specially marked. The marking method includes distinguishing nodes by color and bolding associated edges.
[0125] The core knowledge fragments in the pediatric oncology research literature Q&A results correspond to certain nodes and associated paths in the dynamic knowledge association network. For example, for the core knowledge fragments involving the path "MYCN gene amplification → cell cycle-related proteins → tumor cell proliferation", the nodes "MYCN gene amplification", "cell cycle-related proteins" and "tumor cell proliferation" in this path will be highlighted in darker colors and the associated border lines will be thickened to emphasize these parts that are relevant to the answer content.
[0126] Step S163: Divide the results of the Q&A on pediatric oncology research literature into multiple answer chapters according to the answer logic. Each answer chapter corresponds to a set of related core knowledge fragments. Add a chapter title to each answer chapter. The title content should be able to summarize the core answer content of the chapter.
[0127] The Q&A results are divided into chapters such as "Mechanism of action of MYCN gene amplification in the occurrence and development of neuroblastoma in children" and "Correlation analysis of MYCN gene amplification and clinical prognostic indicators". Each chapter corresponds to a set of related core knowledge segments, and the chapter title accurately summarizes the core content of that part.
[0128] Step S164: At the end of each solution chapter, add a corresponding network structure guide, describing the position of the content of the solution chapter in the core structure diagram of the dynamic knowledge association network and the associated path, to guide researchers to understand the knowledge association network through the diagram.
[0129] At the end of the chapter “Mechanism of MYCN gene amplification in the development and progression of neuroblastoma in children”, add the following guidance: “The content of this chapter corresponds to the bolded path in the diagram of the core structure of the dynamic knowledge association network, where the red square node ‘MYCN gene amplification’ on the left points to the blue circular node ‘cell cycle-related proteins’, and then to the green triangle node ‘tumor cell proliferation’.” Step S165: The question and answer results of pediatric oncology research literature after being divided into chapters are integrated with the core structure diagram of the dynamic knowledge association network after being marked. The question and answer results are located in the left area, and the diagram is located in the right area. The two are linked by chapter number and marking symbols.
[0130] When formatting, place the content of each chapter of the Q&A results in the left area, and place the core structure diagram after marking in the right area. Add numbers such as "1." and "2." before the chapter titles, and add the same numbers next to the nodes or paths of the corresponding chapter content in the diagram to establish a one-to-one correspondence.
[0131] Step S166: Standardize the text format of the integrated content, adjust the font style, paragraph spacing and heading level to conform to the format standards that meet the reading habits of researchers, and form the final intelligent question and answer output content of pediatric tumor research literature.
[0132] The font of the question-and-answer results is standardized to SimSun, size 12, with paragraph spacing set to 1.5 line spacing. Chapter titles are in bold, size 3, and first-level subheadings are in bold, size 4, to conform to the common literature reading format habits of researchers, thus forming the final intelligent question-and-answer output content.
[0133] Step S167: Send the final intelligent Q&A output of pediatric oncology research literature to the researcher's interactive terminal.
[0134] The final Q&A output, after being formatted and integrated, is sent to interactive terminals such as computers and tablets used by researchers via network transmission protocols, enabling them to access and view it in a timely manner.
[0135] Furthermore, the update and maintenance process of the pre-defined pediatric oncology research literature database is integrated into the iterative optimization phase of the pediatric oncology research model. The method may also include: Step S170: Update and maintain the document database.
[0136] Step S171: Regularly collect newly published research literature in the field of pediatric oncology, including recently published academic journal articles, papers from the latest scientific conferences, and newly released clinical trial reports.
[0137] We set a monthly collection cycle and automatically retrieved journal articles published in the field of pediatric oncology within the past month through academic database interfaces (such as PubMed, WebofScience, etc.). We also assigned dedicated personnel to collect the proceedings of the latest international pediatric oncology conferences (such as SIOP conferences) and paid attention to newly published clinical trial reports related to pediatric neuroblastoma on clinical trial registration platforms (such as ClinicalTrials.gov) to ensure that the collected literature is timely and relevant.
[0138] Step S172: Conduct a text quality review on the collected newly published research literature, check whether the content of the literature conforms to the academic norms of the field of pediatric oncology research, remove literature that does not conform to the academic norms of the field of pediatric oncology research, and obtain a qualified set of new literature.
[0139] Professional pediatric oncology researchers reviewed the collected literature, checking whether it included a complete abstract, introduction, methods, results, discussion, and conclusions; whether the research design was scientifically sound; whether the data was accurate and reliable; and whether the citations were standardized. Literature suspected of data fabrication, with obvious methodological flaws, or irrelevant to the field of pediatric oncology was removed, and only those conforming to academic standards were retained to form a qualified new literature collection.
[0140] Step S173: Perform text structuring on each document in the qualified new document set, extracting the document's title, author, publication date, journal name, abstract, keywords, main text paragraphs, and reference list to form structured document data.
[0141] The full-text PDF or HTML format of qualified new documents is parsed using a text parsing tool, automatically extracting the title, authors (including each author's name, affiliation, and email address), publication date (accurate to year, month, and day), journal name (including volume, issue, and page numbers), abstract (structured abstracts should distinguish between purpose, methods, results, and conclusions), keywords (including author keywords and standard keywords), main text paragraphs (broken by chapter, such as 1. Introduction, 2. Materials and Methods, etc.), and a list of references (each reference includes complete information such as author, title, journal, and publication date). The extracted information is then stored according to a preset data structure to form structured document data.
[0142] Step S174: Call the literature database update submodule of the pediatric oncology research big data model, add structured literature data to the preset pediatric oncology research literature database, assign a unique literature identifier to the newly added literature, and establish a correlation mapping between the literature identifier and the structured literature data.
[0143] After receiving structured document data, the document database update submodule assigns a unique document identifier to each new document according to the document entry rules, such as generating it in UUID format. A one-to-one mapping is established between the document identifier and the document's structured data (title, author, text, etc.), and this mapping is stored in a relational database table of the document database. Simultaneously, the full-text document is stored in a distributed file system, with the document identifier serving as an index linking the full-text document in the file system.
[0144] Step S175: Update the index of literature data in the updated pediatric oncology research literature database, adjust the keyword correspondence and association index rules of the literature retrieval index, so that newly added literature can be recognized by subsequent retrieval operations.
[0145] The newly added literature data is indexed using a full-text search engine (such as Elasticsearch), extracting keywords, terms, and term relationships from the literature. The keyword inverted index is then updated to ensure that new literature can be matched when relevant keywords are entered in subsequent searches. Simultaneously, the association indexing rules are adjusted to incorporate term relationships appearing in the new literature into the index system, so that the literature knowledge matching module can retrieve this new association information.
[0146] Step S176: Input the structured literature data of the newly added literature into the parameter optimization submodule of the pediatric oncology research model. Adjust the model parameters through semantic understanding and knowledge update tasks of the literature text so that the model can master the professional knowledge and terminology relationships in the new literature.
[0147] The parameter optimization submodule takes the structured data from new documents, including abstracts and paragraphs, as input for semantic understanding tasks, such as predicting missing terms or relationships in the text, and adjusting the model's weight parameters through backpropagation. Simultaneously, a dedicated knowledge update task is designed to address the specialized knowledge and terminological relationships in the new documents, enabling the model to learn and memorize this new knowledge and update its internal knowledge representation.
[0148] Step S177: Record the time of each literature database update, the number of new literatures, and the adjustments to model parameters to form an update and maintenance log.
[0149] Each time the literature database is updated, the system automatically records the specific time of the update (accurate to the hour, minute, and second), the number of newly added documents (including journal articles, conference papers, and clinical trial reports), the number of strata adjusted in the model parameters, and the range and magnitude of the adjusted parameters. This information forms an update and maintenance log to facilitate the tracking and management of the iteration history of the literature database and the model.
[0150] During the data collection phase, we strictly adhere to relevant laws and regulations. For potentially privacy-sensitive data in pediatric oncology research literature, such as clinical research literature containing patient personal information, we obtain legal data use authorization documents through the ethics review committee of the source institution before collection. For literature data obtained from public databases, we verify the database's license agreement to ensure that the collection complies with the agreement terms, clearly stating that the data is intended solely for research purposes and cannot be used for other commercial or unauthorized scenarios. During data transmission, we employ end-to-end encryption technology, using SSL / TLS encryption to prevent unauthorized interception or leakage of data during transmission.
[0151] Regarding data licensing management, a strict authorization registration system will be established. For document data involving personal privacy, informed consent or authorization forms from the data subject will be required, clearly recording the scope of authorization, usage period, and data processing methods. Authorization documents will be stored using blockchain technology to ensure the immutability and traceability of authorization records. A dedicated data compliance review team will be established to regularly verify the validity of data authorization documents. For data whose authorization period has expired or whose scope of authorization has changed, use will be promptly terminated or a new authorization application will be submitted.
[0152] In the data storage stage, standard-compliant encryption storage technology is used to encrypt privacy-sensitive data before storing it on servers that meet the Level 3 security protection requirements. Strict access control policies are implemented, employing a role-based access control model to assign different access permissions to personnel in different positions, authorizing only necessary personnel to access sensitive data. Servers are deployed in physically isolated data centers, equipped with 24-hour monitoring and intrusion detection systems to prevent unauthorized physical or network access.
[0153] During data usage, privacy-sensitive data undergoes de-identification processing, removing directly identifiable personal information such as names, ID numbers, and contact information, and anonymizing indirect identifiers. A data usage log is established, detailing the access time, access personnel, purpose of use, and operations performed; this log is retained for at least three years. Data sharing and transmission are strictly limited to authorized scope. If data needs to be provided to third parties, a data sharing agreement must be signed, clearly defining the third party's responsibilities and confidentiality obligations, and the shared data must be anonymized.
[0154] Regularly conduct data security and privacy protection training to raise the legal and security awareness of relevant personnel. Engage a third-party organization annually to conduct compliance assessments of data processing activities, promptly identifying and rectifying potential compliance risks. Establish a data breach emergency response mechanism, develop contingency plans, and in the event of a data breach, take timely measures to control the escalation of the situation and report to relevant departments as required by laws and regulations. Through the above technical and management measures, ensure that data collection and authorization comply with legal and regulatory requirements, protecting personal privacy and data security.
[0155] In one exemplary embodiment, a literature intelligent question-answering system based on a large-scale pediatric oncology research model is provided. This system can be a terminal, server, etc., and its internal structure diagram can be as follows: Figure 2 As shown, this intelligent question-and-answer system for pediatric oncology research, based on a large-scale pediatric oncology research model, includes a processor, memory, input / output interface, communication interface, display unit, and input device. The processor, memory, and input / output interface are connected via a system bus, and the communication interface, display unit, and input device are also connected to the system bus via the input / output interface. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides the environment for the operation of the operating system and computer programs in the non-volatile storage media. The input / output interface is used for exchanging information between the processor and external devices. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, near-field communication, or other technologies. When the computer program is executed by the processor, it implements an intelligent question-and-answer method for pediatric oncology research based on a large-scale pediatric oncology research model. The display unit is used to generate a visually visible image and can be a display screen, projection device, or virtual reality imaging device. The display screen can be an LCD screen or an e-ink screen. The input device can be a touch layer covering the display screen, or a button, trackball, or touchpad set on the shell of the intelligent question-and-answer system for literature based on a large model of pediatric tumor research, or an external keyboard, touchpad, or mouse, etc.
[0156] It should be noted that, in order to simplify the description of the present invention and thus help to understand one or more embodiments of the invention, multiple features may sometimes be grouped into one embodiment, drawing or description thereof in the foregoing description of the embodiments of the present invention.
Claims
1. A literature intelligent question and answer method based on a pediatric oncology scientific research large model, characterized by, The method includes: The system receives research questions on pediatric oncology in natural language input from researchers. These research questions include professional research directions in the field of pediatric oncology, scientific questions to be explored, and descriptions of relevant research background. The semantic knowledge path construction module of the pediatric oncology research model is invoked to perform semantic deconstruction processing on the pediatric oncology research problem, extract the core terms, the relationships between terms and the research objectives in the pediatric oncology research problem, and generate the semantic knowledge path of the research problem. Based on the semantic knowledge path of the research question, the literature knowledge matching module of the pediatric oncology research big model is triggered to perform targeted literature search in the preset pediatric oncology research literature database, and filter out literature containing content that matches the core terms and related relationships in the semantic knowledge path to obtain a set of matched literature. The set of matched literature includes the full text of the literature, professional term annotations in the literature, and term related expression fragments. The dynamic knowledge association module of the pediatric oncology research model is invoked to perform knowledge association analysis on the full text of the literature, the professional term annotations, and the term association expression fragments in the matched literature set. A dynamic knowledge association network with the semantic knowledge path of the research question as the core is constructed. The dynamic knowledge association network includes the association path between literature terms and question terms, the description of the association strength, and the logical deduction relationship between knowledge nodes. Information is integrated from the knowledge nodes and association paths in the dynamic knowledge association network to extract knowledge content that can answer the research questions on pediatric oncology. Combined with the literature source information corresponding to the association paths, the results of pediatric oncology research literature question and answer are generated. The results of the question-and-answer session on pediatric oncology research literature are linked and integrated with the core structural diagram of the dynamic knowledge association network to form the final intelligent question-and-answer output content for pediatric oncology research literature. The final intelligent question-and-answer output content for pediatric oncology research literature is then sent to the researchers' interactive terminals.
2. The literature intelligent question and answer method based on a pediatric oncology research large model according to claim 1, characterized in that, The semantic knowledge path construction module of the pediatric oncology research big data model performs semantic deconstruction processing on the pediatric oncology research question, extracts the core terms, inter-term relationships, and research objectives of the pediatric oncology research question, and generates a semantic knowledge path for the research question, including: The pediatric oncology research question is input into the terminology extraction submodule of the semantic knowledge path construction module of the pediatric oncology research big model. The question text is then subjected to sentence-by-sentence semantic analysis to identify the pediatric oncology disease name, pathological mechanism terminology, experimental technique terminology, research indicator terminology, and conclusion expression terminology involved in the question text, thereby forming an initial terminology list. Each term in the initial terminology list is assigned to a specific domain, distinguishing between basic medical terms, clinical research terms, experimental technology terms, and statistical analysis terms. Terms that are not directly related to the research topic of pediatric oncology are then removed, resulting in a core terminology set. The relation definition submodule of the semantic knowledge path construction module of the pediatric tumor research big data model is called to analyze the grammatical position, semantic collocation and logical connectors of each term in the core term set in the question text, and determine the association type between terms. The association type includes causal association, subordinate association, comparative association and synergistic association, forming a list of term association relationships. Extract the research objective statements from the pediatric oncology research questions, analyze the action orientation and expected result in the research objective statements, determine the research objective orientation, and convert the research objective orientation into target nodes in the atlas. The target nodes contain target type descriptions and target-related terms. Each term in the core term set is used as a graph node. Based on the association type in the term association relationship list, association edges are constructed between nodes. Association type descriptions are marked on the association edges. The target node is connected to the relevant core term nodes through the target association edges to form an initial semantic knowledge path. The initial semantic knowledge path is structurally optimized by merging semantically redundant nodes, adjusting the connection logic of associated edges to conform to the knowledge system of the pediatric oncology field, supplementing the domain attribute descriptions of nodes, and generating a semantic knowledge path for scientific research questions. 3.The method of claim 1, wherein the method further comprises: Based on the semantic knowledge path of the research question, the literature knowledge matching module of the pediatric oncology research big data model is triggered to perform a targeted literature search in a preset pediatric oncology research literature database, and to filter out literature containing content that matches the core terms and relationships in the semantic knowledge path, resulting in a set of matched literature, including: Extract all core term nodes and the types of associated edges between nodes from the semantic knowledge path of the scientific research problem to form a combination of search keywords and search conditions for related relationships. The combination of search keywords includes the full name of the core term and commonly used expressions in the field, and the search conditions for related relationships include the logical search rules corresponding to the types of associated edges. The search keyword combination and related search conditions are input into the search strategy generation submodule of the literature knowledge matching module of the pediatric tumor research big data model to generate a targeted search strategy containing multi-round search logic. The multi-round search logic includes a first round of coarse search based on core terms, a second round of fine screening based on related relationships, and a final round of fine matching based on semantic similarity. The literature database retrieval submodule of the literature knowledge matching module of the pediatric oncology research big model is invoked. According to the first-round retrieval logic in the targeted retrieval strategy, literature containing any core terms is retrieved in the preset pediatric oncology research literature database to obtain a coarse retrieval literature set. Based on the secondary search logic in the targeted search strategy, text analysis is performed on each document in the coarse search document set to check whether there are term association expressions in the document text that match the association search conditions. Documents with matching expressions are retained to obtain a finely screened document set. According to the final round retrieval logic in the targeted retrieval strategy, the semantic similarity between the text content of each document in the finely screened document set and the semantic knowledge path of the scientific research question is calculated. The semantic similarity is derived based on the overlap between document terms and graph nodes, and the matching degree between document term associations and graph association edges. Documents whose semantic similarity meets the preset requirements are retained. The full text of the retained documents whose semantic similarity meets the preset requirements is extracted. The terminology annotation submodule of the document knowledge matching module of the pediatric tumor research big data model is used to annotate the full text of the documents with professional terms. The core terms and related expression fragments in the documents are identified and marked, forming a set of matched documents containing the full text of the documents, professional term annotations, and related expression fragments. 4.The method of claim 1, wherein the method further comprises: The dynamic knowledge association module of the pediatric oncology research big data model performs knowledge association analysis on the full text of the matched documents, the technical terminology annotations, and the terminology-related expression fragments in the matching document set, and constructs a dynamic knowledge association network with the semantic knowledge path of the research question as its core, including: The technical term annotations of each document in the matching document set are compared with the core term nodes in the semantic knowledge path of the scientific research question to identify matching terms and unmatched new terms in the documents. The matching terms are document terms that are semantically consistent with the core term nodes, and the new terms are technical terms in the documents that are related to the research topic but do not appear in the map. Semantic analysis is performed on the term association description fragments of each document to extract the term association relationships in the term association description fragments. The matching association relationships that are consistent with the association edge type in the semantic knowledge path of the scientific research question and the newly added association relationships are the term associations that appear in the document but are not recorded in the graph. The newly added terms are added as new knowledge nodes and the newly added relationships are added as new relationship edges to the semantic knowledge path of the scientific research question to form an extended semantic knowledge path. The corresponding literature source identifiers are marked on the newly added knowledge nodes, and the literature reference fragments of the relationship are marked on the newly added relationship edges. The degree of association between each node and other nodes in the extended semantic knowledge path is calculated, and the degree of association is described as the attribute information of the association edges between nodes. The degree of association is calculated based on the number of direct association edges between nodes, the length of indirect association paths, and the frequency of association descriptions in the literature, according to a preset rule. Analyze the logical derivation relationship between knowledge nodes in the extended semantic knowledge path, identify multiple logical paths from the core term node of the problem to the research target node, each logical path contains sequentially connected knowledge nodes and associated edges, and record the literature support basis corresponding to each logical path; By integrating the nodes, related edges, degree of association, and logical paths in the extended semantic knowledge path, a dynamic knowledge association network is constructed with the original research question semantic knowledge path as the core, including newly added knowledge from the literature and multi-dimensional related information. The dynamic knowledge association network can intuitively display the connection between the question and the knowledge in the literature.
5. The literature intelligent question answering method based on the pediatric oncology research large model according to claim 4, characterized in that, The computation of the extended semantic knowledge path determines the degree of association between each node and other nodes, and this degree of association is described as attribute information of the edges connecting nodes, including: Count the number of direct edges between any two nodes in the extended semantic knowledge path. A direct edge is an edge that connects two nodes without passing through any other nodes. Record the results of the count of direct edges. Traverse all indirect association paths between any two nodes in the extended semantic knowledge path. The indirect association path is a sequence of associated edges connecting two nodes through one or more intermediate nodes. Record the number of associated edges contained in each indirect association path as the path length, and calculate the average path length of all indirect association paths between two nodes. Extract all fragments from the matching document set that involve the related terms of two nodes, count the total frequency of the extracted fragments in the documents, and record the statistical results of the frequency of related terms. According to the preset weight allocation rules, corresponding calculation weights are assigned to the statistical results of the number of directly associated edges, the average path length, and the statistical results of the frequency of term association expressions. The weight allocation rules are set based on the importance characteristics of knowledge association in the field of pediatric oncology. Based on the assigned calculation weights, the statistical results of the number of directly related edges, the average path length, and the statistical results of the frequency of related terms are comprehensively calculated to obtain the numerical value of the degree of association between the two nodes. The numerical value of the degree of association is converted into a textual description of the degree of association, and this description is added to the association edge attribute information between the two nodes to complete the association edge attribute annotation.
6. The intelligent question-answering method for literature based on a large-scale pediatric tumor research model according to claim 1, characterized in that, The process involves integrating knowledge nodes and association paths within the dynamic knowledge network, extracting knowledge content capable of answering the pediatric oncology research questions, and combining this with literature source information corresponding to the association paths to generate pediatric oncology research literature question-and-answer results, including: Extract all logical paths pointing to the research target node from the dynamic knowledge association network. Each logical path contains multiple sequentially connected knowledge nodes and associated edges. Content is extracted from the knowledge nodes and related edges in each logical path. The terminology explanations corresponding to the knowledge nodes, the descriptions of the relationships and the degree of connection corresponding to the related edges are integrated to form path knowledge fragments. Each path knowledge fragment corresponds to the core content of a logical path. Analyze the relevance of all path knowledge fragments to the pediatric oncology research problem, retain the path knowledge fragments that are directly related to the scientific questions to be explored in the pediatric oncology research problem, and remove the path knowledge fragments that are irrelevant to the problem or whose relevance does not meet the preset relevance standard to obtain the core knowledge fragment set. The fragments in the core knowledge fragment set are logically sorted according to the derivation order from the core terminology of the problem to the research target node; Extract the literature source identifier and literature basis fragment from the associated edge attribute information corresponding to each core knowledge fragment, and find the literature source details corresponding to the extracted literature source identifier. The literature source details include the journal in which the literature was published, the publication time, the author, and the link to obtain the literature. The sorted core knowledge segments are associated with the corresponding literature reference segments and literature source details, and organized according to the preset answer content format to form a question and answer result for pediatric oncology research literature that includes the question-and-answer logic, knowledge basis, and literature source.
7. The intelligent question-answering method for literature based on a large-scale pediatric tumor research model according to claim 6, characterized in that, The analysis examines the relevance of all path knowledge fragments to the pediatric oncology research question, retaining path knowledge fragments directly related to the scientific questions to be explored within the pediatric oncology research question, and removing path knowledge fragments irrelevant to the question or whose relevance does not meet the preset relevance criteria, resulting in a core knowledge fragment set, including: Extract the scientific question statement to be explored from the pediatric tumor research problem, and decompose the content of the scientific question statement to be explored into multiple key question elements, which include the question object, question angle and expected answer dimension; Each path knowledge segment is analyzed to extract the core knowledge points, which include term definitions, relationship descriptions and logical conclusions. It is then determined whether each core knowledge point corresponds to a key question element. Count the number of core knowledge points in each path knowledge segment that correspond to the key question elements, calculate the proportion of the corresponding key points to the total number of key points in the segment, and obtain the key point correspondence ratio. Check whether the core knowledge points in the path knowledge segment can directly respond to the question angles and expected answer dimensions in the key question elements, record the number of elements that can be directly responded to, and obtain the number of element responses; Based on the corresponding proportions of the key points and the number of element responses, the relevance between the path knowledge fragments and the scientific questions to be explored is comprehensively evaluated according to the preset evaluation rules, and a relevance evaluation result is generated. Path knowledge fragments that reach the relevance assessment threshold are retained, while path knowledge fragments that do not reach the relevance assessment threshold are removed. The retained path knowledge fragments are then combined into a core knowledge fragment set.
8. The intelligent question-answering method for literature based on a large-scale pediatric tumor research model according to claim 1, characterized in that, The process of linking and integrating the results of the question-and-answer session on pediatric oncology research literature with the core structural diagram of the dynamic knowledge association network to form the final intelligent question-and-answer output content for pediatric oncology research literature includes: The network visualization module of the large-scale pediatric tumor research model is invoked to visualize the core structure of the dynamic knowledge association network. The core term nodes, research target nodes and main association paths in the dynamic knowledge association network are retained, while secondary nodes and redundant association edges are simplified to generate a schematic diagram of the core structure of the dynamic knowledge association network. In the schematic diagram of the core structure of the dynamic knowledge association network, the nodes and associated paths corresponding to the core knowledge fragments in the question-and-answer results of the pediatric oncology research literature are specially marked. The marking method includes distinguishing nodes by color and bolding the associated edges. The results of the Q&A on pediatric oncology research literature are divided into multiple answer chapters according to the answer logic. Each answer chapter corresponds to a set of related core knowledge fragments. A chapter title is added to each answer chapter, and the title content can summarize the core answer content of the chapter. At the end of each solution chapter, add corresponding network structure guidance, describing the position of the content of the solution chapter in the core structure diagram of the dynamic knowledge association network and the associated path, to guide researchers to understand the knowledge association context through the diagram; The results of the question and answer questions on pediatric oncology research literature after being divided into chapters are integrated with the core structure diagram of the dynamic knowledge association network after being marked. The question and answer results are located in the left area, and the diagram is located in the right area. The two are linked by chapter numbers and marking symbols. The integrated content undergoes standardized text formatting, adjusting font styles, paragraph spacing, and heading levels to conform to the reading habits of researchers, ultimately forming the final intelligent Q&A output content for pediatric oncology research literature.
9. The intelligent question-answering method for literature based on a large-scale pediatric tumor research model according to claim 8, characterized in that, The network visualization module of the large-scale pediatric oncology research model visualizes the core structure of the dynamic knowledge association network, retaining the core term nodes, research target nodes, and main association paths, simplifying secondary nodes and redundant association edges, and generating a schematic diagram of the core structure of the dynamic knowledge association network, including: Extract the core term nodes and research target nodes contained in the semantic knowledge path of the original scientific research question from the dynamic knowledge association network, and determine the extracted core term nodes and research target nodes as the core node set for visualization processing. Analyze the association edges between nodes in the core node set of the dynamic knowledge association network, and statistically analyze the association tightness description and the corresponding number of documents supporting each association edge. Retain the association edges whose association tightness meets the preset standard and whose number of documents supporting reaches the preset threshold as the main association paths. Secondary nodes in the dynamic knowledge association network are identified. These secondary nodes are nodes in the newly added terms whose association with the core nodes does not reach the preset degree of association and do not appear in the main association path. The identified secondary nodes are merged or hidden to reduce the number of nodes in the diagram. Redundant association edges in the main association paths are cleaned up. Redundant association edges are two association edges with the same node and the same association type. Association edges with the number of documents supporting the document reach a preset threshold are retained, and duplicate redundant association edges are deleted. Each node in the core node set is assigned a unique visual identifier, which includes the node shape and base color. Different types of core nodes use different shapes, and nodes of the same type use the same base color. Based on the relationships between core nodes, the node positions are arranged in the visualization interface to ensure that the main related paths are evenly distributed and without intersections or obstructions. Arrow lines are used to represent related edges, and the direction of the arrows reflects the logical direction of the relationship. Add association type description labels and association tightness description labels to the associated edges. The labels are located in the middle of the associated edges. Adjust the font size and color of the labels to make them visible. This completes the generation of the core structure diagram of the dynamic knowledge association network.
10. A literature intelligent question-answering system based on a large-scale research model of pediatric tumors, characterized in that, include: processor; A machine-readable storage medium for storing machine-executable instructions of the processor; The processor is configured to execute the intelligent question-answering method for literature based on a large-scale pediatric tumor research model as described in any one of claims 1 to 9 by executing the machine-executable instructions.