A scientific and technological literature innovation evaluation method based on large model agent technology

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By constructing a large model Agent, the citation record hierarchy of scientific and technological literature is classified and the span is calculated, which solves the problem that existing evaluation methods cannot measure the degree of citation dispersion and realizes the accurate identification and evaluation of cross-domain innovative literature.

CN122019780BActive Publication Date: 2026-06-23广州市奇之信息技术有限公司

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: 广州市奇之信息技术有限公司
Filing Date: 2026-04-09
Publication Date: 2026-06-23

Application Information

Patent Timeline

09 Apr 2026

Application

23 Jun 2026

Publication

CN122019780B

IPC: G06F16/353; G06F16/31; G06F16/38; G06F40/216; G06F40/30

AI Tagging

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing methods for evaluating the innovativeness of scientific literature cannot accurately measure the dispersion of citation sources across different research branches, resulting in a systematic bias between the evaluation results and the actual innovative value, making it difficult to identify groundbreaking literature across disciplines.

Method used

A large-scale model agent is constructed to obtain citation records of target documents through a literature database interface, classify them hierarchically by discipline and research branch, calculate the span and weight, generate an innovation value score dataset, and identify groundbreaking innovative documents through cross-validation.

Benefits of technology

It has enabled automated collection of citation data and intelligent evaluation of interdisciplinary innovation value, improving the accuracy and efficiency of discovering groundbreaking innovative literature.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122019780B_ABST

Patent Text Reader

Abstract

The application provides a scientific and technological literature innovation evaluation method based on a large model Agent technology, comprising: hierarchical classification of each citation source in a citation dataset in disciplines and research branches, determination of the discipline field code and specific research branch code thereof according to a discipline classification system, and obtaining citation distribution data containing complete classification labels; statistics of the number and proportion of citations of each research branch in the citation distribution data, and calculation of the crossing degree of the literature among different research branches; the large model Agent comprehensively evaluates the innovation value of the literature based on the crossing degree among the branches and the total number of citations, adjusts the weight of different literature according to the crossing degree among the branches, and generates an innovation value score dataset; the large model Agent cross- validates the candidate literature list, confirms the effectiveness of the cross-disciplinary breakthrough characteristics in combination with the information of the proportion of the number of citations of each branch, completes the final labeling, and identifies the breakthrough innovative literature.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information technology, and in particular to a method for evaluating the innovativeness of scientific and technological literature based on large-scale model agent technology. Background Technology

[0002] In the field of scientific literature innovation assessment, accurately judging the true innovative value of literature directly impacts research direction selection, resource allocation, and talent evaluation, and is therefore of crucial significance. Traditionally, the number of citations is often considered a core indicator for measuring innovation; more citations usually mean greater influence and more prominent innovative contributions. However, this assessment method has significant flaws in practical application because it is easily influenced by mutual citations within specific research groups. This can lead to some literature that is widely disseminated only within narrow branches receiving excessively high scores, while those that truly drive progress in multiple fields may be underestimated. A deeper problem is that while the number of citations reflects overall influence, it fails to reveal the distribution structure of citation sources. When a large number of citations are concentrated in the same research branch, even with a high total number of citations, it often only represents a deepening and continuation within that branch, making it difficult to reflect groundbreaking innovative contributions. Conversely, if citation sources are distributed across multiple different research branches, even a relatively small total number of citations may indicate that the literature provides general ideas or methods across branches, thus possessing higher innovative value. Most existing methods focus solely on the total number of citations, neglecting the dispersion of citation relationships across different research branches. This leads to a systematic bias between the evaluation results and the actual innovative value. This bias is particularly pronounced in specific evaluation processes. For example, a paper might be cited 200 times, but 180 of those citations come from follow-up work within the same research branch, with only 20 scattered across other branches. In this case, its innovativeness is easily overemphasized by the total number of citations. Conversely, a paper might be cited only 80 times, but evenly distributed across five unrelated research branches, approximately 16 times per branch. Its cross-branch impact clearly represents a significant breakthrough. However, existing evaluation systems struggle to capture this difference in the dispersion of citation sources, making it impossible to differentiate innovativeness scores accordingly. This poses a serious challenge in identifying truly groundbreaking papers with cross-disciplinary impact. Therefore, the key issue for accurately assessing the innovative value of scientific literature lies in how to consider both the total number of citations and the effective measurement of the dispersion of citation sources across different research branches when evaluating the innovativeness of a paper, and how to dynamically adjust the scoring weights accordingly. Summary of the Invention

[0003] To address the technical problems mentioned above, this invention discloses a method for evaluating the innovativeness of scientific literature based on large-scale model agent technology, comprising:

[0004] A large model agent is constructed to obtain the complete citation records of the target document through the literature database interface, including the total number of citations and detailed information of each citation source. The collected citation records are stored in the long-term memory of the large model agent to form a citation dataset.

[0005] Each citation source in the citation dataset is hierarchically classified by discipline and research branch. Its discipline domain code and specific research branch code are determined according to the discipline classification system to obtain citation distribution data containing complete classification labels.

[0006] In the statistical citation distribution data, the number and proportion of citations in each research branch are analyzed, and the span of literature across different research branches is calculated.

[0007] The large model Agent comprehensively evaluates the innovative value of literature based on the span between branches and the total number of citations. At the same time, it adjusts the weight of different literature according to the span between branches to generate an innovative value score dataset.

[0008] The large model agent selects candidate documents from the innovation value score dataset that simultaneously meet the dual conditions of high score and high leap based on the innovation value score and the preset leap threshold, thus obtaining a list of candidate breakthrough innovation documents.

[0009] The large model agent cross-validates the candidate literature list, combines the citation ratio information of each branch to confirm the effectiveness of its cross-domain breakthrough features, completes the final annotation, and identifies groundbreaking innovative literature.

[0010] Furthermore, the constructed large-scale model agent obtains the complete citation records of the target document through a literature database interface, including the total number of citations and detailed information for each citation source. The collected citation records are stored in the long-term memory of the large-scale model agent to form a citation dataset, including:

[0011] The large model agent calls the API interface of the WebofScience or Scopus database, initiates a query request based on the DOI number of the target document, obtains the returned JSON format citation data package, parses the total citations field and citation details array in the data package, extracts the author name, publication journal, publication year and document title information of each citation record, and obtains the original citation record set;

[0012] The original reference record set is cleaned to remove duplicate reference records and self-reference entries. Valid reference records after the target time are filtered based on the timestamp. Missing information is supplemented through the CrossRef database to obtain a standardized reference record set.

[0013] The large model agent uses a hash algorithm to generate unique index keys based on the standardized citation record set, establishes a mapping table between the cited source documents and the cited documents, and stores it in the database of the long-term memory module to form a traceable citation dataset.

[0014] Furthermore, the process involves hierarchically classifying each citation source in the citation dataset by discipline and research branch, determining its subject area code and specific research branch code according to the discipline classification system, and obtaining citation distribution data containing complete classification labels, including:

[0015] The large model agent extracts the author's name, journal, publication year, and document title for each citation record from the citation dataset. It also queries the ISSN number of the journal based on the publication journal and extracts a set of document keywords based on the document title. It queries the first-level discipline code to which the journal belongs through the subject classification table of the Documentation and Information Center of the Chinese Academy of Sciences. The TF-IDF algorithm is used to vectorize the keywords, and the dot product operation is performed with the pre-constructed feature vectors of each research branch. The branch with the highest similarity is selected as the initial branch affiliation.

[0016] Regarding the preliminary branch attribution, when the similarity between the keyword vector of a citation record and the feature vectors of multiple research branches exceeds the preset interdisciplinary judgment threshold, it is judged as an interdisciplinary document and assigned a main branch code and a sub-branch code; otherwise, a single branch code is maintained, and a hierarchical classification label for each citation record is obtained.

[0017] Based on the hierarchical classification labels, citation records are organized into three levels: subject category, first-level discipline, and research branch. The number and percentage of citations under each level node are counted. For interdisciplinary literature, different statistical weights are assigned to the main branch and sub-branch respectively. A structured data table containing fields such as subject code, branch code, number of citations, and percentage percentage is generated to form citation distribution data with complete classification labels.

[0018] Furthermore, the hierarchical categorization of each citation source in the citation dataset by discipline and research branch also includes:

[0019] The full journal name and abstract text of each citation record are extracted from the citation dataset. Noun phrases are extracted from the abstract text using natural language processing tools to form a set of research topic vocabulary. The subject classification code of the journal is queried according to the SCI journal partition table to obtain the initial classification information of the journal.

[0020] Using a pre-defined subject classification system vocabulary list, the number of words that appear together with the research topic vocabulary set and the keywords of each subject are calculated and divided by the number of words in the union of the vocabulary sets. If the calculation result exceeds a pre-defined threshold, it is determined that it belongs to the subject field and a subject field label is obtained.

[0021] Based on the subject area label, locate the corresponding set of secondary branches, perform string matching between the research topic vocabulary and the feature words of each secondary branch, count the number of successfully matched feature words, select the branch with the most matches as the research branch label, and determine the subject area label and research branch label of the cited source.

[0022] Furthermore, in the statistical citation distribution data, the number and proportion of citations in each research branch are used to calculate the span of literature across different research branches, including:

[0023] Extract the citation record list for each research branch from the citation distribution data, count the number of citations contained in each branch, calculate the percentage of the number of citations in each branch to the total number of citations, and sort the branches by the number of citations from high to low to form a branch citation ranking table.

[0024] Branches whose citation count percentage is greater than a preset minimum percentage threshold are identified as valid branches. The standard deviation and mean of the citation count of all valid branches are calculated. When the standard deviation is less than a preset percentage threshold of the mean, it is determined to be a balanced distribution; otherwise, it is determined to be a concentrated distribution.

[0025] Based on the distribution type identifier and the total number of valid branches, if the distribution type is a balanced distribution, the span is taken as the number of valid branches divided by the total number of research branches; otherwise, it is taken as the number of valid branches minus one divided by the total number of research branches, thus determining the span of the literature between different research branches.

[0026] Furthermore, the calculation of the breadth of literature across different research branches also includes:

[0027] Extract complete citation records for each research branch from the citation distribution data, obtain the journals, author institutions, and keyword sets of the cited literature, determine the first-level discipline code to which each citation belongs through the discipline classification mapping table, and count the number of disciplines spanned.

[0028] Extract the core keyword set of each branch, use the word vector algorithm to convert the keywords into high-dimensional vector representations, calculate the center vector of the keyword vector set of each branch, and calculate the topic similarity between branches using the cosine similarity algorithm;

[0029] The reference density is obtained by dividing the number of references for each branch by the number of subdomains contained in that branch. Branches with reference densities exceeding the average density are identified as core influential branches. The number of core influential branches is counted as the influence breadth value. The coefficient of variation of the number of references for each core influential branch is calculated as the influence depth value. Normalization is used to map the influence breadth and influence depth values to the interval between 0 and 1. The overall span is equal to the weighted sum of the influence breadth and influence depth values according to a preset weight coefficient.

[0030] Furthermore, the large model Agent comprehensively evaluates the innovation value of documents based on the span between branches and the total number of citations. Simultaneously, it adjusts the weights of different documents according to the span between branches to generate an innovation value scoring dataset, including:

[0031] The large model agent obtains the branch span value and the total number of citations for each document from the literature database. After adding one to the total number of citations, it takes the natural logarithm to obtain the citation base. The span value is multiplied by a preset amplification factor and then limited to the range of 0 to 1 as the adjustment factor. When the span exceeds the preset high span threshold, the adjustment factor is taken at the maximum value; otherwise, the adjustment factor is calculated proportionally.

[0032] The publication year of each article is extracted, the difference between the current year and the publication year is calculated, the timeliness weight is calculated using the negative exponential decay formula, the importance value of the subject to which the article belongs is retrieved from the preset subject weight configuration table as the domain weight, and the citation base, adjustment coefficient, timeliness weight and domain weight are multiplied together to obtain the original innovation score;

[0033] The average and standard deviation of the scores for literature in the same subject are calculated. The original scores are subtracted from the average and then divided by the standard deviation for standardization. Based on the standardized scores, the literature is divided into three levels: those with scores greater than the preset high innovation threshold are marked as high innovation, those between the preset high innovation threshold and the low innovation threshold are marked as medium innovation, and those less than the preset low innovation threshold are marked as regular innovation. A structured data table containing five fields, including document identification code, span value, total number of citations, innovation score and innovation level, is constructed and sorted in descending order of innovation score to generate an innovation value scoring dataset.

[0034] Furthermore, the large model agent, based on the innovation value score and a preset threshold for leapfrogging, selects candidate documents from the innovation value score dataset that simultaneously meet the dual conditions of high score and high leapfrogging, thus obtaining a list of candidate breakthrough innovation documents, including:

[0035] The large model agent reads the innovation score and span value of each document from the innovation value scoring dataset, and filters out the document set with a score higher than the preset innovation score threshold, and at the same time filters out the document set with a span higher than the preset span threshold.

[0036] Perform an intersection operation on the document set whose score is higher than the threshold and the document set whose span is higher than the threshold, and extract the identifier, innovation score and span value of the document records that exist in both sets simultaneously;

[0037] Calculate the weighted sum of the innovation score and the breakthrough value for each document, sort them in descending order of comprehensive score, mark the top-ranked documents as core breakthrough candidates, and mark the rest as general breakthrough candidates, thus obtaining a list of candidate breakthrough innovation documents.

[0038] Furthermore, the large model agent performs cross-validation on the candidate literature list, combining the citation ratio information of each branch to confirm the effectiveness of its cross-domain breakthrough features, completing the final annotation, and identifying groundbreaking innovative literature, including:

[0039] The large model agent extracts the branch citation distribution data of each paper from the candidate list of groundbreaking innovative papers, calculates the percentage of citations of each research branch to the total number of citations, counts the number of valid branches whose percentage exceeds the preset minimum threshold, and judges the validation as unsuccessful if the number of valid branches is less than the preset number.

[0040] Based on the preliminary verification results, the Gini coefficient of the citation distribution of the literature that passed the preliminary verification is calculated. Literature with a Gini coefficient less than the preset equilibrium threshold is marked as groundbreaking innovative literature, and literature with a Gini coefficient greater than the threshold is marked as non-groundbreaking innovative literature. The final labeling and identification of groundbreaking innovative literature are completed.

[0041] The technical solutions provided by the embodiments of the present invention may include the following beneficial effects:

[0042] This invention discloses a method for evaluating the innovativeness of scientific literature based on large-scale model agent technology. It obtains complete citation records of target literature through a literature database interface, including the total number of citations and detailed information for each citation source, and stores this information in the long-term memory of the agent to form a citation dataset. This invention further categorizes citation sources hierarchically by subject, combining journal name and research topic, and determining domain codes and branch codes according to the subject classification system to generate citation distribution data with classification labels. Based on this, it statistically analyzes the number and proportion of citations in each research branch, calculates the cross-branch span of literature, and assesses the breadth and depth of influence by analyzing the topic distance and domain interval between branches. This invention utilizes a large-scale model agent to comprehensively evaluate the innovative value of literature based on the total number of citations and the span, generates a scoring dataset, filters candidate literature with high scores and high span, and then confirms cross-disciplinary breakthrough characteristics through cross-validation, ultimately accurately identifying groundbreaking innovative literature. This method realizes automated collection of citation data and intelligent evaluation of cross-disciplinary innovative value, improving the accuracy and efficiency of discovering groundbreaking innovative literature. Attached Figure Description

[0043] Figure 1 This is a flowchart of a scientific literature innovation evaluation method based on large model agent technology according to the present invention.

[0044] Figure 2This is a schematic diagram of a scientific literature innovation evaluation method based on large model agent technology according to the present invention.

[0045] Figure 3 This is another schematic diagram of a scientific literature innovation evaluation method based on large model agent technology according to the present invention. Detailed Implementation

[0046] To further understand the content of this invention, a detailed description of the invention is provided in conjunction with the accompanying drawings and embodiments. The specific embodiments described herein are for illustrative purposes only and are not intended to limit the invention. It should also be noted that, for ease of description, only the parts relevant to the invention are shown in the accompanying drawings.

[0047] like Figures 1-3 This embodiment of a method for evaluating the innovativeness of scientific literature based on large model agent technology may specifically include:

[0048] S101. Construct a large model agent, obtain the complete citation records of the target document through the document database interface, including the total number of citations and detailed information of each citation source, and store the collected citation records in the long-term memory of the large model agent to form a citation dataset.

[0049] The large-scale model agent initiates a query request based on the DOI number of the target document by calling the API interface, obtains the returned JSON format citation data package, parses the total citation count field and citation details array in the data package, and extracts the author name, journal, publication year, and document title information for each citation record to obtain the original citation record set. The original citation record set is then cleaned to remove duplicate citations and self-citations, and valid citation records after the target time are filtered based on timestamps. When the journal source field of a citation record is empty or the number of characters in the author name field does not meet the preset completeness condition, the missing information is supplemented through the CrossRef database to obtain a standardized citation record set. Based on each record in the standardized citation record set, the large-scale model agent uses a hash algorithm to generate a unique index key value, establishes a mapping table between the cited source document and the cited document, and stores it in the database of the long-term memory module, forming a traceable and queryable citation dataset.

[0050] In one implementation, the large model agent establishes a connection with the Web of Science database via a RESTful API interface and retrieves citation data based on the DOI number.

[0051] Specifically, the data cleaning process first identifies self-citation entries by comparing the first author of the cited record with the author list of the target document. If the match exceeds 80%, it is considered a self-citation and marked for deletion. Deduplication is performed using title similarity calculations and the Jaccard similarity coefficient algorithm. If the title similarity between two records exceeds 0.9, the earlier record is retained. Timestamp filtering is based on the timeliness requirements of scientific literature innovation assessment, setting 2015 as the time threshold to filter out earlier cited records.

[0052] In one possible implementation, when a null value is detected in the journal source field, the large model agent extracts the document title of the record and performs a fuzzy matching query through the works interface of the CrossRef API. The CrossRef database returns a list of candidate documents, and the system calculates the similarity score between each candidate and the original record. The candidate with the highest score exceeding the 0.85 threshold is selected, and its container-title field is extracted to supplement the journal information. The integrity judgment of the author name field is based on character length detection; if the Western author name is less than 5 characters or the Chinese author name is less than 2 characters, the supplementation mechanism is triggered.

[0053] S102. Classify each citation source in the citation dataset by discipline and research branch, determine its discipline domain code and specific research branch code according to the discipline classification system, and obtain citation distribution data containing complete classification labels.

[0054] The large-scale model agent extracts the author's name, journal, publication year, and document title for each citation record from the citation dataset. It then queries the journal's ISSN based on the journal and extracts a set of keywords based on the document title. Using the subject classification table from the Documentation and Information Center of the Chinese Academy of Sciences, it queries the first-level subject code to which the journal belongs. The TF-IDF algorithm is used to vectorize the keywords, and a dot product operation is performed with the pre-constructed feature vectors of each research branch. The branch with the highest similarity is selected as the initial branch assignment. For this initial branch assignment, if the similarity between the keyword vector of a citation record and the feature vectors of multiple research branches exceeds a preset interdisciplinary threshold, it is determined to be an interdisciplinary document and assigned a main branch code and a sub-branch code. If the similarity does not reach the threshold, a single branch code is maintained, thus obtaining a hierarchical classification label for each citation record. Based on the hierarchical classification labels, citation records are organized into three levels: subject category, first-level discipline, and research branch. The number and percentage of citations under each level node are counted. For interdisciplinary literature, different statistical weights are assigned to the main branch and sub-branch respectively. A structured data table containing fields such as subject code, branch code, number of citations, and percentage percentage is generated to form citation distribution data with complete classification labels.

[0055] In one implementation, the TF-IDF algorithm performs word frequency statistics on the keyword set of each citation record, calculating the frequency of each keyword in the document as the TF value, and simultaneously calculating the inverse document frequency (IDF) of the keyword in the entire citation dataset as the IDF value. The weight value of each keyword is obtained by multiplying the TF and IDF values, constructing an n-dimensional feature vector, where n is the total number of deduplicated keywords. The branch feature vectors are constructed by pre-collecting representative documents from each branch, selecting keywords from the top 100 highly cited documents in each branch, and generating branch feature vectors using the same TF-IDF method.

[0056] Specifically, the dot product operation standardizes the keyword vector of the cited record and the feature vectors of each research branch, making the vector magnitude 1. Then, the sum of the products of the corresponding dimensions of the two vectors is calculated. Similarity values range from 0 to 1. When the similarity value reaches 0.7 or higher, the cited record is considered to have a strong correlation with the corresponding research branch. This threshold was determined based on ROC curve analysis of 1000 sample citation data, with a sensitivity and specificity balance point of 0.7. The subject classification table of the Documentation and Information Center of the Chinese Academy of Sciences includes 13 subject categories, 110 first-level disciplines, and over 600 research branches, each with a unique six-digit code.

[0057] It should be noted that the identification of interdisciplinary literature employs a multi-threshold determination mechanism. When a citation record has a similarity of 0.75 with the first research branch and a similarity of 0.72 with the second research branch, and the difference between the two is less than 0.1 and both exceed the threshold of 0.7, the system determines that the literature belongs to an interdisciplinary field. The main branch code is assigned to the branch with the highest similarity, and the secondary branch code is assigned to the branch with the second highest similarity. This dual coding mechanism can accurately reflect the distribution of the literature's influence across different research fields.

[0058] Preferably, the three-tier organizational structure is implemented using a tree data structure. The root node represents the subject category level, including major categories such as engineering, science, and medicine; the intermediate nodes represent the first-level discipline level, such as computer science and technology and electronic science and technology; and the leaf nodes represent the research branch level, such as artificial intelligence, machine learning, and computer vision. Each node records the cumulative number of citations at that level and its percentage of the total number of citations.

[0059] For example, a citation record of a paper on the application of deep learning in medical image recognition may be assigned the main branch code F081203 to represent the direction of artificial intelligence and the sub-branch code F100207 to represent the direction of medical imaging. In the final citation distribution data table, this record is included in the statistics of both branches, but the main branch has a weight of 1.0 and the sub-branch has a weight of 0.5, which realizes the differentiated statistics of cross-disciplinary citations.

[0060] The journal name and research topic of each citation source are obtained from the citation dataset. The scope of the first-level discipline and the topic boundaries of the second-level branches are extracted from the subject classification system. The subject field category to which the journal name belongs is analyzed. The specific branch direction to which the research topic belongs is evaluated. The subject field label and research branch label of the citation source are determined.

[0061] The full journal name and abstract text of each citation record are extracted from the citation dataset. Noun phrases are extracted from the abstract text using natural language processing tools to form a research topic vocabulary set, thus obtaining the research topic of the citation source. The journal's subject classification code is retrieved according to the SCI journal partition table to obtain initial journal classification information. Using a pre-defined subject classification system vocabulary list, the keyword set of the first-level discipline and the feature word boundaries of the second-level branches are extracted. The number of words co-occurring between the research topic vocabulary set and the keywords of each discipline is calculated and divided by the union of the vocabulary sets. If the result exceeds a preset threshold, it is determined to belong to that discipline, obtaining a discipline domain label. Based on the discipline domain label, the corresponding second-level branch set is located. String matching is performed between the research topic vocabulary and the feature words of each second-level branch. The number of successfully matched feature words is counted, and the branch with the most matches is selected as the research branch label, thus determining the discipline domain label and research branch label of the citation source.

[0062] It should be noted that the subject classification system thesaurus is constructed based on the subject classification table of the Documentation and Information Center of the Chinese Academy of Sciences, containing 110 first-level disciplines and more than 600 second-level branches. Each first-level discipline corresponds to 500 to 2000 keywords, which are obtained by statistically analyzing the high-frequency words of highly cited literature in that discipline. The feature word boundaries of the second-level branches are determined by analyzing the unique words of representative literature in each branch, forming a thesaurus of 100 to 300 feature words for each branch. Similarity calculation adopts set operation method, with the numerator being the number of elements in the intersection of two word sets and the denominator being the number of elements in the union. This calculation method can effectively avoid the bias caused by the difference in the size of the word sets.

[0063] S103. In the statistical citation distribution data, the number and proportion of citations in each research branch are calculated to determine the span of literature across different research branches.

[0064] The citation record list for each research branch is extracted from the citation distribution data. The number of citations in each branch is counted, and the percentage of branch citations to the total citations is calculated. A branch citation ranking table is formed by sorting the branches from highest to lowest citation count, thus obtaining citation statistics for each branch. For each branch's citation statistics, branches with a citation percentage greater than a preset minimum percentage threshold are identified as valid branches. The standard deviation and mean of the citation counts for all valid branches are calculated. If the standard deviation is less than a preset percentage threshold of the mean, it is considered a balanced distribution; otherwise, it is considered a concentrated distribution, thus obtaining a distribution type identifier. Based on the distribution type identifier and the total number of valid branches, if the distribution type is balanced, the span is calculated as the number of valid branches divided by the total number of research branches; if the distribution type is concentrated, the span is calculated as the number of valid branches minus one divided by the total number of research branches, thus determining the span of the literature across different research branches.

[0065] In one implementation, the citation distribution data is stored in a two-dimensional table structure. Rows represent research branches, and columns include fields such as branch code, branch name, citation count, and a list of cited references. The statistical process iterates through the citation list of each branch, sums the total citations for that branch, and divides this sum by the sum of citation counts for all branches to obtain the percentage. The sorting algorithm uses quicksort to reorganize the branches in descending order of citation count, forming a branch citation ranking table.

[0066] It should be noted that the threshold for identifying effective branches is based on the long-tail characteristic of citation distribution. Citations in scientific literature typically exhibit a power-law distribution, with a few branches accounting for the majority of citations, while most branches receive only sporadic citations. Setting a threshold of 5% filters out marginal branches with very few citations, retaining research directions with substantial influence. When the number of citations for a particular branch reaches 5% of the total citations, it indicates that the literature has had a significant academic impact in that branch.

[0067] Specifically, the distribution type is determined using the coefficient of variation method. The standard deviation reflects the dispersion of citations in each effective branch, while the mean represents the central tendency. When the standard deviation is less than or equal to 0.5 times the preset proportion threshold of the mean, it indicates that the number of citations in each branch is relatively uniform, and is judged as an even distribution. When the standard deviation is greater than 0.5 times the preset proportion threshold of the mean, it indicates that some branches have particularly concentrated citations, and is judged as a concentrated distribution. This method can quantitatively distinguish the interdisciplinary influence patterns of literature.

[0068] Preferably, the span calculation employs normalization to ensure the value is between 0 and 1. For evenly distributed literature, which has considerable influence across multiple branches, the span is directly equal to the number of effective branches divided by the total number of branches. For concentrated literature, whose main influence is concentrated in a few branches, the span needs to be reduced. The calculation formula is the number of effective branches minus one divided by the total number of branches, reflecting the limitation of its cross-branch influence.

[0069] For example, a paper on the application of machine learning in medical imaging diagnosis was cited 300 times, distributed across 15 research branches. The artificial intelligence branch had 150 citations (50%), the medical imaging branch had 90 citations (30%), the clinical diagnosis branch had 30 citations (10%), and the remaining 30 citations were distributed in other branches. All three of these branches exceeded the 5% threshold to be considered valid branches. The calculated mean was 90, and the standard deviation was 60. Since the standard deviation is greater than half the mean, it is considered a concentrated distribution with a span of (3-1) / 15, or 0.133, reflecting that although the paper has interdisciplinary characteristics, its influence is still relatively concentrated.

[0070] By obtaining the research branch categories cited by the literature and the citation sources of each branch from the citation distribution data, we can identify the types of disciplines and branch directions that the literature crosses, analyze the topic distance and domain interval between different research branches, assess the scope of the literature's influence in multiple research branches, and determine the breadth and depth of the literature's cross-research branches.

[0071] Complete citation records for each research branch are extracted from citation distribution data. The journals, author institutions, and keyword sets of cited literature are obtained. The first-level discipline code for each citation is determined using a discipline classification mapping table. The number of disciplines crossed is counted. The main branch direction is identified based on the first three digits of the branch code, and the number of different main branch directions is calculated to obtain a discipline-crossing feature set. For each research branch in the discipline-crossing feature set, a core keyword set for each branch is extracted. The keywords are converted into high-dimensional vector representations using a word vector algorithm. The center vector of each branch's keyword vector set is calculated. The topic similarity between branches is calculated using a cosine similarity algorithm. The topic distance between branches is determined based on whether the similarity is below a preset distance threshold, resulting in a topic distance matrix between branches. Based on the topic distance matrix between branches, the citation density of each branch is calculated by dividing the number of citations by the number of subfields it contains. Branches with citation densities exceeding the average density are identified as core influential branches. The number of core influential branches is counted as the influence breadth value, and the coefficient of variation of the citation count of each core influential branch is calculated as the influence depth value, resulting in a breadth-depth numerical pair. Based on the breadth and depth numerical pairs, normalization is used to map the breadth and depth values to the 0-1 range. The overall span is equal to the weighted sum of the breadth and depth values according to a preset weight coefficient. The breadth and depth of the literature spanning research branches are determined, and the overall span value is calculated.

[0072] In one implementation, the construction of the subject-crossing feature set begins with the structured storage of citation distribution data. This citation distribution data is stored in a relational database and includes three core data tables: a citation record table, a journal information table, and a subject classification table. The citation record table records fields such as the unique identifier of each citation, the title of the cited document, the publishing journal, a list of author institutions, a set of keywords, and the year of publication. The journal information table is linked to the journal's ISSN number to obtain the first-level subject code to which the journal belongs.

[0073] It should be noted that the Word2Vec algorithm is used to convert keywords into vector representations, calculate the center vector of each branch, and measure the topic similarity between branches using cosine similarity.

[0074] Specifically, the calculation of citation density takes into account the internal structural complexity of research branches. Each research branch contains several subfields, and the number of subfields is identified by analyzing the last three digits of the branch code. Citation density is defined as the total number of citations for that branch divided by the number of subfields, reflecting the degree of concentration of citations within the branch. A high citation density indicates that although the branch contains multiple subfields, the literature still has a strong influence, demonstrating the universality of the literature content across subfields.

[0075] Preferably, the calculation of the coefficient of variation involves two statistics: standard deviation and mean. First, the average number of citations across all core influence branches is calculated. Then, the sum of squared deviations of the number of citations in each branch from the mean is calculated, divided by the number of branches, and the square root is taken to obtain the standard deviation. The standard deviation divided by the mean is the coefficient of variation. The coefficient of variation eliminates the influence of dimensions and objectively reflects the dispersion of citation distribution. When the coefficient of variation is less than 0.3, it indicates that the citations across the core branches are relatively balanced, and the cross-branch influence of the literature is relatively uniform. When the coefficient of variation is greater than 0.7, it indicates that the citations are highly concentrated in a few branches, and the cross-branch influence is significantly biased.

[0076] In one possible implementation, the normalization process employs a min-max normalization method. The influence breadth value equals the number of core influence branches divided by the total number of research branches, naturally falling within the 0-1 range. The influence depth value is mapped to the 0-1 range by linearly transforming the reciprocal of the coefficient of variation; the smaller the coefficient of variation, the larger the influence depth value, indicating a deeper influence. Furthermore, the weighting coefficients of 0.6 and 0.4 are based on large-scale bibliographic statistical analysis. By analyzing the citation distribution characteristics of 10,000 highly cited documents, it was found that influence breadth contributes approximately 60% to the innovative value of the documents, while influence depth contributes approximately 40%. This weighting allocation emphasizes both the interdisciplinary breadth of the documents and the depth of penetration within each discipline.

[0077] Understandably, this quantitative evaluation method can distinguish between different types of interdisciplinary literature. Purely applied literature may be cited in multiple branches but has limited depth, resulting in a high breadth value but a low depth value; theoretically groundbreaking literature may be concentrated in a few branches but has a far-reaching impact, resulting in a high depth value but a limited breadth value. By comprehensively considering both breadth and depth, truly innovative literature with groundbreaking cross-disciplinary contributions can be identified more accurately.

[0078] S104. The large model Agent comprehensively evaluates the innovative value of literature based on the span between branches and the total number of citations. At the same time, it adjusts the weight of different literature according to the span between branches to generate an innovative value score dataset.

[0079] The large model agent retrieves the branch span value and total number of citations for each document from the literature database. It increments the total number of citations by one and takes the natural logarithm to obtain the citation base. The span value is multiplied by a preset amplification factor and then limited to the range of 0 to 1 as an adjustment coefficient. When the span exceeds a preset high span threshold, the adjustment coefficient reaches its maximum value; otherwise, it is calculated proportionally, resulting in a span adjustment coefficient table. Based on this table, the publication year of each document is extracted, and the difference between the current year and the publication year is calculated. A negative exponential decay formula is used to calculate the timeliness weight. The importance value of the subject to which the document belongs is retrieved from a preset subject weight configuration table as the domain weight. The citation base, adjustment coefficient, timeliness weight, and domain weight are multiplied together to obtain the original innovation score. For the original innovation scores, the large-scale model agent calculates the mean and standard deviation of the scores for literature in the same discipline. The original scores are then standardized by subtracting the mean and dividing by the standard deviation. Based on the standardized scores, the literature is categorized into three levels: scores above a preset high innovation threshold are labeled as high innovation; scores between the preset high and low innovation thresholds are labeled as medium innovation; and scores below the preset low innovation threshold are labeled as conventional innovation. Based on these innovation level labels and the standardized innovation scores, the large-scale model agent constructs a structured data table containing five fields: document identifier, span value, total citations, innovation score, and innovation level. This table is then sorted in descending order of innovation score to generate an innovation value scoring dataset.

[0080] In one implementation, the citation cardinality calculation employs a logarithmic transformation to smooth the extreme distribution of citation counts. Citation counts in scientific literature typically exhibit a power-law distribution, with a few documents receiving a large number of citations while most receive few. Directly using the raw citation count can lead to an over-biased scoring system towards highly cited documents. Adding one to the total number of citations avoids undefined cases when performing logarithmic calculations on documents with zero citations; the natural logarithmic function maps the citation count to a more uniform numerical range. When a document is cited 10 times, the citation cardinality is 2.4; when cited 100 times, the citation cardinality is 4.6; and when cited 1000 times, the citation cardinality is 6.9, achieving non-linear compression and ensuring that citation counts of different magnitudes are reasonably represented.

[0081] It should be noted that the design of the span adjustment coefficient reflects an incentive mechanism for interdisciplinary innovation. The span value itself is between 0 and 1, reflecting the degree of interdisciplinary nature of the literature. By setting 0.8 as a threshold, when the span exceeds 0.8, it indicates that the literature has extremely strong interdisciplinary characteristics, and the adjustment coefficient is directly set to the maximum value of 1, giving it sufficient weight enhancement. When the span is below 0.8, the adjustment coefficient is calculated linearly, with an adjustment coefficient of 0.5 for a span of 0.4 and 0.75 for a span of 0.6. This piecewise function design ensures that literature with high span is given sufficient attention while avoiding excessive penalty for literature with low span. The amplification factor is usually set to 1.25, so that literature with a span around 0.8 can receive appropriate weight enhancement.

[0082] Specifically, the negative exponential decay formula for timeliness weighting is based on the theory of knowledge aging. The influence of scientific and technological literature gradually decreases over time, but the rate of decay varies across different disciplines. Basic disciplines such as mathematics and physics have longer half-lives, while applied disciplines such as computer science have shorter half-lives. The decay formula is exp(-λ×t), where λ is the decay coefficient and t is the number of years since publication. For computer science, λ is typically set to 0.2, meaning that the timeliness weight of a document after 5 years drops to 0.37; for mathematics, λ is set to 0.05, meaning that the timeliness weight of a document after 20 years still has 0.37. This differentiated treatment of timeliness ensures that literature from different disciplines is evaluated fairly.

[0083] In one possible implementation, the original innovation score is calculated by multiplying four factors. Citation base provides the initial score, the span adjustment coefficient reflects interdisciplinary bonuses, the timeliness weight reflects the impact of time, and the domain weight represents disciplinary importance. The original score obtained by multiplying these four factors comprehensively reflects the innovative value of a document across multiple dimensions. For example, a document cited 50 times with a span of 0.6 can obtain its original innovation score through multiplying these four factors. Furthermore, the standardization process employs Z-score standardization to transform document scores from different disciplines to a unified, comparable scale. First, the mean μ and standard deviation σ of the original scores for all documents within the same discipline are calculated. Then, the score for each document is transformed using (x-μ) / σ. The standardized score follows a standard normal distribution with a mean of 0 and a standard deviation of 1. This process eliminates differences in citation habits across disciplines, making interdisciplinary comparisons possible.

[0084] For example, the three-level classification of innovation level is based on the statistical normal distribution characteristics. A standardized score greater than 1 means that the innovation value of the literature exceeds that of 84% of the literature in the same discipline, belonging to the high innovation category; literature with a score between -1 and 1 accounts for 68% of the total, representing the mainstream innovation level; literature with a score less than -1 has relatively low innovation value, but may still have its value in specific sub-fields.

[0085] Understandably, the structured data table design facilitates subsequent data analysis and application. Document identification codes use DOI or other unique identifiers to ensure data traceability; span values and total citation counts retain the original data for easy verification and recalculation; innovation scores are represented as floating-point numbers with two decimal places; and innovation levels use classification labels. Sorting by innovation score in descending order prioritizes high-value documents, enabling research management departments to quickly identify important innovative achievements.

[0086] S105. The large model agent selects candidate documents from the innovation value score dataset that simultaneously meet the dual conditions of high score and high leap, based on the innovation value score and the preset leap threshold, and obtains a list of candidate breakthrough innovation documents.

[0087] The large model agent reads the innovation score and breakthrough value of each document from the innovation value scoring dataset. Based on a preset innovation score threshold, it filters out a set of documents with scores higher than the threshold, and simultaneously filters out a set of documents with breakthrough values higher than the threshold, resulting in two preliminary selection sets. The intersection of these two preliminary selection sets is performed to identify document records that exist in both sets simultaneously. The identifier, innovation score, and breakthrough value of these documents are extracted to obtain a candidate document set that meets both conditions. For this candidate document set, a weighted sum of the innovation score and breakthrough value is calculated for each document. The innovation score is multiplied by a preset innovation weight coefficient, and the breakthrough value is multiplied by a preset breakthrough weight coefficient. The two are added together to obtain a comprehensive score. Documents are sorted in descending order of comprehensive score, and the top-ranked documents (by a preset percentage) are marked as core breakthrough candidates, while the rest are marked as general breakthrough candidates, resulting in a list of candidate breakthrough innovation documents.

[0088] In one implementation, the innovation value scoring dataset is stored in a relational database table structure, including fields such as document ID, DOI, title, author, publishing journal, innovation score, span value, innovation level, and branch citation distribution. The branch citation distribution data is derived from document citation network analysis. The large model agent performs filtering operations using SQL queries. The innovation score threshold is typically set as the mean score of documents within the same discipline plus one standard deviation, ensuring that the selected documents have significant innovation value. Within the same discipline, the mean μ and standard deviation σ are calculated from the entire dataset, with the threshold being μ + σ. The span threshold is set to 0.6, meaning that a document is considered to have groundbreaking interdisciplinary characteristics only if the ratio of the number of effective branches to the total number of research branches is at least 0.6.

[0089] It's worth noting that the intersection operation uses a hash table data structure to improve efficiency. The document IDs in the first filter set are stored in the hash table. When traversing the second filter set, each document ID is checked to see if it exists in the hash table; if it does, it is added to the intersection result. This method has a time complexity of O(n), which is a significant improvement compared to the O(n²) complexity of nested loops, especially when dealing with large datasets containing tens of thousands of documents.

[0090] Specifically, the weighted sum calculation considers the varying importance of innovation value and interdisciplinary degree. The literature domain judgment is based on keyword analysis, journal type, or abstract content classification: in basic research, interdisciplinary characteristics are more emphasized, with a weight of 0.6 for cross-disciplinary scope and 0.4 for innovation score; in applied research, innovation value is more critical, with a weight of 0.7 for innovation score and 0.3 for cross-disciplinary scope. The weighting coefficients are determined based on expert evaluation and historical data analysis. The rationality of the weighting settings is verified through retrospective analysis of literature already identified as groundbreaking achievements.

[0091] Preferably, the classification of candidate documents adopts a dynamic proportional division. The proportion of core breakthrough candidates is dynamically adjusted according to the size of the candidate document set: when the total number of candidate documents is less than 50, the top 30% are marked as core; when the total number is between 50 and 200, the top 20% are marked as core; and when the total number exceeds 200, the top 10% are marked as core. This dynamic adjustment mechanism avoids the problem of too many or too few core candidates.

[0092] For example, in the 2023 annual evaluation of a research institution, from a scoring dataset of 8,000 documents, 1,200 documents had innovation scores exceeding the threshold, and 800 documents had breakthrough scores exceeding the threshold. After intersection calculation, 320 candidate documents were obtained. A comprehensive score was calculated using weights of 0.7 and 0.3, and the top 32 documents were marked as core breakthrough candidates, while the remaining 288 documents were marked as general breakthrough candidates. The resulting list provided an objective basis for the institution's annual selection of major achievements and allocation of research awards.

[0093] S106. The large model agent cross-validates the candidate literature list, combines the citation ratio information of each branch to confirm the effectiveness of its cross-domain breakthrough features, completes the final annotation, and identifies groundbreaking innovative literature.

[0094] The large-scale model agent extracts the branch citation distribution data for each document from the candidate list of groundbreaking innovation documents, calculates the percentage of citations for each research branch relative to the total citations, and counts the number of valid branches whose percentage exceeds a preset minimum threshold. If the number of valid branches is less than the preset number, the verification is deemed unsuccessful, yielding preliminary verification results. Based on these preliminary verification results, the Gini coefficient of the citation distribution is calculated for documents that pass the preliminary verification. Documents with a Gini coefficient less than a preset equilibrium threshold are labeled as groundbreaking innovation documents, while those with a Gini coefficient greater than the threshold are labeled as non-groundbreaking innovation documents. This completes the final labeling and identification of groundbreaking innovation documents.

[0095] In one implementation, a minimum threshold of 5% is used to determine the validity of a research branch. The preset threshold for the number of valid branches is typically set at three, meaning that a document needs to generate more than 5% citations across at least three research branches to pass preliminary verification. In other words, a research branch must account for more than 5% of the total citations to be considered a valid branch. This threshold is based on the Pareto principle in bibliometrics, filtering out marginal branches with only sporadic citations and retaining research directions with substantial impact.

[0096] It should be noted that the Gini coefficient, originally used to measure income inequality, is used in literature innovation assessment to quantify the concentration of citation distribution. The calculation process sorts each branch by its citation share from smallest to largest, and then accumulates the area under the Lorenz curve. The Gini coefficient is equal to 1 minus twice that area. The coefficient value ranges from 0 to 1; a value close to 0 indicates a uniform citation distribution, while a value close to 1 indicates a high degree of concentration.

[0097] Preferably, the balance threshold is set to 0.4. When the Gini coefficient is less than 0.4, it indicates that the sources of citations for the literature are relatively dispersed and have cross-disciplinary influence; when it is greater than 0.4, it indicates that the citations are overly concentrated in a few branches and the cross-disciplinary characteristics are not obvious. Through this dual verification mechanism, it is ensured that the identified groundbreaking innovative literature has both sufficient breadth of influence and reasonable distribution balance.

[0098] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A method for evaluating the innovativeness of scientific literature based on large-scale model agent technology, characterized in that, include: A large model agent is constructed to obtain the complete citation records of the target document through the literature database interface, including the total number of citations and detailed information of each citation source. The collected citation records are stored in the long-term memory of the large model agent to form a citation dataset. Each citation source in the citation dataset is hierarchically classified by discipline and research branch. Its discipline domain code and specific research branch code are determined according to the discipline classification system to obtain citation distribution data containing complete classification labels. In the statistical citation distribution data, the number and proportion of citations in each research branch are analyzed, and the breadth of literature across different research branches is calculated, specifically including: Extract the citation record list for each research branch from the citation distribution data, count the number of citations contained in each branch, calculate the percentage of the number of citations in each branch to the total number of citations, and sort the branches by the number of citations from high to low to form a branch citation ranking table. Branches whose citation count percentage is greater than a preset minimum percentage threshold are identified as valid branches. The standard deviation and mean of the citation count of all valid branches are calculated. When the standard deviation is less than a preset percentage threshold of the mean, it is determined to be a balanced distribution; otherwise, it is determined to be a concentrated distribution. Based on the distribution type identifier and the total number of valid branches, if the distribution type is a balanced distribution, the span is taken as the number of valid branches divided by the total number of research branches; otherwise, it is taken as the number of valid branches minus one divided by the total number of research branches, thus determining the span of the literature between different research branches. The calculation of the breadth of literature across different research branches also includes: Extract complete citation records for each research branch from the citation distribution data, obtain the journals, author institutions, and keyword sets of the cited literature, determine the first-level discipline code to which each citation belongs through the discipline classification mapping table, and count the number of disciplines spanned. Extract the core keyword set of each branch, use the word vector algorithm to convert the keywords into high-dimensional vector representations, calculate the center vector of the keyword vector set of each branch, and calculate the topic similarity between branches using the cosine similarity algorithm; The reference density is obtained by dividing the number of references for each branch by the number of subdomains contained in that branch. Branches with reference densities exceeding the average density are identified as core influential branches. The number of core influential branches is counted as the influence breadth value. The coefficient of variation of the number of references for each core influential branch is calculated as the influence depth value. Normalization is used to map the influence breadth and influence depth values to the interval between 0 and 1. The overall span is equal to the weighted sum of the influence breadth and influence depth values according to the preset weight coefficient. The large model Agent comprehensively evaluates the innovative value of literature based on the span between branches and the total number of citations. At the same time, it adjusts the weight of different literature according to the span between branches to generate an innovative value score dataset. The large model agent selects candidate documents from the innovation value score dataset that simultaneously meet the dual conditions of high score and high leap based on the innovation value score and the preset leap threshold, thus obtaining a list of candidate breakthrough innovation documents. The large model agent cross-validates the candidate literature list, combines the citation ratio information of each branch to confirm the effectiveness of its cross-domain breakthrough features, completes the final annotation, and identifies groundbreaking innovative literature.

2. The method for evaluating the innovativeness of scientific literature based on large-scale model agent technology according to claim 1, characterized in that, The large-scale model agent obtains complete citation records of target documents through a literature database interface, including the total number of citations and detailed information for each citation source. The collected citation records are stored in the long-term memory of the large-scale model agent, forming a citation dataset, including: The large model agent calls the API interface of the WebofScience or Scopus database, initiates a query request based on the DOI number of the target document, obtains the returned JSON format citation data package, parses the total citations field and citation details array in the data package, extracts the author name, publication journal, publication year and document title information of each citation record, and obtains the original citation record set; The original reference record set is cleaned to remove duplicate reference records and self-reference entries. Valid reference records after the target time are filtered based on the timestamp. Missing information is supplemented through the CrossRef database to obtain a standardized reference record set. The large model agent uses a hash algorithm to generate unique index keys based on the standardized citation record set, establishes a mapping table between the cited source documents and the cited documents, and stores it in the database of the long-term memory module to form a traceable citation dataset.

3. The method for evaluating the innovativeness of scientific literature based on large-scale model agent technology according to claim 1, characterized in that, The process involves hierarchically classifying each citation source in the citation dataset by discipline and research branch, determining its discipline domain code and specific research branch code according to the discipline classification system, and obtaining citation distribution data containing complete classification labels, including: The large model agent extracts the author's name, journal, publication year, and document title for each citation record from the citation dataset. It also queries the ISSN number of the journal based on the publication journal and extracts a set of document keywords based on the document title. It queries the first-level discipline code to which the journal belongs through the subject classification table of the Documentation and Information Center of the Chinese Academy of Sciences. The TF-IDF algorithm is used to vectorize the keywords, and the dot product operation is performed with the pre-constructed feature vectors of each research branch. The branch with the highest similarity is selected as the initial branch affiliation. Regarding the preliminary branch attribution, when the similarity between the keyword vector of a citation record and the feature vectors of multiple research branches exceeds the preset interdisciplinary judgment threshold, it is judged as an interdisciplinary document and assigned a main branch code and a sub-branch code; otherwise, a single branch code is maintained, and a hierarchical classification label for each citation record is obtained. Based on the hierarchical classification labels, citation records are organized into three levels: subject category, first-level discipline, and research branch. The number and percentage of citations under each level node are counted. For interdisciplinary literature, different statistical weights are assigned to the main branch and sub-branch respectively. A structured data table containing fields such as subject code, branch code, number of citations, and percentage percentage is generated to form citation distribution data with complete classification labels.

4. The method for evaluating the innovativeness of scientific literature based on large-scale model agent technology according to claim 3, characterized in that, The hierarchical categorization of each citation source in the citation dataset by discipline and research branch also includes: The full journal name and abstract text of each citation record are extracted from the citation dataset. Noun phrases are extracted from the abstract text using natural language processing tools to form a set of research topic vocabulary. The subject classification code of the journal is queried according to the SCI journal partition table to obtain the initial classification information of the journal. Using a pre-defined subject classification system vocabulary list, the number of words that appear together with the research topic vocabulary set and the keywords of each subject are calculated and divided by the number of words in the union of the vocabulary sets. If the calculation result exceeds a pre-defined threshold, it is determined that it belongs to the subject field and a subject field label is obtained. Based on the subject area label, locate the corresponding set of secondary branches, perform string matching between the research topic vocabulary and the feature words of each secondary branch, count the number of successfully matched feature words, select the branch with the most matches as the research branch label, and determine the subject area label and research branch label of the cited source.

5. The method for evaluating the innovativeness of scientific literature based on large-scale model agent technology according to claim 1, characterized in that, The large model Agent comprehensively evaluates the innovation value of literature based on the span between branches and the total number of citations. It also adjusts the weights of different documents according to the span between branches to generate an innovation value score dataset, including: The large model agent obtains the branch span value and the total number of citations for each document from the literature database. After adding one to the total number of citations, it takes the natural logarithm to obtain the citation base. The span value is multiplied by a preset amplification factor and then limited to the range of 0 to 1 as the adjustment factor. When the span exceeds the preset high span threshold, the adjustment factor is taken at the maximum value; otherwise, the adjustment factor is calculated proportionally. The publication year of each article is extracted, the difference between the current year and the publication year is calculated, the timeliness weight is calculated using the negative exponential decay formula, the importance value of the subject to which the article belongs is retrieved from the preset subject weight configuration table as the domain weight, and the citation base, adjustment coefficient, timeliness weight and domain weight are multiplied together to obtain the original innovation score; The average and standard deviation of the scores for literature in the same subject are calculated. The original scores are subtracted from the average and then divided by the standard deviation for standardization. Based on the standardized scores, the literature is divided into three levels: those with scores greater than the preset high innovation threshold are marked as high innovation, those between the preset high innovation threshold and the low innovation threshold are marked as medium innovation, and those less than the preset low innovation threshold are marked as regular innovation. A structured data table containing five fields, including document identification code, span value, total number of citations, innovation score and innovation level, is constructed and sorted in descending order of innovation score to generate an innovation value scoring dataset.

6. The method for evaluating the innovativeness of scientific literature based on large-scale model agent technology according to claim 1, characterized in that, The large model agent filters candidate documents from the innovation value score dataset that simultaneously meet the dual conditions of high score and high leap, based on innovation value scores and preset leap thresholds, resulting in a list of candidate breakthrough innovation documents, including: The large model agent reads the innovation score and span value of each document from the innovation value scoring dataset, and filters out the document set with a score higher than the preset innovation score threshold, and at the same time filters out the document set with a span higher than the preset span threshold. Perform an intersection operation on the document set whose score is higher than the threshold and the document set whose span is higher than the threshold, and extract the identifier, innovation score and span value of the document records that exist in both sets simultaneously; Calculate the weighted sum of the innovation score and the breakthrough value for each document, sort them in descending order of comprehensive score, mark the top-ranked documents as core breakthrough candidates, and mark the rest as general breakthrough candidates, thus obtaining a list of candidate breakthrough innovation documents.

7. The method for evaluating the innovativeness of scientific literature based on large-scale model agent technology according to claim 1, characterized in that, The large model agent cross-validates the candidate literature list, combines the citation ratio information of each branch to confirm the effectiveness of its cross-domain breakthrough features, completes the final annotation, and identifies groundbreaking innovative literature, including: The large model agent extracts the branch citation distribution data of each paper from the candidate list of groundbreaking innovative papers, calculates the percentage of citations of each research branch to the total number of citations, counts the number of valid branches whose percentage exceeds the preset minimum threshold, and judges the validation as unsuccessful if the number of valid branches is less than the preset number. Based on the preliminary verification results, the Gini coefficient of the citation distribution of the literature that passed the preliminary verification is calculated. Literature with a Gini coefficient less than the preset equilibrium threshold is marked as groundbreaking innovative literature, and literature with a Gini coefficient greater than the threshold is marked as non-groundbreaking innovative literature. The final labeling and identification of groundbreaking innovative literature are completed.

Citation Information

Patent Citations

Academic research analysis method and device based on large language model and medium
CN121327108A
College scientific research topic selection recommendation system based on academic literature map and implementation method
CN121681894A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Academic research analysis method and device based on large language model and medium

College scientific research topic selection recommendation system based on academic literature map and implementation method