A literature reading method, system, device and medium
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INNER MONGOLIA UNIVERSITY
- Filing Date
- 2025-05-28
- Publication Date
- 2026-06-12
AI Technical Summary
Existing literature management and analysis systems fail to delve deeply into the content of literature, making it impossible to achieve efficient reading and summarizing comprehension, especially in extracting key information and identifying research directions related to citations.
The YOLO++ algorithm is used to extract the literature structure. The prefix-tuned PT-GLM-6B model and the sparse soft hybrid expert model Soft Moe-SM-BERT are used to identify citation intent, construct the hierarchical structure of the literature, and extract the main content.
It enables efficient reading and understanding of literature, and can quickly extract key information and citation-related information, thereby improving research efficiency and quality.
Smart Images

Figure CN120670529B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of retrieval technology, specifically relating to a document guidance method, system, device, and medium. Background Technology
[0002] Scientific research is a crucial means of exploring the future and innovating in unknown fields. It is not only the cornerstone of scientific development but also a vital driving force for human civilization and national economic and social development. Breakthroughs and innovations in science and technology are all built upon the foundation of learning from, summarizing, absorbing, and transforming the experiences of predecessors. Literature serves as the carrier of scientific inheritance and scientific ethics, the basis and foundation for topic selection and decision-making, and the raw material for writing papers. Literature review can quickly obtain key information and core content from literature, grasp the breadth and depth of research, avoid duplication of work, prevent intellectual property disputes, find breakthroughs in research, and improve research efficiency and quality. Existing literature reading tools are designed to help users manage, read, and organize literature more effectively, improving the efficiency and experience of academic research. These commonalities reflect the basic functional requirements of literature reading tools, which can be categorized into the following two types based on function:
[0003] The first category consists of systems focused on literature management, retrieval, and reading tools. For example, Aminer, developed by the Institute of Computing Technology, Chinese Academy of Sciences, is a platform dedicated to searching and analyzing academic literature. It provides intelligent retrieval, author information analysis, citation analysis, and other functions to help users quickly discover relevant literature and understand academic trends. Readcube is a literature management and reading software that offers paper discovery, PDF organization, annotation, and sharing functions to help users efficiently manage their academic literature.
[0004] The second category comprises systems focused on generating and deconstructing document content. Examples include Semantic Schola, a free academic search engine developed by the Allen Institute for AI, which provides intelligent functions such as paper abstracts, key concepts, and citation analysis to enhance users' understanding of document content. CORE, an open-access paper search engine developed by a university, offers intelligent document recommendations, relevance analysis, and knowledge graphs to help users discover valuable academic findings. Scholarcy, an AI-based document abstracting tool developed by Scholarcy Ltd., utilizes natural language processing technology to automatically analyze PDF files, extract key sentences and paragraphs, and generate concise paper abstracts.
[0005] In summary, existing literature management and analysis systems primarily focus on literature retrieval, organization, and reading assistance, offering some automatic summary generation functions. However, these systems are not specifically designed for rapid comprehension of the literature content itself. They fail to extract key information frameworks from the main text, thus failing to delve into the potential information contained within the literature and unable to achieve the goals of efficient reading and summarizing. Furthermore, readers often consider the relevant research directions of cited texts during the reading process; this background information also has value and cannot be ignored. Existing literature guidance systems fail to uncover this type of information, therefore failing to truly achieve rapid and efficient literature comprehension and information extraction. Summary of the Invention
[0006] To overcome the shortcomings of the existing technology, the present invention provides a document guidance method, comprising the following steps:
[0007] Obtain the literature to be searched;
[0008] The YOLO++ algorithm is used to extract the literature structure of the literature to be retrieved, and the literature to be retrieved is decomposed into multiple structural units of different granularities according to chapters; multiple different structural units are divided into different categories according to subheadings; a prefix vector is added before each attention layer of the encoder of the GLM-6B generative language model to obtain the prefix-tuned PT-GLM-6B model; the prefix vector of the PT-GLM-6B model is used to extract the main content of the paragraphs corresponding to the subheadings of each category.
[0009] Based on the BERT model, a sparse matrix SM and a soft hybrid expert model SoftMoe are introduced into the BERT model to obtain the sparse soft hybrid expert model SoftMoe-SM-BERT. SoftMoe-SM-BERT is used to identify the citation intent of the literature to be retrieved.
[0010] Based on the extracted core content and the intended use of citations, the user is guided through the literature.
[0011] Preferably, after using the YOLO++ algorithm to extract the literature structure of the literature to be retrieved and decomposing the literature to be retrieved into multiple structural units of different granularities according to chapters, the method further includes extracting the text, images and tables of each structural unit as leaf nodes, constructing a hierarchical structure of the literature text based on the extracted leaf nodes, classifying the subheadings of each structural unit according to the hierarchical structure, and replacing the subheadings that do not clearly reflect the structural unit.
[0012] Preferably, the step of using Soft Moe-SM-BERT to identify the citation intent of the document to be retrieved specifically includes the following steps: converting the sample sequence of the input document to be retrieved into multiple token sequences, and marking the multiple token sequences; each token is mapped to a corresponding vector representation in the vocabulary;
[0013] A type embedding is added to each token using BERT from SoftMoe-SM-BERT, and a position embedding is generated for each token. The vector representation of each token is added to its type embedding and position embedding to obtain the final embedding representation of each token.
[0014] The final embedded representation of each token is input into Soft Moe, which captures contextual relationships and semantic representations, and dynamically weighs and integrates them among different experts. At the same time, the weight matrices of each expert are sparsified to identify the citation intent.
[0015] Preferably, the token sequence includes multiple tokens, where a token is the smallest unit processed by SoftMoe-SM-BERT, and the token includes words and punctuation marks.
[0016] Preferably, the Soft Moe is located between the self-attention mechanism of the BERT coding layer and the feedforward neural network.
[0017] This invention also provides a document guidance system, comprising:
[0018] The data acquisition module is used to acquire the literature to be retrieved;
[0019] The core content extraction module is used to extract the bibliographic structure of the literature to be retrieved using the YOLO++ algorithm, and decompose the literature to be retrieved into multiple structural units of different granularities according to chapters; divide the multiple different structural units into different categories according to subheadings; add a prefix vector before each attention layer of the encoder of the GLM-6B generative language model to obtain a prefix-tuned PT-GLM-6B model, and use the prefix vector of the PT-GLM-6B model to extract the core content of the paragraphs corresponding to the subheadings of each category;
[0020] The citation intent recognition module is used to identify the citation intent of the retrieved documents by introducing a sparse matrix SM and a soft hybrid expert model Soft Moe into the BERT model as the base model.
[0021] The guide module is used to identify the citation intent of the literature to be retrieved, based on the BERT model as the base model and by introducing a sparse matrix SM and a soft hybrid expert model Soft Moe into the BERT model.
[0022] The present invention also provides a computer-readable storage medium storing a computer program adapted for loading by a processor to execute the document guidance method.
[0023] The document guidance method provided by this invention has the following beneficial effects:
[0024] This invention utilizes the YOLO++ algorithm to extract the structural structure of a document and decomposes the document to be retrieved into multiple structural units of different granularities according to chapters. These structural units are then categorized according to subheadings, creating a complete hierarchical structure. By employing a prefix-tuned PT-GLM-6B model, the core content of the paragraphs corresponding to each category's subheading is extracted. This process requires only minor adjustments to a small number of prefix parameters to adapt to different tasks. Furthermore, this method allows for customized parameter adjustments of PT-GLM-6B for different task categories, better matching task requirements. This gives PT-GLM-6B stronger cross-task generalization capabilities, achieving excellent performance across various task categories, and enabling the extraction of core content from different subheadings. The sparse matrix in the constructed Soft Moe-SM-BERT model accelerates computation, and the introduction of Soft Moe better captures long-distance semantic dependencies, thereby identifying citation intent. Combining the extracted core content and citation intent of the document to be retrieved, this invention enables users to efficiently read and understand document information. Attached Figure Description
[0025] To more clearly illustrate the embodiments and design schemes of the present invention, the accompanying drawings required for this embodiment will be briefly described below. The drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0026] Figure 1 This is a flowchart of the present invention;
[0027] Figure 2 This is a diagram of the PT-GLM-6B architecture.
[0028] Figure 3 This is a diagram of the GLM-6B architecture.
[0029] Figure 4 Encode 2D location information;
[0030] Figure 5 For self-attention mask;
[0031] Figure 6 Annotated example of prefix adjustment for encoder-decoder model;
[0032] Figure 7 Comparison charts showing the effects of different categories; among them, Figure 7 (a) is UniLM RG-1, Figure 7 (b) is UniLM RG-2. Figure 7 (c) is UniLM RG-L; where, Figure 7 (d) is PT-GLM-6BRG-1. Figure 7 (e) is PT-GLM-6B RG-2. Figure 7 (f) is PT-GLM-6B RG-L; Figure 7 (g) is LLaMARG-1, Figure 7 (h) is LLaMARG-2, Figure 7 (i) is LLaMA RG-L;
[0033] Figure 8 This is a comparison chart of extraction results; where, where, Figure 8 (a) is RG-1, Figure 8 (b) is RG-2. Figure 8 (c) is RG-L;
[0034] Figure 9 For comparative ablation experiments of different categories; among them, Figure 9 (a) is GLM-6B RG-1. Figure 9 (b) is GLM-6BRG-2. Figure 9 (c) is GLM-6B RG-L; where, Figure 9 (d) is PT-GLM-6B RG-1. Figure 9 (e) is PT-GLM-6BRG-2. Figure 9 (f) is PT-GLM-6BRG-L;
[0035] Figure 10 The figure shows a comparison experiment using the CAC dataset; where, Figure 10 (a) is RG-1, Figure 10 (b) is RG-2. Figure 10 (c) is RG-L;
[0036] Figure 11 Here is a diagram of the Soft Moe-SM-BERT model structure;
[0037] Figure 12 This is the matrix sparsification process; where, Figure 12 (a) is a schematic diagram of the structure for filtering the first position. Figure 12 (b) is a schematic diagram of the structure for filtering the second position. Figure 12 (c) is a schematic diagram of the structure for filtering the nth position;
[0038] Figure 13 Calculate the matrix for self-attention;
[0039] Figure 14 Here is a diagram of the Soft Moe structure;
[0040] Figure 15 A distribution diagram of CCCF citation data;
[0041] Figure 16 The results are used to compare the model with experimental results; among them, Figure 16 (a) represents the accuracy rate. Figure 16 (b) represents the accuracy. Figure 16 (c) represents the recall rate. Figure 16 (d) is the F1 value;
[0042] Figure 17 Comparison of Loss with citation intent classification model results;
[0043] Figure 18 This is an iterative graph of multi-label classification metrics; where, Figure 18 (a) represents the accuracy rate. Figure 18 (b) represents the accuracy. Figure 18 (c) represents the recall rate. Figure 18 (d) is the F1 value;
[0044] Figure 19 The Ren-CECps data distribution;
[0045] Figure 20 The results are compared using the Ren-CECps dataset; among them, Figure 20 (a) represents the accuracy rate. Figure 20 (b) is Micro F1. Figure 20 (c) is Macro F1;
[0046] Figure 21 These are the ablation experiment results from the CCCF citation dataset; among them, Figure 21 (a) represents the accuracy rate. Figure 21 (b) represents the accuracy. Figure 21(c) represents the recall rate. Figure 21 (d) is the F1 value;
[0047] Figure 22 This is a graph showing the ablation experiment iterations;
[0048] Figure 23 The results show the sparsity comparison; among them, Figure 23 (a) represents the accuracy rate. Figure 23 (b) represents the accuracy. Figure 23 (c) represents the recall rate. Figure 23 (d) is the F1 value. Detailed Implementation
[0049] To enable those skilled in the art to better understand and implement the technical solutions of the present invention, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments. The following embodiments are only used to more clearly illustrate the technical solutions of the present invention and should not be construed as limiting the scope of protection of the present invention.
[0050] Example
[0051] This invention provides a document guidance method, such as... Figure 1 As shown, it includes the following steps:
[0052] Step 1: Obtain the literature to be searched.
[0053] Step 2: Using the literature architecture extraction YOLO++ algorithm, the literature to be retrieved is decomposed into multiple structural units of different granularities according to chapters; multiple different structural units are divided into different categories according to subheadings; a prefix vector is added before each attention layer of the encoder of the GLM-6B generative language model to obtain the PT-GLM-6B model; the prefix-tuned PT-GLM-6B model is used to extract the main content of the paragraphs corresponding to the subheadings of each category.
[0054] This invention primarily introduces a method for extracting the main body of an article based on extracting subheadings and their corresponding text content. Compared to methods that directly extract abstracts, this method can better preserve the main content and semantic structure of the original text. Specifically, the method includes two steps: First, the original document structure is extracted using the YOLO++ algorithm (Document Hierarchy Extraction), and then the subheadings are classified to facilitate subsequent extraction of the main body with a focus, and useless subheadings are replaced with other subtopics to fit the document theme; then, from the paragraph content under each subheading, the sentences that best summarize the information of that paragraph are extracted. These sentences can preserve the semantic information of the original text to the greatest extent and are displayed as the main body structure of the article. This invention uses a general language model (PT-GLM-6B) optimized by prefix-tuning (PT) to achieve the above-mentioned main body extraction task.
[0055] Visual elements such as images and tables in the text are identified and extracted as leaf nodes. Plain text paragraphs are also treated as leaf nodes. Through this division of structural units and extraction of leaf nodes, the algorithm constructs a complete hierarchical structure of the document content. This representation method based on the document structure avoids directly processing entire lengthy article sequences. Instead, it utilizes more granular paragraph-level semantic information, which helps improve the performance of main body extraction. When finally displaying the main body of the document, it can clearly present the hierarchical structure of the original text, making it easier for readers to quickly grasp the main content. When extracting the document structure, it is important to note that not all headings need to be extracted. This invention only extracts important headings that are clearly related to the document's topic as nodes, excluding headings with weak relevance such as related work, introduction, summary, and methods, because these headings do not truly reflect the core information of the document. To address this issue, this invention classifies headings based on rules and sets differentiated extraction tendencies according to different categories of headings. Specifically, this invention categorizes the introduction as background information and extracts sentences describing existing problems and the solutions presented in this paper. These two terms are then used as nodes to replace the original introduction title. This not only avoids the appearance of irrelevant nodes, making the final document hierarchy more closely aligned with the document content, but also simplifies the task of extracting the main body of the document, allowing for targeted extraction of key information from different categories of titles.
[0056] In autoregressive models, the input text is segmented into fixed-length sequences for training. However, for document-level long text tasks, using fixed-length input sequences makes it difficult to capture the overall semantic structure and contextual information of the text, which may affect the model's performance on long text generation tasks. Therefore, it is necessary to segment the document into different parts according to subheadings, retaining the complete paragraphs of each part to shorten the text length. After segmenting the document structure, the question categories can be further divided according to subheading categories. Then, the GLM-6B generative language model is combined for paragraph core extraction. It uses an autoregressive whitespace filling method for pre-training and improves upon this by introducing two-dimensional positional encoding to better capture the dependencies between different positions in the text. The PT-GLM-6B architecture is as follows: Figure 2 As shown.
[0057] PT-GLM-6B also improves the accuracy and coherence of generation by allowing arbitrary prediction order, thus better modeling the contextual relationships in long texts of subheading paragraphs. Traditional fine-tuning requires modifying and storing all parameters of the entire pre-trained language model, resulting in high storage costs and memory burdens. Furthermore, fine-tuning alters the overall parameter distribution of the language model, disrupting original knowledge and easily leading to overfitting to specific tasks. To address the issues of high fine-tuning costs and overfitting, this invention introduces prefix tuning. Compared to traditional fine-tuning methods, prefix tuning only requires optimizing a small number of prefix parameters, significantly reducing storage and computational overhead. Moreover, prefix tuning performs better when extrapolating to examples of different topics. In the question-answering task of this invention, the good generalization ability of prefix tuning is relied upon for different questions.
[0058] This invention addresses the problem of extracting the core sentences of paragraphs under subheadings. Considering that this text extraction involves both Natural Language Understanding (NLU) and Natural Language Generation (NLG) characteristics, it can be considered a typical comprehensive NLP task. It requires understanding the semantic information of the original text while generating the core sentences. This invention uses the Transformer-based pre-trained model GLM to solve this problem because such models possess both semantic understanding and text generation capabilities. The architecture is as follows: Figure 3 As shown, the model mainly consists of two loops during the inference phase: the first loop continues continuously, generating a new token in each iteration. The loop terminates when the model generates a specific end marker. <eos>The second loop: In this fixed-number loop, GLMBlock runs sequentially. In each loop, the most likely token ID is determined based on the attention scores, and used to generate the next token.
[0059] (1) Pre-training objectives.
[0060] In the GLM pre-training framework, autoregressive blank infilling is used as the pre-training objective. Specifically, the blank infilling task is implemented using the following methods: input text composition; 2D positional encoding; and the model's attention mask matrix.
[0061] Input text composition: The goal of autoregressive whitespace imputation is to randomly select consecutive text segments from the input text and replace them with special "[MASK]" markers to form a corrupted text. The model needs to predict these replaced text segments in an autoregressive manner based on the corrupted text sequence. When predicting each text segment, the model can access the corrupted text as well as previously predicted segments to fully capture the interdependencies between different segments.
[0062] 2D position encoding: such as Figure 4 Position1 and Position2 are shown as two-dimensional encoding representations of fragments in the input text. The first records the relative order in Part A, and the second records the relative order of the masked text fragments in Part B. All tokens in Part A have a relative position code of 0, indicating that they do not belong to any masked span. Tokens in Part B have relative position codes ranging from 1 to the span length, representing their relative positions within the masked span. The advantage of this 2D positional encoding method is that for tokens in Part B, they do not need to know the length of the masked span; the model can directly perform autoregressive generation based on the relative position information, without needing to manually set the number of [MASK] as in BERT. 2D encoding can better capture the positional information of tokens within the entire sequence and local spans, which is very helpful for autoregressive generation.
[0063] The model's attention mask matrix: based on the first [MASK], x2, x3 are decoded sequentially; based on the second [MASK], x6 is decoded sequentially. Here, x2, x3, and x6 are text blocks, and 2, 3, and 6 are numbers. How to decode the variable-length sequence code from the [MASK]? The GLM model cleverly uses special [START] and markers when handling variable-length masked text segments. The model autoregressively generates the masked text starting from [START] until the marker is generated, thus adapting to masked sequences of arbitrary length. Furthermore, GLM uses a custom attention mask, achieving a unification of the bidirectional encoder and unidirectional decoder.
[0064] like Figure 5 As shown, a custom self-attention mask is needed to achieve the following: words in the bidirectional encoder Part A are visible to each other (the area within the green box in the diagram); words in the unidirectional decoder Part B are unidirectionally visible (the area within the yellow box in the diagram); Part B is visible to Part A, but not to the rest (the gray area in the diagram). Reflected in the task of generating sentence sets from text, the input text is divided into two parts: Part A and Part B. Part A constitutes the model's input, with a mask symbol added at the end. Part A words are visible to each other, equivalent to the encoder (BERT), focusing on information extraction. The model generates the text content of Part B through autoregression. Part B can only see what precedes it, equivalent to the decoder (GPT), focusing on generation. The combined effect is to conditionally generate text by using extracted information as a condition. This achieves understanding of the original text information in the dataset and generating sentence sets based on this understanding. In unconditional generation, the model only needs an initial context to generate subsequent text content. If the input is "representative is Google", then adding a mask tag to the end mapping will generate the next sentence through a GLM autoregressive model. In conditional generation tasks, the model can be fine-tuned to generate corresponding text based on specific conditions, such as generating an answer to a given question or generating the main body of an article based on a specific context.
[0065] (2) Model fine-tuning.
[0066] Models without prompts suffer from several key drawbacks during training and application, such as a lack of targeted guidance and an inability to provide clear generation goals; a lack of overall purpose and logic in the generated content, making it difficult to achieve human-level coherence; and the need to completely fine-tune a model copy for each task, which is detrimental to multi-tasking and increases deployment and storage costs. This invention proposes a lightweight fine-tuning method called Prefix-Tuning. Prefix-Tuning introduces trainable prefix vectors, providing clear task guidance to the language model, resulting in more cohesive generated content with good modularity and space efficiency, thus overcoming the limitations of traditional fine-tuning methods. Furthermore, Prefix-Tuning differs from Prompt-Tuning in the location of prompt injection. Prefix-Tuning injects prompts into the input of every attention layer, while Prompt-Tuning only injects them at the word embedding layer. Therefore, Prefix-Tuning can control the model's behavior with finer granularity because it provides prompts for each attention layer, better capturing task-related information and providing stronger expressive power. Therefore, this invention chooses Prefix-Tuning for fine-tuning optimization and improvement.
[0067] Differentiated prefix tuning: For inputs containing sub-labels and paragraphs of different categories, prefix vectors can guide the language model's extraction focus towards the extraction target of the paragraph's category, making the extraction process more targeted. For example, for background paragraphs, it is necessary to extract the problems existing in current methods, while this extraction mode is not suitable for prospective paragraphs, so differentiated extraction is needed for paragraphs of different categories. This invention targets the task of extracting the main body of paragraphs under subheadings, dividing the subheadings into five categories according to their nature, and using five prefix fine-tuning methods. Before the extraction task, the selection of the pre-tuning type ensures that the extraction tendency is applicable to the subheading. In addition to selecting the type of pre-tuning, the insertion method of each prefix tuning is the same, such as... Figure 6 As shown.
[0068] This prefix is inserted into the model's input and interacts with the original input through a continuous self-attention mechanism, thus influencing the final generated output. Compared to directly fine-tuning the pre-trained model, fine-tuning only the prefix can significantly reduce the number of parameters while achieving similar or even better performance. In the Transformer encoder of GLM, a learnable prefix vector is inserted before the original input. This prefix vector interacts with the original input through a self-attention mechanism. An example of introducing prefix tuning in the encoder and decoder is... Figure 6 As shown, the left side focuses on prefix adjustment and input processing of the encoder, while the right side focuses on prefix update and output generation of the decoder, together reflecting the phased optimization process of the model.
[0069] The principle of prefix tuning is that providing appropriate contextual information can guide a language model to complete a task without modifying its parameters. For example, if you want the language model to generate a word (e.g., "Yao Ming"), you can add common collocations of that word (e.g., "basketball star") as context, making the language model more likely to generate the desired word. Extending this beyond generating single words or sentences, the goal is to find a context that guides the language model to solve natural language generation tasks. Intuitively, this context can influence the encoding of input x by guiding which information to extract, and it can also influence the generation of output y by guiding the distribution of the next generated token. Natural language task instructions (e.g., "Summarize this table in one sentence") might guide an expert to complete the task, but they fail for most pre-trained language models.
[0070] Optimizing the instructions into continuous word embeddings propagates their effects across all activation layers of the GLM and influences subsequent token generation. This is more expressive than using discrete hints, which require matching the embeddings of real words. Furthermore, this approach is simpler than directly modifying all activation layers, avoiding long-range dependencies and reducing the number of adjustable parameters. Therefore, prefix tuning optimizes the parameters of the entire prefix.
[0071] The PT-GLM-6B of this invention extracts content from the paragraphs divided in the original title based on prescribed rules. This process includes extracting background, technical framework, outlook, parallel structures, and other key content to gain a deeper understanding of the core research information. A total of 4096 data entries were successfully extracted and corrected. The training set, test set, and validation set of this invention consist of 3276, 409, and 409 data points, respectively.
[0072] As shown in Table 1, the GPT model successfully extracted original sentences with negative sentiment and those describing the main content of the solution from the background section. This helps to comprehensively understand the research motivation and key elements of the solution. In the technical framework section, original sentences describing the most significant technological innovations and implementation methods were extracted, along with sentences identifying the shortcomings of the technology and solutions. In the outlook section, original sentences outlining the future of the technology were extracted. This is crucial for understanding the long-term impact of the research and its potential development directions. Furthermore, original sentences containing parallel terms were extracted to capture the parallel relationships between various aspects of the research. For other key information, GPT was used to extract original sentences from the text related to original methods and technologies, representing unique highlights and innovations in the research. Through the GPT-assisted content extraction process, this invention successfully delved into the various paragraphs divided in the research title, extracting key information. This information not only helps this invention understand the core of the research but also provides rich material for further analysis and research. GPT's intelligent analysis provides an efficient and accurate tool for the information extraction process, making the entire research more in-depth and comprehensive.
[0073] Table 1 Title Category Settings
[0074]
[0075] Evaluation metrics: RG-1, RG-2, and RG-L are different variations of the ROUGE evaluation metrics used to assess the performance of automatically extracted sentences. They measure the overlap between the extracted sentence and the reference sentence.
[0076] ROUGE-1: Measures the overlap between words (unigrams) in the extracted sentence and words in the reference sentence set.
[0077] ROUGE-2: Measures the overlap between two consecutive words (bigrams) in an extracted sentence and two consecutive words in a reference sentence set.
[0078] ROUGE-L: measures the length of the longest common subsequence between the extracted sentence and the reference sentence, where the subsequence can be discontinuous, representing their semantic similarity.
[0079] These ROUGE metrics range from 0 to 1, where 1 indicates that the extracted sentence and the reference sentence are completely identical in terms of their respective overlap rates. These metrics are commonly used in automated evaluation to measure the performance of sentence set generation systems. Using multiple ROUGE metrics simultaneously can provide a more comprehensive assessment of different aspects of the overlap between the extracted and reference sentences.
[0080] Experimental environment: The specific details of the hardware and software used in the experiment are shown in Table 2.
[0081] Table 2 Hardware and Software Information
[0082] software and hardware property operating system Windows 11 CPU model Intel(R)Core(TM)i5-11700@3.00GHz GPU model GeForce RTX 3060 12G Memory 16G Python 3.8.10
[0083] Experimental Design: The effectiveness of the PT-GLM-6B model will be verified through experiments. These experiments aim to demonstrate the following:
[0084] Experiment 1: Model comparison experiment, comparing the model with other large models to explore the optimization effect.
[0085] Experiment 2: An ablation experiment was conducted to investigate the impact of the Prefix-Tuning module in this invention on the overall model performance.
[0086] Experiment 3: The effect of different prefix lengths on the model.
[0087] Experiment 4: Compare the effectiveness of the model on the Chinese public dataset LCSTS.
[0088] Experiment 5: Analyze the predictive performance of different models through case studies.
[0089] To verify the effectiveness of the PT-GLM-6B model proposed in this invention, this invention compares it with the classic large language model UniLM and the latest LLaMA model. UniLM (Unified Language Model) is a large language model based on the Transformer architecture and using a bidirectional Transformer encoder-decoder structure. LLaMA, on the other hand, is a large language model based on the Transformer architecture, incorporating improvements such as pre-normalization, the SwiGLU activation function, and rotational position embedding.
[0090] Experiment 1: Model Comparison Experiment.
[0091] like Figure 7 As shown, Figure 7 Comparison charts showing the effects of different categories; among them, Figure 7 (a) is UniLM RG-1, Figure 7 (b) is UniLM RG-2. Figure 7 (c) is UniLM RG-L; where, Figure 7 (d) is PT-GLM-6B RG-1. Figure 7 (e) is PT-GLM-6B RG-2. Figure 7 (f) is PT-GLM-6BRG-L; Figure 7 (g) is LLaMARG-1, Figure 7 (h) is LLaMARG-2, Figure 7 (i) represents LLaMARG-L. From a category analysis perspective, the background category shows relatively high values across all categories, thanks to its relatively clear extraction method: extracting related negative impact words as shortcomings and related solutions. This category contains less complex information than the semantically complex technical category, requiring the extraction of more relevant core information. Furthermore, the comparison requirements with other categories are relatively clear, resulting in higher values for RG-1, RG-2, and RG-L for this category. From a trend analysis perspective, the UniLM and LLaMA personality indicators show similar structures, indicating no significant difference between the two models across different data categories. In contrast, PT-GLM-6B exhibits unique effects across all categories, with significant differences between indicators. This suggests that pre-adjustment has a clear difference in question tendency across different subheading categories, confirming the effectiveness of various pre-adjustment methods. From an overall analysis, both LLaMA and PT-GLM-6B are higher than UniLM in RG-1, with little difference between them. In RG-2, LLaMA is higher than PT-GLM-6B, while UniLM has the lowest value. PT-GLM-6B outperforms LLaMA on RG-L, indicating that this targeted fine-tuning approach is more beneficial for the backbone extraction task. Furthermore, the GLM model possesses both semantic understanding and text generation capabilities, making it more suitable for the tasks in this chapter.
[0092] like Figure 8 As shown, Figure 8 This is a comparison chart of extraction results; where, where, Figure 8 (a) is RG-1, Figure 8 (b) is RG-2. Figure 8 (c) represents RG-L. An overall analysis of the mean values for each indicator shows that UniLM performs the worst among RG-1, RG-2, and RG-L. Compared to PT-GLM-6B, PT-GLM uses 6 billion parameters, while UniLM has fewer. Due to its larger number of parameters, PT-GLM is better able to capture key information in the text extraction task, which is superior to UniLM. UniLM requires different attention masks to switch between different pre-training objectives, while GLM uses a single autoregressive Transformer network. GLM combines Natural Language Understanding (NLU) and Generation (NLG) tasks by using an encoder and decoder to process different pre-training objectives. This model structure allows GLM to better adapt to text sentence extraction tasks, thus outperforming UniLM. The LLaMA model is slightly inferior to the PT-GLM-6B model in RG-1 and RG-L. The LLaMA model is based on a collection of basic language models. In contrast, the autoregressive pre-training objective of PT-GLM-6B can better capture semantic and contextual information and effectively share parameters across different tasks. The LLaMA model, however, cannot fully utilize parameter sharing across tasks. Therefore, LLaMA outperforms PT-GLM-6B in terms of word repetition rates and longest common subsequence repetition rates. It slightly outperforms PT-GLM-6B in RG-2. While LLaMA has more parameters than PT-GLM-6B, giving it an advantage in repetition rates between consecutive words, the semantic connections between individual words are weaker. This only indicates that the superior performance is due to the numerical accumulation of more model parameters.
[0093] Experiment 2: Ablation Experiment.
[0094] like Figure 9 As shown, Figure 9 For comparative ablation experiments of different categories; among them, Figure 9 (a) is GLM-6BRG-1. Figure 9 (b) is GLM-6B RG-2. Figure 9 (c) is GLM-6B RG-L; where, Figure 9 (d) is PT-GLM-6B RG-1. Figure 9 (e) is PT-GLM-6B RG-2. Figure 9 (f) represents PT-GLM-6B RG-L. It can be seen that the background category performs relatively well in extraction, which is related to its accurate extraction method, namely extracting negative words as shortcomings in the literature, allowing the model to clearly understand the intent. However, the technical framework category performs poorly across all metrics. This is because the text under this category contains a high proportion of technical terms with relatively complex semantic information, leading to inferior performance compared to other categories. Looking at the bar chart trend, GLM-6B's trend is relatively flat, indicating a lack of specificity in its extraction results for different categories. In contrast, PT-GLM-6B introduces prefix optimization for different categories, resulting in different extraction effects for corresponding categories, and its performance is mostly better than GLM-6B without pre-optimization. This further demonstrates that the introduction of prefix optimization is beneficial for extracting subheadings of different categories.
[0095] As shown in Table 3, it can be seen that GLM-6B is slightly worse than PT-GLM-6B after averaging all indicators.
[0096] Table 3 GLM Ablation Experiment
[0097]
[0098] From the perspective of prefix tuning advantages, PT-GLM-6B employs a prefix tuning method that requires only minor adjustments to a small number of prefix parameters to adapt to different tasks. This method allows PT-GLM-6B to customize parameter adjustments for different task categories, thus better matching task requirements. This gives PT-GLM-6B stronger cross-task generalization capabilities, achieving excellent performance across various task categories. In terms of the backbone extraction task, PT-GLM-6B, by tuning consecutive prefixes, helps GLM better understand the input text and extract more accurate backbones. The tuning process can be performed by optimizing the embedding vectors of consecutive prompts to better represent task-related features. In this way, GLM can extract a more accurate and relevant set of sentences based on these optimized prompts.
[0099] Experiment 3: Comparison of different prefix lengths.
[0100] To verify the impact of prefix length on task performance, the average RG value of each category is used as the result to clearly illustrate the comparison of prefix performance. As shown in Table 4, it can be seen that a prefix length of 110 yields the best results in this task, while the performance gradually decreases at 128. Although a longer prefix means more trainable parameters and thus greater expressiveness, performance improves as the prefix length increases to the threshold of 110, and then slightly decreases. The reason for this is that a longer prefix can provide richer contextual information, thus better guiding the model to generate relevant outputs. A prefix that is too short may not fully convey the task requirements. However, a prefix that is too long is too flexible compared to the overall model parameters, and may become overly reliant on training data, making it difficult to generalize well to other categories.
[0101] Table 4 Prefix Length Experiment
[0102] PrefixLength RG-1 (%) RG-2 (%) RG-L (%) 64 30.2 12.2 28.1 110 37.9 17.7 35.8 128 35.4 15.6 34.2
[0103] Experiment 4: Comparison experiment with CAC dataset.
[0104] Most existing Chinese datasets are generative summarization datasets, so this experiment uses the ChineseAbstractive Corpus (CAC), which collects 24,526 historical articles from mainstream vertical media in the education and training industry. It was primarily compiled for training an abstract model, and each entry has two fields: summary and text. All fields are annotated with author information.
[0105] This dataset is similar to the dataset used in this invention and includes manually corrected data. To ensure the model has significant performance comparisons on public datasets, prefix tuning is directly introduced for comparison on this dataset without classification.
[0106] like Figure 10 As shown, Figure 10 The figure shows a comparison experiment using the CAC dataset; where, Figure 10 (a) is RG-1, Figure 10 (b) is RG-2. Figure 10 (c) represents RG-L; it is quite clear that PT-GLM-6B outperforms the other two models in RG-1 and RG-L, indicating that PT-GLM-6B has higher overlap in individual words and higher semantic and syntactic similarity. Prefixes can effectively capture task-related information and quickly adapt to specific tasks without compromising pre-training capabilities. GLM's bidirectional attention mechanism enables the model to understand the text context more comprehensively, thereby extracting higher-quality sentences. This combination allows GLM-6B to achieve optimal performance in both word overlap (RG-1) and semantic similarity (RG-L). The LLAMA model has a higher phrase overlap rate, thanks to its pre-normalization design, which normalizes the input of each transformer sublayer instead of the output. This mechanism can better learn the semantic structure of the input text, thus better capturing key phrases when generating sentence sets. This is the most significant difference between GLM-6B and other optimization methods.
[0107] Experiment 5: Case Analysis and Comparison Experiment, analyzing typical cases.
[0108] Paragraph: Traditional Dynamic Random Access Memory (DRAM) has served as the main memory of computer systems for decades. However, due to the limited scalability and high refresh energy consumption of DRAM, it is difficult to meet the needs of future computer systems. Therefore, new types of non-volatile memory, such as Phase-Change Memory (PCM) and Resistive Random Access Memory (ReRAM), have been developed. Due to their high scalability and low energy consumption, they have been proposed as the next-generation main memory storage medium to replace or supplement DRAM. The non-volatility of NVM also allows data to be persistently stored in main memory, enabling instantaneous system fault recovery. The byte-addressable nature of NVM, along with access latency close to that of DRAM, allows NVM to be directly connected to the memory bus and accessed via CPU read / write instructions, avoiding the overhead of traditional block-based interfaces.
[0109] To ensure the random validity of this experiment, random sampling was performed from all datasets. Table 5 shows that, in the case study, UniLM's sampling results, due to the lack of prefix tuning, missed extracting the necessary negative sentiment sentences. These sentences indicated deficiencies in the background paragraphs, and other sentences that needed extraction were not actually included. In the LLaMA model, the sampling results were closer to the correct results, indicating effective fine-tuning and benefiting from the model's strong understanding capabilities. However, the extracted sentences for negative sentiment were incomplete, omitting other sentences with negative sentiment and lacking sentences describing the specific content of the solution. This is because prefix tuning was not introduced during fine-tuning.
[0110] Table 5. Case Study Comparison Experiment Table
[0111]
[0112] The initial stage lost its extraction bias for corresponding paragraph categories, and relying solely on manual prompts to change the extraction bias was not ideal. GLM-6B performed well in extracting negative sentiment sentences and extracted more complete content than LLaMA. However, the solution extracted the sentence incorrectly, and due to the lack of prefix tuning, the extraction bias was unclear, indicating that the extraction bias was still limited by manual prompts. In PT-GLM-6B, the results show that the model extracted relatively complete content and structure, proving the effectiveness of prefix tuning for extraction bias. Moreover, the extraction performance was slightly improved compared to the original model, thus perfecting the extraction results.
[0113] This invention primarily introduces the task of extracting the main body of an article, along with related models and methods. Traditional main body extraction methods mainly employ sentence extraction, but these are not suitable for the task described in this invention. Therefore, a Yolo++ method for extracting the main body of an article based on subheading structure and paragraphs is proposed. A prefix-optimized PT-GLM-6B model is also presented to solve the main body extraction task. In the PT-GLM-6B model, autoregressive blank filling is used as the pre-training objective, and the model architecture and fine-tuning process are described. In the experimental and results analysis section, comparative experiments and ablation experiments were conducted to evaluate the model's performance. The results show that the PT-GLM-6B model performs excellently on the key original sentence extraction task, and its performance is further improved after prefix optimization.
[0114] Step 3: Based on the BERT model, introduce the sparse matrix SM and the soft hybrid expert model Soft Moe into the BERT model to obtain the sparse soft hybrid expert model Soft Moe-SM-BERT. Use Soft Moe-SM-BERT to identify citation intent.
[0115] In the task of citation intent recognition, citation intent is often hidden deep within the text, making it difficult to establish a unified standard and feature system. This invention uses Multi-Label Text Classification (MLTC) to address the citation intent problem and establish a clear label range. This invention proposes the SoftMoe-SM-BERT model to identify citation intent from extracted citation data. To address the issue that increased training time can lead to a decline in contextual understanding, this invention optimizes training speed and accelerates convergence by adding sparsity to BERT; furthermore, it introduces SoftMoe to perform weighted averaging of tokens, better capturing long-distance semantic dependencies. Experiments on multi-label citation intent classification and control experiments are conducted, and the performance is compared with other classic and state-of-the-art models, demonstrating the superior performance of this model.
[0116] The citation intent recognition task is specifically defined as follows: there exists a sample space X∈R d Let d be the word vector dimension, R be the real number field, and there exist n sample sequences, let the set of these sequences be denoted as . y i ∈Y, x i Let N be the number of sentences with citation tags, and N be the total number of sentences with citation tags. Citations involve multiple intents or topics. Intents are modeled as multiple labels, transforming the citation intent recognition task into a multi-label classification task. Let the label set be T = {t1, t2, ..., t...} h }, where h represents the total number of labels. Assume the label space contains m labeled samples: S = {(s1, s2, ..., s...} m )},s i ∈T. The meaning of the set is shown in Table 6.
[0117] Table 6 Set Description
[0118] gather describe X A sequence set is a collection of text sequences. T A label set is a collection of labels that are categorized within a given problem or task. S Labelspace is an abstract space used to describe the possible labels in a problem. Y The ResultSet is the set of correct labels for X.
[0119] The goal of this task is to train a prediction model, which requires s during the training process. i Converging with real labels y i To achieve predictive results. The model corresponds to the function. f(x, y) can be represented as the confidence of y∈Y as the correct prediction of x, which can be transformed into a ranking rank(x, y), where x represents the sentence with a citation tag and y represents the intent tag. The higher the ranking, the higher the score.
[0120] This invention primarily addresses the task of multi-label text classification, dealing with long, semantically rich text. Considering contextual understanding and semantic learning capabilities, this invention uses BERT as the base model. However, the BERT model requires significant computational resources and training time when processing semantically rich long text data. Furthermore, excessively long texts may contain noise or irrelevant information, making it impossible to capture all contextual information, thus reducing the BERT model's ability to understand the text.
[0121] To this end, this invention innovatively proposes the SoftMoe-SM-BERT model, with the following architecture: Figure 11 As shown, to address the problem of low training efficiency, a technique based on sparse matrix (SM) is used. This leverages the sparsity of neural network activation patterns to improve computational efficiency and speed during training without sacrificing, and may even improve, model performance. To address the problem of contextual semantic understanding, this invention proposes a fully trainable hybrid expert model, namely a soft hybrid expert model. In SoftMoe (SoftMixture of Experts), a weighted average of all tags is calculated, and these weighted combinations are distributed to each expert. This soft allocation method allows each expert to process a subset of tags based on the relevance or importance of the context. By considering multiple weighted combinations of tags, SoftMoe can capture richer contextual information, thereby better modeling citation context relationships.
[0122] Figure 11 This section demonstrates the basic architecture of the Soft Moe-SM-BERT model. The architecture consists of three main modules: a vector input transformation module, an SM-BERT module, and an SM-Soft Moe module.
[0123] Vector Input Module: The input is a sequence of samples, such as the sentence sample "represents Google" from the CCF paper dataset. The sentence is converted into a sequence of tokens. The sentence is broken down into individual tokens, which are the smallest units processed by Soft Moe-SM-BERT. These tokens may be words, punctuation marks, etc., and each token corresponds to an index in the vocabulary. After tokenization, a special [CLS] marker is added to the beginning of the sentence to indicate the start of the classification task. A special [SEP] marker is added to the end of the sentence to indicate its end.
[0124] Based on the content of the quoted sentence, each token is mapped to a corresponding vector representation in the vocabulary, i.e., word embeddings. For sentence pair tasks (such as text classification, sentence relationship judgment, etc.), BERT adds a segment embedding to each token to distinguish between sentences. To encode the positional information of each token, BERT generates a position embedding for each token. The dimension of the position embedding is the same as that of the word embedding, but each position corresponds to a different position code. The position code of each word is calculated based on the dimension of the embedding vector using formulas (1) and (2). Where pos represents the position of the word; i represents the dimension index of the word embedding vector; d model is the dimension of the word embedding vector; PE(pos, 2i) represents the positional encoding of even-numbered positions; PE(pos, 2i+1) represents the positional encoding of odd-numbered positions.
[0125]
[0126] For BERT, long texts increase the computation and storage requirements of all weight matrices in the BERT model, increasing memory pressure and computational resource consumption, thus affecting training efficiency and model performance. Sparse matrices, on the other hand, only store non-zero elements and their corresponding indices, thus significantly reducing storage space. Furthermore, since most elements in a sparse matrix are zero, the sparsity property can be utilized to accelerate computations such as matrix multiplication, thereby reducing computational costs and better capturing semantics. This invention applies sparse matrices to BERT. The sparsity process is as follows: Figure 12 As shown, where, Figure 12 (a) is a schematic diagram of the structure for filtering the first position. Figure 12 (b) is a schematic diagram of the structure for filtering the second position. Figure 12 (c) is a schematic diagram of the structure for filtering the nth position.
[0127] The specific transformation process involves a neural network with an input layer X containing N neurons, a hidden layer H containing M neurons, and an output layer Y containing K neurons. Formula (3) represents the feedforward neural network process. Here, only the gradient of the hidden layer is derived. j For the calculation result of the j-th neuron, x i For neurons in input layer X, w ij Let be the weight matrix, where i∈N, i={1,...,n}, j∈M, j={1,...,m}, and formula (3) is as follows:
[0128]
[0129] Backpropagation updates the weight matrix by calculating the partial derivative of the loss function E with respect to w, and using Δw. ij This represents the weight at that position. The derivative formula for the weight matrix is obtained by calculation (4):
[0130] Δw ij =ηδ j ·x i (4);
[0131] Where η is the learning rate, and δ is the error term. j It is the partial derivative of the loss function with respect to the activation value of the neuron, as shown in Equation (5):
[0132]
[0133] Where, δ q It is the error term of the output layer, x j It is the output of neuron j, u j It is the weighted output of neuron j. This can be represented as the activation function f acting on its input u. j If the input matrix X, or the error term matrix δ, is sparse, then the matrix multiplication calculation will be greatly reduced, thus saving computational resources. Therefore, in formula (3), x i with w ij The result of multiplying the hidden layer calculations of the feedforward neural network is given, so w is at this point. ij Changing to a sparse matrix will greatly reduce Δw. ij The calculation enables the gradient to decrease rapidly.
[0134] In the hidden layer, although many neurons are not exactly equal to 0 or 1, they are still very close to 0 or 1. If the difference between two quantities is less than a predefined threshold γ... h If they are very close, then the present invention sets T. h The subset of activated neurons in all hidden layers is shown in Equation (6).
[0135] T h ={j|1-y cj >γ h and y cj >γ h } (6);
[0136]
[0137] Among them, y cl The weights of the neurons.
[0138] According to the updated formula (7), only those belonging to T h Only neurons that are active participate in the computation; those that are not active are recorded as zero.
[0139] Thus, we can obtain a sparse update expert matrix that satisfies the threshold condition. Then, according to the zero-update rule, the original weight matrix w is made... ij This transforms the matrix into a sparse matrix, thus achieving a sparse matrix.
[0140] For each token, its word embedding, position embedding, and token type embedding are summed to obtain the final representation of the token. The tokens are then fed into the encoder, where an attention mechanism is used to capture the dependencies and importance between different parts of the input sequence. The matrix T composed of tokens is then compared with the weight matrix W. Q W K W V Multiply, where The weight matrix is then transformed into a sparse matrix, with the transformation effect as follows: Figure 13 As shown, after obtaining the corresponding query matrix Q, key matrix K, and value matrix V, attention scores are calculated using these Q, K, and V matrices. These scores are then used to perform a weighted summation on the value matrix V, resulting in the final attention weighted vector representation. Let the target word be "Google" x. i Calculate the similarity between the query and each key. If the query is at position i in the sentence, then its attention score at position j is s. ij His overall score is shown in formula (8). Formula (9) reveals that the overall attention is multiplied by the value representation V, and then the similarity score is converted into a probability distribution through the softmax function, where is the dimension of the key vector, used to scale the attention weights. As shown in formula (10), multi-head attention is an extension of the attention mechanism, concatenating the outputs of each attention head to obtain the weights. The weighted sum of the values is then used to obtain the representation of the current query.
[0141] S = QK T (8);
[0142]
[0143] MH(Q,K,V)=Concat(h1,h2,...,h n (10);
[0144] Among them, W i Q W i K W i V These are the query vector, key vector, and value vector, respectively.
[0145] Given an attention matrix, the value at each position represents the degree of attention or importance the model pays to each position in the input sequence. As shown in Table 7, the probability score matrix is obtained, and after connection to a two-layer feedforward neural network, it is denoted as x. MH The reconnection will then be done through the Add&Layer Norm module, which is a residual connection and layer normalization.
[0146] Table 7 Example of a probability score matrix for self-attention
[0147] generation surface yes valley Song generation 0.5 0.3 0.1 0.05 0.05 surface 0.3 0.4 0.1 0.14 0.16 yes 0.1 0.1 0.5 0.12 0.1 valley 0.05 0.14 0.12 0.4 0.29 Song 0.05 0.06 0.1 0.29 0.5
[0148] As shown in the normalization formula (11), μ is the mean of the input, σ is the standard deviation of the input, γ and β are learnable parameters, and ∈ is a very small number to prevent the denominator from being zero. These 12 encoder layers constitute the complete BERT structure.
[0149]
[0150] The final loss function is shown in equation (12):
[0151]
[0152] Where n is the number of samples. Indicates when the real label y i The loss when it is 1 is a penalty for cases where the model predicts a lower probability for the positive class (y=1). This indicates that when the real label (y) i The loss is zero when the true label and the predicted probability are equal. This penalty penalizes cases where the model predicts a higher probability for the negative class (y=0). This formula measures the difference between the true label and the predicted probability, encouraging the model to improve its predictions to minimize the overall loss.
[0153] To address the potential contextual ambiguity issue that BERT may encounter when processing long text data, this invention introduces the Soft Mixture of Experts (Soft Moe) method. Soft Moe is a combined model based on expert networks, better adaptable to complex data structures. In text classification tasks, especially for semantically rich long text data, Soft Moe can more comprehensively consider the training results of various expert models, thereby better capturing the multifaceted features of the data. Since the dataset used in this invention consists of long texts, introducing Soft Moe can better adapt to this complex data structure and improve the model's ability to understand citation contextual information.
[0154] SM-Soft Moe Module: In this invention, a soft hybrid expert module (SoftMoe) is added before the Feed Forward of the BERT encoding layer. Each token is normalized by the multi-head attention mechanism and residual network of the Transformer layer before entering the SoftMoe. This allows the characteristics of the soft hybrid experts to be integrated into the BERT model. In the BERT encoder, the self-attention mechanism is used to calculate the relevance between each word in the input sequence and other words to obtain a better citation context representation. The feedforward neural network is used to perform nonlinear transformation and feature extraction on the output of the self-attention mechanism. Placing the SoftMoe between the self-attention mechanism and the feedforward neural network allows the SoftMoe to receive the output of the self-attention mechanism as input and to make inferences and combinations based on different experts. This position allows the SoftMoe to better utilize contextual relationships and semantic representations, dynamically weigh and integrate different experts, and sparsify the weight matrices of each expert, thus accelerating the training process. Figure 14 As shown, the expert layer consists of multiple Experts, each of which is a simple feedforward neural network that selects a sparse combination of experts to process each input. All parts of the network are jointly trained via backpropagation. The sample sequence "representing Google" is fed into the SoftMoe module, using... This is used as the input to the sequence, where n is the number of tokens and d represents the dimension of the tokens, such as... Figure 14 The example shown illustrates that the Soft Moe module uses a set of expert networks corresponding to each token, which can be represented as follows: Each slot can learn a parameter matrix containing weights. This parameter matrix is linearly transformed with the input token vector to generate a weight logits vector related to the number of slots. This logits vector represents the weight of each slot for the input token. These weight assignments determine which slot processes each input in the SoftMoe model. The weight logits are calculated through feedforward. After obtaining the weight logit matrix, the columns are normalized because the columns correspond to the number of slots. For each slot, a linear combination of all input tokens is calculated based on these weights. Each expert is responsible for processing two slots, where each slot has a corresponding parameter vector, which is also d-dimensional, denoted as Θ∈R. d×(n·c) ,use This represents the result of a linear combination of n input tokens. As shown in formula (13):
[0155]
[0156] In the formula, X is the input feature, Θ is the weight, and m is the sequence length.
[0157] Here, the scheduling weight E is... i,j The softmax is calculated only for the columns of XΘ, without calculating the rows. After forming the input slots, the results are sent to an expert for calculation to obtain the output slots, as shown in formula (14).
[0158]
[0159] Where c is the slot number, i represents the expert number, and i / c means that the i-th expert is applied in the c-th slot.
[0160] Finally, the same original logic as shown in Equation (21) is used to normalize each token (i.e., row-wise), and the output tokenY is calculated as (n·p) output slots. The weights of the convex combination are calculated as shown in formula (15).
[0161]
[0162] In the formula, Q is called the combined weight, which is the result of applying softmax to the rows of XΘ. The final output is obtained by weighting and combining the representations of each slot according to the Combine Weights. Each column of the Combine Weights matrix corresponds to a slot, which determines the contribution of each slot to the final result.
[0163] The test dataset used in this invention comes from the China Computer Federation Communications, covering all papers published in the journal since 2005, totaling 3273 papers. The YOLO v8 model was used to parse the PDF papers. First, the coordinates of each part of the document were manually marked, including the main text, headings, and citations. The PDFPlumer tool was used to extract the citations. Through manual proofreading and review, the raw data was screened, cleaned, and filtered. Finally, this invention extracted a total of 16845 citations. The data is shown in Table 8.
[0164] Table 8. CCCF Citation Data Statistics
[0165] title Subheading Quotation References 1341 3321 16845 18885
[0166] By analyzing existing classifications of citation intent and considering the structure and characteristics of the dataset, the most suitable classification strategy for this dataset is proposed, as shown in Table 9:
[0167] Table 9 CCCF Citation Classification Patterns
[0168] model explain Phenomenon Describe the occurrence of a simple phenomenon in the citation. application The content mentions applications, or applies literature. question Describe and explain the source of the problem evidence Proof relationships in the description For example Explanations that do not serve any other purpose, citing relevant quotations. ordinary Vague results important Detailed, effective, and meritorious results Inspiration The inspiring conclusions drawn from this article
[0169] Based on the classification and prediction results of 16,845 data points, the data volume for each category is as follows: Figure 15 As shown, the proportion of ordinary and important categories in the outcome class is relatively large, while the data for supporting and heuristic categories is relatively small. This may result in better training and prediction performance for ordinary and important categories in the outcome class, while other categories are predicted relatively less. Furthermore, the training set, test set, and validation set sizes in this invention are 13476, 1685, and 1685, respectively.
[0170] Table 10 Confusion Matrix of Evaluation Indicators
[0171]
[0172] As shown in Table 10, the evaluation metrics used in this invention are accuracy, precision, recall, training time (per-epoch time), and Micro F1 score (F-Measure). Here, TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives, and FN is the number of false negatives.
[0173] As shown in formula (22), this invention uses the average Jaccard coefficient as the accuracy, which is mainly used for similarity calculation between sets, where D is the test set, s represents each sample in D, and G s P is the actual output of sample s. s This is the model prediction output of sample s. This metric reflects the overall similarity between the model prediction output and the reference output. The F1 score can comprehensively consider the model's precision and recall, making it a relatively balanced performance evaluation metric. However, in this invention, due to the uneven distribution of samples, the Micro F1 score is used as the evaluation metric because it considers the contribution of each sample and is more suitable for cases of unbalanced data distribution. The calculation method of the metric is as shown in formulas (16)-(19), where Recall micro Precision micro Let Precision and Recall represent the sums of all categories, respectively, and let i represent the i-th category. In the following experiments, Recall, Precision, and F1 are used.
[0174]
[0175] TP represents the number of positive classes predicted as positive, FP represents the number of negative classes predicted as positive, FN represents the number of positive classes predicted as negative, and n represents the total number of classes.
[0176] The Macro F1 score for Experiment 3 is shown in equations (20)-(22). Since different labels may have varying importance in multi-label classification tasks, some labels may appear more frequently while others may be very rare. Macro-F1 simply averages the F1 scores for each label, thus ignoring label imbalance, which may lead to insufficient evaluation of rare labels. Therefore, the Micro F1 score is used globally as the evaluation metric, and is only used as a reference when testing the public dataset in Experiment 3.
[0177]
[0178] Experimental scheme: The effectiveness of the model optimization of this invention will be verified through experiments. These experiments aim to demonstrate the following research questions:
[0179] Experiment 1: How does the proposed model compare with classic and state-of-the-art models?
[0180] Experiment 2: Convergence performance of four indicators on different models.
[0181] Experiment 3: How well does Soft Moe-SM-BERT perform on public datasets?
[0182] Experiment 4: Verify the effectiveness of the module through ablation experiments.
[0183] Experiment 5: Investigate the effect of different sparsity on the model's performance.
[0184] This invention uses TextCnn, BERT+TextCnn, Soft Moe-SM-BERT, and RoBERTa-MA models for comparative experiments.
[0185] Experiment 1: Model Comparison Experiment.
[0186] The results of multi-label classification experiments on the citation dataset of this invention for each model are shown in the figure below. Figure 16 As shown, where, Figure 16 (a) represents the accuracy rate. Figure 16 (b) represents the accuracy. Figure 16 (c) represents the recall rate. Figure 16 (d) is the F1 value.
[0187] In the test results, compared to the classic classification model TextCNN and SoftMoe-SM-BERT, the F1 score differed by 5%. Given the dataset consists of long text data, TextCNN's contextual understanding ability is far inferior to other models. Its fixed-length input truncates long texts, leading to information loss and limiting semantic capture capabilities. SoftMoe-SM-BERT outperformed BERT+TextCNN by 3.9%, showing a significant improvement. By comparing Soft Moe and TextCNN, it's clear that Soft Moe combines different experts with weighted calculations to obtain predictions, reducing model bias and variance and extracting more comprehensive and diverse feature information, thus improving model performance. TextCNN primarily focuses on local features, while its processing of global information is relatively weak. In the classification task of this invention, global information is crucial for the correct classification of long texts, hence its weaker performance compared to SoftMoe-SM-BERT. Figure 16 The results also show that the BERT+TextCNN model has a higher accuracy than SoftMoe-SM-BERT, by 3.2%. This is because the BERT+TextCNN model has limited context understanding, so it may be more conservative in its predictions, tending to predict samples as negative examples. This results in higher accuracy when predicting positive examples, i.e., higher precision. SoftMoe-SM-BERT is slightly better than RoBERTa-MA by 0.3%. RoBERTa-MA adds a Multi-Attention layer to the output of RoBERTa, introducing multiple attention mechanisms for different labels to capture the relationships between labels. However, in some cases, it may overemphasize these relationships and ignore the importance of predicting each label independently. The SoftMoe-SM-BERT model uses a trainable soft routing mechanism, avoiding hard label assignment, thus making better use of all input information and avoiding label omission or imbalance. The results show that the SoftMoe-SM-BERT model has a slightly better F1 score than the latest RoBERTa-MA, which proves this point.
[0188] Depend on Figure 17 It can be seen that the loss of the Soft Moe-SM-BERT model decreases faster than that of RoBERTa-MA. This indicates that the Soft Moe-SM-BERT model has advantages in training efficiency or performance. The Soft Moe-SM-BERT model can reduce the loss value more quickly and learn and adjust the model parameters more effectively during training.
[0189] Experiment 2: Convergence experiment of multi-index model.
[0190] The convergence process of each model on the multi-label classification task experiment on the citation dataset of this invention is as follows: Figure 18 As shown, where, Figure 18 (a) represents the accuracy rate. Figure 18 (b) represents the accuracy. Figure 18 (c) represents the recall rate. Figure 18 (d) is the F1 value.
[0191] Depend on Figure 18 It can be seen that the accuracy tends to level off after 12 epochs, with RoBERT-MA having slightly higher accuracy. BERT+TextCNN has the highest accuracy. In recall, Soft Moe-SM-BERT converges faster than RoBERT-MA, with RoBERT-MA having the highest accuracy and TextCNN the lowest. In F1 score, Soft Moe-SM-BERT is slightly higher than RoBERT-MA. Based on accuracy, recall, and F1 score, Soft Moe-SM-BERT converges the fastest.
[0192] Experiment 3: Validation experiment on the effect of public dataset.
[0193] The model from Experiment 1 was used for comparison, and testing was conducted on a public dataset. The dataset used was Ren-CECps, which consists of sentences selected from Chinese blogs. These sentences have been manually annotated with eight basic emotions: anger, anxiety, expectation, hatred, joy, love, sadness, and surprise. This Chinese blog corpus contains rich sentiment annotation information. The data distribution and number of categories are similar to the data used in this invention, making it a better reference for model optimization. The data distribution is as follows: Figure 19 As shown.
[0194] Figure 20 The results are compared using the Ren-CECps dataset; among them, Figure 20 (a) represents the accuracy rate. Figure 20 (b) is Micro F1. Figure 20 (c) is Macro F1, from Figure 20 It can be seen that Soft Moe-SM-BERT achieves a 2.2 percentage point higher accuracy than RoBERTa-MA. In the Micro F1 score, the latter is 0.4 percentage points higher than the former, while in the Macro F1 score, Soft Moe-SM-BERT outperforms RoBERTa-MA. Micro F1 is calculated by combining the true and predicted labels of all samples, thus focusing more on frequently occurring labels in the dataset. This indicates that Soft Moe-SM-BERT performs better in identifying common labels in the dataset, resulting in a higher Micro F1 score than RoBERTa-MA. Macro F1 focuses more on the model's performance on each label, showing that RoBERTa-MA performs worse than Soft Moe-SM-BERT when dealing with rarer or more challenging labels in the dataset.
[0195] Experiment 4: Ablation Experiment.
[0196] To verify the effectiveness of the optimization, ablation experiments were conducted on the CCCF citation dataset of this invention. The experimental results are as follows: Figure 21 As shown, where, Figure 21 (a) represents the accuracy rate. Figure 21 (b) represents the accuracy. Figure 21 (c) represents the recall rate. Figure 21 (d) is the F1 value.
[0197] It can be seen that Soft Moe-SM-BERT has the highest accuracy, 0.5 percentage points higher than SM-BERT. Soft Moe-BERT achieves the highest precision at 57.2%, while BERT-base has the highest recall at 49.2%. The reason for this is that in BERT's pre-training task, BERT-base is trained to predict masked tags, which may have multiple possible answers. To improve accuracy, BERT-base tends to select the most common or most reasonable answer, ignoring other possibilities. The Soft Moe layer, on the other hand, weights all input tags with each expert to generate the output tag. This trainable assignment mechanism allows Soft Moe-BERT to more comprehensively consider different possibilities and make predictions more flexibly.
[0198] SM-BERT shows a slight improvement over BERT, primarily by reducing computational load and accelerating the inference process, as shown in Table 11.
[0199] Table 11 shows the training speed results for the CCCF citation dataset.
[0200]
[0201] As can be seen from the Soft Moe-SM-BERT results in the figure, Soft Moe, a soft hybrid expert model, solves this problem by training on more key information. The high accuracy of Soft Moe-SM-BERT indicates that the model can perform global performance evaluation and achieve a balance in the classification process. Furthermore, its F1 score is among the best, demonstrating that the experts in the soft hybrid expert model enhance the model's decision-making ability. Hybrid experts can make decisions through methods such as voting and weighted averaging, thereby improving the model's accuracy and robustness. The iterative process is as follows... Figure 22 As shown.
[0202] Experiment 5: Sparsity Comparison Experiment
[0203] Depend on Figure 22 It can be seen that SoftMoe-SM-BERT has the fastest iteration speed. The introduction of sparse soft hybrid experts allows the model to select experts to learn data features in a targeted manner, which can accelerate the convergence process to a considerable extent.
[0204] Figure 23 The results show the sparsity comparison; among them, Figure 23 (a) represents the accuracy rate. Figure 23 (b) represents the accuracy. Figure 23 (c) represents the recall rate. Figure 23 (d) is the F1 value. From Figure 23 It can be seen that sparsity values of 0.25 and 0.5 result in relatively poor performance. Between 0.5 and 0.8, the model's performance gradually increases with sparsity, with the best F1 score at 0.8. However, when the sparsity is 1, the performance drops significantly. The reason for this is that when the model sparsity is too high, such as 0.25, connections between some neurons are cut or weakened, and the model cannot utilize these cut connections to convey important information, leading to information loss. If the sparsity is too low, such as 1, it leads to redundant information and over-adaptation to the training data, resulting in the loss of key information and reduced generalization ability. When choosing sparsity, a balance needs to be struck between information preservation and model performance. In this model, a sparsity of 0.8 yields the best results.
[0205] This invention primarily introduces the identification of citation intent in extracted citations and the extraction of labels from references. The application of cited references in an article has multiple functions, including increasing the credibility of the article, supporting arguments and viewpoints, emphasizing research innovation, and increasing verifiability. This invention proposes the Soft Moe-SM-BERT model for multi-label citation intent classification experiments. Experimental results show that the Soft Moe-SM-BERT model exhibits advantages in accuracy and F1 score, especially in loss descent speed and training speed, significantly outperforming other models. Ablation experiments and instance-based case learning further validate the model's optimization effect and performance improvement, indicating that the Soft Moe-SM-BERT model can better handle multi-label classification tasks.
[0206] This invention utilizes the YOLO++ algorithm to extract the structure of a document and decomposes the document to be retrieved into multiple structural units of different granularities according to chapters. These structural units are then categorized according to subheadings, creating a complete hierarchical structure. A prefix-tuned PT-GLM-6B model is used to extract the core content of paragraphs corresponding to subheadings in each category. This process requires only minor adjustments to a few prefix parameters to adapt to different tasks. Furthermore, this method allows for customized parameter adjustments of PT-GLM-6B for different task categories, better matching task requirements. This gives PT-GLM-6B stronger cross-task generalization capabilities, achieving excellent performance across various task categories, and enabling the extraction of core content from different subheadings. The sparse matrix in the constructed Soft Moe-SM-BERT model accelerates computation. Introducing Soft Moe for weighted averaging of tokens better captures long-distance semantic dependencies. Finally, this invention constructs a hyponym / hypernym network, capable of querying existing hypernyms and providing auxiliary prediction for new words, extracting the research directions and methods of the document to be retrieved. Based on the extracted main content, citation intent, research direction, and methodology of the literature to be retrieved, the system recommends matching literature to the user.
[0207] This invention also provides a document guidance system, comprising:
[0208] The data acquisition module is used to acquire the literature to be retrieved;
[0209] The core content extraction module is used to extract the bibliographic structure of the literature to be retrieved using the YOLO++ algorithm, and decompose the literature to be retrieved into multiple structural units of different granularities according to chapters; divide the multiple different structural units into different categories according to subheadings; add a prefix vector before each attention layer of the encoder of the GLM-6B generative language model to obtain a prefix-tuned PT-GLM-6B model, and use the prefix vector of the PT-GLM-6B model to extract the core content of the paragraphs corresponding to the subheadings of each category;
[0210] The citation intent recognition module is used to identify the citation intent of the retrieved documents by introducing a sparse matrix SM and a soft hybrid expert model Soft Moe into the BERT model as the base model.
[0211] The guide module is used to identify the citation intent of the literature to be retrieved, based on the BERT model as the base model and by introducing a sparse matrix SM and a soft hybrid expert model Soft Moe into the BERT model.
[0212] The present invention also provides a computer device including a memory and a processor; the memory stores a computer program, and the processor is used to run the computer program in the memory to perform a document reading method.
[0213] The present invention also provides a computer-readable storage medium storing a computer program adapted for loading by a processor to execute a document reading method.
[0214] The above-described embodiments are merely preferred embodiments of the present invention, and the scope of protection of the present invention is not limited thereto. Any simple changes or equivalent substitutions of the technical solutions that can be obviously obtained by those skilled in the art within the scope of the technology disclosed in the present invention shall fall within the scope of protection of the present invention.< / eos>
Claims
1. A method for guiding the reading of literature, characterized in that, include: Obtain the literature to be searched; The YOLO++ algorithm is used to extract the literature structure of the literature to be retrieved, and the literature to be retrieved is decomposed into multiple structural units of different granularities according to chapters; multiple different structural units are divided into different categories according to subheadings; a prefix vector is added before each attention layer of the encoder of the GLM-6B generative language model to obtain the prefix-tuned PT-GLM-6B model; the prefix vector of the PT-GLM-6B model is used to extract the main content of the paragraphs corresponding to the subheadings of each category. Based on the BERT model, a sparse matrix SM and a soft hybrid expert model Soft Moe are introduced into the BERT model to obtain the sparse soft hybrid expert model Soft Moe-SM-BERT. Soft Moe-SM-BERT is used to identify the citation intent of the literature to be retrieved. Based on the extracted core content and the intent of citations, provide literature guidance for users; The process of using Soft Moe-SM-BERT to identify citation intent in the retrieved documents specifically includes the following steps: The input sample sequence of documents to be retrieved is converted into multiple token sequences, and the multiple token sequences are labeled; each token is mapped to a corresponding vector representation in the vocabulary. A type embedding is added to each token using BERT from Soft Moe-SM-BERT, and a position embedding is generated for each token. The vector representation of each token is added to its type embedding and position embedding to obtain the final embedding representation of each token. The final embedded representation of each token is input into Soft Moe, which captures contextual relationships and semantic representations, and dynamically weighs and integrates them among different experts. At the same time, the weight matrices of each expert are sparsified to identify the citation intent.
2. The document guidance method according to claim 1, characterized in that, The process of extracting the literature structure of the literature to be retrieved using the YOLO++ algorithm and decomposing the literature to be retrieved into multiple structural units of different granularities according to chapters also includes extracting the text, images and tables of each structural unit as leaf nodes, and constructing the hierarchical structure of the literature text based on the extracted leaf nodes. The subheadings of each structural unit are categorized according to the hierarchical structure, and subheadings that do not clearly reflect the structural unit are replaced.
3. The document guidance method according to claim 1, characterized in that, The token sequence includes multiple tokens, which are the smallest units processed by Soft Moe-SM-BERT. Tokens include words and punctuation marks.
4. The document guidance method according to claim 1, characterized in that, The Soft Moe is located between the self-attention mechanism of the BERT coding layer and the feedforward neural network.
5. A document guidance system, characterized in that, include: The data acquisition module is used to acquire the literature to be retrieved; The core content extraction module is used to extract the bibliographic structure of the literature to be retrieved using the YOLO++ algorithm, and decompose the literature to be retrieved into multiple structural units of different granularities according to chapters; divide the multiple different structural units into different categories according to subheadings; add a prefix vector before each attention layer of the encoder of the GLM-6B generative language model to obtain a prefix-tuned PT-GLM-6B model, and use the prefix vector of the PT-GLM-6B model to extract the core content of the paragraphs corresponding to the subheadings of each category; The citation intent recognition module is used to identify the citation intent of the retrieved documents by introducing a sparse matrix SM and a soft hybrid expert model Soft Moe into the BERT model as the base model. The reading guidance module is used to guide users through the literature based on the extracted main content and the citation intent. The process of using Soft Moe-SM-BERT to identify citation intent in the retrieved documents specifically includes the following steps: The input sample sequence of documents to be retrieved is converted into multiple token sequences, and the multiple token sequences are labeled; each token is mapped to a corresponding vector representation in the vocabulary. A type embedding is added to each token using BERT from Soft Moe-SM-BERT, and a position embedding is generated for each token. The vector representation of each token is added to its type embedding and position embedding to obtain the final embedding representation of each token. The final embedded representation of each token is input into Soft Moe, which captures contextual relationships and semantic representations, and dynamically weighs and integrates them among different experts. At the same time, the weight matrices of each expert are sparsified to identify the citation intent.
6. A computer device, characterized in that, It includes a memory and a processor; the memory stores a computer program, and the processor is used to run the computer program in the memory to perform the document reading method according to any one of claims 1-4.
7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program adapted for loading by a processor to execute the document guidance method according to any one of claims 1-4.