A method for generating a bar based on mT5
By performing structured parsing of laws and regulations and word segmentation using the mT5 model, the problem of inaccurate article generation in existing technologies has been solved. This achieves intelligent decomposition of legal and regulatory information and improves the accuracy of Chinese word segmentation, resulting in more accurate and fluent articles.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INST OF MILITARY LEGAL AFFAIRS ACAD OF MILITARY SCI OF THE CHINESE PEOPLES LIBERATION ARMY
- Filing Date
- 2022-12-19
- Publication Date
- 2026-06-23
AI Technical Summary
Existing technologies have problems such as inaccurate text generation, insufficient retention of Chinese information, and text length limitations when generating legal provisions. In particular, the fluency and accuracy of the generated provisions are poor in Chinese text.
The mT5 model is used to perform structured parsing of laws and regulations, and the jieba and sentencepiece word segmentation processes are used to generate article titles by combining text, position and paragraph encoding. The mT5 encoder and decoder are used to perform accurate Chinese word segmentation and article title generation.
It achieves intelligent decomposition of legal and regulatory information and improves the accuracy of article generation, better preserving Chinese word information and generating more accurate and fluent articles.
Smart Images

Figure CN115983259B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of legal provision generation technology, and more specifically to a provision generation method based on mT5. Background Technology
[0002] The gist of a legal provision, or the essence of a law or regulation, serves to facilitate understanding and review of the provisions and provides legislative guidance. In legislative work, it is a tool and method for legislative focus. From a legislative perspective, the gist of a legal provision must be non-repetitive, non-overlapping, unambiguous, and non-redundant.
[0003] Currently, it is mainly generated through the following two methods: one is extractive text generation, and the other is generative text generation. Extractive text generation, as the name suggests, extracts some text from existing text to generate text. Although it can solve some text generation tasks, it has the following disadvantages: (1) all generated text exists in the original text and there is no summary; (2) the generated text is not fluent and is prone to ambiguity. Generative text generation, as the name suggests, generates new text based on existing text through the model's own summary. Its effect is better than extractive text generation, but it still has the following disadvantages: (1) it is usually limited to the length of the text being processed, and it works well in short texts; (2) it usually processes the text into characters, but Chinese is more about expressing meaning through words, and it cannot preserve the Chinese word information well, resulting in poor effect. Therefore, how to accurately generate text is a problem that needs to be solved by those skilled in the art. Summary of the Invention
[0004] In view of this, the present invention provides a text generation method based on mT5 to overcome the above-mentioned defects.
[0005] To achieve the above objectives, the present invention provides the following technical solution:
[0006] A text generation method based on mT5, the specific steps of which are as follows:
[0007] Information gathering: Obtaining information on laws and regulations;
[0008] Document structure analysis: structural analysis of laws and regulations;
[0009] Article generation: The mT5 article extraction model is used to generate articles from the parsed laws and regulations.
[0010] Optionally, the decomposition of the document structure includes the decomposition of the overall content and the decomposition of the clauses.
[0011] Optionally, the overall content can be broken down into its name, body text, publication date, and effectiveness level.
[0012] Optionally, clause decomposition involves breaking down the content of legal provisions into a hierarchical structure of sections, chapters, subsections, articles, clauses, items, and purposes.
[0013] Optionally, the steps for generating an entry are as follows:
[0014] Encode the input legal provisions using text tokens;
[0015] The text token encoding (to obtain token embeddings), position encoding (to obtain position embeddings), and segmentation encoding (to obtain segmentation embeddings) are combined and then fed into the mT5 encoder to generate an encoded file;
[0016] The encoded file is fed into the mT5 decoder, and the output is displayed.
[0017] Optionally, the steps for obtaining the text token encoding are as follows:
[0018] The input legal provisions are segmented using jieba.
[0019] Determine if any word exists in the built-in dictionary of mT5. If it does, encode it using the built-in dictionary of mT5. If not, use sentencepiece to segment any word, and then encode any segmented word using the built-in dictionary of mT5.
[0020] Arranging the codes of each word in order results in the text token encoding.
[0021] Optionally, if any character is not in the dictionary provided with mT5 after character segmentation, UNK processing will be used.
[0022] Optional steps for obtaining the encoded file:
[0023] Step 321: Obtain the attention level of words in the input text to other words through the attention mechanism;
[0024] Step 322: The output of the attention mechanism is passed through the feedforward network and the linear layer before being output.
[0025] Step 323: Combine the output of step 322 with the original encoding vector to obtain the output text; Step 324: Repeat steps 321-323 until all input text is encoded.
[0026] As can be seen from the above technical solution, compared with the prior art, the present invention discloses a clause generation method based on mT5, which can not only realize the intelligent decomposition of regulatory information, but also improve the accuracy of clause generation through precise Chinese word segmentation. Attached Figure Description
[0027] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.
[0028] Figure 1 This refers to the phrase extraction steps of the mT5 phrase extraction model in this invention;
[0029] Figure 2 This is a schematic diagram of the method flow of Embodiment 2 of the present invention. Detailed Implementation
[0030] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. Example 1
[0031] This invention discloses a text generation method based on mT5, comprising the following steps:
[0032] Step 1: Collection of legal information: Collect publicly available laws at the national level, such as the Constitution, Civil Law, and Criminal Law, or some publicly available local regulations and some classified and non-public regulations.
[0033] Step 2: Deconstructing the Structure of Regulations: Analyzing the structure of regulations from various sources facilitates subsequent processing of different parts of the regulations. This involves extracting structured information from the regulations, such as the name, main text, publication date, and level of legal force. The level of legal force defines the hierarchical relationship between regulations. During the deconstruction process, relevant legal provisions are broken down into hierarchical structures such as sections, chapters, subsections, articles, clauses, items, and sub-items, facilitating the generation of the article's main points and the extraction of legal elements.
[0034] Step 3: Input a single legal provision into the mT5 provision extraction model to generate the provision.
[0035] It is worth noting that a single legal provision is a clause, item, or sub-item under that provision.
[0036] Among them, the detailed extraction process of mT5 is as Figure 1 shown below, specifically:
[0037] First, perform text token encoding on the input legal provisions. When performing text encoding, the input legal provisions will first be segmented using jieba, a custom dictionary, and a stop dictionary. Then, the segmented data will be encoded according to the dictionary built into mT5. When encountering words not in the mT5 dictionary, they will be processed using the sentencepiece built into mT5 and then encoded. The specific steps are as follows:
[0038] Step 311: Segment the input text using jieba.
[0039] Step 312: Determine whether the segmented words are in the dictionary built into mT5. If so, encode the input text of the segmented words according to the dictionary built into mT5. The encoding form of the dictionary built into mT5 is (index, word), where word is the word in the dictionary and index is the position of the word in the dictionary. For example, if "Beijing" is the 99th word in the dictionary, its corresponding encoding is (99, Beijing). Simply put, it is to find the position of each word in the segmented text in the mT5 dictionary one by one. If not, for words not in the dictionary, use sentencepiece to perform character segmentation. Suppose "Beijing" is not in the dictionary, then "Beijing" needs to be processed into two characters, "North" and "Jing", and then the above method is used to process each single character. In extreme cases, if both "North" and "Jing" are not in the dictionary, a unified UNK (unknown) processing will be performed on them, that is, a unified encoding information will be given to words not in the dictionary, such as (100, UNK).
[0040] Secondly, send the encoded vector into the encoder of mT5. The specific steps are as follows:
[0041] Combine the text token encoding, position encoding, and segmentation encoding and send them into the encoder of mT5. Among them, the position encoding is the position encoding of the words in the text, and the segmentation encoding is the paragraph encoding of the text. The execution steps of the mT5 encoder are as follows:
[0042] Step 321: Pass the input vector through the attention mechanism to obtain the attention of words in the text to other words.
[0043] Step 322: Output the output result of the attention mechanism after passing through the forward network and the linear layer.
[0044] Step 323: Combine the output of step 322 with the original encoding vector to obtain the output text; Step 324: Repeat steps 321-323 until all text is encoded.
[0045] Finally, the encoder's result is fed into the mT5 decoder for text output. The decoder is the inverse operation of the encoder.
[0046] In this embodiment, jieba's word segmentation is based on finding the maximum probability path by word frequency, and its algorithm is as follows:
[0047] 1. Based on a prefix dictionary, achieve efficient word graph scanning to generate a directed acyclic graph consisting of all possible word combinations of Chinese characters in a sentence;
[0048] 2. Use dynamic programming to find the path with the highest probability and identify the maximum segmentation combination based on word frequency;
[0049] 3. For out-of-vocabulary words, an HMM model based on the word-forming ability of Chinese characters was adopted, and the Viterbi algorithm was used for processing.
[0050] In this embodiment, PEGASUS is used as the generative pre-training task. Its general idea is to summarize similar data pairs using the longest common subsequence. The specific usage during pre-training is as follows:
[0051] Suppose a document has n sentences. Select approximately n / 4 sentences (which may not be consecutive) such that the longest common subsequence of the text concatenated from these n / 4 sentences and the text concatenated from the remaining 3n / 4 sentences is as long as possible. Then, consider the concatenated 3n / 4 sentences as the original text and the concatenated n / 4 sentences as the summary. This creates a pseudo-summary data pair of "(original text, summary)". Use these data pairs to train a Seq2Seq model. Example 2
[0052] like Figure 2As shown, taking the "Article 35 of the Company Rules and Regulations" as an example, which states, "Strictly abide by the company's attendance system, arrive and leave work on time, and do not have someone else take attendance for you. If someone is found to have taken attendance for you, a fine of 20 yuan will be imposed for each instance," the following text is obtained using jieba Chinese word segmentation: "Article 35 / strictly / abide / company / attendance / system / , / arrive / leave / work / on time / , / do not / take / attendance / for you / , / find / those / who / take / attendance / for you / , / one / penalty / 20 / yuan / ." Since Article 35 is not in the dictionary, sentencepiece is used for word segmentation to obtain "Article 35 / strictly / abide / company / attendance / system / , / arrive / leave / work / on time / , / do not / take / attendance / for you / , / find / those / who / take / attendance / for you / , / one / penalty / 20 / yuan / ." The word-segmented text is encoded using the mT5 built-in dictionary to obtain token encoding, position encoding, and segmentation encoding. After passing through multiple layers of encoder and decoder, the article title "Attendance Management" is output.
[0053] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.
[0054] The above description of the disclosed embodiments enables those skilled in the art to make or use the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of the invention. Therefore, the invention is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims
1. A method for generating mT5-based barcodes, characterized by, The specific steps are: Information collection: obtain the content of laws and regulations; Text structure disassembly: structured analysis of laws and regulations; Purpose generation: use mT5 purpose extraction model to generate the purpose of the analyzed laws and regulations; The steps of purpose generation are: Text token encoding is performed on the input law; Combine text token encoding, position encoding and segmentation encoding, and send it to the encoder of mT5 to generate an encoded file; Send the encoded file to the decoder of mT5 to output the purpose; The steps for obtaining text token encoding are: Use jieba to perform word segmentation processing on the input law; Determine whether any word exists in the mT5 dictionary, if so, encode it through the mT5 dictionary; if not, use sentencepiece to split the word into characters, and then encode the split word through the mT5 dictionary; Arrange the encoding of each word in order, which is the text token encoding; Text structure disassembly includes overall content disassembly and article disassembly; Overall content disassembly is the extraction of name, text, release date and validity level; Article disassembly is to disassemble the law content into code, chapter, section, article, clause, item and purpose hierarchy.
2. The method of claim 1, wherein, If any character after splitting is not in the mT5 dictionary, use UNK to process it.
3. The method of claim 1, wherein, The steps for obtaining the encoded file are: Step 321, obtain the attention of the words in the input text to other words through attention mechanism; Step 322, output the results of the attention mechanism after passing through the forward network and linear layer; Step 323, combine the results output by step 322 with the original encoding vector to obtain the output text; Step 324, repeat steps 321-323 until all input text encoding is complete.