A format preserving method in a document translation process, storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By parsing and merging the XML document object model of docx documents, special formats are identified and restored, solving the problem of format loss in existing translation tools and realizing an efficient format preservation method suitable for multilingual document translation.

CN122242441APending Publication Date: 2026-06-19北京领初医药科技有限公司

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: 北京领初医药科技有限公司
Filing Date: 2026-03-18
Publication Date: 2026-06-19

Application Information

Patent Timeline

18 Mar 2026

Application

19 Jun 2026

Publication

CN122242441A

IPC: G06F40/103; G06F40/111; G06F40/205; G06F40/14; G06F40/40

AI Tagging

Application Domain

Natural language translation Text processing

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing translation tools often lose formatting information when processing docx documents, such as headings, spaces, paragraph formatting, subscripts in mathematical formulas, table continuity across pages, reference markers and cross-reference hyperlinks generated by EndNote, etc. They cannot effectively parse the XML hierarchical structure of docx documents, resulting in inconsistent formatting after translation.

Method used

This program uses the python-lxml library to parse the word/document.xml file, generate an XML Document Object Model (DOM), collect and merge paragraph text, and perform special format translation and formatting supplementation, including chart references, subscripts and superscripts, table of contents hyperlinks, etc., through paragraph nodes. As the basic processing unit, recursively traverse The node identifies and restores special formats.

Benefits of technology

It achieves accurate reconstruction of formatting at four levels: paragraph, sentence, word, and character. It improves the retention rate of table borders, subscripts and superscripts, and bibliographic citations. It can preserve almost all original text formatting and is suitable for translation between Chinese, English, Japanese, and Russian.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122242441A_ABST

Patent Text Reader

Abstract

This invention provides a method and storage medium for format preservation during document translation, relating to the field of information processing technology. The method includes: S1, parsing the read word / document.xml file to generate an XML document object model; S2, collecting and merging paragraph text based on the generated XML document object model; simultaneously, collecting the relationships between special formats; S3, overall translation: translating paragraph text using the paragraph tree in the XML file as the translation unit; secondary translation: translating special formats, including figure and table citations, subscripts and superscripts, and table of contents hyperlinks; S4, format supplementation: supplementing the document citation format to restore the original document format. This invention creates a four-level combined processing system of paragraphs, sentences, words, and characters, enabling accurate reconstruction of the original format elements at all four levels.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of information processing technology, specifically to a method for preserving formatting and a storage medium during the document translation process. Background Technology

[0002] Currently, conventional translation tools (such as Trados and MemoQ) only retain the basic paragraph structure when processing docx documents and cannot correctly handle the original formatting in the source text. For example, some common formatting issues may occur:

[0003] 1. Document format: After translating the entire document, only the translated text is usually retained, but formatting elements such as headings, spaces, and paragraph breaks are often lost.

[0004] 2. Subscripts and superscripts in mathematical formulas, such as H2O, are translated as H2O, and the subscripts are translated to normal font size.

[0005] 3. Continuous formatting of tables spanning multiple pages: For some tables that span two pages, the translation results in the tables on the two pages being split into two independent tables with different widths, leading to the loss of table formatting.

[0006] 4. EndNote generates bibliographic citation markers; bibliographic citations are marked with superscript, for example, "Results". [1-3] The translation might be "Results 1-3", meaning the superscript font is changed to the normal font and the citation marks are lost.

[0007] 5. Cross-referencing, for example, the phrase "as" appears in the text. Figure 1 As shown in the original text, " Figure 1 "It has a domain variable, meaning there is a hyperlink; clicking it..." Figure 1 It will automatically switch to the corresponding " Figure 1 The location of the displayed image; however, the hyperlink will be lost after translation.

[0008] In addition, there are some technical bottlenecks. For example, existing solutions are mostly based on regular expression matching, which cannot parse the XML hierarchical structure of docx files, leading to the following problems: 1. Ignore <w:vertalign w:val="superscript" / > Format control tags; 2. Damage <w:fldchar w:fldCharType="begin" / > The correlation of characters in the field; 3. Parsing methods based on paragraphs and characters cannot effectively handle complex nested structures, resulting in inconsistent formatting in the translated documents; 4. For documents containing references, images, and tables, conventional software often cannot dynamically update the citation format, leading to inconsistent or incorrect citation information, or the inability to retain the original hyperlink functionality.

[0009] Therefore, although existing technologies have solved the problem of preserving Word document formatting to some extent, they still have many limitations. There is an urgent need to design a format preservation method and storage medium in the document translation process to improve translation quality and efficiency. Summary of the Invention

[0010] The purpose of this invention is to provide: A method and storage medium for preserving format during document translation are proposed, aiming to solve the technical problem of format loss after translating complex documents using existing translation methods.

[0011] Terminology Explanation: Unless otherwise defined, all technical terms used herein have the same meanings as commonly understood by one of ordinary skill in the art. Unless otherwise stated, all patents, patent inventions, and disclosures cited throughout this document are incorporated herein by reference in their entirety. If multiple definitions exist for terms herein, the definitions provided in this chapter shall prevail.

[0012] It should be understood that the above brief description and the following detailed description are exemplary and for illustrative purposes only, and do not limit the subject matter of the invention in any way. In this invention, the singular is used in conjunction with the plural unless otherwise specifically stated. It should also be noted that, unless otherwise stated, the use of “or” or “or” means “and / or”. Furthermore, the use of the term “comprising” and other forms such as “including,” “containing,” and “contains” are not limiting.

[0013] To achieve the above objectives, the technical solution adopted by the present invention is as follows: Firstly, a method for preserving formatting during document translation includes: S1. File parsing: The python-lxml library is used to parse the read word / document.xml file, and the result of the parsing is an XML Document Object Model (DOM). S2, Paragraph processing and format association: Based on the parsed XML Document Object Model (DOM), paragraph text is collected and merged; at the same time, special format relationships are collected, including figure and table references, superscripts, subscripts, and reference citations; S3. Document translation and special format translation: S31. Overall Translation: Translate paragraph text using the paragraph tree in the XML file as the translation unit; S32. Secondary translation: Perform translations in special formats, including chart references, subscripts and superscripts, and table of contents hyperlinks; S4. Format Supplement: Supplement the citation format of the literature and restore the document format of the citation.

[0014] Based on further solutions to the technical problems of the present invention, or simultaneous solutions to multiple technical problems, the preferred solutions provided by the present invention include: Furthermore, before file parsing, the following steps are also included: File retrieval: Read the word / document.xml file by unzipping the docx document package.

[0015] Furthermore, the Python zipfile library is used to decompress the docx document, and then the contents of the word / document.xml file are read; the read content is in XML document format, containing the document's text and formatting information.

[0016] Furthermore, in S1, the word / document.xml file is parsed using the xml.etree.ElementTree library or another XML parsing library; the parsing process includes: parse each <w:p>Paragraph tree, determining each <w:r>Minimal format unit tree; Then parse each <w:r>The type of each minimum format unit tree is determined based on the label name of the last subtree. according to <w:rpr>The content of the node tree, read the style value.

[0017] Furthermore, in S2, the process of collecting and merging paragraph text is as follows: use <w:p>The element represents a paragraph; <w:t>The element contains specific text content; extract <w:t>The text within the element, and based on <w:p>The hierarchical relationship of elements groups text into paragraphs.

[0018] Furthermore, using paragraph nodes <w:p>As the basic processing unit, recursively traverse <w:p>and <w:r>Nodes, among which, <w:r>Positioned as a minimal format unit tree; aggregating various <w:r>The text content of the node is used to obtain the paragraph text content in Word software, which is then used as the source text for translation.

[0019] Furthermore, <w:r>and <w:t>It is a parent-child containment relationship, where both work together to define the text and formatting in the document.

[0020] Furthermore, the specific methods of S31 include: Based on the original XML framework, <w:p>Using paragraph trees as units, the original text's XML framework... <w:t>Remove the text content from the text and keep each one. <w:r>The formatting in the node; then translate based on the merged paragraph text in S2, and place the translation into... <w:p>The first paragraph tree <w:r>In the node.

[0021] Furthermore, before performing the overall translation, skip the reference citation tags in the original text's XML framework. <w:t>Then, fill in the node content with the translation.

[0022] Furthermore, the specific methods of S32 include: Translation of chart references with hyperlinks: After translating the paragraph text, iterate through the first paragraph of each paragraph. <w:r>The node identifies and sorts the content that needs its chart reference format restored. Based on the relationships between special formats, it inserts the corresponding special formatting into the content that needs its chart reference format restored, forming a new chart with special formatting. <w:r>node; The translation steps for text with superscript and subscript format include: After translating the paragraph text, iterate through the first paragraph of each paragraph. <w:r>The node identifies and sorts the content that needs subscript / superscript formatting restoration. Based on the relationships between special formats, it inserts the corresponding special formatting into the content that needs restoration, forming new content with special formatting. <w:r>node.

[0023] Furthermore, in S4, the format restoration methods include: uniformly placing the text of the cited references at the end of the sentence.

[0024] Furthermore, in S4, the format restoration method also includes: based on the phrase preceding the reference, finding the translated phrase, and then restoring the original position.

[0025] Furthermore, the overall translation in S3 specifically includes the conversion between Chinese, English, Japanese, and Russian.

[0026] Secondly, the present invention also provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the above-described format preservation method in the document translation process.

[0027] Compared with the prior art, the present invention has the following beneficial effects: This invention creates a four-level combined processing system for paragraphs, sentences, words, and characters, enabling precise reconstruction of the original format elements at all four levels. Specifically, at the paragraph level, only the text storage nodes are modified, without altering the paragraph-level XML structure, and the internal formatting of the paragraph is restored to the required granularity. At the word level, mathematical symbols with subscripts and superscripts indicating references and cross-reference fields are uniquely escaped, restoring their accurate position in the translation, further translating and replacing local text without affecting the overall semantic translation. At the character level, the precise replacement algorithm preserves subscripts and superscripts.

[0028] When using the format preservation method of this invention to translate documents, such as from Chinese to English, compared with the prior art, the retention rate of table borders is increased from 40% to 98%; the retention rate of subscripts and superscripts is increased from 12% to 100%; and the retention rate of citation relationships is increased from 35% to 99%. Thus, this invention can almost preserve all the original text format.

[0029] Furthermore, the format preservation method of the present invention is not limited to the language category of the text and can be applied to the conversion of multiple languages, such as the mutual conversion between Chinese, English, Japanese and Russian, which can be achieved, thus making it more convenient for users to use. Attached Figure Description

[0030] Figure 1 This is a flowchart of the format preservation method of the present invention. Detailed Implementation

[0031] The technical solution of the present invention will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are not all embodiments of the present invention. All other embodiments obtained by those skilled in the art without creative effort are within the protection scope of the present invention.

[0032] The following description of exemplary embodiments is merely illustrative and is not intended to limit the invention or its application or use in any way. Techniques, methods, and apparatus known to those skilled in the art may not be discussed in detail herein, but where applicable, such techniques, methods, and apparatus should be considered part of this specification.

[0033] Example 1 This embodiment provides a method for preserving formatting during document translation, such as... Figure 1 As shown, it includes: S1. File parsing: First, the file is retrieved: the word / document.xml file is read by unzipping the docx document package. In this embodiment, Python's zipfile library is used to unzip the docx document, and then the contents of the word / document.xml file are read; the content read is in XML document format, containing the document's text and formatting information.

[0034] Then, file parsing is performed: the python-lxml library is used to parse the read word / document.xml file, and the result is an XML Document Object Model (DOM); in this embodiment, the xml.etree.ElementTree library or other XML parsing libraries are used to parse the word / document.xml file. It is necessary to traverse the XML tree to find the elements containing the file content, for example... <w:p>(paragraph) and <w:t>(Text execution)

[0035] The following is a partial content of the XML file: <w:body> <w:p w:rsidr="00C752D3" w:rsidrdefault="00C752D3" w:rsidp="00C752D3"> <w:ppr> <w:spacing w:before="120" w:after="120" / > <w:rpr> <w:rfonts w:hint="eastAsia" / > < / w:rpr> < / w:ppr> <w:bookmarkstart w:id="0" w:name="_GoBack" / > <w:bookmarkend w:id="0" / > < / w:p> <w:tbl> <w:tblpr> <w:tblw w:w="5000" w:type="pct" / > <w:tblborders> <w:top w:val="single" w:sz="4" w:space="0" w:color="auto" / > <w:left w:val="single" w:sz="4" w:space="0" w:color="auto" / > <w:bottom w:val="single" w:sz="4" w:space="0" w:color="auto" / > <w:right w:val="single" w:sz="4" w:space="0" w:color="auto" / > use <w:p>The element represents a paragraph tree and serves as the parent node; <w:r>Representing the minimum format unit tree, as child nodes, paragraphs are divided into different minimum format unit trees according to different formats. Therefore, each paragraph tree... <w:p>The element contains multiple <w:r>Child nodes. <w:t>The element contains specific text content and acts as a grandchild node.

[0036] The parsing order is as follows: parse each <w:p>Paragraph tree, determining each <w:r>Minimal format unit tree; Then parse each <w:r>The minimum format unit tree determines the type based on the label name of the last subtree. Specifically: <w:t>Indicates text type, <w:tab>It's a whitespace character. <w:instrtext>These are field instructions, and there are other types, including but not limited to: field start, separator, end, page break, graph, newline, non-breaking newline, comment, etc.

[0037] according to <w:rpr>The content of the node tree, reading style values, including <w:vertalign>(Superscript or subscript), color, italics, bold, quotation styles, etc. Also includes... <w:vertalign>The value of w:val in the node.

[0038] In summary, we are now able to read the type, style, and content of each minimum format unit tree for each paragraph.

[0039] S2, Paragraph processing and format association: Based on the parsed XML Document Object Model (DOM), paragraph text is collected and merged; at the same time, special format relationships are collected, including figure and table references, superscripts, subscripts, and reference citations; The process of collecting and merging paragraph text is as follows: extract <w:t>The text within the element, and based on <w:p>The hierarchical relationship of elements combines text into paragraphs; the resulting paragraphs contain the differences between paragraphs, so the text can be translated paragraph by paragraph, avoiding the loss of paragraph formatting across pages and also preserving paragraph formatting.

[0040] After merging paragraph text, use paragraph nodes. <w:p>As the basic processing unit, recursively traverse <w:p>(paragraph) and <w:r>(run) node, where <w:r>The (run) node is positioned as the minimum format unit tree. It aggregates various... <w:r>By examining the text content of the (run) node, you can obtain the paragraph text that is visible in Word and use it as the source text for translation.

[0041] <w:r>(run) node and <w:t>The relationship between the elements is: in the Office Open XML (OOXML) format, <w:r>(Run, text block) and <w:t>(Text, text content) has a parent-child containment relationship, and the two work together to define the text and its format in the document.

[0042] It can be understood as: <w:r>It is a container that encloses a segment of text and its attributes with the same format; that is, it is defined as a minimal format unit tree, such as "All Participants: All participants who had signed informed consent form were included in this analysis set". Here, each paragraph or sentence has at least two formats: a bold format (bolding "All Participants:") and a non-bold format (not bolding "All participants who had signed informed consent form were included in this analysis set"). <w:r> ； <w:t>It is a content carrier, and it must be nested within... <w:r>Within this section, the actual string to be displayed is stored; and this section can include multiple... <w:r>But one <w:r>Only one inside <w:t>.

[0043] During the process of collecting each minimal format unit tree, for each <w:r>Special formats within nodes are collected simultaneously, forming a relationship between the original text with special formats and the special formats, so that the format can be restored later.

[0044] S3. Text translation and translation in special formats: S31. Overall Translation: First, the paragraph text is translated. Specifically, the translation is inserted into the overall XML framework of the original text; that is, based on the XML framework of the original text, the translation is... <w:p>Using paragraph trees as units, the original text's XML framework... <w:t>Remove the text content from the text and keep each one. <w:r>The formatting in the node; then translate based on the merged paragraph text in S2, and place the translation into... <w:p>The first paragraph tree <w:r>Within nodes. This ensures the semantic continuity of the entire paragraph; it also avoids deleting other functions carried by the nodes, such as annotations.

[0045] Since titles, tables of contents, tables, formulas, headers, and footers in a document are usually expressed using different paragraphs, they can be divided into different categories. <w:p>Paragraph trees, translated according to the underlying logic of XML, can preserve the paragraph formatting of the original text, including headings, tables, tables, formulas, headers and footers. This can avoid the loss of most formatting in the translation, such as missing table borders, changing heading fonts, and not translating headers and footers.

[0046] The initial translation may result in some errors. <w:r>The format of the nodes is changed, and then the changed parts are adjusted. For most documents, the parts with changed format are usually those with special format, such as figure and table references, subscripts and superscripts, and table of contents hyperlinks. The format needs to be restored later.

[0047] Before translating the entire paragraph, to avoid the citation marks in the references affecting the semantics of the original text, these citation marks need to be processed. This is because references are usually marked with numbers, and their insertion position is relatively random. This often results in awkward phrasing after translation and can even affect the translation result. Therefore, before translating the entire paragraph, it is necessary to skip the citation marks in the original text's XML framework. <w:t>The content of the nodes is such that the references do not affect the semantic translation of the original text.

[0048] Similarly, reference citation marks generated by Endnote are treated the same way as reference citation marks, and are skipped before the entire paragraph text is translated.

[0049] S32, Second Translation: Then, special format translations are performed, including chart references, subscripts and superscripts, and table of contents hyperlinks.

[0050] For example, the translation steps for chart references with hyperlinks include: After translating the paragraph text, iterate through the first paragraph of each paragraph. <w:r>The node identifies and sorts the content that needs its chart reference format restored. Based on the special format relationships in S2, it inserts the corresponding special format into the content that needs its chart reference format restored, forming a new chart with special formatting. <w:r>node; The translation steps for text with superscript and subscript format include: After translating the paragraph text, iterate through the first paragraph of each paragraph. <w:r>The node identifies and sorts the content that needs subscript / superscript formatting restoration. Based on the relationships between special formats, it inserts the corresponding special formatting into the content that needs restoration, forming new content with special formatting. <w:r>node.

[0051] For the restoration of all special formats, they are all merged, including chart references, subscripts and superscripts, and table of contents hyperlinks, which are restored in the order they are sorted.

[0052] S4. Format Restoration: Replace text with text of special format based on its translation results to restore the document format cited in the literature.

[0053] In this embodiment, the text cited in the literature is usually marked with superscript and is not included in the original text translation because it is meaningless for semantic understanding. Therefore, the replacement and restoration methods may include: 1. Always place cited texts at the end of the sentence; 2. Record the phrases preceding the references, find the translated phrases, and then restore their original positions.

[0054] In this embodiment, the format preservation method is based on the original Word document, which is fundamentally different from other processing methods: This invention identifies the minimal formatting unit tree of the original text and <w:r>This node extracts the visible text from a Word document, avoiding disruption of the original document structure. This is because Office software, for its own purposes, often splits a single sentence into multiple... <w:r>While the document is on a node, translation requires coherent sentences. Therefore, to ensure semantic coherence, this method collects and merges paragraph text while maintaining the original document structure. The text is simply moved from one node to another on the DOM tree, without disrupting the overall structure of the DOM tree. This process preserves paragraph-level formatting and features automatic recognition and judgment.

[0055] Its specific principle is: traversal <w:p>Tree, search <w:r>The node is used to read its formatting definitions, specifically including vertical alignment, superscript, subscript, color, bold, italics, font size, and font. Nodes with special formatting definitions are usually excluded because most text in a normal paragraph does not have special formatting definitions. After excluding nodes with special formatting definitions, nodes within a single field are also excluded, and the remaining nodes are used. <w:r>The first one is fine.

[0056] It's important to note that the DOM (Document Object Model Tree) is a structured representation of a web page document (such as HTML or XML) using the Document Object Model (DOM). It transforms the document into a tree structure composed of nodes and objects, facilitating dynamic access and manipulation of the document's content, structure, and style by programs or scripts (such as JavaScript). XML documents support multiple programming languages (such as JavaScript and Python), and the DOM tree allows for accessing and modifying page content, attributes, styles, and other functionalities.

[0057] The DOM tree displays document content in a hierarchical structure, with each node representing a part of the document: 1. Root node: Usually a label, located at the top level of the tree; 2. Child nodes: Elements (such as , , <div ... ), attributes, text, comments, etc.; 3. Parent-child relationship: For example, is the parent node of and , and is . The parent node; 4. Sibling relationship: Child nodes of the same parent node are sibling nodes (e.g., and).

[0058] The DOM tree contains various node types: 1. Element node: Corresponding HTML tag (e.g., ...) , ); 2. Attribute nodes: Attributes of an element (e.g., class="example"); 3. Text node: Plain text content within an element (e.g., ... text the "text" in 4. Comment Node: Comments in HTML ; 5. Document Node: The root node of the entire document (document object).

[0059] To restore the formatting of the text within a paragraph, mainly including the restoration of the formatting of cross-references to charts, superscripts, subscripts, etc. The algorithm traverses the DOM tree to identify the special-format DOM tree of each paragraph and its visible text. In the translation result, find the translated text corresponding to the special-format DOM tree and perform local DOM tree reconstruction to achieve the goal of restoring the formatting.

[0060] According to the aforementioned retention method, an example is given as follows: By traversing the word / documents.xml file and using <w:p>The technical structure of paragraph nodes as processing units for translation is implemented. <w:r>The node-based text aggregation algorithm automatically preserves 100% of the original formatting elements at the paragraph level, achieving zero loss of paragraph-level formatting attributes. Therefore, it has unique advantages in preserving complex tables, images, and mathematical formulas.

[0061] In terms of translation effect, because <w:p>Paragraph nodes are processing units; typically, paragraphs in the main text are all located within a single node. <w:p>The nodes preserve the complete semantics, which is crucial for translation and allows for the highest possible translation efficiency.

[0062] For preserving formatting within paragraphs, the algorithm can identify mathematical symbols, superscripts, and subscripts. This portion will be returned as is after translation, without being converted into other text. The algorithm will then restore the formatting, reusing the read XML node tree for replacement processing to maximize the preservation of the original style. For cross-references, such as " Figure 1 The translated text is "Figure 1". The algorithm also uses the original XML node tree to replace the text in the original text.

[0063] Example 2 This embodiment provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the format preservation method described above in the document translation process.

[0064] Finally, it should be noted that the above content is only used to illustrate the technical solution of the present invention, and is not intended to limit the scope of protection of the present invention. Simple modifications or equivalent substitutions made by those skilled in the art to the technical solution of the present invention do not depart from the essence and scope of the technical solution of the present invention.< / w:p> < / w:p> < / w:r> < / w:p> < / w:r> < / w:r> < / w:p> < / w:r> < / w:r> < / w:r> < / w:r> < / w:r> < / w:r> < / w:t> < / w:r> < / w:p> < / w:r> < / w:p> < / w:r> < / w:t> < / w:p> < / w:r> < / w:t> < / w:r> < / w:r> < / w:r> < / w:t> < / w:r> < / w:r> < / w:t> < / w:r> < / w:t> < / w:r> < / w:r> < / w:r> < / w:r> < / w:p> < / w:p> < / w:p> < / w:t> < / w:vertalign> < / w:vertalign> < / w:rpr> < / w:instrtext> < / w:tab> < / w:t> < / w:r> < / w:r> < / w:p> < / w:t> < / w:r> < / w:p> < / w:r> < / w:p> < / w:tblborders> < / w:tblpr> < / w:tbl> < / w:body> < / w:t> < / w:p> < / w:r> < / w:r> < / w:r> < / w:r> < / w:t> < / w:r> < / w:p> < / w:r> < / w:t> < / w:p> < / w:t> < / w:r> < / w:r> < / w:r> < / w:r> < / w:p> < / w:p> < / w:p> < / w:t> < / w:t> < / w:p> < / w:rpr> < / w:r> < / w:r> < / w:p>

Claims

1. A method for preserving formatting in a document translation process, the method comprising: include: S1. File parsing: The python-lxml library is used to parse the read word / document.xml file, and the result generated is an XML document object model; S2, Paragraph processing and format association: Based on the parsed XML document object model, paragraph text is collected and merged; at the same time, relationships with special formats are collected. Special formatting includes figure and table citations, superscripts, subscripts, and reference citations; S3. Document translation and special format translation: S31. Overall Translation: Translate paragraph text using the paragraph tree in the XML file as the translation unit; S32. Secondary translation: Perform translations in special formats, including chart references, subscripts and superscripts, and table of contents hyperlinks; S4. Format Supplement: Supplement the citation format of the literature and restore the document format of the citation.

2. The format preservation method according to claim 1, characterized in that, Before file parsing, the following is also included: File retrieval: Read the word / document.xml file by unzipping the docx document package.

3. The format preservation method according to claim 1, characterized in that, In S1, the word / document.xml file is parsed using the xml.etree.ElementTree library or another XML parsing library; the parsing process includes: parse each <w:p>Paragraph tree, determining each <w:r> Minimal format unit tree;< / w:r> < / w:p> Then parse each <w:r> The type of each minimum format unit tree is determined based on the label name of the last subtree.< / w:r> according to <w:rpr> The content of the node tree, read the style value.< / w:rpr> 4. The format preservation method according to claim 1, characterized in that, In S2, the process of collecting and merging paragraph text is as follows: By paragraph node <w:p>As the basic processing unit, recursively traverse <w:p>and <w:r>Nodes, among which, <w:r>Positioned as a minimal format unit tree; aggregating various <w:r> The text content of the node is used to obtain the paragraph text content in Word software, which is then used as the source text for translation.< / w:r> < / w:r> < / w:r> < / w:p> < / w:p> 5. The format preservation method according to claim 4, characterized in that, The specific methods of S31 include: Based on the original XML framework, <w:p>Using paragraph trees as units, the original text's XML framework... <w:t>Remove the text content from the text and keep each one. <w:r>The formatting in the node; then translate based on the merged paragraph text in S2, and place the translation into... <w:p>The first paragraph tree <w:r> In the node.< / w:r> < / w:p> < / w:r> < / w:t> < / w:p> 6. The format preservation method according to claim 5, characterized in that, Before proceeding with the overall translation, skip the reference citation tags within the original text's XML framework. <w:t> Then, fill in the node content with the translation.< / w:t> 7. The format preservation method according to claim 5, characterized in that, The specific methods of S32 include: Translation of chart references with hyperlinks: After translating the paragraph text, iterate through the first paragraph of each paragraph. <w:r>The node identifies and sorts the content that needs its chart reference format restored. Based on the relationships between special formats, it inserts the corresponding special formatting into the content that needs its chart reference format restored, forming a new chart with special formatting. <w:r> node;< / w:r> < / w:r> The translation steps for text with superscript and subscript format include: After translating the paragraph text, iterate through the first paragraph of each paragraph. <w:r>The node identifies and sorts the content that needs subscript / superscript formatting restoration. Based on the relationships between special formats, it inserts the corresponding special formatting into the content that needs restoration, forming new content with special formatting. <w:r> node.< / w:r> < / w:r> 8. The format preservation method according to claim 1, characterized in that, In S4, there are two ways to restore the format: The text of the cited literature should be placed at the end of the sentence; Based on the phrase preceding the reference, and by finding the translated phrase, the original position can be restored.

9. The format preservation method according to claim 1, characterized in that, The overall translation in S3 specifically includes the conversion between Chinese, English, Japanese, and Russian.

10. A computer-readable storage medium storing a computer program thereon, characterized in that, When the computer program is executed by the processor, it implements the format preservation method in the document translation process as described in any one of claims 1-9.