A picture book document role recognition and structured segmentation method

By parsing and grouping the styles and content of .docx documents, the system automatically identifies and structures the dialogue, solving the problem of data loss after exporting from online sketchbook platforms. This achieves the conversion from visual format to structured data, improving the automation of recording and track matching.

CN122240842APending Publication Date: 2026-06-19TAIYUAN UNIVERSITY OF TECHNOLOGY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TAIYUAN UNIVERSITY OF TECHNOLOGY
Filing Date
2026-04-22
Publication Date
2026-06-19

Smart Images

  • Figure CN122240842A_ABST
    Figure CN122240842A_ABST
Patent Text Reader

Abstract

This invention relates to the field of document information processing technology, specifically a method for character recognition and structured segmentation of picture book documents. To address the technical problem of accurately and robustly inferring a complete character system and dialogue segmentation logic from a set of visual formats that may lack a unified standard, this invention provides a method for character recognition and structured segmentation of picture book documents. This method automatically identifies special content, deleted content, and marked content based on the style of .docx format files, and organizes scattered text fragments into structured dialogue lines. This provides a clear data foundation for subsequent dialogue processing or reading applications, and allows picture book content from different platforms to be directly embedded into professional software, greatly facilitating professional audio content recording.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of document information processing technology, specifically to a method for character recognition and structured segmentation of picture book documents. Background Technology

[0002] Currently, the audio content industry (such as audiobooks and radio drama production) has formed a standardized digital production process. This process begins with "annotating" the original text (usually a novel or script), where editors or directors annotate the text with dialogues, narration, sound effects, etc., for different characters. Mature online annotation platforms such as "Youshengmiao" and "Huaben Yaoji" already offer online annotation functions. The core advantage of these platforms lies in their integrated process: after annotation, voice actors (CVs) can log in to their personal accounts, directly select and read their assigned lines, and the platform can automatically or semi-automatically complete the recording collection and initial alignment, greatly simplifying team collaboration.

[0003] However, this model suffers from a critical bottleneck in quality and flexibility: limited by the browser environment and web audio interface, the platform's built-in recording function lags significantly behind professional digital audio workstations (DAWs, such as Adobe Audition, Pro Tools, and Reaper) in terms of sound quality, real-time processing capabilities, and plugin support. Therefore, professional teams pursuing publication-grade sound quality typically adopt a compromise and fragmented approach, with the main steps including: 1) completing illustration annotations and task assignments on platforms such as "Youshengmiao"; 2) exporting the annotation results as a common .docx format file; 3) voice actors recording high-quality audio in a professional DAW based on this .docx file; and 4) post-production staff manually aligning the scattered recording files with the document content, then editing and mixing.

[0004] The problem with this method is that online picture book platforms (such as Youshengmiao and Picture Book Fairy) use a structured data model that contains rich metadata such as characters, dialogue sequences, and time points. However, when the content is exported as a .docx file, most of this machine-readable structured information with clear business semantics is lost or is only flattened into a visual format (such as specific colors and indentation). The generated .docx file is just a "dumb" text for professional audio production tools and cannot be automatically understood. This data dimensionality reduction process from "intelligent data" to "formatted text" is the root cause of why all subsequent steps must rely on manual labor.

[0005] Due to the aforementioned data gaps, the professional production process was forced to halt its automated operation, specifically manifested in the following ways:

[0006] For voice actors (CVs), the lack of intelligent prompting and recording tools based on the current character's lines means they must manually search for their lines throughout the entire document and manually control the start, stop, and marking of recording in professional recording software (DAW). This process is tedious and error-prone, failing to achieve the efficient experience of "what you see is what you record." For post-production staff, the inability to utilize the established line order and character structure from the scriptwriting stage makes fully automatic alignment impossible. Post-production staff must manually align scattered audio files with .docx format text files using a "listen-align-drag" process, which is time-consuming and difficult to guarantee accuracy.

[0007] Essentially, existing technical solutions create a data and process breakpoint between "cloud-based structured annotation" and "local high-quality audio recording production." Online picture book platforms lock in the data and production stages, while professional production tools cannot natively understand the business logic embedded in the .docx format from these platforms, leading to cumbersome subsequent processing. How to accurately and robustly infer a complete character system and dialogue division logic automatically from a set of visual formats that may not have a unified standard is a technical problem in this field that has not yet been effectively solved. Summary of the Invention

[0008] In order to solve the technical problem of how to accurately and robustly infer a complete character system and dialogue segmentation logic from a set of visual formats that may not have a unified standard, this invention provides a method for character recognition and structured segmentation of picture book documents.

[0009] This invention is achieved using the following technical solution:

[0010] A method for character recognition and structured segmentation in picture book documents includes the following steps:

[0011] 1. Parse the style definition file of the .docx format document exported from the online drawing platform, extract the style identifiers and the default colors and shading of the text fragments in the style, and store them in the style mapping table;

[0012] 2. Parse the main content of a .docx format document and obtain all paragraph tags in the document;

[0013] III. Create a list of original fragments;

[0014] IV. Process each paragraph:

[0015] 1) Based on the style identifier used in the current paragraph, select the corresponding default color and shading from the style map table;

[0016] 2) Traverse all segments of each paragraph and process them one by one. The specific processing steps are as follows:

[0017] a. Sequentially obtain the text content of all segments and check their own format, that is, check whether they have a uniquely specified color, shading or strikethrough. If a uniquely specified color or shading is present, cover the current segment's default color or shading with that color or shading and mark the segment as special content. At the same time, generate a format combination identifier composed of that color or shading. If a strikethrough exists, mark the segment as deleted content.

[0018] b. Store the fragments containing text, whether they are deleted content, whether they are special content, and whether they are format combination identifiers into the original fragment list created in step three;

[0019] 5. Merging adjacent fragments with the same format: Traverse the original fragment list. If the current fragment and the previous fragment have the same information as each other, such as whether it is deleted content, whether it is special content, or whether it is a format combination identifier, then append the text of the current fragment to the previous fragment, thereby merging continuous and formatted text into a whole.

[0020] 6. Splitting bracket annotations in special segments: For each segment marked as special content, use regular expressions to match the bracket content and the text immediately following it. Split each matching result into a new segment with the text "bracket content + immediately following text", and set the format combination identifier of the segment to the bracket content itself. Segments that are not special content remain unchanged.

[0021] 7. Filter out segments with no actual content: Count the number of valid characters (Chinese characters, complete English words, numbers) for each segment. If the number of valid characters is 0 (i.e., only punctuation, spaces, or parentheses are present), discard the segment.

[0022] VIII. First Stage Grouping:

[0023] Initialize an array of group IDs, assign an initial group ID (equal to its index) to each fragment, then iterate through all fragments and merge them according to fragment type:

[0024] Normal segment: Starting from the current segment, scan backwards and set the group ID of all subsequent consecutive segments that do not contain special or deleted content to the group ID of the current segment;

[0025] Special segments with content: Starting from the current segment, scan backwards and set the group ID of all segments with the same format combination identifier and those belonging to unimportant content to the group ID of the current segment. Unimportant content refers to content with strikethrough or special content that only contains marked content but no subsequent unmarked content. Marked content refers to content enclosed in square brackets or parentheses within special content.

[0026] Unimportant segments: Starting from the current segment, find the first non-unimportant segment and set the group ID of the current segment and all consecutive unimportant segments in between to the group ID of the first non-unimportant segment; if all subsequent segments are of this type, mark the group IDs of these segments as invalid.

[0027] 9. Define grouping types: Grouping types are divided into special groups and ordinary groups. If the last segment in a group is an ordinary segment, the corresponding group is an ordinary group; otherwise, it is a special group.

[0028] 10. Second-stage grouping: Merge consecutive ordinary groups and set the group ID of the merged ordinary group to the group ID of the first ordinary group in that part, ensuring that all adjacent ordinary groups eventually belong to the same group ID;

[0029] 11. Constructing Dialogue Lines Based on Grouping Results: Based on the final group ID, collect the segments with the same group ID together to construct a "dialogue line". Each line contains a list of segments and sets a boolean flag. If the line contains at least one ordinary segment, the line is marked as an ordinary line (narration); otherwise, it is a special line (such as a pure character annotation line).

[0030] 12. List of lines of dialogue

[0031] All the lines of dialogue that have been constructed are combined into a list of lines of dialogue as the final result.

[0032] The beneficial effects of this invention are as follows: The method described in this invention can automatically identify, delete, and annotate special content based on the style of .docx format files, and organize scattered text fragments into structured lines of dialogue, providing a clear data foundation for subsequent dialogue processing or reading applications. Furthermore, this method does not attempt to change existing online picture book platforms or professional recording software, but rather acts as an intelligent "translator" or "decoder," bridging the data gap between them and releasing the structured value locked within formatted documents. It can automatically parse the format styles in documents, automatically identify and segment text content belonging to different readers, and is particularly suitable for automatically converting electronic picture books and scripts containing narration and multi-character dialogues into structured data that can be used for speech synthesis, track matching, recording prompting, and automatic tagging. It can directly embed picture book content from different platforms into professional software, providing great convenience for professional audio content recording. Attached Figure Description

[0033] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.

[0034] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0035] Figure 1 This is a flowchart of the method described in this invention;

[0036] Figure 2 This is a schematic diagram of a plug-in developed based on the method described in this invention;

[0037] Figure 3 The content of the online sketchbook described in Example 1;

[0038] Figure 4 This is the effect after processing by the method described in this invention in Example 1;

[0039] Figure 5 To implement the online picture book content described in section 2;

[0040] Figure 6 The effect after processing by the method described in this invention in implementation 2;

[0041] Figure 7 To implement the online picture book content described in 3;

[0042] Figure 8 To achieve the effect of processing by the method described in this invention in step 3;

[0043] Figure 9 To implement the online picture book content described in 4;

[0044] Figure 10 To achieve the effect of processing by the method described in this invention in step 4;

[0045] Figure 11 To implement the online picture book content described in 5;

[0046] Figure 12 The effect after processing by the method described in this invention in implementation 5. Detailed Implementation

[0047] To better understand the above-mentioned objectives, features, and advantages of the present invention, the solutions of the present invention will be further described below. It should be noted that, unless otherwise specified, the embodiments of the present invention and the features thereof can be combined with each other.

[0048] Many specific details are set forth in the following description in order to provide a full understanding of the invention, but the invention may also be practiced in other ways different from those described herein; obviously, the embodiments in the specification are only some embodiments of the invention, and not all embodiments.

[0049] The specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0050] like Figure 1 As shown, a method for character recognition and structured segmentation in a drawing book document includes the following steps:

[0051] 1. Parse the style definition file of the .docx format document exported from the online drawing platform, extract the style identifiers and the default colors and shading of the text fragments in the style, and store them in the style mapping table;

[0052] 2. Parse the main content of a .docx format document and obtain all paragraph tags in the document;

[0053] III. Create a list of original fragments;

[0054] IV. Process each paragraph:

[0055] 1) Based on the style identifier used in the current paragraph, select the corresponding default color and shading from the style map table;

[0056] 2) Traverse all segments of each paragraph and process them one by one. The specific processing steps are as follows:

[0057] a. Sequentially obtain the text content of all segments and check their own format, that is, check whether they have a uniquely specified color, shading or strikethrough. If a uniquely specified color or shading is present, cover the current segment's default color or shading with that color or shading and mark the segment as special content. At the same time, generate a format combination identifier composed of that color or shading. If a strikethrough exists, mark the segment as deleted content.

[0058] b. Store the fragments containing text, whether they are deleted content, whether they are special content, and whether they are format combination identifiers into the original fragment list created in step three;

[0059] 5. Merging adjacent fragments with the same format: Traverse the original fragment list. If the current fragment and the previous fragment have the same information as each other, such as whether it is deleted content, whether it is special content, or whether it is a format combination identifier, then append the text of the current fragment to the previous fragment, thereby merging continuous and formatted text into a whole.

[0060] 6. Splitting bracket annotations in special segments: For each segment marked as special content, use regular expressions to match the bracket content and the text immediately following it. Split each matching result into a new segment with the text "bracket content + immediately following text", and set the format combination identifier of the segment to the bracket content itself. Segments that are not special content remain unchanged.

[0061] 7. Filter out segments with no actual content: Count the number of valid characters (Chinese characters, complete English words, numbers) for each segment. If the number of valid characters is 0 (i.e., only punctuation, spaces, or parentheses are present), discard the segment.

[0062] VIII. First Stage Grouping:

[0063] Initialize an array of group IDs, assign an initial group ID (equal to its index) to each fragment, then iterate through all fragments and merge them according to fragment type:

[0064] Normal segment: Starting from the current segment, scan backwards and set the group ID of all subsequent consecutive segments that do not contain special or deleted content to the group ID of the current segment;

[0065] Special segments with content: Starting from the current segment, scan backwards and set the group ID of all segments with the same format combination identifier and those belonging to unimportant content to the group ID of the current segment. Unimportant content refers to content with strikethrough or special content that only contains marked content but no subsequent unmarked content. Marked content refers to content enclosed in square brackets or parentheses within special content.

[0066] Unimportant segments: Starting from the current segment, find the first non-unimportant segment and set the group ID of the current segment and all consecutive unimportant segments in between to the group ID of the first non-unimportant segment; if all subsequent segments are of this type, mark the group IDs of these segments as invalid.

[0067] 9. Define grouping types: Grouping types are divided into special groups and ordinary groups. If the last segment in a group is an ordinary segment, the corresponding group is an ordinary group; otherwise, it is a special group.

[0068] 10. Second-stage grouping: Merge consecutive ordinary groups and set the group ID of the merged ordinary group to the group ID of the first ordinary group in that part, ensuring that all adjacent ordinary groups eventually belong to the same group ID;

[0069] 11. Constructing Dialogue Lines Based on Grouping Results: Based on the final group ID, collect the segments with the same group ID together to construct a "dialogue line". Each line contains a list of segments and sets a boolean flag. If the line contains at least one ordinary segment, the line is marked as an ordinary line (narration); otherwise, it is a special line (such as a pure character annotation line).

[0070] 12. List of lines of dialogue

[0071] All the lines of dialogue that have been constructed are combined into a list of lines of dialogue as the final result.

[0072] Using the above method, documents from different online drawing platforms can be directly embedded into professional software to achieve automatic tagging. The corresponding content can be directly located in the document, thereby achieving fully automatic alignment and providing great convenience for professional audio content recording.

[0073] like Figure 2 This is a plugin developed based on this method, suitable for embedding in professional audio production software. It can automatically fill in the dialogue content within the labeled areas during the segmentation (i.e., tagging) of recorded audio clips. The advantage of this is that it can directly locate the corresponding audio clip's position in the script, allowing for automatic track alignment (i.e., splicing the audio clips according to their sequential order). Currently, the automatic track alignment scheme based on this algorithm has undergone some experimental verification. Because the audio clips contain information from the original dialogue content, the accuracy is significantly improved compared to existing commercial solutions.

[0074] The experiment verified the following:

[0075] A. The online picture book content is picture books exported from the Picture Book Fairy Chicken platform;

[0076] Example 1: The content of the picture book is as follows Figure 3 As shown, the effect after processing by the method described in this invention is as follows: Figure 4 As shown, the strikethrough section will automatically be grouped with the following content, making it easier to view the relevant description when reading aloud;

[0077] Example 2: The content of the picture book is as follows Figure 5 As shown, the effect after processing by the method described in this invention is as follows: Figure 6 As shown, supplementary explanations for other content in the narration do not interrupt the division (for example, the explanation of the sound effects here is for post-production and does not need to be divided into the character voice).

[0078] Example 3: The content of the picture book is as follows Figure 7 As shown, the effect after processing by the method described in this invention is as follows: Figure 8As shown, the content at the end of the chapter is only used to describe the characteristics of the characters. Since it only contains content enclosed in parentheses, it will be judged as an unimportant segment. And since there is no other content after it, it will be deleted.

[0079] B. The online picture book content is a picture book exported from the Guagua Audio platform;

[0080] Example 4: The content of the picture book is as follows Figure 9 As shown, the effect after processing by the method described in this invention is as follows: Figure 10 As shown, after setting the chapter title division rules, "Episode 10" will be treated as a new chapter, while the description content before the main text will be judged as "empty content" or "unimportant content" and deleted due to its special style;

[0081] Example 5: The online picture book content is a picture book exported from the YouShengMiao platform. The picture book content is as follows: Figure 11 As shown, the effect after processing by the method described in this invention is as follows: Figure 12 As shown, if the narration between two lines of the same character has a strikethrough, the two lines and the narration with the strikethrough in between will be grouped into one group. This is because the strikethrough segment is judged as an unimportant segment and will be grouped with the subsequent content. In the last iteration, groups with consecutive identifiers that are the same will be merged.

[0082] The above description is merely a specific embodiment of the present invention, enabling those skilled in the art to understand or implement the present invention. Although detailed descriptions have been provided with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments, and they should all be covered within the protection scope of the claims.

Claims

1. A method for recognizing and structuring a notebook document, characterized in that, Includes the following steps:

1. Parse the style definition file of the .docx format document exported from the online drawing platform, extract the style identifiers and the default colors and shading of the text fragments in the style, and store them in the style mapping table; 2. Parse the main content of a .docx format document and obtain all paragraph tags in the document; III. Create a list of original fragments; IV. Process each paragraph: 1) Based on the style identifier used in the current paragraph, select the corresponding default color and shading from the style map table; 2) Traverse all segments of each paragraph and process them one by one. The specific processing steps are as follows: a. Sequentially obtain the text content of all segments and check their own format, that is, check whether they have a uniquely specified color, shading or strikethrough. If a uniquely specified color or shading is present, cover the current segment's default color or shading with that color or shading and mark the segment as special content. At the same time, generate a format combination identifier composed of that color or shading. If a strikethrough exists, mark the segment as deleted content. b. Store the fragments containing text, whether they are deleted content, whether they are special content, and whether they are format combination identifiers into the original fragment list created in step three; 5. Merging adjacent fragments with the same format: Traverse the original fragment list. If the current fragment and the previous fragment have the same information as each other, such as whether it is deleted content, whether it is special content, or whether it is a format combination identifier, then append the text of the current fragment to the previous fragment, thereby merging continuous and formatted text into a whole.

6. Splitting bracket annotations in special segments: For each segment marked as special content, use regular expressions to match the bracket content and the text immediately following it. Split each matching result into a new segment with the text "bracket content + immediately following text", and set the format combination identifier of the segment to the bracket content itself. Segments that are not special content remain unchanged.

7. Filter out segments with no actual content: Count the number of valid characters in the text of each segment. If the number of valid characters is 0, discard the segment. VIII. First Stage Grouping: Initialize an array of group IDs, assign an initial group ID to each fragment, and then iterate through all fragments, merging them according to fragment type: Normal segment: Starting from the current segment, scan backwards and set the group ID of all subsequent consecutive segments that do not contain special or deleted content to the group ID of the current segment; Special segments with content: Starting from the current segment, scan backwards and set the group ID of all segments with the same format combination identifier and those belonging to unimportant content to the group ID of the current segment. Unimportant content refers to content with strikethrough or special content that only contains marked content but no subsequent unmarked content. Marked content refers to content enclosed in square brackets or parentheses within special content. Unimportant segments: Starting from the current segment, find the first non-unimportant segment and set the group ID of the current segment and all consecutive unimportant segments in between to the group ID of the first non-unimportant segment; if all subsequent segments are of this type, mark the group IDs of these segments as invalid.

9. Define grouping types: Grouping types are divided into special groups and ordinary groups. If the last segment in a group is an ordinary segment, the corresponding group is an ordinary group; otherwise, it is a special group.

10. Second-stage grouping: Merge consecutive ordinary groups and set the group ID of the merged ordinary group to the group ID of the first ordinary group in that part, ensuring that all adjacent ordinary groups eventually belong to the same group ID; 11. Construct dialogue lines based on grouping results: Based on the final group ID, collect the fragments with the same group ID together to construct a dialogue line. Each line contains a list of fragments and sets a boolean flag. If the line contains at least one ordinary fragment, the line is marked as an ordinary line; otherwise, it is a special line.

12. List of lines of dialogue All the lines of dialogue that have been constructed are combined into a list of lines of dialogue as the final result.