A robust text watermarking method based on chinese character feature modification and grouping

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By adopting a robust text watermarking method based on Chinese character feature modification and grouping, the problems of complex character library generation and low robustness in existing technologies are solved, achieving efficient text watermark embedding and extraction, and improving the ability to trace the source of text content leaks.

CN115689853BActive Publication Date: 2026-06-23HANGZHOU DIANZI UNIV

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HANGZHOU DIANZI UNIV
Filing Date: 2022-11-22
Publication Date: 2026-06-23

Smart Images

Figure CN115689853B_ABST

Patent Text Reader

Abstract

The application discloses a robust text watermarking method based on Chinese character feature modification and grouping. The watermarking process includes: cutting the watermark sequence into bit strings; grouping commonly used high-frequency Chinese characters based on Chinese character correlation features; generating a watermark-containing variant glyph based on Chinese character structure features; establishing an index table for mutual mapping among the groups, Chinese characters, watermark bit strings and variant glyphs; searching the index table according to the watermark bit string, selecting the variant glyph to generate a deformed character library file and installing it to a computer terminal. The watermark extraction process includes: processing a text image and recognizing Chinese characters in the text by using an OCR technology, obtaining a single Chinese character image block by using an image segmentation technology; matching the variant glyph to which the segmented Chinese character belongs and extracting a watermark bit string; classifying the watermark bit string by using a grouping index table and adopting a voting strategy to perform in-group error correction to extract a correct watermark sequence. The application can effectively improve the robustness of the watermarking method and the extraction accuracy.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of text content protection technology, and relates to grouping algorithms and digital watermarking methods, specifically a robust text watermarking method based on Chinese character feature modification and grouping. Background Technology

[0002] In the context of information digitization, multimedia information is widely used in people's work and life. While this has greatly improved the efficiency of information processing and dissemination, it has also brought about security problems such as information leakage and illegal dissemination. Digital watermarking technology provides an effective way to solve the problem of secure dissemination of multimedia information. Currently, digital watermarking technology for images, videos, and audio is relatively mature because these types of digital information contain a large amount of redundant information to embed watermark information. However, text documents contain relatively less redundant information, making it relatively difficult to effectively add digital watermarks to document information.

[0003] In daily life, many important text documents are still frequently transmitted in paper form through printing and scanning. In some sensitive departments, confidential documents are highly susceptible to leakage through printing and scanning. Furthermore, with the widespread use of smartphones, leakers can easily evade monitoring system logs and leak text information simply by taking photos of paper documents or display screens with their phones. Due to the complexity of screen photography and the uncertainty of the content displayed on the screen, tracing the source of leaked text documents and verifying their content has become extremely difficult.

[0004] Traditional text watermarking technology effectively solves the problem of paper-based text document distribution, but it cannot effectively address the threat of information leakage displayed on screen. Because traditional text watermarking techniques are typically only applicable to text files with fixed formats and suffer from poor robustness, font-based text watermarking methods have emerged. These methods usually modify the geometric features of fonts in a font library to create deformations or perturbations that are almost imperceptible to the human eye. Different deformations or perturbations are used to encode different watermark information, generating a watermarked font file. The watermark information is then embedded by statically or dynamically replacing the original system font file. Currently, methods for generating fonts by modifying characters mainly fall into two categories: methods that manually modify font geometric features and methods based on deep learning-based glyph perturbation generation.

[0005] Existing text watermarking methods based on font libraries can effectively address the issue of information leakage and tracing when text documents are displayed on terminal screens or printed as paper documents, but they still have the following problems:

[0006] 1) Due to the sheer number of Chinese characters and their diverse and complex writing styles, generating a single Chinese font often requires designing at least six thousand characters, and sometimes even ten thousand, to ensure the watermark capacity and applicability of the method. Furthermore, the manual design and modification of fonts, as well as the inference and optimization of deep learning models, all require extensive professional knowledge, which obviously consumes a significant amount of human and material resources. If the number of fonts requiring design and modification could be reduced without affecting the watermark capacity and applicability of the method, the workload of such methods would be greatly reduced.

[0007] 2) Low robustness of the method. In text watermarking methods based on character sets, a common approach to watermark extraction is to segment the watermarked text image into characters and then perform image matching between the segmented characters and template characters to obtain the watermark information. However, the photographing process introduces more distortion and interference, easily causing font deformation and distortion in the document image. Furthermore, capturing screen photos often produces moiré patterns, affecting the accuracy of watermark extraction. Currently, most methods require continuously embedding watermark sequences with a large amount of information. During watermark extraction, if noise interference causes a few Chinese characters to fail to extract the watermark, it can potentially affect the extraction of the entire watermark sequence. Therefore, an improved method is needed to enhance the robustness of such methods.

[0008] To address this, the present invention provides a robust text watermarking method based on Chinese character feature modification and grouping. While reducing the difficulty of font generation, it can effectively improve the robustness of the method. It can not only solve the problem of tracing the source of printed text output, but also solve the problem of preventing leakage when browsing streaming document content on the screen. Summary of the Invention

[0009] This invention provides a robust text watermarking method based on Chinese character feature modification and grouping, which solves the problems of complex and difficult character library generation and low robustness in the prior art. While ensuring the watermark capacity and adaptability of the method, it further improves the watermark extraction accuracy and robustness of the method, and solves the problems of text content authentication and leakage tracing of screen display.

[0010] This invention provides a robust text watermarking method based on Chinese character feature modification and grouping, characterized by including a watermark embedding process and a watermark extraction process;

[0011] The specific steps of the watermark embedding process include:

[0012] (1) Convert the watermark information into a binary sequence and divide the sequence into binary bit strings evenly;

[0013] (2) Statistically analyze the Chinese characters and words in the Chinese text corpus and their frequency of occurrence. Based on the statistical frequency characteristics and the grouping algorithm that combines the Chinese character correlation characteristics, the high-frequency commonly used Chinese characters are divided into groups with the same length as the binary sequence after segmentation.

[0014] (3) Based on the structural features of Chinese characters, the displacement features of strokes in high-frequency and commonly used Chinese characters are modified by font tools to generate a variety of different variant glyphs. Each variant glyph is encoded using a different binary bit string to carry different watermark information.

[0015] (4) Establish a group index table that maps groups, Chinese characters, binary bit strings and variant glyphs to each other;

[0016] (5) Based on the segmented binary bit string, select a variant glyph of a high-frequency Chinese character in the group index table in turn, merge all variant glyphs with the unvariant glyphs of the remaining non-high-frequency commonly used Chinese characters, generate a font library type file and install it on the terminal to realize the embedding of watermark;

[0017] The specific steps of the watermark extraction process include:

[0018] (6) Obtain the watermarked text image after cross-media transmission, identify each Chinese character in the watermarked text image using OCR technology, and extract the character images corresponding to all identified Chinese characters using character image segmentation methods.

[0019] (7) For each Chinese character and its corresponding character image, determine whether the Chinese character is in the index table according to the index table, that is, determine whether the Chinese character is a high-frequency commonly used Chinese character. If it is a high-frequency commonly used Chinese character, use the image matching method to match the segmented character image with multiple variant characters corresponding to the Chinese character to calculate the similarity, and extract the watermark bit string carried by the variant character with the highest similarity.

[0020] (8) Based on the index table, the extracted watermark bit strings are classified into the corresponding groups, and the voting strategy is used to correct the watermark bit strings in the groups, thereby extracting the correct watermark sequence and completing the extraction of watermark information.

[0021] As a preferred approach, in step (1), the watermark information is converted into a binary sequence, and the binary sequence is evenly divided into segments of length [length missing]. A binary string of bits, where n can be 0, 1, 2 or 3.

[0022] As a preferred option, step (2) includes:

[0023] A word frequency statistics tool was used to count all Chinese characters and words in the Chinese text corpus and their frequency of occurrence. The frequency of occurrence of Chinese characters and two-character words that exceeded the set threshold was obtained and recorded, along with their frequency characteristics.

[0024] Based on the statistical frequency characteristics of Chinese characters, commonly used high-frequency Chinese characters are initially grouped, and the length of each group is the same as the length of the binary sequence after segmentation.

[0025] Based on simple grouping of Chinese characters, a secondary precise grouping is performed using a grouping algorithm based on the correlation features of Chinese characters, utilizing statistical two-character words and their frequency characteristics.

[0026] As a preferred approach, the specific implementation of the grouping algorithm for Chinese character relevance features includes:

[0027] After the initial simple grouping, the point mutual information value between any two Chinese characters in each group is calculated sequentially, and the correlation between Chinese characters is determined by the point mutual information value. Based on the correlation between Chinese characters in each group, each Chinese character is subjected to secondary precise grouping: if the correlation between two Chinese characters in a group is higher than a set threshold, the Chinese character with the lower frequency is removed from the original group, and then the Chinese character is added to the group with the lowest cumulative frequency in the remaining groups. This process is repeated multiple times until no two Chinese characters in each group are correlated, which is considered as completing the precise grouping.

[0028] As a preferred option, in step (3), based on the length of the binary bit string segmented in step (1), the high-frequency commonly used Chinese characters are manually modified with various different stroke features using a font modification tool, thereby generating a variety of different variant characters. Each variant character is encoded using a binary number with the same length as the binary bit string, and different variant characters carry different watermark information.

[0029] As a preferred option, step (4) establishes a logical index table that records the mapping relationships between groups, Chinese characters, binary bit strings and variant glyphs. This index table can quickly find the connections between each table item, which will simplify the process of embedding and extracting watermark information.

[0030] As a preferred option, step (5) is implemented as follows: according to the segmented binary watermark information bit string, in each group of the index table, select the variant glyph of each Chinese character encoded by the bit string, merge all the variant glyphs with the unvariant glyphs of the remaining non-high-frequency commonly used Chinese characters, generate a font library file through the font tool, install the generated font library file on the computer terminal, and the text content output by the computer terminal will contain watermark information, thereby completing the watermark embedding process.

[0031] As a preferred solution, step (6) includes:

[0032] Obtain the watermarked text image after cross-media transmission. That is, on a computer terminal equipped with a watermarked font library, when a text file is output through printing or display, it is converted into image data by a screenshot or photographing device.

[0033] Identify Chinese characters in the watermarked text image through OCR technology, and use a character image segmentation method to sequentially segment all the recognized Chinese characters and their corresponding character image blocks from the text image.

[0034] As a preferred solution, the specific implementation in step (7) is as follows: Traverse each recognized Chinese character in turn, check whether it belongs to high-frequency common Chinese characters in the index table, perform image matching on the segmented character image with variant glyphs, and extract the watermark bit string carried by the matching variant glyph.

[0035] As a preferred solution, a voting strategy is adopted in step (8) to correct the watermark bit strings within the group. That is, within each group in the index table, the binary bit string with the most occurrences is used as the correctly extracted watermark information bit string, so as to obtain the correct binary watermark sequence.

[0036] The beneficial effects of the present invention are as follows:

[0037] In the present invention, the usage frequencies of Chinese characters and words are statistically analyzed. When generating the watermarked deformed font library, high-frequency common Chinese characters are selected for character modification. According to the statistical results, the cumulative occurrence frequency of the top 1,000 Chinese characters with high frequencies is 91.40%, the cumulative occurrence frequency of 1,500 characters reaches 96.50%, and the cumulative occurrence frequency of 3,500 characters can reach 99.90%. Therefore, only modifying high-frequency common Chinese characters to generate the watermarked deformed font library can fully meet the daily application scope, and can reduce the work complexity and difficulty of character modification and generation.

[0038] In the present invention, a grouping algorithm based on the correlation characteristics between Chinese characters is provided, so that the Chinese characters within each group are as independent and uncorrelated as possible. When embedding the watermark, it can be realized that the watermark information is distributed as discretely and evenly as possible in different Chinese characters. For example, in the high-frequency word "we", the same watermark information is not embedded in the two Chinese characters. Using the grouping algorithm can not only increase the relative capacity of the watermark, but also achieve a high watermark embedding rate and extraction rate when the text content is less or limited.

[0039] Because this invention employs an index table lookup mechanism and a voting error correction strategy during watermark extraction, it significantly improves the efficiency of watermark embedding and extraction. Furthermore, the error correction strategy easily eliminates errors occurring during character image segmentation and matching, thereby increasing the accuracy of watermark extraction. This combined mechanism of grouping and error correction ensures the robustness of the watermarking method, enabling successful extraction of watermark information even when the leaked image content is partially damaged or the image is incomplete. Attached Figure Description

[0040] Figure 1 This is a flowchart of the watermark information embedding process of the present invention;

[0041] Figure 2 This is a schematic diagram illustrating the character feature modification of the present invention;

[0042] Figure 3 This is a schematic diagram of a Chinese character grouping index table according to the present invention;

[0043] Figure 4 This is a flowchart illustrating the watermark information extraction process of the present invention.

[0044] Figure 5 This is a graph showing the effectiveness test results of the Chinese character relevance grouping method of the present invention;

[0045] Figure 6 This example demonstrates the photographic test results from different angles when text files of different font sizes are printed as paper documents.

[0046] Figure 7 This is a test of the photo-taking effect under different font sizes and angles when browsing streaming files on the screen in the embodiment. Detailed Implementation

[0047] The technical solutions of the present invention will now be clearly and completely described with reference to the accompanying drawings of the embodiments of the present invention. It should be understood that the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art through creative effort are within the scope of protection of the present invention.

[0048] like Figure 1 The diagram shows a watermark embedding flowchart for a robust text watermarking method based on Chinese character feature modification and grouping, including the following steps:

[0049] Step S101: Convert the watermark information into a binary sequence and divide it evenly. That is, convert the watermark information to be embedded into a binary 01 bit sequence and divide the sequence evenly into binary bit strings.

[0050] Specifically, in this embodiment of the invention, the watermark information is converted into a 32-bit binary sequence, and the sequence is divided into groups of 2 bits each to obtain 16 groups of binary bit strings.

[0051] Step S102: Statistically analyze the frequency features of Chinese characters and words, and use a grouping algorithm based on the correlation features of Chinese characters to group frequently used Chinese characters.

[0052] Specifically, the THUCNews Chinese corpus dataset was processed using a word frequency statistics tool to count all Chinese characters and words in the corpus, as well as their frequency of occurrence. One thousand frequently occurring Chinese characters and one thousand frequently occurring two-character words were selected from the statistical results as high-frequency common characters, and their corresponding frequencies were retained.

[0053] Using a grouping algorithm based on the correlation features between Chinese characters, one thousand frequently used Chinese characters are divided into 16 groups of the same length as the watermark sequence.

[0054] First, the Chinese characters are initially grouped based on their frequency characteristics. The frequently used characters are sorted from highest to lowest frequency, and the top 16 characters are divided into 16 groups, with the cumulative frequency of each character in each group recorded. The remaining characters are then added to the group with the lowest cumulative frequency. This approach aims to ensure that the cumulative frequency of characters in each of the 16 groups is as even as possible.

[0055] Based on the initial simple grouping of Chinese characters, a second precise grouping is performed. Using statistical two-character words and their frequency characteristics, the point mutual information value between any two characters in each group is calculated sequentially. The magnitude of the point mutual information value is used to determine the correlation between characters. Based on the correlation magnitude of the characters in each group, the initial simple grouping is further refined to ensure that the characters within each group are as uncorrelated as possible.

[0056] Point-wise mutual information is commonly used to measure the relevance between things, and in the field of text analysis in natural language processing, it can be used to calculate the semantic similarity between words. The basic idea of point-wise mutual information is to statistically analyze the probability of two words appearing simultaneously in a text; the higher the probability, the greater the relevance and the higher the degree of association. Borrowing this idea to calculate the relevance between Chinese characters, the formula is as follows:

[0057]

[0058] in, , They are Chinese characters and Frequency of use These are Chinese characters and the usage frequency after forming words. If a Chinese character and cannot form a word, then . If is more relevant to , the greater the value, and vice versa the value is smaller. If the two are not relevant, then is equal to 0.

[0059] If the relevance between two Chinese characters within a group is higher than the set threshold, then the Chinese character with a smaller frequency is removed from the original group, and then this Chinese character is added to the group with the smallest cumulative frequency among the remaining groups. By iterating this process multiple times until the Chinese characters within each of the 16 groups are not relevant to each other pairwise, it is considered that the precise grouping is completed.

[0060] Step S103: Based on the structural characteristics of Chinese character characters, modify the displacement characteristics of the strokes in the characters to generate variant glyphs, and use binary bit strings for encoding to carry watermark information.

[0061] Specifically, since the length of the segmented watermark bit string is 2 bits, a font modification tool is used to manually perform four different stroke feature modification operations on one thousand high-frequency commonly used Chinese characters, thereby generating four different variant glyphs corresponding to the Chinese characters. Among them, each variant glyph is encoded with a two-bit binary bit string, so the four variant glyphs can respectively carry four different 2-bit watermark information bit strings of "00", "01", "10", and "11", thereby establishing the relationship between the variant glyphs and the watermark information. As Figure 2 shown, the strokes of the character "学" are respectively modified with different displacement amounts to generate four deformed characters, and they are encoded with binary bit strings.

[0062] Step S104: Establish an index table between the groups, Chinese characters, binary bit strings, and deformed characters.

[0063] Specifically, as Figure 3 shown, establish a logical index table. The table records the mapping relationships between the groups, Chinese characters, binary bit strings, and deformed characters. Through this index table, the connections between each table entry can be quickly found, simplifying the watermark information embedding and extraction processes in subsequent procedures.

[0064] Step S105: Select variant glyphs according to the segmented watermark bit string, fuse them with the non-variant glyphs to generate a font library file, and perform watermark embedding according to the font library file.

[0065] Specifically, based on each group of 2-bit binary watermark bits, the variant glyphs of the Chinese characters within each group, encoded by that binary bit string, are retrieved from the index table, generating a set of variant glyphs for high-frequency Chinese characters. This set of variant glyphs is then merged with the unvarnished characters of other less frequently used Chinese characters, and a font library file is generated using a font tool. The generated font library file is installed on a computer terminal, and all text files output from that terminal will contain the watermark information, thus achieving streaming watermark embedding.

[0066] The flowchart of watermark information extraction in this embodiment of the invention is as follows: Figure 4 As shown:

[0067] Step S401: Obtain a screenshot or photo of the watermarked text and preprocess it. Use OCR technology to recognize Chinese characters in the image and extract the character images of all recognized Chinese characters through character segmentation.

[0068] Specifically, on a computer terminal with a watermarked font file installed, the watermarked text image is obtained by taking a picture of the printed document or by taking a screenshot or picture of the streaming text content being viewed on the screen. The image undergoes preprocessing operations such as cropping, scaling, affine transformation, brightness equalization, and sharpening, and is then converted into a white-background black-text image to make the characters in the text image as clear and recognizable as possible.

[0069] OCR technology is used to identify each character in a white-on-black image and obtain the position coordinates of each character on the image. The image projection method is used to segment each character in the white-on-black image into individual white-on-black character image blocks. That is, the image is first horizontally projected to obtain each line of text content, and then each line of text is vertically projected to obtain each white-on-black character image block.

[0070] Step S202: Sequentially search for frequently used Chinese characters among the identified Chinese characters, use image matching method to determine the variant glyph of the segmented Chinese character image, and extract the watermark information it carries.

[0071] Specifically, the identified Chinese characters are searched in the index table in turn. If the Chinese character is not a high-frequency commonly used Chinese character, it will not exist in the index table and there will be no variant glyphs. Therefore, the segmented character image does not contain watermark information and no operation is performed on the character image. If the identified Chinese character is a high-frequency commonly used Chinese character, the four different variant glyphs corresponding to the Chinese character are found according to the index table. The similarity between the segmented character image and the four variant glyphs is calculated using the image matching method. The most similar variant character is matched, and the binary bit string carried by the variant character is extracted according to the binary code of the variant glyph in the index table. That is, the watermark information bit string contained in the segmented character image.

[0072] Step S203: Use the Chinese character grouping index table to classify the watermark information bit strings extracted from the segmented characters into the corresponding Chinese character groups, and use a voting strategy to correct the grouped watermark bit strings to extract the correct watermark sequence.

[0073] Specifically, because errors in character segmentation and image matching may result in incorrect watermark information bit strings within each extracted watermark information group, leading to inconsistencies in the watermark information bit strings within each group, a voting strategy is employed for error correction. This involves selecting the two most frequently occurring bits from each watermark information group as the final correctly extracted watermark information bit strings. Finally, the 16 groups of watermark information bit strings are sequentially concatenated and converted into a plaintext sequence, completing the watermark information extraction. Based on this watermark information, the source of the leaked text image can be identified, thereby pinpointing the leaking terminal.

[0074] In this embodiment, the effectiveness of the grouping algorithm in this invention was tested, such as... Figure 5 As shown, compared to random grouping and simple grouping, precise grouping based on the relevance features of Chinese characters significantly improves the accuracy of watermark extraction, and this improvement is particularly noticeable when the text content is limited. This indicates that even when the text content in the leaked image is limited, this invention can still accurately extract the embedded watermark information, thus solving the problem of tracing the source of the leak.

[0075] In this embodiment, the embedding and extraction of watermarks are performed using a Chinese character grouping algorithm and a voting error correction strategy, which enhances the robustness of the method of the present invention. Even when the content of leaked text images is partially maliciously damaged or partially missing, the present invention can still extract watermark information to trace the source of the document.

[0076] In this embodiment, the watermarking method used in this invention still exhibits strong robustness even in complex shooting scenarios. For example... Figure 6 In this embodiment, the watermarked text is printed out as a paper carrier document, and the text content is displayed in different font sizes. Then, it is photographed from different angles using a smartphone, thereby converting it into a watermarked text image. Figure 7 These are screenshots of the screen display taken by a mobile phone from three different perspectives in this embodiment, where the screen is currently browsing streaming files with different font sizes. It can be observed that although the captured images contain varying degrees of noise, lighting, and moiré patterns, the watermark information can still be extracted, demonstrating the robustness and effectiveness of this embodiment, and fulfilling the requirement for highly robust leak tracing.

[0077] The above description is merely a detailed explanation of preferred embodiments and principles of the present invention. For those skilled in the art, there may be changes in specific implementation methods based on the ideas provided by the present invention, and these changes should also be considered within the scope of protection of the present invention.

Claims

1. A robust text watermarking method based on Chinese character feature modification and grouping, characterized in that... This includes the watermark embedding process and the watermark extraction process; The specific steps of the watermark embedding process include: (1) Convert the watermark information into a binary sequence and divide the sequence into binary bit strings evenly; (2) Statistically analyze the Chinese characters and words in the Chinese text corpus and their frequency of occurrence. Based on the statistical frequency characteristics and the grouping algorithm that combines the Chinese character correlation characteristics, the high-frequency commonly used Chinese characters are divided into groups with the same length as the binary sequence after segmentation. Step (2) includes: A word frequency statistics tool was used to count all Chinese characters and words in the Chinese text corpus and their frequency of occurrence. The frequency of occurrence of Chinese characters and two-character words that exceeded the set threshold was obtained and recorded, along with their frequency characteristics. Based on the statistical frequency characteristics of Chinese characters, commonly used high-frequency Chinese characters are initially grouped, and the length of each group is the same as the length of the binary sequence after segmentation. Based on simple grouping of Chinese characters, a secondary precise grouping is performed using a grouping algorithm based on the correlation features of Chinese characters, utilizing statistical two-character words and their frequency features. The specific implementation of the grouping algorithm for Chinese character relevance features includes: After the initial simple grouping, the point mutual information value between any two Chinese characters in each group is calculated sequentially, and the correlation between Chinese characters is determined by the point mutual information value. Based on the correlation between Chinese characters in each group, each Chinese character is subjected to secondary precise grouping: if the correlation between two Chinese characters in a group is higher than a set threshold, the Chinese character with the lower frequency is removed from the original group, and then the Chinese character is added to the group with the lowest cumulative frequency in the remaining groups. This process is repeated multiple times until no two Chinese characters in each group are correlated, which is considered as completing the precise grouping. (3) Based on the structural features of Chinese characters, the displacement features of strokes in high-frequency and commonly used Chinese characters are modified by font tools to generate a variety of different variant glyphs. Each variant glyph is encoded using a different binary bit string to carry different watermark information. (4) Establish a group index table that maps groups, Chinese characters, binary bit strings and variant glyphs to each other; (5) Based on the segmented binary bit string, select a variant glyph of a high-frequency Chinese character in the group index table in turn, merge all variant glyphs with the unvariant glyphs of the remaining non-high-frequency commonly used Chinese characters, generate a font library type file and install it on the terminal to realize the embedding of watermark; The specific steps of the watermark extraction process include: (6) Obtain the watermarked text image after cross-media transmission, identify each Chinese character in the watermarked text image using OCR technology, and extract the character images corresponding to all identified Chinese characters using character image segmentation methods. (7) For each Chinese character and its corresponding character image, determine whether the Chinese character is in the index table according to the index table, that is, determine whether the Chinese character is a high-frequency commonly used Chinese character. If it is a high-frequency commonly used Chinese character, use the image matching method to match the segmented character image with multiple variant characters corresponding to the Chinese character to calculate the similarity, and extract the watermark bit string carried by the variant character with the highest similarity. (8) Based on the index table, the extracted watermark bit strings are classified into the corresponding groups, and the voting strategy is used to correct the watermark bit strings in the groups, thereby extracting the correct watermark sequence and completing the extraction of watermark information.

2. The robust text watermarking method based on Chinese character feature modification and grouping as described in claim 1, characterized in that... In step (1), the watermark information is converted into a binary sequence, and the binary sequence is evenly divided into segments of length [length missing]. A binary string of bits, where n can be 0, 1, 2 or 3.

3. The robust text watermarking method based on Chinese character feature modification and grouping as described in claim 1, characterized in that, In step (3), based on the length of the binary bit string segmented in step (1), the high-frequency commonly used Chinese characters are manually modified in various ways using a font modification tool, thereby generating a variety of different variant characters. Each variant character is encoded using a binary number with the same length as the binary bit string, and different variant characters carry different watermark information.

4. A robust text watermarking method based on Chinese character feature modification and grouping as described in claim 1 or 3, characterized in that... Step (4) Establish a logical index table that records the mapping relationship between groups, Chinese characters, binary bit strings and variant glyphs. This index table can quickly find the relationship between each table item, which will simplify the embedding and extraction process of watermark information.

5. A robust text watermarking method based on Chinese character feature modification and grouping as described in claim 4, characterized in that... Step (5) is implemented as follows: Based on the segmented binary watermark information bit string, in each group of the index table, select the variant glyph of each Chinese character encoded by the bit string, merge all the variant glyphs with the unvariant glyphs of other non-high-frequency commonly used Chinese characters, generate a font library file through the font tool, install the generated font library file on the computer terminal, and the text content output by the computer terminal will contain watermark information, thus completing the watermark embedding process.

6. A robust text watermarking method based on Chinese character feature modification and grouping as described in claim 5, characterized in that... Step (6) includes: To obtain watermarked text images after cross-media transmission, that is, when a text file is printed or displayed on a computer terminal with a watermarked font installed, the text file is converted into image data by a screenshot or photographing device. The OCR technology is used to identify Chinese characters in watermarked text images. A character image segmentation method is used to sequentially segment all the identified Chinese characters and their corresponding character image blocks from the text image.

7. A robust text watermarking method based on Chinese character feature modification and grouping as described in claim 6, characterized in that... The specific implementation of step (7) is as follows: sequentially traverse each recognized Chinese character, search in the index table whether it belongs to a high-frequency commonly used Chinese character, perform image matching between the segmented character image and the variant character shape, and extract the watermark bit string carried by the matching variant character shape.

8. A robust text watermarking method based on Chinese character feature modification and grouping as described in claim 7, characterized in that... In step (8), a voting strategy is used to correct the watermark bit string within the group. That is, in each group in the index table, the binary bit string that appears most frequently is used as the correct watermark information bit string to obtain the correct binary watermark sequence.