Methods, computer programs, and computer systems for encoding character data using encoding schemes (management of metadata at the binary level).

The encoding scheme integrates metadata with character data encoding to securely manage and process sensitive information, addressing the inefficiencies in existing data protection methods by ensuring compatibility and enhanced security.

JP2026105815APending Publication Date: 2026-06-26INTERNATIONAL BUSINESS MACHINE CORPORATION

Patent Information

Authority / Receiving Office
JP · JP
Patent Type
Applications
Current Assignee / Owner
INTERNATIONAL BUSINESS MACHINE CORPORATION
Filing Date
2025-09-17
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing data protection methods struggle to efficiently manage and secure confidential information, particularly personal identification data, by accurately identifying and handling sensitive data while ensuring compatibility with existing systems.

Method used

A method and system for encoding character data using a predefined encoding scheme that represents each character's binary value in a unique set of code units, incorporating metadata for enhanced security and flexibility, allowing for secure storage and processing of sensitive information.

Benefits of technology

This approach ensures secure storage and processing of sensitive data by embedding metadata within the character encoding, maintaining compatibility with existing systems and enhancing data security and privacy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure 2026105815000001_ABST
    Figure 2026105815000001_ABST
Patent Text Reader

Abstract

This provides a method for encoding character data using an encoding scheme for sensitive data. [Solution] The encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the encoding scheme, according to a formatting scheme, and the method includes the steps of: representing metadata describing a particular token in a binary value called a metadata binary value; representing each character in the particular token in a binary value called a character binary value; creating a unique set of code units containing at least some of the metadata binary values ​​and the character binary values ​​according to a formatting scheme; and storing a particular token by storing one or more sets of the resulting code units.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present invention relates to the field of digital computer systems, and more particularly, to a method for encoding character data using an encoding scheme.

Summary of the Invention

Problems to be Solved by the Invention

[0002] Data protection defines the implementation of measures to safeguard confidential information such as personal identification data. To achieve compliance, systems often have to establish processes to identify the location of confidential data and ensure its proper handling, often through techniques such as data masking. Various solutions are available to assist in identifying confidential data, applying appropriate classification labels, and implementing the necessary protection measures. However, there is still room for further improvement in these solutions.

Means for Solving the Problems

[0003] Various embodiments provide a method, a computer program product, and a computer system for encoding character data using an encoding scheme, as described by the subject matter of the independent claims. Advantageous embodiments are described in the dependent claims. Embodiments of the present invention can be freely combined with each other if they are not mutually exclusive.

[0004] In one embodiment, the present invention relates to a method for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, wherein the original binary value is obtained by a binary encoding technique, and the method comprises: with respect to a particular token: representing metadata describing the particular token as a binary value called a metadata binary value; for each character in the particular token: representing the character as a binary value called a character binary value; creating a unique set of code units containing the metadata binary value and the character binary value from at least a portion of the metadata binary value and the character binary value in accordance with the formatting scheme; and storing the particular token by storing one or more sets of the resulting code units.

[0005] In one embodiment, the present invention relates to a computer program product comprising a computer-readable storage medium in which computer-readable program code is embodied, wherein the computer-readable program code is configured to implement the method of the above embodiment.

[0006] In one embodiment, the present invention relates to a computer system for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, wherein the original binary value is obtained by a binary encoding technique, and the computer system is configured to store the particular token by: representing metadata describing the particular token as a binary value called a metadata binary value; representing each character in the particular token as a binary value called a character binary value; creating a unique set of code units containing the metadata binary value and the character binary value from at least a portion of the metadata binary value and the character binary value in accordance with the formatting scheme; and storing one or more sets of the resulting code units. [Brief explanation of the drawing]

[0007] Hereinafter, embodiments of the present invention will be described in more detail with reference to the drawings, merely as examples.

[0008] [Figure 1] This is a flowchart illustrating a method for encoding character data using an example encoding scheme from this subject.

[0009] [Figure 2] This is a flowchart illustrating a method for encoding unstructured documents using an example encoding scheme from this subject.

[0010] [Figure 3] Figure 2 is a flowchart illustrating a method for decrypting an encoded, unstructured document, as an example of this subject.

[0011] [Figure 4] This figure shows a method for encoding a specific token, as an example of the subject matter.

[0012] [Figure 5] This figure shows a computing environment as an example of the subject of this topic. [Modes for carrying out the invention]

[0013] The descriptions of various embodiments of the present invention are presented for illustrative purposes only and are not intended to be exhaustive or to limit the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terms used herein have been selected to best describe the principles, practical applications, or technological improvements over the art available on the market of the embodiments, or to enable other those skilled in the art to understand the embodiments disclosed herein.

[0014] Storing metadata related to data classification alongside the data itself provides greater flexibility in managing sensitive information, allowing changes to data to be tracked independently of its classification, thereby ensuring more efficient updates across multiple systems. Access to tokens using this subject may be more secure because access may require the use of a catalog or tool that leverages the stored metadata. For example, this subject may allow metadata applicable to tokens such as single words, sentences, or values ​​in a document to be stored within the document itself, rather than externally. This ensures that the metadata remains synchronized when the document is updated, thereby preventing misalignment due to inaccurate offsets. Additionally, even if the document is copied, the metadata cannot be lost because it is embedded within the data. Even if the document is damaged, sensitive data may remain protected because masking mechanisms can be enabled through strict catalog access controls.

[0015] This subject may provide methods for encoding character data using encoding schemes. Character data refers to any data composed of individual characters, which may include letters, numbers, punctuation marks, symbols, and control characters. In digital systems, these characters can be represented using a specific encoding scheme that can convert each character into a unique sequence of binary numbers. This binary sequence can be a way in which character data is stored, enabling a computer to process and interpret the information. Binary sequences can be stored in a variety of formats, including structured documents or text files, databases, or other unstructured data formats such as data repositories.

[0016] An encoding scheme can be a method used to convert characters into binary values ​​so that the characters can be stored and processed by a computer. An encoding scheme can be associated with, or used for, the encoding of characters in a particular character set. An encoding scheme is configured such that the original binary value of each character in a character set is represented by a unique set of one or more code units of the encoding scheme, according to a formatting scheme. The formatting scheme can define how the original binary value can be structured in code units. The representation of the original binary value can be implemented such that each character in the character set is assigned a specific sequence of code units that uniquely identifies it in the encoding scheme. A code unit can be a bit sequence. A code unit can be the smallest bit sequence used to encode a character using its encoding scheme. The original binary value can refer to the raw binary representation of a given character based on a binary encoding technique. For example, a given character may be mapped to or associated with a unique decimal value, where the original binary value can be obtained by applying a binary encoding technique to that decimal value. For example, given the letter "M", the corresponding decimal value could be 77. The conversion from decimal to binary using binary coding techniques can be performed in the following steps: 77 ÷ 2 = 38, remainder 1; 38 ÷ 2 = 19, remainder 0; 19 ÷ 2 = 9, remainder 1; 9 ÷ 2 = 4, remainder 1; 4 ÷ 2 = 2, remainder 0; 2 ÷ 2 = 1, remainder 0; and 1 ÷ 2 = 0, remainder 1. This results in the original binary value "1001101" representing the letter "M".

[0017] Therefore, given the encoding scheme described above, this method can be implemented for a specific token. The specific token can be any token, where a token can be the smallest character sequence that has semantic meaning. The token may be, for example, part of an electronic document or a database. For example, in natural language processing (NLP), a token may refer to a word or punctuation mark. For example, in the sentence "Hello, world!", the tokens may be "Hello", "", "world", and "!". A specific token may consist of one or more characters. This method can be implemented for a specific token as follows: Metadata describing the specific token can be determined. The metadata can be represented by a binary value called a metadata binary value. The metadata can be represented by a metadata binary value, for example, using a binary encoding technique. For each character in the specific token, the character can be represented by a binary value called a character binary value, and a set of unique code units can be created according to a formatting scheme such that the set of code units includes at least some metadata binary values ​​and character binary values. This can result in one or more sets of code units depending on the number of characters in a particular token. For example, if a particular token has n characters, it can result in n sets of code units. Characters can be represented as character binary values ​​using binary coding techniques. A particular token can be stored by storing one or more sets of resulting code units. Storing metadata along with the actual data can maintain important contextual information (e.g., classification) about a particular token, which can be used for data protection, verification, or other purposes.

[0018] For example, at least a portion of the metadata binary value is a unique part of the metadata binary value associated with the character. For instance, the metadata binary value may be divided into multiple parts, each part associated with a specific character of a given token. Dividing the metadata binary value into unique parts, each associated with a specific character of the token, allows for more detailed and specific metadata control, providing greater granularity during processing or analysis. This method can lead to more efficient storage, especially when the metadata associated with each character is relatively small, thereby reducing redundancy. Additionally, it can offer flexible processing, as it can accommodate scenarios where only a subset of the metadata is associated with a particular character within the token.

[0019] For example, at least a portion of the metadata binary value is the metadata binary value itself. That is, each character of a particular token can be associated with the complete content of the metadata. Associating each character of a particular token with the entire metadata binary value simplifies processing because it eliminates the need to divide or manage unique parts. This approach provides a holistic view and eliminates the risk of information loss during processing, ensuring that all characters have access to the complete metadata. Additionally, direct access to the complete metadata for each character can speed up the retrieval process without the need to reconstruct the metadata from individual parts, thus facilitating ease of retrieval.

[0020] For example, the encoding scheme is a predefined multibyte encoding scheme, and the representation of characters and / or metadata is performed such that the concatenation of at least a portion of the metadata binary values ​​and the character binary values ​​represents existing characters in the multibyte region of the character set according to the binary encoding technique. Existing characters can be replaced by characters and at least the associated portion of the metadata.

[0021] In fact, the resulting concatenated binary sequence can correspond to existing characters within the multibyte character region. This means that the process can effectively replace or represent characters within a predefined encoding scheme using the combined metadata and character binary values. For example, emojis that may not be necessary to encode database data can be replaced in this manner.

[0022] This approach can enhance character representation by directly embedding metadata into the character encoding, thereby providing additional context or information for each character. By utilizing established multibyte encodings such as UTF-8 or UTF-16, it is possible to ensure compatibility with existing systems while efficiently expanding character functionality.

[0023] According to one example, the encoding scheme is a predefined multibyte encoding scheme, and the representation is implemented such that the concatenation of at least a portion of the metadata binary value and the character binary value represents unused characters outside the multibyte region of the character set according to binary encoding techniques.

[0024] The encoding scheme may be a predefined multibyte encoding such as UTF-8 or UTF-16, where characters are typically represented using one or more bytes. In this approach, the representation may involve concatenating at least a portion of the metadata binary value with the character binary value to create a new binary sequence. This sequence is designed to correspond to unused characters outside the standard multibyte region of the character set. Unused characters are those that are not currently assigned within the multibyte region of the character set and thus can provide a free area that can be repurposed to meet the needs of custom encoding.

[0025] This scheme can maintain the integrity of the original encoding by using characters outside the existing multi-byte region to ensure that additional metadata does not interfere with the standard characters. It enables seamless integration of custom metadata and thus may allow additional information such as annotations or formatting details to be directly added within the encoding without relying on external storage.

[0026] According to one example, the step of creating a set of code units includes the steps of padding zero or more bits between the metadata binary value and the character binary value to integrate the metadata binary value and the character binary value into a combined binary value, and determining a specific set of code units including the combined binary value according to the formatting scheme.

[0027] The character binary value may be a first bit set, and the metadata binary value may be a second bit set. Integrating the metadata binary value and the character binary value into a combined binary value can be implemented, for example, by placing the first bit set adjacent to the second bit set with zero or more padding bits therebetween. Zero or more padding bits can be inserted between the metadata and the character binary value to ensure that the combined binary value conforms to a predefined format (e.g., fixed length). The padding can be implemented such that the total number of bits in the combined binary value enables obtaining a unique set of code units according to the formatting scheme. That is, if the total number of bits in the first bit set and the second bit set enables obtaining a unique set of code units, the padding may not exist (be zero).

[0028] For example, padding is implemented such that the combined binary value has a number of bits greater than the maximum number of bits that can represent the character set. For instance, the character set may include a set of characters such that each character can be represented by a set of bits less than or equal to the maximum number of bits. This ensures that the combined binary value corresponds to a character that does not belong to the character set, thus avoiding the use of the same code unit.

[0029] For example, if the formatting scheme is the UTF-8 formatting scheme, this example can store metadata at the code point level using previously unused UTF-8 multibyte code points in order to store metadata within the linked information / data. In fact, UTF-8 has 1,048,576 defined code points that can be represented in multibyte sequences of up to 4 bytes. Anything more than a 4-byte multibyte sequence can still be technically plausible within the UTF-8 structure.

[0030] In one example, the method further comprises the steps of parsing an electronic document to identify a sensitive token in the electronic document, and repeating the method for each sensitive token of the sensitive token as a specific token, wherein the sensitive token is identified based on security access criteria, and the storage of each sensitive token includes replacing one or more existing code units of the sensitive token in the electronic document with one or more sets of the code units of the sensitive token. The method in this example may involve scanning an electronic document to identify sensitive tokens according to security access criteria. For each sensitive token found, the method is applied to that sensitive token, where the term “specific token” in this method is understood to mean “sensitive token.”

[0031] A confidential token may be a token that satisfies security access criteria either alone or in combination with one or more additional tokens within an electronic document, in which case the one or more additional tokens are also classified as confidential tokens. Security access criteria may require, for example, that the token has certain attributes, is part of a specific token category, or is linked to one or more additional tokens in a way that collectively satisfies the security access criteria. Confidential tokens may, for example, represent a financial data category or be personally identifiable information (PII) tokens. This can enhance data security and privacy by replacing confidential tokens within a document with a different set of code units, effectively masking the original information while maintaining its integrity. It can enable automated and scalable identification of confidential data, simplifying storage and transmission without compromising the document structure. Furthermore, it can ensure compatibility with existing encoding schemes and protect confidential information from unauthorized access.

[0032] For example, the character set includes one of the following: UNICODE, ASCII, or ISO-8859-1.

[0033] For example, the formatting scheme is a formatting scheme that is either a UTF encoding scheme or an ASCII encoding scheme.

[0034] For example, the formatting scheme is a UTF-8 encoding scheme, and the set of code units is longer than 4 code units.

[0035] For example, the set of code units may consist of 5 or 6 code units. By using 5 or 6 code units, the encoding scheme may be able to represent a wider range of characters than existing multibyte encoding schemes.

[0036] For example, each code unit is a byte.

[0037] In one example, the method further comprises the step of encrypting the particular token before the representation of the characters of the particular token, where the characters of the particular token are the encrypted characters of the encrypted particular token. The method in this example specifies that the particular token, for which metadata is determined and one or more sets of code units are created and then stored, is an encrypted version of the original token. Physically, the encrypted token and the original token may be similar in format or structure; however, they are semantically different, as the encrypted token may conceal the meaning of the original token, making it unreadable without decryption.

[0038] For example, before representing the characters of a particular token, the token is first encrypted. Encryption may involve converting the particular token into an encoded format that is unreadable without a decryption key. As a result, the characters of the particular token that are ultimately represented will be the encrypted form of the original characters. Essentially, the particular token is encrypted into a new, secure version, and these encrypted token characters will be used in subsequent operations.

[0039] Encrypting specific tokens before their representation ensures enhanced security by protecting sensitive information, as intercepted data cannot be accessed without the decryption key. This method can maintain data protection during transmission, storage, or processing and can be applied across various domains, including secure communications and databases.

[0040] For example, a specific token is a database attribute value stored within the database.

[0041] A specific token can refer to a database attribute value, which is a single piece of data stored within the database. In a database system, data can be organized into tables consisting of rows and columns. Each column represents an attribute such as name, age, or product ID, and each row holds the corresponding value for these attributes. A specific token can be one of these values.

[0042] By treating database attribute values ​​as tokens, granular data management becomes possible, enabling fine-grained actions such as encryption or metadata tagging for individual data items, thereby enhancing security without affecting the entire database.

[0043] Figure 1 is a flowchart of a method for encoding character data using an encoding scheme, which is an example of the subject of this paper. For illustrative purposes, the method described in Figure 1 may be implemented in, but is not limited to, the system shown in Figure 5. The encoding scheme is configured to represent the original binary value of each character in a predefined character set within a unique set of one or more code units of the encoding scheme, according to a formatting scheme. The original binary value is obtained from the character by a binary encoding technique.

[0044] In step 101, metadata describing a particular token may be represented by a binary value called a metadata binary value. Steps 103 and 105 may be performed for each character within a particular token. In step 103, a character may be represented by a binary value called a character binary value. In step 105, a set of unique code units can be created according to a formatting scheme such that the set of code units includes at least some metadata binary values ​​and character binary values.

[0045] In stage 107, a particular token may be stored by storing one or more sets of resulting code units.

[0046] Figure 2 is a flowchart of a method for encoding data using an encoding scheme, which is an example of the subject of this paper. For illustrative purposes, the method described in Figure 2 may be implemented in the system shown in Figure 5, but is not limited to this implementation.

[0047] The method begins in step 201 by receiving an unstructured document that is likely to contain various types of data, including sensitive information. The next step, 202, involves scanning the document and identifying sensitive tokens, where the system detects sensitive data such as personally identifiable information (PII). Following this, in step 203, metadata about the tokens is generated, capturing details about the tokens, such as their type, location, and level of confidentiality. An optional step 204 is included, which encrypts the original characters within the sensitive tokens, thus adding an additional layer of security by ensuring that the original data remains unreadable without proper decryption, even if exposed. The process concludes in step 205 by re-encoding the sensitive tokens using a UTF multibyte encoding scheme, thereby ensuring that the tokens are stored or transmitted in a secure and standardized format compatible with various systems and platforms. This method may represent a specific encoding called encoding 200.

[0048] Figure 3 is a flowchart of a method for encoding data using an encoding scheme, which is an example of the subject of this paper. For illustrative purposes, the method described in Figure 3 may be implemented in the system shown in Figure 5, but is not limited to this implementation.

[0049] The process begins in step 301 by verifying whether the application being used recognizes the encoding described in Figure 2. If the application does not recognize it, the sensitive token appears as an undefined character in step 302, meaning that the application cannot accurately interpret the data. If the application does recognize the encoding, it scans for blocks of UTF multibyte characters in step 303 to identify potentially encoded data. Once a block is identified, the next step 304 verifies the identified block to determine whether it belongs to the specific encoding (encoding 200) used for the token. If the block belongs to encoding 200, the system identifies it as part of the same sensitive token in step 305 and proceeds with decryption. If the block is not part of encoding 200, the process proceeds to the next block in step 306. For blocks identified as part of encoding 200, the next step 307 decrypts the UTF data to reconstruct the original data and metadata. The process then verifies whether the encoded bytes are encrypted. If they are encrypted, the system proceeds to decrypt the encrypted bytes in step 308; if they are not encrypted, this step is skipped. Finally, in step 309, the system completes the decryption process by either replacing the encoded tokens with the original tokens and / or displaying additional metadata.

[0050] Figure 4 is a block diagram illustrating a method for encoding a specific token, "Miami," using an example encoding scheme for this subject. For illustrative purposes, the method described in Figure 4 may be implemented in, but is not limited to, the system shown in Figure 5. In this example, the formatting scheme is a formatting scheme for the UTF-8 encoding scheme.

[0051] Each character of a particular token is represented by its respective binary value. This is shown in Figure 4, where the character "M" is represented by binary value 401.1, the character "i" by binary value 401.2, the character "a" by binary value 401.3, the character "m" by binary value 401.4, and the character "i" by binary value 401.5.

[0052] Metadata for a specific token, "Miami," is defined or determined. The metadata is represented by a metadata binary value of 402.

[0053] For each character of a specific token "Miami," the metadata binary value and the character binary value can be combined into a combined binary value. This may be done, for example, by padding 10 digits between the metadata binary value and the character binary value. Alternatively, the binary encoding technique may be configured to provide a character binary value for each character in the token, which may include padding zeros from the outset. This is shown in Figure 4, where the character "M" is associated with a combined binary value 403.1 obtained from metadata binary value 402 and character binary value 401.1, the character "i" is associated with a combined binary value 403.2 obtained from metadata binary value 402 and character binary value 401.2, the character "a" is associated with a combined binary value 403.3 obtained from metadata binary value 402 and character binary value 401.3, the character "m" is associated with a combined binary value 403.4 obtained from metadata binary value 402 and character binary value 401.4, and the character "i" is associated with a combined binary value 403.5 obtained from metadata binary value 402 and character binary value 401.5.

[0054] Using a formatting scheme, a set of code units can be created for each character of a specific token "Miami," such that the set of code units contains the metadata binary value and the character binary value for each character. Since this formatting scheme is of the UTF-8 encoding scheme, the code units are bytes, and the formatting scheme can indicate the number of bytes in a sequence using a specific prefix in the first byte for multibyte characters. For example, in a 2-byte sequence, the first byte starts with the size 3 prefix "110," and the second byte starts with "10." In a 3-byte sequence, the first byte starts with the size 4 prefix "1110," followed by two bytes starting with "10." In a 4-byte sequence, the first byte starts with the size 5 prefix "11110," followed by three bytes starting with "10." However, since the length of a concatenated binary value is longer than 21, which is the maximum number of bits that can represent the character set associated with the UTF-8 encoding scheme, the maximum number of bytes in the UTF-8 encoding scheme, 4 bytes, may not be sufficient to encode a concatenated binary value. For this reason, the same logic of the formatting scheme can be followed such that the first byte begins with the prefix "111110" of size 6, followed by four bytes beginning with "10". This is shown in Figure 4, where each character of the particular token "Miami" is associated with its own set of five code units (i.e., 5 bytes). The character "M" is associated with set 404.1 of five code units, the character "i" is associated with set 404.2 of five code units, the character "a" is associated with set 404.3 of five code units, the character "m" is associated with set 404.4 of five code units, and the character "i" is associated with set 404.5 of five code units.

[0055] This subject may include the following clauses.

[0056] [Clause 1] A method for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, wherein the original binary value is obtained by a binary encoding technique, and the method comprises: with respect to a particular token: representing the metadata describing the particular token as a binary value called a metadata binary value; for each character in the particular token: representing the character as a binary value called a character binary value; creating a unique set of code units containing the metadata binary value and the character binary value from at least a portion of the metadata binary value and the character binary value in accordance with the formatting scheme; and storing the particular token by storing one or more sets of the resulting code units.

[0057] [Clause 2] The method according to Clause 1, wherein at least a portion of the metadata binary value is a unique part of the metadata binary value associated with the character.

[0058] [Clause 3] The method according to Clause 1, wherein at least a portion of the metadata binary value is the metadata binary value.

[0059] [Clause 4] The method according to any one of the preceding clauses 1 to 3, wherein the encoding scheme is a predefined multibyte encoding scheme, and the representation is carried out such that the concatenation of at least a portion of the metadata binary value and the character binary value represents an existing character in the multibyte area of ​​the character set according to the binary encoding technique, thereby replacing the existing character with the associated at least portion of the character and the metadata.

[0060] [Clause 5] The encoding scheme is a predefined multibyte encoding scheme, and the representation is the method according to any one of the preceding clauses 1 to 3, wherein the concatenation of at least a portion of the metadata binary value and the character binary value is performed in accordance with the binary encoding technique to represent unused characters outside the multibyte area of ​​the character set.

[0061] [Clause 6] The method according to any one of the preceding clauses 1 to 5, wherein the step of creating the set of code units comprises the steps of combining the metadata binary value and the character binary value into a combined binary value by padding zero or more bits between the metadata binary value and the character binary value, and determining the set of unique code units including the combined binary value according to the formatting scheme.

[0062] [Clause 7] The method according to clause 6, wherein the padding is performed such that the combined binary value has a number of bits greater than the maximum number of bits that represent the character set.

[0063] [Clause 8] The method of any of the preceding clauses 1 to 7, further comprising the steps of parsing an electronic document to identify a sensitive token in the electronic document, and repeating the method for each sensitive token of the sensitive token as a specific token, wherein the sensitive token is identified based on security access criteria, and the storage of each sensitive token includes replacing one or more existing code units of the sensitive token in the electronic document with one or more sets of the code units of the sensitive token. For example, the method of clause 8 may involve scanning an electronic document to identify a sensitive token according to security access criteria. For each sensitive token found, the method of clause 1 is applied to that sensitive token, wherein the term “specific token” in clause 1 is understood to mean “sensitive token.”

[0064] [Clause 9] The method according to any of the preceding clauses 1 to 7, wherein the aforementioned specific token is a database attribute value stored in the database.

[0065] [Clause 10] The character set is one of UNICODE, ASCII, or ISO-8859-1, as described in any of the preceding clauses 1 to 9.

[0066] [Clause 11] The method according to any of the preceding clauses 1 to 10, wherein the formatting scheme is a formatting scheme which is either a UTF encoding scheme or an ASCII encoding scheme.

[0067] [Article 12] The formatting scheme is a formatting scheme that is a UTF-8 encoding scheme, and the set of code units is longer than 4 code units, as described in any of the preceding clauses 1 to 11.

[0068] [Clause 13] The method according to Clause 12, wherein the set of code units is 5 or 6 code units.

[0069] [Clause 14] The method according to any of the preceding clauses 1 to 13, wherein each code unit is a byte.

[0070] [Article 15] The method according to any of the preceding clauses 1 to 14, further comprising the step of encrypting the particular token, wherein the character of the particular token is the encrypted character of the encrypted particular token. For example, the method of clause 15 may specify that the particular token encoded in clause 1 is an encrypted version of the original token. Physically, the encrypted token and the original token may be similar in format or structure; however, they are semantically different, since the encrypted token may conceal the meaning of the original token, making it unreadable without decryption.

[0071] The computing environment 800 includes an example of an environment for executing at least some of the computer code involved in carrying out the method of the present invention, such as code 900 for encoding data using an encoding scheme. In addition to block 900, the computing environment 800 includes, for example, a computer 801, a wide area network (WAN) 802, an end user device (EUD) 803, a remote server 804, a public cloud 805, and a private cloud 806. In this embodiment, the computer 801 includes a processor set 810 (including processing circuits 820 and a cache 821), a communication fabric 811, volatile memory 812, persistent storage 813 (including an operating system 822 and block 900 identified above), a peripheral device set 814 (including a user interface (UI) device set 823, storage 824, and an Internet of Things (IoT) sensor set 825), and a network module 815. The remote server 804 includes a remote database 830. Public Cloud 805 includes Gateway 840, Cloud Orchestration Module 841, Host Physical Machine Set 842, Virtual Machine Set 843, and Container Set 844.

[0072] Computer 801 may take the form of a desktop computer, laptop computer, tablet computer, smartphone, smartwatch or other wearable computer, mainframe computer, quantum computer, or any other form of computer or mobile device currently known or to be developed in the future that can run programs, access networks, or query databases such as the remote database 830. As is well understood in the field of computer technology, and depending on the technology, the implementation of a computer implementation method may be distributed among multiple computers and / or multiple locations. On the other hand, in this description of the computing environment 800, in order to simplify the explanation as much as possible, the detailed discussion will focus on a single computer, specifically computer 801. Although computer 801 is not shown in the cloud in Figure 5, it may be located in the cloud. On the other hand, computer 801 is not required to be located in the cloud, except to any extent that may be shown positively.

[0073] The processor set 810 includes one or more computer processors of any type currently known or to be developed in the future. The processing circuitry 820 may be distributed across multiple packages, for example, multiple interconnected integrated circuit chips. The processing circuitry 820 may implement multiple processor threads and / or multiple processor cores. The cache 821 is memory located within the processor chip package and is typically used for data or code that should be available for high-speed access by threads or cores running on the processor set 810. The cache memory is typically organized into multiple levels depending on its relative proximity to the processing circuitry. Alternatively, some or all of the cache for the processor set may be located "off-chip". In some computing environments, the processor set 810 may operate using qubits and be designed to perform quantum computing.

[0074] Computer-readable program instructions are typically loaded onto computer 801, causing the processor set 810 of computer 801 to perform a series of operational steps, thereby executing the computer implementation method. Instructions thus executed will instantiate the methods specified in the flowcharts and / or descriptions of the computer implementation methods contained in this document (collectively referred to as the "Methods of the Invention"). These computer-readable program instructions are stored in various types of computer-readable storage media, such as cache 821 and other storage media discussed below. The program instructions and associated data are accessed by the processor set 810 to control and direct the implementation of the Methods of the Invention. In computing environment 800, at least some of the instructions for implementing the Methods of the Invention may be stored in block 900 within persistent storage 813.

[0075] The communication fabric 811 is a signal conduction path that enables various components of the computer 801 to communicate with one another. Typically, this fabric is made up of switches and conductive paths, such as buses, bridges, physical input / output ports, and similar components. Other types of signal communication paths, such as fiber optic communication paths and / or wireless communication paths, may be used.

[0076] Volatile memory 812 is any type of volatile memory currently known or to be developed in the future. Examples include dynamic random-access memory (RAM) or static RAM. Typically, volatile memory 812 is characterized by random access, but this is not mandatory unless otherwise indicated. In computer 801, volatile memory 812 is located within a single package and resides inside computer 801, but alternatively or additionally, volatile memory may be distributed across multiple packages and / or located externally to computer 801.

[0077] Persistent storage 813 is any form of non-volatile storage for a computer that is currently known or may be developed in the future. The non-volatility of this storage means that the stored data is maintained regardless of whether power is supplied to computer 801 and / or directly to the persistent storage 813. Persistent storage 813 may be read-only memory (ROM), but typically at least a portion of the persistent storage allows for writing, deleting, and rewriting of data. Some well-known forms of persistent storage include magnetic disks and solid-state storage devices. Operating system 822 can take several forms, such as various known proprietary operating systems employing a kernel or open-source portable operating system interface type operating systems. The code contained in block 900 typically includes at least some of the computer code involved in carrying out the methods of the present invention.

[0078] The peripheral device set 814 includes a set of peripheral devices for computer 801. Data communication connections between computer 801's peripheral devices and other components can be implemented in various ways, such as Bluetooth® connections, Near-Field Communication (NFC) connections, connections made by cables (such as universal serial bus (USB) type cables), insert-type connections (e.g., secure digital (SD) cards), connections made through local area communication networks, and even connections made through wide area networks such as the Internet. In various embodiments, the UI device set 823 may include components such as a display screen, speakers, microphones, wearable devices (such as goggles and smartwatches), keyboards, mice, printers, touchpads, game controllers, and haptic devices. Storage 824 is external storage such as an external hard drive, or insertable storage such as an SD card. Storage 824 may be persistent and / or volatile. In some embodiments, storage 824 may take the form of a quantum computing storage device for storing data in the form of qubits. In embodiments where computer 801 is required to have a large amount of storage (for example, computer 801 locally stores and manages a large database), this storage may be provided by peripheral storage devices designed to store very large amounts of data, such as a storage area network (SAN) shared by multiple geographically distributed computers. The IoT sensor set 825 consists of sensors that may be used in Internet of Things applications. For example, one sensor may be a thermometer and another may be a motion detector.

[0079] The network module 815 is a collection of computer software, hardware, and firmware that enables computer 801 to communicate with other computers via the WAN 802. The network module 815 may include hardware such as a modem or Wi-Fi® signal transceiver, software for packetizing and / or depacketizing data for communication network transmission, and / or web browser software for communicating data over the Internet. In some embodiments, the network control and network forwarding functions of the network module 815 are performed on the same physical hardware device. In other embodiments (e.g., embodiments utilizing software-defined networking (SDN)), the control and forwarding functions of the network module 815 are performed on physically separate devices, so that the control function manages several different network hardware devices. Computer-readable program instructions for carrying out the method of the present invention can typically be downloaded from an external computer or external storage device to computer 801 via a network adapter card or network interface included in the network module 815.

[0080] WAN802 is any wide area network (e.g., the Internet) that can transmit computer data over non-local distances using any technology currently known or to be developed for transmitting computer data. In some embodiments, WAN802 may be replaced and / or complemented by a local area network (LAN), such as a Wi-Fi® network, designed to transmit data between devices located in a local area. WANs and / or LANs typically include computer hardware such as copper transmission cables, optical transmission fibers, wireless transmissions, routers, firewalls, switches, gateway computers, and edge servers.

[0081] The end-user device (EUD) 803 is any computer system used and controlled by an end-user (e.g., a customer of the company operating computer 801) and can take any of the forms discussed above in relation to computer 801. EUD 803 typically receives useful and valuable data from the operation of computer 801. For example, in a hypothetical case where computer 801 is designed to provide recommendations to the end-user, these recommendations would typically be communicated from the network module 815 of computer 801 to EUD 803 via WAN 802. Thus, EUD 803 can display or otherwise present recommendations to the end-user. In some embodiments, EUD 803 may be a client device such as a thin client, heavy client, mainframe computer, or desktop computer.

[0082] The remote server 804 is any computer system that provides at least some data and / or functionality to computer 801. The remote server 804 may be controlled and used by the same entity that operates computer 801. The remote server 804 represents a machine that collects and stores useful and valuable data for use by other computers, such as computer 801. For example, in a hypothetical case where computer 801 is designed and programmed to provide recommendations based on historical data, this historical data may be provided to computer 801 from the remote database 830 of the remote server 804.

[0083] Public Cloud 805 is any computer system available for use by multiple entities, providing on-demand availability of computer system resources and / or other computing capabilities, particularly data storage (cloud storage) and computing power, without direct active management by the user. Cloud computing typically leverages resource sharing to achieve coherence and economies of scale. Direct active management of computing resources in Public Cloud 805 is performed by the computer hardware and / or software of the Cloud Orchestration Module 841. The computing resources provided by Public Cloud 805 are typically implemented by virtual computing environments running on various computers that make up the host physical machine set 842, which is a collection of physical computers located within and / or available to Public Cloud 805. The virtual computing environment (VCE) typically takes the form of virtual machines from the virtual machine set 843 and / or containers from the container set 844. These VCEs can be stored as images and transferred between and between various physical machine hosts, either as images or after VCE instantiation. The cloud orchestration module 841 manages the transfer and storage of images, deploys new VCE instantiations, and manages active instantiations of VCE deployments. The gateway 840 is a collection of computer software, hardware, and firmware that enables the public cloud 805 to communicate over the WAN 802.

[0084] Here, some further explanation of virtualized computing environments (VCEs) is provided. A VCE can be stored as an "image." A new active instance of a VCE can be instantiated from an image. Two well-known types of VCEs are virtual machines and containers. A container is a VCE that uses operating system-level virtualization. This refers to an operating system feature where the kernel allows for the existence of multiple isolated user-space instances called containers. These isolated user-space instances typically behave like actual computers from the perspective of the programs running within them. Computer programs running on a typical operating system can utilize all the resources of that computer, such as connected devices, files and folders, network shares, CPU power, and quantifiable hardware capabilities. However, programs running inside a container can only use the contents of the container and the devices allocated to that container; this feature is known as containerization.

[0085] Private Cloud 806 is similar to Public Cloud 805, except that its computing resources are available only for use by a single enterprise. While Private Cloud 806 is shown as being in communication with WAN 802, in other embodiments, a private cloud may be completely isolated from the internet and accessible only through a local / private network. A hybrid cloud is a combination of multiple clouds of different types (e.g., private, community, or public cloud types), often implemented by different vendors. Each of the multiple clouds remains a separate, discrete entity, but the larger hybrid cloud architecture is coupled together by standardized or proprietary technologies that enable orchestration, management, and / or data / application portability between the multiple configuration clouds. In this embodiment, both Public Cloud 805 and Private Cloud 806 are part of a larger hybrid cloud.

[0086] Cloud computing services and / or microservices (not shown individually in Figure 5): Private and public clouds are programmed and configured to provide cloud computing services and / or microservices (unless otherwise indicated, the term “microservices” shall be interpreted as including larger “services,” regardless of size). Cloud services are typically infrastructure, platforms, or software hosted by a third-party provider and made available to users over the internet. Cloud services facilitate the flow of user data from front-end clients (e.g., user-side servers, tablets, desktops, laptops) to the provider’s systems and back over the internet. In some embodiments, cloud services may be configured and orchestrated according to the “as a service” technology paradigm, where something is presented to internal or external customers in the form of cloud computing services. The as a service offering typically provides endpoints that various customers interface with. These endpoints are typically based on a set of APIs. One category of as-a-service offerings is Platform as a Service (PaaS), where a service provider provisions, instantiates, runs, and manages modular bundles of code that customers can use to instantiate a computing platform and one or more applications without the complexity of building and maintaining the infrastructure typically associated with them. Another category is Software as a Service (SaaS), where software is centrally hosted and allocated on a subscription basis. SaaS is also known as on-demand software, web-based software, or web-hosted software.The four technical sub-areas involved in cloud services are: deployment, integration, on-demand, and virtual private networks.

[0087] Various aspects of this disclosure are described by explanatory text, flowcharts, block diagrams of computer systems, and / or block diagrams of machine logic included in embodiments of computer program products (CPPs). With respect to any flowchart, depending on the technology involved, operations may be performed in a different order than those shown in a given flowchart. For example, also depending on the technology involved, two operations shown in consecutive blocks of a flowchart may be performed in reverse order, as a single integrated stage, simultaneously, or with at least partial time overlap.

[0088] Embodiments of a computer program product ("CPP Embodiment" or "CPP") are terms used in this disclosure to describe any set of one or more storage media (also called "mediums") that are collectively comprised of a set of one or more storage devices that collectively contain machine-readable code corresponding to instructions and / or data for performing computer operations specified in a given CPP claim. A "storage device" is any tangible device capable of holding and storing instructions for use by a computer processor. Computer-readable storage media may be, but are not limited to, electronic storage media, magnetic storage media, optical storage media, electromagnetic storage media, semiconductor storage media, mechanical storage media, or any suitable combination of those described above. Some known types of storage devices, including these media, include diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), static random access memory (SRAM), compact disc read-only memory (CD-ROM), digital versatile discs (DVDs), memory sticks, floppy disks, mechanically encoded devices (such as pits / lands formed on the main surface of a punch card or disk), or any suitable combination of the foregoing. When the term "computer-readable storage medium" is used in this disclosure, it shall not be construed as storage in the form of a transient signal itself, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides, optical pulses passing through optical fiber cables, electrical signals communicated through wires, and / or other transmission media. As will be understood by those skilled in the art, data is typically moved at several intermittent points during the normal operation of a storage device, such as during access, defragmentation, or garbage collection. However, since data is not transient while it is stored, this does not mean that the storage device is transient. [Item 1] A method for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, according to a formatting scheme, the original binary value is obtained by a binary encoding technique, and the method for a particular token The stage of representing the metadata describing the aforementioned specific token as a binary value called a metadata binary value; For each character within the aforementioned specific token The stage of representing the aforementioned character using a binary value called a character binary value; A step of creating a set of unique code units containing at least a portion of the metadata binary values ​​and the character binary values ​​from at least a portion of the metadata binary values ​​and the character binary values, in accordance with the formatting scheme; and The step of storing the particular token by storing one or more sets of resulting code units. A method that includes [a certain feature]. [Item 2] The method according to item 1, wherein at least a portion of the metadata binary value is a unique part of the metadata binary value associated with the character. [Item 3] The method according to item 1, wherein at least a portion of the metadata binary value is the metadata binary value. [Item 4] The method according to item 1, wherein the encoding scheme is a predefined multibyte encoding scheme, and the representation is performed such that the concatenation of at least a portion of the metadata binary value and the character binary value represents an existing character in the multibyte area of ​​the character set according to the binary encoding technique, thereby replacing the existing character with the associated at least a portion of the character and the metadata. [Item 5] The method according to item 1, wherein the encoding scheme is a predefined multibyte encoding scheme, and the representation is performed such that the concatenation of at least a portion of the metadata binary value and the character binary value represents unused characters outside the multibyte area of ​​the character set, in accordance with the binary encoding technique. [Item 6] The method according to item 1, wherein the step of creating the set of code units comprises the steps of combining the metadata binary value and the character binary value into a combined binary value by padding zero or more bits between the metadata binary value and the character binary value, and determining the set of unique code units including the combined binary value according to the formatting scheme. [Item 7] The method according to item 6, wherein the padding is performed such that the combined binary value has a number of bits greater than the maximum number of bits that represent the character set. [Item 8] The method according to Item 1, further comprising the steps of parsing an electronic document to identify a sensitive token in the electronic document, and repeating the method for each sensitive token of the sensitive token as a specific token, wherein the sensitive token is identified based on security access criteria, and the storage of each sensitive token includes replacing one or more existing code units of the sensitive token in the electronic document with one or more sets of the code units of the sensitive token. [Item 9] The character set is as described in item 1, including one of UNICODE (trademark), ASCII, or ISO-8859-1. [Item 10] The method according to item 1, wherein the formatting scheme is a formatting scheme which is either a UTF encoding scheme or an ASCII encoding scheme. [Item 11] The formatting scheme is a formatting scheme that is a UTF-8 encoding scheme, and the set of code units is longer than 4 code units, as described in item 1. [Item 12] The method according to item 11, wherein the set of code units is 5 or 6 code units. [Item 13] The method described in item 1, wherein each of the code units is a byte. [Item 14] The method according to item 1, further comprising the step of encrypting the particular token before the representation of the characters of the particular token, wherein the characters of the particular token are the encrypted characters of the encrypted particular token. [Item 15] The method described in item 1, wherein the aforementioned specific token is a database attribute value stored in the database. [Item 16] A computer program for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, according to a formatting scheme, the original binary value is obtained by a binary encoding technique, and the computer program product is configured for a particular token. It comprises a computer-readable storage medium in which computer-readable program code is embodied, and the computer-readable program code is, A procedure for representing metadata describing the aforementioned specific token as a binary value called a metadata binary value; For each character within the aforementioned specific token A procedure for representing the aforementioned character using a binary value called a character binary value; A procedure for creating a set of unique code units containing at least a portion of the metadata binary values ​​and the character binary values ​​from at least a portion of the metadata binary values ​​and the character binary values, in accordance with the formatting scheme; and A procedure for storing the particular token by storing one or more sets of resulting code units. A computer program product configured to perform actions including [specific actions]. [Item 17] A computer system for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, according to a formatting scheme, the original binary value is obtained by a binary encoding technique, the computer system is configured for a particular token, and the computer system is Processor set; One or more computer-readable storage media; and The data is stored on one or more computer-readable storage media and transmitted to the processor set. A procedure for representing metadata describing the aforementioned specific token as a binary value called a metadata binary value; For each character within the aforementioned specific token A procedure for representing the aforementioned character using a binary value called a character binary value; A procedure for creating a set of unique code units containing at least a portion of the metadata binary values ​​and the character binary values ​​from at least a portion of the metadata binary values ​​and the character binary values, in accordance with the formatting scheme; and A procedure for storing the particular token by storing one or more sets of resulting code units. Program instructions that perform an action including A computer system equipped with the following features. [Item 18] The computer system described in item 17, wherein at least a portion of the metadata binary value is a unique part of the metadata binary value associated with the character. [Item 19] The computer system according to item 17, wherein at least a portion of the metadata binary value is the metadata binary value. [Item 20] The computer system according to item 17, wherein the encoding scheme is a predefined multibyte encoding scheme, and the representation is performed such that the concatenation of at least a portion of the metadata binary value and the character binary value represents an existing character in the multibyte area of ​​the character set according to the binary encoding technique, thereby replacing the existing character with the associated at least portion of the character and the metadata.

Claims

1. A method for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, according to a formatting scheme, the original binary value is obtained by a binary encoding technique, and the method for a particular token The step of representing the metadata describing the aforementioned specific token as a binary value called a metadata binary value; For each character within the aforementioned specific token The stage of representing the aforementioned character using a binary value called a character binary value; A step of creating a set of unique code units containing at least a portion of the metadata binary values ​​and the character binary values ​​from at least a portion of the metadata binary values ​​and the character binary values, in accordance with the formatting scheme; and The step of storing the particular token by storing one or more sets of resulting code units. A method that includes [a certain feature].

2. The method according to claim 1, wherein at least a portion of the metadata binary value is a unique portion of the metadata binary value associated with the character.

3. The method according to claim 1, wherein at least a portion of the metadata binary value is the metadata binary value.

4. The method according to claim 1, wherein the encoding scheme is a predefined multibyte encoding scheme, and the representation is performed such that the concatenation of at least a portion of the metadata binary value and the character binary value represents an existing character in the multibyte area of ​​the character set according to the binary encoding technique, thereby replacing the existing character with the associated at least a portion of the character and the metadata.

5. The method according to claim 1, wherein the encoding scheme is a predefined multibyte encoding scheme, and the representation is performed such that the concatenation of at least a portion of the metadata binary value and the character binary value represents unused characters outside the multibyte area of ​​the character set, in accordance with the binary encoding technique.

6. The method according to claim 1, wherein the step of creating the set of code units comprises the steps of combining the metadata binary value and the character binary value into a combined binary value by padding zero or more bits between the metadata binary value and the character binary value, and determining the set of unique code units including the combined binary value according to the formatting scheme.

7. The method according to claim 6, wherein the padding is performed such that the combined binary value has a number of bits greater than the maximum number of bits that represent the character set.

8. The method according to claim 1, further comprising the steps of parsing an electronic document to identify a confidential token in the electronic document, and repeating the method for each confidential token of the confidential token as a specific token, wherein the confidential token is identified based on security access criteria, and the storage of each confidential token includes replacing one or more existing code units of the confidential token in the electronic document with one or more sets of the code units of the confidential token.

9. The method according to claim 1, wherein the character set includes one of UNICODE®, ASCII, or ISO-8859-1.

10. The method according to claim 1, wherein the formatting scheme is a formatting scheme which is either a UTF coding scheme or an ASCII coding scheme.

11. The method according to claim 1, wherein the formatting scheme is a UTF-8 encoding scheme, and the set of code units is longer than four code units.

12. The method according to claim 11, wherein the set of code units is 5 or 6 code units.

13. The method according to claim 1, wherein each of the code units is a byte.

14. The method according to claim 1, further comprising the step of encrypting the particular token before the representation of the characters of the particular token, wherein the characters of the particular token are the encrypted characters of the encrypted particular token.

15. The method according to any one of claims 1 to 14, wherein the specific token is a database attribute value stored in the database.

16. A computer program for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, according to a formatting scheme, the original binary value is obtained by a binary encoding technique, the computer program is configured for a particular token, and the computer program is provided to the computer, A procedure for representing the metadata describing the aforementioned specific token as a binary value called a metadata binary value; For each character within the aforementioned specific token A procedure for representing the aforementioned character using a binary value called a character binary value; A procedure for creating a set of unique code units containing at least a portion of the metadata binary values ​​and the character binary values ​​from at least a portion of the metadata binary values ​​and the character binary values, in accordance with the formatting scheme; and A procedure for storing the particular token by storing one or more sets of resulting code units. A computer program designed to execute something.

17. A computer system for encoding character data using an encoding scheme, wherein the encoding scheme is configured to represent the original binary value of each character in a predefined character set in a unique set of one or more code units of the formatting scheme, according to a formatting scheme, the original binary value is obtained by a binary encoding technique, the computer system is configured for a particular token, and the computer system is Processor set; One or more computer-readable storage media; and The data is stored on one or more computer-readable storage media and is transmitted to the processor set. A procedure for representing the metadata describing the aforementioned specific token as a binary value called a metadata binary value; For each character within the aforementioned specific token A procedure for representing the aforementioned character using a binary value called a character binary value; A procedure for creating a set of unique code units containing at least a portion of the metadata binary values ​​and the character binary values ​​from at least a portion of the metadata binary values ​​and the character binary values, in accordance with the formatting scheme; and A procedure for storing the particular token by storing one or more sets of resulting code units. Program instructions that perform an action including A computer system equipped with the following features.

18. The computer system according to claim 17, wherein at least a portion of the metadata binary value is a unique portion of the metadata binary value associated with the character.

19. The computer system according to claim 17, wherein at least a portion of the metadata binary value is the metadata binary value.

20. The computer system according to any one of claims 17 to 19, wherein the encoding scheme is a predefined multibyte encoding scheme, and the representation is performed such that the concatenation of at least a portion of the metadata binary value and the character binary value represents an existing character in the multibyte area of ​​the character set according to the binary encoding technique, thereby replacing the existing character with the associated at least a portion of the character and the metadata.