A method for low-resource language word segmentation model training and cross-model vocabulary table migration injection

By combining self-training and cross-model vocabulary transfer injection methods, an injectable list is generated and the base segmenter vocabulary is expanded, which solves the encoding fragmentation problem in low-resource language vocabulary expansion and improves encoding efficiency and generation quality.

CN122197877APending Publication Date: 2026-06-12NANDA SHUAN (TIANJIN) TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANDA SHUAN (TIANJIN) TECHNOLOGY CO LTD
Filing Date
2026-05-09
Publication Date
2026-06-12

Smart Images

  • Figure CN122197877A_ABST
    Figure CN122197877A_ABST
Patent Text Reader

Abstract

The application relates to the technical field of natural language processing and large language model, in particular to a low-resource language word segmentation model training and cross-model word table migration injection method. The method trains a low-resource language word segmentation model based on corpus obtained through target low-resource language text preprocessing, obtains a first candidate word table, extracts a word table from an external low-resource language model or word segmenter to a second candidate word table, fuses, removes duplicates and filters the injectability of multiple source candidate word elements to form an injectable list, appends the injectable list to a base model segmenter and synchronously updates the segmenter configuration, so that the application realizes low-resource language coding at low cost without replacing the base model segmentation system in whole, and improves the word segmentation effect.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of natural language processing and large language model technology, and in particular to a method for training a low-resource language segmentation model and cross-model vocabulary transfer injection. Background Technology

[0002] Large language models typically use a base segmenter as the fundamental component for input encoding and output decoding. The base segmenter decomposes continuous text into a sequence of tokens using its vocabulary, and uses the token indices as discrete input to the model. During the generation phase, the model outputs the probability distribution of the tokens, which is then reconstructed into text by the base segmenter. The coverage, granularity, and stability of the base segmenter and its vocabulary directly determine the model's encoding efficiency, semantic capacity, and inference speed for a specific language.

[0003] For low-resource languages ​​like Tibetan, the initial vocabulary of general-purpose large language models is often constructed primarily from high-resource languages. Low-resource languages ​​often lack sufficient reusable tokens for target characters, syllables, or lexical variations. This leads to fragmentation during the encoding stage of large language models, where a word or syllable structure at the natural language level is broken down into numerous fragment tokens, significantly increasing the length of the token sequence. Furthermore, when numerous fragment tokens are present, the fixed context length constraint squeezes out effective content. This not only increases the number of self-attention computations and decoding steps due to the increased token sequence length, thus increasing inference latency and memory usage, but also weakens the expression of word formation and lexical rules, hindering the large language model from learning stable semantics and ultimately affecting the quality of downstream generation.

[0004] Therefore, it is urgent to design a low-resource language segmentation model training and cross-model vocabulary transfer injection method to solve the above-mentioned technical defects. Summary of the Invention

[0005] The purpose of this invention is to design a cross-model vocabulary transfer injection method compared to existing technical solutions. This method enables the joint use of a vocabulary obtained by self-training on target low-resource language text with a vocabulary extracted from an external low-resource language model or vocabulary extractor without replacing the base segmenter's segmentation system as a whole. Furthermore, it allows injectable lexical units to be added to the base model vocabulary while ensuring coding loop consistency and system compatibility.

[0006] To achieve the above objectives, the present invention provides the following technical solution: A method for training a low-resource language segmentation model and performing cross-model vocabulary transfer injection includes the following steps: Step S1: Collect and preprocess the target low-resource language texts to obtain training corpus; Step S2: Train a low-resource language segmentation model based on the training corpus to obtain the first candidate word list; Step S3: Determine the vocabulary list exported by the low-resource language model or word segmenter that is in the same language as the target low-resource language text as the second candidate vocabulary list; Step S4: Merge the first candidate word list and the second candidate word list to obtain the total candidate set; filter the total candidate set to generate an injectable list; Step S5: Append the injectable list to the base segmenter of the large language model that is in the same language as the target low-resource language text.

[0007] The above technical solution produces the following technical effects: The technical solution described above in this application combines a self-trained low-resource language segmentation model with cross-model vocabulary transfer extraction. This approach not only obtains a list of injectable tokens that more closely resemble the target low-resource language text when data is abundant, but also quickly introduces high-quality external candidates when data is insufficient or when reusability is desired. This achieves a better balance between coverage (corresponding to system compatibility) and cost. As a further improvement to the method of low-resource language segmentation model training and cross-model vocabulary transfer injection in this application, in step S1, for target low-resource language text with explicit syllable delimiters, the explicit syllable delimiters are retained during data preprocessing and used in the low-resource language segmentation model training in step S2. In step S2, the low-resource language segmentation model calculates and generates a candidate word set based on the training corpus using the BPE algorithm and / or the Unigram algorithm. If the candidate word set does not meet the training parameters, the candidate word set is regenerated. If the candidate word set meets the training parameters, the candidate word set is determined as the first candidate word list.

[0008] As a further improvement to the low-resource language segmentation model training and cross-model vocabulary transfer injection method of this application, the calculation method of the BPE algorithm includes the following steps: Step S211: Initialize the training corpus into pre-segmented unit-level sequences; Step S212: Count the frequency of occurrence of adjacent unit pairs in the pre-segmented unit-level sequence; sort the adjacent unit pairs according to their frequency of occurrence and identify high-frequency adjacent unit pairs; Step S213: Merge high-frequency adjacent units into a pre-segmented unit-level sequence; Step S214: Determine the pre-segmented unit-level sequence from step S213 based on the training parameters; If the pre-segmented unit-level sequence satisfies the training parameters, then the pre-segmented unit-level sequence is determined as the first candidate vocabulary. If the pre-segmented unit-level sequence does not meet the training parameters, then step S212 is re-executed for the pre-segmented unit-level sequence.

[0009] As a further improvement to the low-resource language segmentation model training and cross-model vocabulary transfer injection method of this application, the Unigram algorithm includes the following steps: Step S221: Based on the language structure of the target low-resource language text, construct a candidate word set through training corpus and assign initial probability values ​​to the words in the candidate word set; Step S222: Calculate the probability of a subword in the training corpus based on the training corpus, sort the subwords according to the subword probability, retain the subwords whose probability meets the probability threshold, and update the candidate subword set. Step S223: Based on the updated candidate word set in step S222, make a judgment according to the training parameters; If the candidate word set satisfies the training parameters, then the candidate word set is determined as the first candidate word list; If the candidate word set does not meet the training parameters, then step S222 is repeated for the candidate word set.

[0010] As a further improvement to the method of training a low-resource language segmentation model and cross-model vocabulary transfer injection proposed in this application, the sub-word probability is the proportion of the sub-word frequency to the total sub-word frequency; the sub-word frequency is the frequency of occurrence of the sub-word in the training corpus, and the total sub-word frequency is the sum of the frequencies of all sub-words in the candidate sub-word set.

[0011] As a further improvement to the method of training a low-resource language segmentation model and cross-model vocabulary transfer injection in this application, in step S4, the total candidate set is filtered by at least one of the following: character set determination rules, length constraints, and conflict avoidance. The character set determination rule is as follows: determine whether the candidate lexical units in the total candidate set meet the pure low-resource language pattern of the target low-resource language text; if not, filter them. The length constraint is determined by checking whether the candidate words in the total candidate set meet the preset length range; if not, they are filtered. The method for conflict avoidance is to determine whether the candidate words in the total candidate set conflict with the words in the base segmenter. If there is a conflict, filtering is performed.

[0012] As a further improvement to the method of training a low-resource language word segmentation model and cross-model vocabulary transfer injection proposed in this application, in the character set determination rule, when all characters in the candidate word element belong to the character set or whitelist punctuation set of the target low-resource language text, the candidate word element is determined to satisfy the pure low-resource language mode; when the number of candidate word elements after filtering by the character set determination rule is lower than the injection threshold, candidate word elements that do not satisfy the character set determination rule but satisfy other filtering methods are filled into the injectable list.

[0013] As a further improvement to the method of training a low-resource language segmentation model and cross-model vocabulary transfer injection in this application, in step S5, the injectable list is appended to the end of the vocabulary of the base segmenter, and the dimensions of the word embedding matrix and the output layer of the base segmenter are expanded to keep them consistent with the dimensions of the vocabulary of the base segmenter.

[0014] As a further improvement to the method of training a low-resource language segmentation model and cross-model vocabulary transfer injection in this application, in step S1, data preprocessing includes at least one of the following processing methods: format cleaning, information density filtering, data deduplication, sensitive information removal, and advertisement removal.

[0015] Compared with the prior art, the overall technical solution of the present invention also has the following beneficial effects: (1) By using injectability filtering, tokens that can be directly injected into the base segmenter are explicitly screened out from candidates from multiple sources, reducing the risk of encoding / decoding anomalies, special symbol conflicts and irreversible loops caused by rule mismatch.

[0016] (2) The base segmenter extension and alignment with the large language model structure are incorporated into a unified and automated process, and a configurable initialization strategy is provided, which significantly reduces the probability of loading failure and hidden errors caused by inconsistencies among multiple components. Attached Figure Description

[0017] Figure 1 This is one of the flowcharts for the low-resource language segmentation model training and cross-model vocabulary transfer injection method of the present invention; Figure 2 This is the second flowchart of the low-resource language segmentation model training and cross-model vocabulary transfer injection method of the present invention; Figure 3 Example images showing the comparison of word segmentation results; Figure 4 A statistical chart showing the compression effect of tokens. Detailed Implementation

[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0019] To facilitate understanding of the solutions provided in the following embodiments of the present invention, the terms involved in the present invention will be explained as follows before describing the technical solutions provided by the present invention: Large language models are artificial intelligence models trained on large-scale text data, belonging to the field of natural language processing (NLP) technology, and possess the ability to understand and generate human language. Their core workflow relies on a base segmenter to decompose continuous text into a sequence of tokens as input, and the model calculates the probability distribution of the output tokens, which are then ultimately restored to natural language text by the segmenter. In low-resource language processing such as Tibetan, general-purpose large language models often suffer from insufficient target language token units because the original vocabulary is dominated by high-resource languages, leading to encoding fragmentation and performance degradation in downstream tasks.

[0020] Base segmenter: This is a fundamental component in large language models responsible for input encoding and output decoding. It decomposes continuous text into a sequence of tokens using its vocabulary, and uses the token indices as discrete inputs to the model. Its coverage, granularity, and stability directly determine the encoding efficiency, semantic capacity, and inference speed of large language models on specific languages.

[0021] In natural language processing and large language modeling, a token refers to the basic sequence unit obtained by a base segmenter after decomposing continuous text. It is the fundamental discrete unit for input encoding and output decoding in the model. Specifically, the base segmenter splits text into a token sequence using its vocabulary, using the token index as model input; in the generation stage, the model outputs the probability distribution of the tokens, which is then restored to text by the segmenter. Its coverage, granularity, and stability directly determine the model's encoding efficiency, semantic capacity, and inference speed in a specific language.

[0022] Subwords are the smallest semantic or grammatical units learned from training corpora by algorithms (such as BPE and Unigram), falling between characters and complete words. For example, "unhappiness" might be broken down into subwords such as "un-", "happy", and "-ness", which preserves word formation rules while avoiding the problem of uncommon words not being registered.

[0023] Pre-segmented unit-level sequences are a data format used to initialize training corpora in low-resource language segmentation model training (especially the BPE algorithm). They refer to the pre-segmentation of the original text into basic unit sequences with certain semantic or syntactic relationships based on the inherent structure of the target language (such as syllables, affixes, explicit delimiters, etc.). The segmentation unit level also contains the smallest character-level units.

[0024] As mentioned in the background technology above, the problems caused by fragmented word segmentation include at least the following: effective content is squeezed under the fixed context length limit; the increase in sequence length increases the number of self-attention calculation and decoding steps, and increases inference latency and memory usage; at the same time, excessively fine segmentation weakens the expression of word formation and word form rules, which is not conducive to the model learning stable semantics and affects the quality of downstream generation.

[0025] To address the aforementioned issues, common adaptation approaches in existing technologies fall into two main categories: First, training and using a new base segmenter for low-resource languages ​​and replacing or reconstructing the base segmenter's segmentation system accordingly; second, expanding the vocabulary while retaining the base segmenter's structure and simultaneously adjusting the word embedding and output layer dimensions of the large language model. However, in practical implementation, both of these single paths have insurmountable limitations. First, while the first approach can improve vocabulary coverage for low-resource languages, replacing the entire segmentation system often presents compatibility risks with existing inference services, fine-tuning frameworks, and distributed training configurations, resulting in high migration costs and significant engineering risks. While the second approach offers relatively controllable scope of modification, relying solely on direct merging of external vocabulary or single-source candidate tokens can easily introduce tokens that do not match the base segmenter's segmentation system rules, leading to encoding / decoding anomalies, special symbol conflicts, or irreversible loops in the large language model. Furthermore, the vocabulary expanded through the second approach also faces the challenge of balancing coverage and injectability. Furthermore, after the vocabulary expansion is completed, it is still necessary to ensure strict consistency among multiple components of the base word segmenter, such as the vocabulary, configuration file, model embedding, and output layer. Inconsistency in any of these components may lead to loading failure or hidden errors. Based on this, this application finds that existing solutions generally lack a unified and quantifiable testing and acceptance system, making it difficult to compare and reproduce the expansion effect horizontally, and difficult to establish clear acceptance criteria in project delivery.

[0026] Therefore, there is an urgent need for a joint method that can simultaneously utilize the vocabulary obtained from a self-trained low-resource language segmentation model and the vocabulary extracted from external low-resource language models / segmenters. This method should achieve injectability filtering of candidate tokens, cross-model vocabulary transfer and injection, and automatic alignment of model structure without replacing the base model's segmentation system as a whole. Furthermore, the method should be stable and usable through a verifiable testing and acceptance metric system.

[0027] The following detailed description is exemplary and intended to provide further detailed explanation of the invention. Unless otherwise specified, all technical terms used in this invention have the same meaning as commonly understood by one of ordinary skill in the art. The terminology used in this invention is for describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention.

[0028] Example 1 Specifically, in this embodiment, the target low-resource language is Tibetan; the base segmenter is an autoregressive large language model with an existing large language model segmenter and weight files, preferably Qwen3-32B as an example, but the invention is not limited thereto. Furthermore, this embodiment provides a joint method implementation flow for Tibetan. To highlight the technical points of this invention, please refer to... Figure 2 and Figure 3 The following description takes the simultaneous activation of "candidate term list A" and "candidate term list B" as an example. Candidate term list A refers to the specific implementation of the first candidate term list of this application, and candidate term list B refers to the specific implementation of the second candidate term list of this application. In other embodiments, path A refers to step S2 of this application, and path B refers to step S3 of this application. Either path A or path B can be executed, or different weights and priorities can be set according to resource conditions.

[0029] Please see Figures 1-2 To achieve the above-mentioned objectives, this application provides a method for training a low-resource language segmentation model and performing cross-model vocabulary transfer injection, comprising the following steps: Step S1: Collect and preprocess the target low-resource language texts to obtain training corpus. Data sources for acquiring the target low-resource language texts can include news sites, official documents, encyclopedic knowledge, educational materials, religious literature, social media and forum texts, question-and-answer dialogue data, etc. To avoid token (lexical) distribution imbalance caused by source bias, it is preferable to perform stratified sampling of the sources and record source labels for subsequent analysis.

[0030] Specifically, data preprocessing includes at least one of the following methods: format cleaning, information density filtering, data deduplication, sensitive information removal, and advertising removal. In some embodiments, format cleaning includes: removing HTML tags and scripts, standardizing line breaks and whitespace, normalizing punctuation, cleaning invisible control characters, and mapping full-width / half-width characters to a uniform format. Information density filtering includes: removing low-information text based on indicators such as character type ratio, repetitive n-gram ratio, punctuation ratio, number ratio, and extremely short line ratio. Deduplication can use local hash fingerprinting or similarity clustering to remove highly repetitive paragraphs, templated content, and reprinted / mirrorized content. Sensitive information removal can include: performing regular expression matching and desensitization replacement on patterns such as phone numbers, ID card numbers, email addresses, and precise addresses; advertising information removal can include: removing promotional text, link sets, and QR code prompts based on rules.

[0031] Furthermore, in the aforementioned data preprocessing process, this application also establishes a data quality detection process. Preferably, a quality score is output from the preprocessed corpus data, and when the score is lower than a threshold, the cleaning and filtering steps are returned. This forms a loop to ensure that the final corpus meets the training requirements. It is worth noting that for target low-resource language texts with explicit syllable delimiters, the explicit syllable delimiters are retained during data preprocessing and used in the low-resource language word segmentation model training in step S2, rather than being simply deleted during the preprocessing stage. Taking Tibetan as an example, the training corpus contains delimiters representing syllable boundaries. For this type of language, when using the BPE algorithm, the syllable delimiters can be retained in the initial pre-segmentation unit-level sequence, so that the merging operation occurs preferentially within syllables. If necessary, the syllable delimiters can also be set as reserved boundary symbols to constrain cross-syllable merging. When using the Unigram algorithm, substrings containing syllable delimiters can be used as part of candidate subwords to participate in probability modeling, so that the model learns the statistical rules of high-frequency combinations within syllables and the positions of syllable boundaries. In some embodiments, syllable separators can exist as independent learnable units or participate in candidate generation and pruning as boundary constraint symbols. By incorporating syllable separators into the training process, cross-syllable erroneous merging and meaningless fragmentation can be reduced, making the generated candidate words more consistent with the syllable structure and word formation rules of the target language.

[0032] Step S2: Train a low-resource language segmentation model based on the training corpus to obtain the first candidate vocabulary. In specific implementation, based on the training corpus obtained in step S1, train the low-resource language segmentation model to obtain candidate vocabulary A as the first candidate vocabulary. The segmentation model preferably adopts the BPE algorithm or the Unigram algorithm to reduce the encoding length and improve the reusability of token units under the constraint of a given vocabulary size.

[0033] Furthermore, the calculation method of the BPE algorithm includes the following steps: Step S211: Initialize the training corpus into a pre-segmented unit-level sequence; wherein, the pre-segmented unit-level sequence is a data form for initializing the training corpus in low-resource language segmentation model training (especially the BPE algorithm), which refers to pre-segmenting the training corpus into a basic unit sequence with certain semantic or syntactic relevance based on the inherent structure of the target language (such as syllables, affixes, explicit delimiters, etc.). For example, when processing languages ​​with explicit syllable delimiters, such as Tibetan, the syllable delimiters are retained in the pre-segmentation unit-level sequence, allowing the merging operation to occur primarily within syllables. This pre-segmentation into larger granular units (such as syllables and stems) reduces the complexity of subsequent iterative merging and aligns with the word formation rules of low-resource languages ​​(e.g., avoiding cross-syllable erroneous merging). In this application, the pre-segmentation unit-level sequence includes the character-level as the smallest unit. When the pre-segmentation is at the character level, the aforementioned training corpus is split into the smallest character units (such as single letters and symbols), which is a basic segmentation without linguistic structure. This application preferably adopts the method of pre-segmenting the training corpus into a sequence of basic units with certain semantic or grammatical relationships.

[0034] Step S212: Count the frequency of occurrence of adjacent unit pairs in the pre-segmented unit-level sequence; sort the adjacent unit pairs according to their frequency of occurrence and identify high-frequency adjacent unit pairs; Step S213: Merge high-frequency adjacent units into a pre-segmented unit-level sequence; Step S214: Determine the pre-segmented unit-level sequence from step S213 based on the training parameters; If the pre-segmented unit-level sequence satisfies the training parameters, then the pre-segmented unit-level sequence is determined as the first candidate vocabulary. If the pre-segmented unit-level sequence does not meet the training parameters, then step S212 is re-executed for the pre-segmented unit-level sequence.

[0035] Furthermore, the calculation method of the Unigram algorithm includes the following steps: Step S221: Based on the linguistic structure of the target low-resource language text, construct a candidate word set using training corpus and assign initial probability values ​​to the words in the candidate word set. Specifically, the preferred method for constructing the candidate word set is to split the corpus into the smallest character units (such as single letters, syllable symbols, etc.) as initial candidates. Next, extract frequently occurring character segments from the corpus using a sliding window (e.g., 2-5 character lengths), or generate potential words based on dictionaries and affix rules. Finally, based on the initial character set, generate longer or shorter word variants through merging or splitting operations to ensure coverage of the language's word formation rules (such as root and affix combinations). The preferred method for assigning initial probability values ​​to words is any one of uniform distribution, frequency statistics, or smoothing. The uniform distribution method assumes that all words have equal initial probabilities and is suitable for scenarios without prior knowledge. The frequency statistics method is as follows: assign probability based on the frequency of occurrence of subwords in the training corpus, with higher frequency subwords having a higher initial probability; the smoothing method is to assign a very small probability to low frequency or non-occurring subwords (e.g., by adding 1 to smooth) to avoid the probability being zero, which would cause subsequent iterations to fail.

[0036] Step S222: Calculate the probability of a subword in the training corpus. Sort the subwords according to their probabilities, retaining those whose probabilities meet a threshold and updating the candidate subword set. Here, subword probability is the proportion of a subword's frequency to the total frequency of all subwords; subword frequency is the frequency of a subword's occurrence in the training corpus; and the total frequency of subwords is the sum of the frequencies of all subwords in the candidate subword set. The candidate subword set is optimized through multiple rounds of iteration, retaining high-value subwords. Specific methods include sorting, pruning, and re-estimating probability steps. The sorting and pruning steps are as follows: sort candidate words from high to low probability, and prune words with probabilities below a threshold (e.g., retain a certain number of words or words with cumulative probabilities reaching a preset threshold); then proceed to the probability re-estimation step, and recalculate the probability distribution of each word in the training corpus based on the pruned candidate word set (since some words were removed, the probabilities of the remaining words will be normalized again); repeat the above process until the size of the candidate word set reaches the preset vocabulary size (e.g., 10,000 words) or the probability distribution converges (the probability change is less than the threshold in continuous iterations).

[0037] Step S223: Based on the updated candidate word set in step S222, make a judgment according to the training parameters; If the candidate word set satisfies the training parameters, then the candidate word set is determined as the first candidate word list; If the candidate word set does not meet the training parameters, then step S222 is repeated for the candidate word set.

[0038] It is worth noting that in step S2 of this application, the low-resource language segmentation model generates a candidate word set (specifically, a pre-segmented unit-level sequence and a candidate sub-word set) based on the training corpus using the BPE algorithm and / or the Unigram algorithm. If the candidate word set does not meet the training parameters, a new candidate word set is generated. If the candidate word set meets the training parameters, the candidate word set is determined as the first candidate word list. The training parameters include at least character coverage, minimum frequency threshold, word list size, and a backoff strategy. Specifically, character coverage is used to determine the target character range to be included in the initial character set, i.e., prioritizing characters whose cumulative coverage in the training corpus reaches a preset threshold; the minimum frequency threshold is used to filter low-frequency character fragments or low-frequency candidate sub-words (specifically, low-frequency adjacent units and low-frequency sub-words), i.e., only candidate units with a frequency not lower than the threshold are included in the subsequent merging or probability modeling process; in the BPE algorithm, filtering low-frequency adjacent unit pairs avoids merging meaningless character combinations and improves lexical quality; in the Unigram algorithm, filtering low-frequency sub-words reduces the computational complexity of probability estimation. The backoff strategy is used to handle character sequences that are not fully covered by the current vocabulary. That is, when an input segment cannot be fully represented by existing words, it is further split into smaller units or preset backoff units to ensure the integrity and executability of the word segmentation process.

[0039] Furthermore, since the functionality of the Unigram algorithm is limited by the quality of the training corpus, this application also includes a method for using the first candidate word list output by the BPE algorithm as the training corpus for the Unigram algorithm, which will not be elaborated here.

[0040] Step S3: Determine the vocabulary derived from a low-resource language model or word segmenter that is in the same language as the target low-resource language text as the second candidate vocabulary; wherein, the low-resource language model is an external model, which refers to other existing models or word segmenters that are different from the low-resource language word segmentation model trained in step S2 of this application. Its source can be a published low-resource language model, a word segmenter generated by an existing training project, a domain-specific vocabulary, or other historical model assets.

[0041] In some embodiments, the external low-resource language model can employ either the same word segmentation training rules as in step S2 or different training rules. For example, it can use BPE, Unigram, or other sub-word modeling methods; its training can incorporate syllable delimiter mechanisms or not. Because external source models may differ in word segmentation rules, character processing methods, boundary processing methods, and vocabulary structure, this invention does not directly inject its output tokens into the base word segmenter in full. Instead, it treats them as one of the candidate sources and uniformly performs injectability filtering and compatibility screening in subsequent step S4.

[0042] In some embodiments, exporting candidate word list B (second candidate word list) may include: reading the external word segmenter word list file; parsing the serialization configuration of the external base word segmenter; or exporting all tokens (lexical units) from the base word segmenter interface called from the external model. To control the migration scale, a top-k truncation parameter can be optionally set, for example, extracting the top 32,000 candidate tokens (lexical units), or filtering by frequency of occurrence or score threshold.

[0043] Step S4: Merge the first candidate vocabulary and the second candidate vocabulary to obtain the total candidate set; filter the total candidate set to generate an injectable list; specifically, the above filtering process includes a deduplication step. In the specific implementation process, deduplication is preferably performed using the string form of the token as the primary key, and normalization rules can be combined to avoid homographs.

[0044] Furthermore, suitable tokens (lexical units) for direct injection into the base segmenter are selected from the total candidate set. To this end, this application proposes multi-level constraint rules. Specifically, the filtering method for the total candidate set includes at least one of the following: character set determination rules, length constraints, and conflict avoidance. The character set determination rule is as follows: it determines whether candidate tokens in the total candidate set satisfy the pure low-resource language mode of the target low-resource language text; if not, they are filtered. Specifically, the character set determination rule is used to distinguish whether a token (word) satisfies the pure low-resource language mode. Preferably, this determination is performed automatically by the program, rather than manually confirming each item. Specifically, it can be based on Unicode character ranges, a preset target character set, and a whitelisted punctuation set: when all characters in a candidate token belong to the target low-resource language character set or whitelisted punctuation, it is determined to satisfy the pure low-resource language mode; when there are non-target characters in a candidate token, the mixed mode determination process is entered.

[0045] Furthermore, the mixed mode determination process is as follows: when the number of candidate tokens after filtering by the character set determination rules is lower than the injection threshold, candidate tokens that do not meet the character set determination rules but meet other filtering methods are added to the injectable list. That is, when the number of injectable tokens (terms) is insufficient in the pure low-resource language mode, the mixed mode can be enabled. The mixed mode allows tokens (terms) to contain a small number of non-target characters, but requires that the proportion of target low-resource language characters is not lower than a preset threshold; when this threshold is met, the candidate token (term) can be determined to meet the mixed mode retention conditions; otherwise, it is discarded. The mixed mode is an optional configuration and does not affect the security in the pure low-resource language mode.

[0046] Preferably, the length constraint is determined by: checking whether the candidate tokens in the total candidate set meet the preset length range; if not, filtering is performed. Specifically, the length constraint is preferably executed as the default filtering rule, that is, setting a preset length range [L] for the candidate tokens. min ,L max For example, [1, 12], only candidate tokens within this length range are retained. The purpose of using length constraints is to avoid insufficient information carrying capacity due to excessively short tokens, and to avoid sample sparsity, decreased generalization ability, and reduced vocabulary utilization due to excessively long tokens. In some embodiments, the upper limit of length parameter can be adjusted according to the characteristics of the target corpus; when there are indeed long tokens with high value in a specific domain, they can be retained by increasing the upper limit of length or introducing a whitelist retention mechanism to reduce the risk of accidental deletion.

[0047] Preferably, the conflict avoidance judgment method is as follows: determine whether the candidate tokens in the total candidate set conflict with the tokens in the base segmenter, and if there is a conflict, filter them. That is, if the candidate token already exists in the vocabulary of the base segmenter, or conflicts with special tokens, control symbols, or configuration reserved items at the string level, it will not be injected again or will be removed.

[0048] Step S5: Append the injectable list to the base segmenter of the large language model that is in the same language as the target low-resource language text to expand the base segmenter's vocabulary. Specifically, add new tokens to the end of the base segmenter's vocabulary in an append-only manner to ensure that the indexes of the original tokens remain unchanged, thereby avoiding damage to the semantic space already learned by the base model.

[0049] In some embodiments, the expansion of the base segmenter involves not only the vocabulary but also the base segmenter word configuration file, serialization file, and mapping information associated with specific tokens. To ensure compatibility between the training and inference frameworks, the relevant configuration of the base segmenter vocabulary should be updated and saved synchronously so that the expanded state can be fully restored upon reloading.

[0050] Furthermore, step S5 of this application also includes structural expansion and alignment of the base segmenter. The base segmenter typically includes a word embedding matrix and a language model output layer, with dimensions consistent with the vocabulary size. When the base segmenter vocabulary is expanded, the word embedding matrix and output layer need to be expanded synchronously to ensure their row count matches the new vocabulary size. The expansion process should maintain the original row vectors unchanged and initialize the row vectors for newly added tokens.

[0051] In some embodiments, the vector initialization of new tokens can employ one or a combination of the following strategies: (a) mean initialization: using the mean of the original vocabulary vectors as the initial value of the new vector; (b) zero initialization: initializing the new vector as a zero vector; (c) random initialization: sampling from a distribution consistent with the statistical properties of the original vectors, for example, using a multivariate normal distribution constructed from the mean and covariance of the original vectors for sampling. These initialization strategies are configurable to suit different training stability and convergence speed requirements.

[0052] Furthermore, this application also includes a testing and acceptance process. This step ensures that the expanded base segmenter and the large language model can be stably deployed without completely replacing the base segmenter's segmentation system. This implementation establishes a standardized testing and acceptance process to verify the structural correctness, functional usability, and improved segmentation performance of the expanded vocabulary results.

[0053] In some implementations, the tests include at least structural consistency testing, functional consistency testing, and performance verification testing. Structural consistency testing checks whether the vocabulary size of the extended base segmenter, the vocabulary size in the model configuration, the number of rows in the word embedding matrix, and the number of rows in the output layer are consistent, and verifies that the corresponding configuration files, vocabulary files, and model weight files can be loaded uniformly. Functional consistency testing checks whether the extended base segmenter can stably perform encoding and decoding on the test text, including encoding-decoding loopback consistency checks, special symbol compatibility checks, and abnormal token checks. Performance verification testing evaluates whether the extended segmentation results have better low-resource language expression capabilities compared to before the extension, and includes at least a comparison of segmentation performance and statistical analysis of token compression performance.

[0054] like Figure 3 As shown, for the same Tibetan sentence, the results of manual segmentation, segmentation by the extended pre-base model, segmentation by path A only, segmentation by path B only, and segmentation by the combined method of this invention are presented separately. This comparison allows observation of the differences between different schemes in segmentation granularity, syllable preservation, fragmentation fallback degree, and the number of readable sub-words. The manual segmentation results serve as a reference benchmark; the segmentation results of the extended pre-base model reflect the fragmentation situation when low-resource language coverage is insufficient; the results of path A only and path B only illustrate the role of a single path; and the results of the combined method of this invention demonstrate the comprehensive improvement in segmentation quality after combining the two paths.

[0055] In some embodiments, to further verify the word segmentation compression effect brought about by the present invention at the sample set level, the original target low-resource language text sample set can be expanded and segmented statistically analyzed to form a result such as... Figure 4The graph shown illustrates the statistical effect of token compression. The sample set is preferably a collection of test texts from real corpora of the target low-resource language, where text length, domain distribution, and language style can cover different types such as short sentences, general sentences, and long sentences. Figure 4 The horizontal axis represents the original target low-resource language text sample number, and the vertical axis represents the number of tokens (lexical units) in the corresponding text after word segmentation; among them, one curve represents the word segmentation result of the base segmenter before expansion on the sample, and the other curve represents the word segmentation result of the same sample set after processing by steps S1-S5 of this application.

[0056] pass Figure 4 It can be intuitively seen that, in most samples, the expanded curve is below the unexpanded curve, indicating that the overall number of tokens corresponding to the expanded segmentation results is reduced. Simultaneously, in the long-tail sample interval, the growth trend of the expanded token number is slower than before expansion, indicating that fragmented segmentation is suppressed, and the token sequence length of long or complex texts converges. These results demonstrate that this invention can improve the efficiency of low-resource language encoding and enhance context utilization by achieving low-cost word expansion and structural alignment without completely replacing the base segmenter system.

[0057] It is noteworthy that those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0058] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, as well as combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0059] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0060] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0061] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit it. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art should understand that modifications or equivalent substitutions can still be made to the specific implementation of the present invention. Any modifications or equivalent substitutions that do not depart from the spirit and scope of the present invention should be covered within the scope of protection of the claims of the present invention.

[0062] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus.

[0063] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.

Claims

1. A method for training a low-resource language segmentation model and performing cross-model vocabulary transfer injection, characterized in that, Includes the following steps: Step S1: Collect and preprocess the target low-resource language texts to obtain training corpus; Step S2: Train a low-resource language segmentation model based on the training corpus to obtain the first candidate word list; Step S3: Determine the vocabulary list derived by the low-resource language model or word segmenter that is in the same language as the target low-resource language text as the second candidate vocabulary list; Step S4: Merge the first candidate word list and the second candidate word list to obtain the total candidate set; The total candidate set is filtered to generate an injectable list; Step S5: Append the injectable list to the base segmenter of the large language model that is the same language as the target low-resource language text.

2. The method for training a low-resource language segmentation model and cross-model vocabulary transfer injection according to claim 1, characterized in that, In step S1, for the target low-resource language text with explicit syllable delimiters, the explicit syllable delimiters are retained during data preprocessing and used in the training of the low-resource language word segmentation model in step S2. In step S2, the low-resource language segmentation model calculates and generates a candidate word set based on the training corpus using the BPE algorithm and / or the Unigram algorithm. If the candidate word set does not meet the training parameters, the candidate word set is regenerated. When the candidate word set satisfies the training parameters, the candidate word set is determined as the first candidate word list.

3. The method for training a low-resource language segmentation model and cross-model vocabulary transfer injection according to claim 2, characterized in that, The calculation method of the BPE algorithm includes the following steps: Step S211: Initialize the training corpus into a pre-segmented unit-level sequence; Step S212: Calculate the frequency of occurrence of adjacent unit pairs in the pre-segmented unit-level sequence; sort the adjacent unit pairs according to their frequency of occurrence and identify high-frequency adjacent unit pairs; Step S213: Merge the high-frequency adjacent unit pairs into the pre-segmented unit-level sequence; Step S214: Determine the pre-segmented unit-level sequence from step S213 based on the training parameters; If the pre-segmented unit-level sequence satisfies the training parameters, then the pre-segmented unit-level sequence is determined as the first candidate word list; If the pre-segmented unit-level sequence does not meet the training parameters, then step S212 is re-executed on the pre-segmented unit-level sequence.

4. The method for training a low-resource language segmentation model and cross-model vocabulary transfer injection according to claim 2, characterized in that, The calculation method of the Unigram algorithm includes the following steps: Step S221: Based on the language structure of the target low-resource language text, construct a candidate word set using the training corpus and assign initial probability values ​​to the words in the candidate word set; Step S222: Calculate the probability of the sub-word in the training corpus based on the training corpus, sort the sub-words according to the sub-word probabilities, retain the sub-words whose probabilities meet the probability threshold, and update the candidate sub-word set; Step S223: Based on the updated candidate word set from step S222, determine according to the training parameters; If the candidate word set satisfies the training parameters, then the candidate word set is determined as the first candidate word list; If the candidate word set does not meet the training parameters, then step S222 is re-executed on the candidate word set.

5. The method for training a low-resource language segmentation model and cross-model vocabulary transfer injection according to claim 4, characterized in that, The sub-word probability is the proportion of the sub-word frequency to the total sub-word frequency; the sub-word frequency is the frequency of the sub-word in the training corpus; and the total sub-word frequency is the sum of the frequencies of all the sub-words in the candidate sub-word set.

6. The method for training a low-resource language segmentation model and cross-model vocabulary transfer injection according to claim 1, characterized in that, In step S4, the filtering method for the total candidate set is at least one of the following: character set determination rules, length constraints, and conflict avoidance. The character set determination rule is determined by judging whether the candidate words in the total candidate set meet the pure low-resource language pattern of the target low-resource language text. If not, they are filtered. The length constraint is determined by: determining whether the candidate words in the total candidate set meet the preset length range; if not, filtering is performed. The method for determining conflict avoidance is as follows: determine whether the candidate lexical in the total candidate set conflicts with the lexical in the base segmenter; if there is a conflict, then filter it.

7. The method for training a low-resource language segmentation model and cross-model vocabulary transfer injection according to claim 6, characterized in that, In the character set determination rule, when all characters in the candidate word belong to the character set or whitelist punctuation set of the target low-resource language text, the candidate word is determined to satisfy the pure low-resource language mode; when the number of candidate words after filtering by the character set determination rule is lower than the injection threshold, the candidate words that do not satisfy the character set determination rule but satisfy other filtering methods are filled into the injectable list.

8. The method for training a low-resource language segmentation model and cross-model vocabulary transfer injection according to claim 1, characterized in that, In step S5, the injectable list is appended to the end of the base segmenter vocabulary in the base segmenter, and the dimensions of the word embedding matrix and the output layer of the base segmenter are expanded to match the dimensions of the base segmenter vocabulary.

9. The method for training a low-resource language segmentation model and cross-model vocabulary transfer injection according to claim 1, characterized in that, In step S1, the data preprocessing includes at least one of the following processing methods: format cleaning, information density filtering, data deduplication, sensitive information removal, and advertisement removal.