A Chinese-English Machine Translation Method for the Vertical Field of Traditional Chinese Medicine

By employing transfer learning and remote supervised knowledge base strategies, a translation model for the vertical field of traditional Chinese medicine (TCM) was constructed, which solved the problem of data scarcity in TCM-English translation and achieved efficient and accurate Chinese-English translation results.

CN115660000BActive Publication Date: 2026-06-30INST OF INFORMATION ON TRADITIONAL CHINESE MEDICINE CACMS

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
INST OF INFORMATION ON TRADITIONAL CHINESE MEDICINE CACMS
Filing Date
2022-07-04
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

The lack of high-quality parallel corpora in the Chinese-English translation of traditional Chinese medicine literature has resulted in unsatisfactory performance of existing neural machine translation, making it difficult to achieve efficient and accurate translation of TCM literature.

Method used

By employing transfer learning and remote supervised knowledge base strategies, a Chinese-English translation model for the vertical domain of traditional Chinese medicine is constructed by building a parallel corpus of traditional Chinese medicine, utilizing an M2M_100 pre-trained model and a BERT model, and combining it with a remote supervised knowledge base for Chinese-English translation. The model parameters and structure are optimized.

Benefits of technology

It improves the accuracy and efficiency of Chinese-English translation, enabling high-quality Chinese-English translation even with low sample sizes, especially accurate translation of Chinese medicine terminology and entities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115660000B_ABST
    Figure CN115660000B_ABST
Patent Text Reader

Abstract

This invention discloses a Chinese-English machine translation method in the vertical domain of Traditional Chinese Medicine (TCM), comprising the following steps: 1. Construction of a parallel TCM corpus; 2. Building a neural machine translation model using transfer learning; 3. Processing a TCM terminology database; 4. Construction of a remotely supervised knowledge base; 5. Comprehensive utilization. The advantages of this invention compared to existing technologies are: better utilization of transfer learning strategies, optimization of model parameters, and improvement of model structure. This allows for significant improvements in model training accuracy and efficiency while fully inheriting the advantages of the original pre-trained model and its massive parameters, resulting in a Chinese-English translation model with TCM linguistic characteristics. It utilizes remote supervision to integrate high-quality TCM Chinese-English parallel corpus resources, professional Chinese-English terminology resources, and synonym / synonym resources into a knowledge base. The target language can be translated using only the knowledge base with extremely high accuracy, and it also has excellent merging capabilities for synonyms and synonyms.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of machine translation in natural language processing technology, specifically to a Chinese-English machine translation method for the vertical field of traditional Chinese medicine based on transfer learning and remote supervised knowledge base strategies. Background Technology

[0002] Traditional Chinese medicine (TCM) is a treasure of ancient Chinese science and a key to unlocking the treasure trove of Chinese civilization. To accelerate the development and internationalization of TCM under the Belt and Road Initiative, it is necessary to strengthen TCM translation, especially by utilizing machine translation to efficiently, quickly, and accurately translate large volumes of TCM literature. However, as a part of traditional Chinese culture, TCM has the following characteristics: the information content requiring translation is relatively large, reflected in the fact that TCM terminology and prescription names not only have a referential function but also serve an explanatory role. TCM has a long history, and many terms are written in classical Chinese, characterized by concise and succinct descriptions, using few words but conveying profound meaning.

[0003] Currently, mainstream Chinese-English neural machine translation (NMT) is a data-driven approach to building translation models. It heavily relies on the quality, structure, and scale of parallel corpora. Furthermore, due to the numerous parameters and hyperparameters involved in building the neural network, the effectiveness of NMT only significantly improves and becomes applicable when the parallel corpus reaches a very large scale. Moreover, the characteristics of Traditional Chinese Medicine (TCM), including classical Chinese, pronouns, and domain terminology, which are often interspersed with natural language, lead to significant discrepancies between the predicted and translated texts in NMT models. Additionally, the scarcity of high-quality TCM translation data and the lack of a sufficiently large parallel corpus for NMT construction result in unsatisfactory NMT translation performance in TCM. Therefore, improving TCM NMT has significant application prospects and importance.

[0004] Neural Machine Translation

[0005] Neural machine translation is a method of automatic end-to-end translation between natural languages ​​using neural networks. It typically employs an encoder-decoder framework to achieve sequence-to-sequence conversion. Neural machine translation based on the encoder-decoder framework has two key characteristics:

[0006] The encoder-decoder framework learns sentence vector representations that can group sentences with different syntax but the same semantics together, while also distinguishing sentences with the same syntax but different semantics generated by swapping the subject and object.

[0007] Neural machine translation can effectively capture long-distance dependencies through recurrent neural networks based on long short-term memory, while alleviating the data sparsity problem through vector representation, thereby improving the fluency and readability of the translation.

[0008] Transfer learning

[0009] Transfer learning is a machine learning method that uses a model developed for task A as a starting point and reuses it in the process of developing a model for task B. Simply put, transfer learning transfers knowledge from one domain (the source domain) to another domain (the target domain), enabling better learning outcomes in the target domain.

[0010] Transfer learning is a research field within machine learning. It focuses on storing solutions to existing problems and applying them to other different but related problems. In simpler terms, transfer learning uses existing prior knowledge to enable an algorithm to learn new knowledge; that is, it seeks similarities between prior and new knowledge. Domain adaptation is a primary approach in current transfer learning. In transfer learning and domain adaptation, the dataset containing the existing prior knowledge is called the source domain, and the dataset containing the new knowledge the algorithm needs to learn is called the target domain. Typically, there are significant differences between the source and target domains—that is, the data distributions are not entirely identical—but there is always some correlation.

[0011] Remote supervision

[0012] Remote supervised learning algorithms are widely used in mainstream relation extraction systems and are a research hotspot in this field. These algorithms effectively address the scale problem of data annotation. To overcome the limitations of manual data annotation in supervised learning, the core idea is to align text with large-scale knowledge graphs and use the existing entity relationships within the knowledge graph to annotate the text. To obtain richer training samples, a multi-instance multi-labels method is proposed. This method assumes that within the same bag, a sentence can only represent one relation (E1, E2), meaning only one label can be given. However, different sentences can represent different relations (E1, E2), thus obtaining different labels. The label value of multi-label annotation is not positive or negative, but rather represents a specific relation. This provides a possible way to simultaneously mine multiple relations of an entity pair, providing a fundamental data guarantee for building a knowledge base.

[0013] knowledge base

[0014] A knowledge base is a structured, easy-to-operate, easy-to-use, comprehensive, and organized cluster of knowledge in knowledge engineering. It is a collection of interconnected knowledge fragments stored, organized, managed, and used in computer memory, employing one or more knowledge representation methods to address the needs of solving problems in a specific domain (or related domains). These knowledge fragments include domain-related theoretical knowledge, factual data, heuristic knowledge derived from expert experience, such as relevant definitions, theorems, and operational rules within the domain, as well as common-sense knowledge.

[0015] Full-text search engine

[0016] Full-text search technology targets various types of data, such as text, audio, and images, providing information retrieval based on content rather than external features. Its key characteristic is the effective management and rapid retrieval of massive amounts of data. It is the core technology of search engines and a supporting technology for e-commerce websites. Full-text search technology can be applied to corporate websites, media websites, government websites, commercial websites, digital libraries, and search engines. We know that enterprise informatization is the foundation of e-commerce. Enterprises rely on full-text search for establishing their own business websites, building internal information publishing platforms, establishing secure information publishing and exchange channels with other websites, developing e-commerce applications, and building application platforms centered on data. This search technology can span all data sources, support various data and information formats, arrange search results according to business classification rules, and also meet specific knowledge retrieval requests from users. It arranges the results from all different information queries according to relevance or classification, providing information browsing functions in different formats.

[0017] BERT pre-trained models

[0018] BERT stands for Bidirectional Encoder Representation from Transformers, and it's a pre-trained language representation model. A pre-trained language representation model is one that is first trained on a large dataset independent of the final task to generate language representations, and then the learned knowledge (representations) is applied to task-related language representations. It emphasizes that instead of using traditional unidirectional language models or shallow concatenation of two unidirectional language models for pre-training, it employs a new masked language model (MLM) to generate deep bidirectional language representations.

[0019] Its key features include: using MLM to pre-train bidirectional Transformers to generate deep bidirectional language representations. After pre-training, only an additional output layer needs to be added for fine-tuning to achieve state-of-the-art performance on a wide variety of downstream tasks. No task-specific structural modifications to BERT are required during this process.

[0020] M2M pre-trained model

[0021] Facebook's publicly released M2M-100 model, based on Facebook's multilingual model XLM-R, used open-source data mining tools such as ccAligned, ccMatrix, and LASER to collect over 7.5 billion sentences in more than 100 languages. These sentences were categorized into 14 different language groups based on parameters such as language classification, geographical and cultural similarity. Within each of the 14 language groups, one to three "transitional languages" were identified for each language, which were then used as the basis for translation into different language groups.

[0022] M2M-100 leverages model parallelism to train a model two orders of magnitude larger than current bilingual models. Using Fairscale (a PyTorch tool for training large models), the model is divided into hundreds of graphics cards during training, but all have the same underlying data, so each card trains a portion of the model rather than a portion of the data. To ensure that M2M-100 can scale without performance loss, Facebook researchers partitioned the model's parameters (variables that affect the model's predictions when translating in this case) into non-overlapping language groups. This combination of strategies increased the model's capacity by 100 times. Summary of the Invention

[0023] The technical problem to be solved by this invention is to provide a Chinese-English machine translation method for the vertical field of traditional Chinese medicine based on transfer learning and remote supervised knowledge base strategies.

[0024] To solve the above-mentioned technical problems, the technical solution provided by this invention is: a Chinese-English machine translation method in the vertical field of traditional Chinese medicine, comprising the following steps:

[0025] 1. Construction of a parallel corpus of traditional Chinese medicine;

[0026] 2. Build a neural machine translation model using transfer learning;

[0027] 3. Processing of a terminology database in the field of traditional Chinese medicine;

[0028] 4. Construction of a remote supervision knowledge base;

[0029] 5. Comprehensive utilization.

[0030] As an improvement, step 1 is divided into the following sub-steps:

[0031] 1.1 The collected parallel corpora in the field of traditional Chinese medicine in English were preprocessed, including filtering out garbled characters in Chinese and English, handling special characters, deduplication, and checking for reasonableness.

[0032] 1.2 Perform word segmentation and bilingual alignment on the preprocessed Chinese data;

[0033] 1.3 Dataset partitioning: The preprocessed data is divided into training set, validation set, and test set in a ratio of 7:2:1.

[0034] As an improvement, step 2 is divided into the following sub-steps:

[0035] 2.1 The M2M_100 model was selected as the pre-trained model for transfer learning;

[0036] 2.2 Expanding the bilingual parallel corpus dataset of traditional Chinese medicine based on the pre-trained model;

[0037] 2.3 Install the M2M100 Tokenizer;

[0038] 2.4 Configure and construct model hyperparameters and parameters according to the characteristics of traditional Chinese medicine language;

[0039] 2.5 Reduce Transformer depth by pruning it using structured Dropout;

[0040] 2.6 Binarize the training, validation, and test sets to prepare for model training;

[0041] 2.7 Perform model training;

[0042] 2.8 A new TCM vertical domain translation model based on transfer learning was obtained by training a pre-trained model and TCM parallel corpus data in Chinese and English.

[0043] As an improvement, step 3 is divided into the following sub-steps:

[0044] 3.1 The collected Chinese-English terminology in the field of traditional Chinese medicine is subjected to garbled character filtering, special character processing, deduplication, and rationality verification.

[0045] 3.2 Classify the collected Chinese-English terminology in the field of traditional Chinese medicine;

[0046] 3.3 The lengths of different categories of TCM-specific English-Chinese terms are sorted and assigned numerical weights.

[0047] As an improvement, step 4 is divided into the following sub-steps:

[0048] 4.1 Classification of Construction Field Synonyms in Chinese and English;

[0049] 4.1.1 For the processed Chinese-English TCM terminology, add corresponding positive and negative synonyms for those that have a relationship of noun / synonym relationship;

[0050] 4.1.2 Chinese and English TCM terms are cached in Redis according to different categories to form different Redis terminology vocabulary classification databases. The order of entry into the database is based on the priority of the vocabulary and the length of the terms.

[0051] 4.1.3 Publish the business interface for other functions to call;

[0052] 4.2 Indexing of parallel Chinese-English and Chinese medicine corpus data;

[0053] 4.2.1 The high-quality Chinese-English parallel corpus data collected is synchronized into a one-to-one Chinese-English index dataset using ESearch full-text search service;

[0054] 4.2.2 Build and configure the ESearch full-text search server cluster, and publish the search interface for other functions to call;

[0055] 4.3 Extract sentence patterns from parallel Chinese and English corpora to form a sentence pattern pool;

[0056] 4.3.1 BERT pre-trained model is used to mount Chinese vocabulary data for traditional Chinese medicine, and its NER function is used to identify domain entity data in Chinese parallel corpora of traditional Chinese and English texts for traditional Chinese medicine;

[0057] 4.3.2 After identification, the entities present in the parallel corpus are masked (both Chinese and their corresponding English are processed) and then processed into regular sentence patterns for the parallel corpus.

[0058] 4.3.3 Input the sentence structure into the database and publish the search interface for other functions to call.

[0059] As an improvement, step 5 is divided into the following sub-steps:

[0060] 5.1 When the user client inputs data, the system first segments the data into sentences, performs NER recognition on each sentence, replaces entities with masks, and generates corresponding regular expression sentences.

[0061] 5.2 Sentence-by-sentence comparison and retrieval of data in the Esearch full-text search database. If the data exists, it is directly replaced. If the original sentence does not exist but the regular expression exists, the Redis dictionary is called to translate the masked entity data, restore the regular expression to the standard output, and then the replacement is performed.

[0062] 5.3 If the content of this sentence does not exist in the remote supervision knowledge base, the model is invoked for translation;

[0063] 5.4 Combine the sentence data and merge it into output data to provide services to external parties.

[0064] The advantages of this invention compared to existing technologies are as follows: Compared to existing TCM neural machine translation, this invention better utilizes transfer learning strategies. Based on the m2m_100 pre-trained model, it expands the model training on low-sample, high-quality parallel TCM corpora. Furthermore, it optimizes model parameters and improves model structure, significantly improving the accuracy and efficiency of model training while fully inheriting the advantages of the original pre-trained model and its massive parameters, thus forming a Chinese-English translation model with TCM linguistic characteristics.

[0065] This invention utilizes remote supervision to integrate high-quality parallel Chinese-English corpora of traditional Chinese medicine, professional Chinese-English terminology resources, and synonym / antinym resources into a knowledge base. This knowledge base is then used to intervene in the translation process, fully leveraging the advantages of high-quality existing data resources. This allows a certain amount of source data that meets certain conditions to be translated into the target language without model prediction, solely through the knowledge base, with extremely high accuracy. It also has a good function for merging synonyms and antinyms. Attached Figure Description

[0066] Figure 1 is a comparison chart of Chinese and English translations of literature in the field of traditional Chinese medicine.

[0067] Figure 2 is a comparison chart of Chinese and English parallel corpora.

[0068] Figure 3 is a structure diagram based on Transformer networks.

[0069] Figure 4 is a glossary of professional terms in Traditional Chinese Medicine.

[0070] Figure 5 is a diagram of sentence language structure.

[0071] Figure 6 is a sentence lexical analysis diagram.

[0072] Figure 7 is a sentence structure analysis diagram.

[0073] Figure 8 is a semantic analysis diagram of the sentence.

[0074] Figure 9 is a flowchart of the overall structure of the present invention.

[0075] Figure 10 shows the process of generating regular sentence patterns from parallel corpora.

[0076] Figure 11 is a diagram of the construction of the remote supervision knowledge base. Detailed Implementation

[0077] The present invention will now be described in further detail with reference to the accompanying drawings.

[0078] In a specific implementation, the technical solution of this invention is to perform Chinese-English machine translation in the vertical field of traditional Chinese medicine based on transfer learning and remote supervised knowledge base strategies. The specific implementation scheme of the method is as follows:

[0079] 6. Construction of a parallel corpus of traditional Chinese medicine

[0080] 6.1 The collected parallel corpora in the field of Traditional Chinese Medicine (TCM) in both Chinese and English were preprocessed, including filtering out garbled characters, handling special characters, deduplication, and validation of reasonableness. The parallel corpora included data from the China Academy of Chinese Medical Sciences' Institute of Information on Traditional Chinese Medicine's "Multi-Database Fusion Retrieval System," bilingual summaries of acupuncture literature, and other sources, totaling 200,000 parallel data entries. Garbled character filtering removed all non-GB2312 encoded Chinese characters and non-ASCII encoded English characters. Special characters were removed, such as punctuation marks (、), which are not present in English, and book titles (《》). Deduplication was performed manually to remove duplicate Chinese or English data in the parallel corpora. Validation of reasonableness included length checks (Chinese characters less than 1 and greater than 512 are considered invalid) and checks for missing or blank lines in Chinese or English.

[0081] 6.2 Word segmentation and bilingual alignment are performed on the preprocessed Chinese data. As shown in Figure 2, the Chinese data word segmentation uses an AI word segmentation mechanism based on hanlp with a domain terminology vocabulary.

[0082] 6.3 Dataset partitioning: The preprocessed data is divided into training set, validation set, and test set in a ratio of 7:2:1.

[0083] 7. Build a neural machine translation model using transfer learning.

[0084] 7.1 The M2M_100 model was selected as the pre-trained model for transfer learning. Its structure is shown in Figure 3.

[0085] 7.2 Expand the bilingual parallel corpus dataset of traditional Chinese medicine based on the pre-trained model.

[0086] 7.3 Install the M2M100Tokenizer.

[0087] 7.4 Configure and construct model hyperparameters and parameters according to the characteristics of TCM language, including d_model (dimension of layers and pooling layers), encoder_layers (number of encoder layers), decoder_layers (number of decoder layers), encoder_attention_heads (number of attention heads per attention layer in the Transformer encoder), decoder_attention_heads (number of attention heads per attention layer in the Transformer decoder), dropout (dropout probability of all fully connected layers in the embedding, encoder, and pooler), attention_dropout (dropout rate of attention probabilities), activation_dropout (dropout rate of activation within fully connected layers), max_position_embeddings (maximum sequence length that this model may use), and init_std (standard deviation of truncated_normal_initializer used to initialize all weight matrices).

[0088] 7.5 Reduce Transformer depth by pruning it with structured Dropout.

[0089] 7.5.1 The traditional Transformer consists of multiple stacked layers, each consisting of two sub-layers: a multi-head self-attention layer followed by a feedforward sub-layer. The multi-head self-attention sub-layer consists of multiple attention heads applied in parallel. Each attention head takes a matrix X, where each row represents an element of the input sequence, and updates its representation by gathering information from its context using an attention mechanism: Y = Softmax(XTK(QX + P))VX, where K, V, Q, and P are parameter matrices. The outputs are then concatenated into a series of vectors over time. The second sub-layer independently applies a fully connected feedforward network to each element of the sequence, FFN(x) = U ReLU(Vx), where V and U are parameter matrices.

[0090] 7.5.2 Because Transformers are typically computed in parallel, we can focus on layer removal by data-driven pruning. This is achieved by computing the rate of descent for each layer, given a target rate of dp, and learning a separate rate of dpd for each layer, such that the average rate of descent across all layers equals p. pd is parameterized as a non-linear function of the layer's activations, and a softmax function is applied. During inference, we only forward a fixed set of the top-k high-scoring layers.

[0091] 7.6 Binarize the training, validation, and test sets to prepare for model training.

[0092] 7.7 Perform model training.

[0093] 7.8 A new TCM vertical domain translation model based on transfer learning was obtained by training a pre-trained model and TCM parallel corpus data in Chinese and English.

[0094] 8. Processing of the terminology database in the field of Traditional Chinese Medicine

[0095] 8.1 The collected Chinese-English terminology in the field of Traditional Chinese Medicine (TCM) undergoes garbled character filtering, special character processing, deduplication, and rationality verification. The TCM-English database includes data from the National Committee for Terminology in Science and Technology, the TCM MESH thesaurus, the WHO standard thesaurus, terms from the *Huangdi Neijing Lingshu*, diagnostic terms from the Eight Principles of Differentiation, diagnostic terms from the Six Channels Differentiation (Wei Qi Ying Xue differentiation), treatment principles and methods, a new translation of the *Huangdi Neijing Suwen*, a glossary of ophthalmological terms, a glossary of institutional terms, and English translations of commonly used modern TCM terms by Wang Xi, Bao Yuhui, and others.

[0096] 8.2 Classify the collected Chinese-English terminology in the field of traditional Chinese medicine. Classify them according to their source and the category of traditional Chinese medicine to which they belong.

[0097] 8.3 Sort the Chinese and English terms with distinctive characteristics of traditional Chinese medicine in different categories by length and assign them numerical weights.

[0098] 9. Construction of a remote supervision knowledge base

[0099] 9.1 Classification of Construction Fields and English-Chinese Glossary of Positive and Negative Names, as shown in Figure 4:

[0100] 9.1.1 For the processed Chinese-English TCM terminology, add corresponding positive and negative synonyms for those that have a positive-synonym relationship;

[0101] 9.1.2 Chinese and English TCM terms are cached to Redis according to different categories to form different Redis terminology vocabulary classification databases. The order of entry into the database is based on the priority of the vocabulary and the length of the terms.

[0102] 9.1.3 Publish the business interface for other functions to call.

[0103] 9.2 Indexed Chinese-English Parallel Corpus of Traditional Chinese Medicine

[0104] 9.2.1 The high-quality Chinese-English parallel corpus data collected is synchronized into a one-to-one Chinese-English index dataset using ESearch full-text search service;

[0105] 9.2.2 Build and configure the ESearch full-text search server cluster, and publish the search interface for other functions to call.

[0106] 9.3 Extract sentence patterns from parallel Chinese and English corpora to form a sentence pattern pool.

[0107] 9.3.1 BERT pre-trained models are used to mount Chinese vocabulary data for traditional Chinese medicine, and the NER function is used to identify domain entity data in Chinese and English parallel corpora of traditional Chinese medicine; for example, the original sentence: Observe the clinical efficacy of self-prepared cold medicine for treating colds.

[0108] Its linguistic structure is as follows: Figure 5

[0109] The lexical analysis results are as follows: Figure 6

[0110] The sentence structure analysis results are as follows: Figure 7

[0111] The semantic analysis results are shown in Figure 8.

[0112] 9.3.2 After identification, the entities present in the parallel corpus are masked (both Chinese and their corresponding English are processed) and processed into parallel corpus regular sentence patterns.

[0113] For example, the original sentence: Observe the clinical efficacy of self-prepared cold remedies in treating colds.

[0114] The terminology is masked as: Observe the clinical efficacy of self-designed treatment 1 for treatment 2.

[0115] Regularization is: observe the clinical efficacy of self-designed + treatment +.

[0116] 9.3.3 Input sentence patterns into the database and publish the search interface for other functions to call.

[0117] 10. Comprehensive utilization method, as shown in Figure 9.

[0118] 10.1 When the user client enters data, the system first segments the data into sentences, performs NER recognition on each sentence, replaces entities with masks, and generates corresponding regular expression sentences.

[0119] 10.2 Sentence-by-sentence comparison and retrieval: Esearch full-text search database data; if it exists, it is directly replaced. If the original sentence does not exist but the regular expression exists, a Redis dictionary is called to translate the masked entity data, restore the regular expression to standard output, and then perform the replacement.

[0120] 10.3 If the content of this sentence does not exist in the remote supervision knowledge base, the model is invoked for translation.

[0121] 10.4 Combine the sentence data and merge it into output data to provide services to external parties.

[0122] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature, and in the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0123] In this invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," "linking," and "fixing," etc., should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention according to the specific circumstances.

[0124] In this invention, unless otherwise explicitly specified and limited, "above" or "below" the second feature can include direct contact between the first and second features, or contact between the first and second features through another feature between them. Furthermore, "above," "over," and "on top" of the second feature include the first feature being directly above or diagonally above the second feature, or simply indicating that the first feature is at a higher horizontal level than the second feature. "Below," "below," and "under" the second feature include the first feature being directly below or diagonally below the second feature, or simply indicating that the first feature is at a lower horizontal level than the second feature.

[0125] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0126] Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Those skilled in the art can make changes, modifications, substitutions and variations to the above embodiments within the scope of the present invention without departing from the principles and spirit of the present invention.

Claims

1. A method for machine translation of Chinese medicine vertical field from Chinese to English, characterized in that, Includes the following steps:

1. Construction of a parallel corpus of traditional Chinese medicine; 2. Build a neural machine translation model using transfer learning; Step 2 is divided into the following sub-steps: 2.1 The M2M_100 model was selected as the pre-trained model for transfer learning; 2.2 Based on the pre-trained model, an extended bilingual parallel corpus dataset for traditional Chinese medicine was developed; 2.3 Install the M2M100 Tokenizer; 2.4 Configure and construct model hyperparameters and parameters according to the characteristics of traditional Chinese medicine language; 2.5 Reduce Transformer depth by pruning it using structured Dropout; 2.6 Binarize the training, validation, and test sets to prepare for model training; 2.7 Perform model training; 2.8 A new TCM-English vertical domain translation model was obtained by training a pre-trained model and TCM-English parallel corpus data.

3. Processing of a terminology database in the field of traditional Chinese medicine; 4. Construction of a remote supervision knowledge base; Step 4 is divided into the following sub-steps: 4.1 Classification of Construction Field Synonyms in Chinese and English; 4.1.1 For the processed Chinese-English TCM terminology, add corresponding positive and negative synonyms for those that have a relationship of noun / synonym relationship; 4.1.2 Chinese and English TCM terms are cached in Redis according to different categories to form different Redis terminology vocabulary classification databases. The order of entry into the database is based on the priority of the terms and their length. 4.1.3 Publish the business interface for other functions to call; 4.2 Indexing of parallel Chinese-English and Chinese medicine corpus data; 4.2.1 The high-quality Chinese-English parallel corpus data collected is synchronized into a one-to-one Chinese-English index dataset using ESearch full-text search service; 4.2.2 Build and configure the ESearch full-text search server cluster, and publish the search interface for other functions to call; 4.3 Extract sentence patterns from parallel Chinese and English corpora to form a sentence pattern pool; 4.3.1 BERT pre-trained model is used to mount Chinese vocabulary data for traditional Chinese medicine, and its NER function is used to identify domain entity data in Chinese and English parallel corpora of traditional Chinese medicine; 4.3.2 After identification, the entities present in the parallel corpus are masked and processed to form a regular sentence pattern for the parallel corpus; 4.3.3 Input the sentence structure into the database and publish the search interface for other functions to call; 5. Comprehensive utilization; Step 5 is divided into the following sub-steps: 5.1 When the user client inputs data, the system first segments the data into sentences, performs NER recognition on each sentence, replaces entities with masks, and generates corresponding regular expression sentences. 5.2 Sentence-by-sentence comparison and retrieval of Esearch full-text search database data. If the data exists, it is directly replaced. If the original sentence does not exist but the regular expression exists, the Redis dictionary is called to translate the masked entity data, restore the regular expression to the standard output, and then the replacement is performed. 5.3 If the content of this sentence does not exist in the remote supervision knowledge base, the model is invoked for translation; 5.4 Combine the sentence data and merge it into output data to provide services to external parties.

2. The method for Chinese-English machine translation in the vertical field of traditional Chinese medicine according to claim 1, characterized in that, Step 1 is divided into the following sub-steps: 1.1 The collected parallel corpora in the field of traditional Chinese medicine in English were preprocessed, including filtering out garbled characters in Chinese and English, handling special characters, deduplication, and checking for reasonableness. 1.2 Perform word segmentation and bilingual alignment on the preprocessed Chinese data; 1.3 Dataset partitioning: The preprocessed data is divided into training set, validation set, and test set in a ratio of 7:2:

1.

3. The method for Chinese-English machine translation in the vertical field of traditional Chinese medicine according to claim 1, characterized in that, Step 3 is divided into the following sub-steps: 3.1 The collected Chinese-English terminology in the field of traditional Chinese medicine is subjected to garbled character filtering, special character processing, deduplication, and rationality verification. 3.2 Classify the collected Chinese-English terminology in the field of traditional Chinese medicine; 3.3 The lengths of different categories of TCM-specific English-Chinese terms are sorted and assigned numerical weights.