An unsupervised bilingual lexicon extraction method based on nonlinear mapping

The existing technical problems are solved by using a nonlinear mapping method. By introducing a nonlinear mapping method, the shortcomings of linear mapping in existing unsupervised bilingual dictionary extraction methods are addressed, thereby improving the accuracy and update efficiency of the bilingual dictionary.

CN116050437BActive Publication Date: 2026-06-19NANKAI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANKAI UNIV
Filing Date
2022-12-26
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing unsupervised bilingual dictionary extraction methods rely on linear mapping matrices, which makes it difficult to construct bilingual dictionaries between low-resource languages ​​and slow to update them. Furthermore, traditional linear mapping methods generally perform poorly and cannot satisfy the isomorphism assumption between different languages.

Method used

A nonlinear mapping method is adopted, which optimizes the alignment of the bilingual vector space by initializing the seed dictionary, rotating the Kabsch algorithm, calculating word pair confidence and similarity, and combining the cross-domain similarity local scaling method. Finally, the nearest neighbor method is used to extract the bilingual dictionary.

🎯Benefits of technology

It improves the accuracy of bilingual dictionary extraction, alleviates the overfitting problem caused by linear mapping, is suitable for dictionary construction and updating of low-resource languages, and improves the efficiency of dictionary generation under unsupervised conditions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116050437B_ABST
    Figure CN116050437B_ABST
Patent Text Reader

Abstract

The application provides an unsupervised bilingual dictionary extraction method based on nonlinear mapping. The method firstly extracts seed dictionaries with high accuracy from a bilingual vector space which is preliminarily aligned through linear mapping, and then rotates and translates the language vectors in the vector space based on the seed dictionaries, so that the bilingual dictionary is extracted in the bilingual vector space with higher alignment. The whole model relieves the overfitting problem caused by the traditional linear mapping method by rotating the vector space as a whole, and realizes the nonlinear mapping of the bilingual vector space through the weighted translation method, so that the bilingual vector space is aligned in a fine-grained manner. The method of the application improves the extraction accuracy by introducing the nonlinear mapping method into the unsupervised bilingual dictionary extraction method, and effectively solves the overfitting problem caused by the traditional linear mapping.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of big data technology, specifically relating to an unsupervised bilingual dictionary extraction method based on nonlinear mapping. Background Technology

[0002] Bilingual dictionaries are widely used in various scenarios in daily life and production. They are tools for looking up word-level relationships between two languages, aiding in language learning or translation. In natural language processing, bilingual dictionaries are fundamental to further research on relationships between texts; they provide significant assistance in machine translation, multimodal relation extraction, and other fields. They also offer technical support for further research on bilingual translation and provide a research foundation for bilingual text understanding. Therefore, high-quality bilingual dictionaries are crucial for in-depth research on relationships between two languages.

[0003] However, current bilingual dictionary extraction methods face two main problems. First, annotation is costly, requiring annotation by language experts. This disadvantage is amplified when dealing with low-resource languages—languages ​​with fewer speakers and less relevant data—making the construction of bilingual dictionaries between low-resource languages ​​and other languages ​​extremely difficult and costly. Second, iteration and updates are slow. Because traditional bilingual dictionary generation methods rely on manual annotation by experts, the update speed is slow, making it difficult to achieve accurate word correspondences in the face of semantic drift in popular vocabulary. Therefore, automated bilingual dictionary generation, especially the automatic extraction of bilingual dictionaries from monolingual corpora under unsupervised signal conditions, is crucial to overcoming the shortcomings of current bilingual dictionaries.

[0004] In existing methods, unsupervised bilingual dictionary extraction takes as input two monolingual corpora, namely the source language and the target language monolingual corpora. Word2Vec technology is used to obtain the vector spaces of the two languages ​​through a large-scale corpus. Based on this vector space, generative adversarial networks or iteratively updated seed dictionaries are used to calculate a linear (or orthogonal) matrix that can map the vectors of one language into the vector space of the other language. After mapping the vectors of the two languages ​​to a vector space, dictionary extraction is performed according to the nearest neighbor principle to generate a bilingual dictionary as output.

[0005] However, these unsupervised dictionary extraction methods are all built on the same assumption: that the vector spaces of different languages ​​have nearly identical structures, i.e., languages ​​are isomorphic, and therefore a linear mapping matrix can be used to map the vector spaces. However, isomorphism has been shown to be unsatisfactory even in language pairs that are linguistically close, such as English and French, both belonging to the continental European language family. Therefore, existing dictionary extraction methods using linear mapping matrices generally perform poorly and have significant limitations in their application.

[0006] Using nonlinear mappings avoids operating the language vector space holistically using only a single mapping matrix. Instead, it allows for fine-grained adjustments to language vectors at different locations within the vector space, overcoming the limitations of traditional linear mapping methods. Furthermore, nonlinear mappings can incorporate different dimensions of correlation between two languages ​​to achieve more nuanced adjustments to the language vector space, such as adjusting the mapping distance by combining information like word frequency and word morphology.

[0007] In conclusion, unsupervised bilingual dictionary extraction based on nonlinear mapping is an innovative research approach with significant research implications and application value. Summary of the Invention

[0008] The purpose of this invention is to address the problem that existing unsupervised bilingual dictionary extraction methods, which rely solely on linear mapping matrices to operate on the vector space, lead to decreased performance. This invention provides an unsupervised bilingual dictionary extraction method based on nonlinear mapping. This method initializes a seed dictionary in an unsupervised manner, building upon traditional linear mapping methods. Using the seed dictionary, supplemented by information such as linguistic morphology, word frequency, and term credibility, the bilingual vector space is rotated and translated, ultimately resulting in bilingual dictionary extraction. This method produces a dictionary with higher accuracy and can be effectively used for unsupervised bilingual dictionary extraction tasks.

[0009] This invention is achieved through the following technical solution:

[0010] An unsupervised bilingual dictionary extraction method based on nonlinear mapping includes the following steps:

[0011] Step 1: Using a bilingual unsupervised dictionary extraction method based on linear mapping, input the bilingual language vector space, learn to obtain the linear mapping matrix, and use the matrix to initially align the bilingual vector space;

[0012] Step 2: Based on the preliminarily aligned bilingual vector space obtained in Step 1, a cosine similarity and word frequency weighting mechanism between language vectors is introduced. The words are sorted according to their confidence scores, and K-means clustering is used to extract word pairs with high confidence scores to obtain the initial seed dictionary.

[0013] Step 3: Based on the seed dictionary obtained in Step 2, calculate the optimal rotation matrix that aligns the bilingual vector spaces as much as possible using the Kabsch algorithm, and then use this matrix to rotate the vector space.

[0014] Step 4: Based on the seed dictionary obtained in Step 2, the confidence and similarity of word pairs are introduced to calculate the weights of different word pairs. The language vectors in the rotated vector space obtained in Step 3 are translated according to the calculated weights.

[0015] Step 5: Based on the translated vector space obtained in Step 4, a new seed dictionary is extracted using the cross-domain similarity local scaling method, and the new bilingual dictionary is used to iterate through Steps 3-5 to optimize the alignment of the bilingual vector space.

[0016] Step 6: After iterating through steps 3-5 until the bilingual vector space converges, use the cross-domain similarity local scaling method and the nearest neighbor method to extract the bilingual dictionary in the optimized bilingual vector space, and merge them as the final output.

[0017] In the above technical solution, in step 1, the Vecmap model is used to perform preliminary alignment of the vector space. The input is the source language monolingual vector space X and the target language monolingual vector space Y. A linear mapping matrix W is learned. o Calculate the mapped vector space X D =XW o ;

[0018] The monolingual vector space is a set of low-dimensional representation vectors of all words in a monolingual corpus. Each row of the matrix contains the vectors representing the words i1,…,i. n Representing words w1, ..., w n The representation vector, where X is composed of words from all source languages. Y is composed of representation vectors from all words in the target language. composition.

[0019] In the above technical solution, in step 2, in X D The seed dictionary is initialized on Y, where the source word w x ∈W x With target word w y ∈W y The similarity calculation formula is as follows:

[0020]

[0021] In this context, function d is used to calculate the cosine similarity between the source word vector and the target word vector, and function F is used to calculate the word frequency weight. The derivation of the word frequency weight is as follows:

[0022] F(w * )=1+1 / f(w * )

[0023] Where f(w) * When sorting monolingual vocabulary by frequency from highest to lowest, the word w * The order, w * w corresponding to the above formula x or w y ;

[0024] For each source word, the target word with the highest confidence is calculated according to the confidence calculation formula, and these word pairs form a candidate seed dictionary. In order to select word pairs with high accuracy to form a seed dictionary, the K-means clustering algorithm is used to divide the word pairs in the candidate seed dictionary into N classes according to weight, and the class with the highest confidence is selected as the seed dictionary D output.

[0025] In the above technical solution, in step 3, the optimal rotation matrix R is calculated using the Kabsch algorithm, and the derivation formula is as follows:

[0026]

[0027] R = VI R U T

[0028] Among them, I R It is a special identity matrix whose last element is det(VU) T The source language vector space is rotated using the rotation matrix R, with all other values ​​being 1, to obtain the rotated vector space X. R The derivation formula is as follows:

[0029] X R =X D R.

[0030] In the above technical solution, in step 4, the confidence of the word pair is composed of the cosine similarity of the word pair, the word frequency weight, and the number of word pair iterations, and the similarity is composed of the cosine similarity and the edit distance.

[0031] In the above technical solution, in step 4, for each source word w x According to the weighting formula, calculate the relationship with... Based on the weights, select the K words with the highest weights and use... These words are represented using x1, ..., x K Indicates its position in X R The corresponding language vectors are represented by y1,…y K Indicates a word in the seed dictionary Corresponding translation The language vectors are used to derive the translated language vectors, and the source language vector space X1 is obtained by using the following formula:

[0032]

[0033] Similarly, for each word w y After calculating the weights of word pairs in the seed dictionary, the K words with the highest weights are selected, and the translated language vectors are derived using the following formula to obtain the target language vector space Y1:

[0034]

[0035] In the above technical solution, in step 6, the source language vector space X after the nonlinear mapping is completed... n With the target language vector space Y n The bilingual dictionary D is extracted using the nearest neighbor method. n1 The bilingual dictionary D was extracted using a cross-domain similarity local scaling method. n2 Take D n1 With D n2 Intersection of word pairs D f This will be output as the result.

[0036] The advantages and beneficial effects of this invention are as follows:

[0037] This invention innovatively proposes an unsupervised bilingual dictionary extraction method based on nonlinear mapping. After initial alignment of the bilingual vector space using linear mapping, a seed dictionary with high accuracy is extracted. Based on this seed dictionary, the language vectors in the vector space are rotated as a whole and then translated with weighted weights, ultimately extracting the bilingual dictionary from a more aligned bilingual vector space. The entire model mitigates the overfitting effect of traditional linear mapping methods by rotating the vector space as a whole, and achieves fine-grained alignment of the bilingual vector space through weighted translation. This invention focuses on the benefits of nonlinear mapping in unsupervised bilingual dictionary extraction tasks, improving extraction accuracy by introducing nonlinear mapping into the unsupervised bilingual dictionary extraction method, and effectively solving the overfitting problem caused by traditional linear mapping. Attached Figure Description

[0038] Figure 1 Extracting definition graphs for unsupervised bilingual dictionaries.

[0039] Figure 2 This is a schematic diagram of the unsupervised bilingual dictionary extraction process based on nonlinear mapping.

[0040] Figure 3 This is a schematic diagram of an unsupervised bilingual dictionary extraction method based on nonlinear mapping.

[0041] Figure 4 This is a schematic diagram of a nonlinear mapping method.

[0042] Figure 5 This is a diagram showing the dictionary extraction results for commonly used languages.

[0043] Figure 6 This is a schematic diagram of the dictionary extraction results for low-resource languages.

[0044] For those skilled in the art, other related figures can be obtained from the above figures without any creative effort. Detailed Implementation

[0045] To enable those skilled in the art to better understand the present invention, the technical solution of the present invention will be further described below with reference to specific embodiments.

[0046] This invention proposes an unsupervised bilingual dictionary extraction method based on nonlinear mapping, the main process of which is as follows: Figure 2 As shown, Figure 3 The method used in this invention is shown.

[0047] The specific implementation process of this invention consists of six steps, wherein the nonlinear mapping method in steps 3, 4, and 5 is as follows: Figure 4 The detailed steps shown are as follows. The specific embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

[0048] This invention addresses the research problem of automatically extracting bilingual dictionaries from non-parallel corpora of two languages. Figure 1 The diagram defines the problem. On the left are two non-parallel corpora of two languages. The goal is to extract word alignment relationships between the two languages ​​from these non-parallel corpora to form a bilingual dictionary, such as deep learning and deep learning.

[0049] In this embodiment, the unsupervised bilingual dictionary extraction method based on nonlinear mapping proposed in this invention employs two types of datasets. One type is the widely used bilingual dictionary extraction task dataset MUSE, a large-scale bilingual dictionary extraction task dataset published by the International Conference on Learning Representations (ICLR). This dataset includes 28 interconnected bilingual dictionary extraction task datasets for six languages ​​(English, German, French, Italian, Spanish, and Portuguese), as well as multiple sets of bilingual dictionary extraction task datasets for other languages ​​and English. For a set of language pairs, the dataset is divided into two parts: TRAIN (training set) and TEST (test set), both of which contain a large number of accurate bilingual aligned word pairs. Under unsupervised conditions, this invention only uses the TEST part to verify the effectiveness of bilingual dictionary extraction.

[0050] Another type is publicly available Uyghur-Chinese bilingual aligned corpora, which addresses the problem of the first type of datasets consisting entirely of commonly used languages. Uyghur, as a minority language, has a relatively small corpus size, making it a low-resource language. For Uyghur-Chinese language pairs, manually annotated bilingual dictionaries are also used as the test set.

[0051] After obtaining the bilingual dictionary extraction task dataset, for a given language, it is necessary to extract low-dimensional lexical representations from its monolingual corpus as input for this invention. Following consistent research settings in bilingual dictionary extraction tasks, this invention uses the Word2Vec method to convert the large-scale corpus into low-dimensional lexical representations, i.e., word vectors, as input. This yields the language vectors used as input to the method and the TEST dataset used for testing results.

[0052] Specifically, the unsupervised bilingual dictionary extraction method based on nonlinear mapping includes the following steps:

[0053] Step 1: Initially align the bilingual vector space using a linear mapping method.

[0054] To achieve a nonlinear mapping of the bilingual vector space, this invention first initializes a seed dictionary with high accuracy using a traditional linear mapping method. This method provides a favorable initial condition for the second step of initializing the seed dictionary.

[0055] This invention chooses to use the Vecmap model for initial alignment of the vector space. The Vecmap model is a typical linear mapping-based method in the field of bilingual dictionary extraction, and it is currently considered the best-performing unsupervised model. The input to this method is the source language monolingual vector space X and the target language monolingual vector space Y. The monolingual vector space is a set of low-dimensional representation vectors of all words in the monolingual corpus. Each row i1,…,i in the matrix… n Representing words w1, ..., w n The representation vector. Where X consists of the vocabulary of all source languages. Y is composed of representation vectors from all words in the target language. The representation vectors are composed of [variable name]. This method obtains the linear mapping matrix W by calculating the similarity of the monolingual vector space matrix. o Calculate the mapped vector space X D =XW o .

[0056] Step 2: Initialize the seed dictionary

[0057] After initial alignment of the vector space, to further improve the accuracy of the seed dictionary, the word alignment degree and word frequency are chosen as the confidence scores for word pairs to initialize the seed dictionary. The higher the word alignment degree or the higher the word frequency, the higher the confidence score of the entry.

[0058] In X D The seed dictionary is initialized on Y, where the source word w x ∈W x With target word w y∈W y The similarity calculation formula is as follows:

[0059]

[0060] In this context, function d is used to calculate the cosine similarity between the source word vector and the target word vector, and function F is used to calculate the word frequency weight. The derivation of the word frequency weight is as follows:

[0061] F(w * )=1+1 / f(w * )

[0062] Where f(w) * When sorting monolingual vocabulary by frequency from highest to lowest, the word w * The order, w * w corresponding to the above formula x or w y .

[0063] For each source word, the most similar words are calculated based on their confidence scores, forming a large candidate seed dictionary. To select word pairs with high accuracy to form the seed dictionary, the K-means clustering algorithm is used to divide the word pairs in the candidate seed dictionary into N classes based on their confidence scores, and the class with the highest confidence score is selected as the seed dictionary D output.

[0064] During the experiment, since the number of words in the source and target words is not the same, in order to avoid the situation where one word corresponds to multiple words in the seed dictionary, the language with fewer words in one language is used as the source language, and the language with more words in one language is used as the target language.

[0065] Step 3: Rotation of the vector space

[0066] To mitigate overfitting caused by the linear mapping method, based on the seed dictionary D, the optimal rotation matrix that aligns the bilingual vector spaces as closely as possible is calculated, and the language vector space is rotated using this matrix.

[0067] According to the Kabsch algorithm, the formula for calculating the optimal rotation matrix R is as follows:

[0068]

[0069] R = VI R U T

[0070] Wherein, SVD is the singular value decomposition algorithm, I R It is a special identity matrix whose last element is det(VU) TThe remaining values ​​are all 1. After obtaining the optimal rotation matrix, the word vectors in the source language vector space are made to fit the word vectors in the target language vector space as closely as possible. The rotated vector space of the source language is X. R =X D R.

[0071] Step 4: Translation of the vector space

[0072] Using word pairs from the seed dictionary D, on the vector space X R The words in Y are translated to achieve alignment under non-linear mapping conditions. For word w x ∈W x The corresponding word vector is Calculation and words in seed dictionary D The formula for the translation weight is as follows:

[0073]

[0074] This weight takes into account word pairs in the seed dictionary D. Confidence level and word pair w x , The similarity between them is used to improve the accuracy of the weights. The function... Used to calculate the confidence level between word pairs, function Used to calculate word w x and Similarity between them. Function This ensures that seed dictionary entries with high confidence levels have higher weights, and the derivation process is as follows:

[0075]

[0076] Among them, the function The function `remain`, used in step 2, is used to calculate the number of consecutive occurrences of word pairs in the seed dictionary during the iteration process.

[0077] function This ensures that words with higher similarity have higher weights, and the derivation process is as follows:

[0078]

[0079] The function d calculates the cosine similarity between the vectors of the two languages. The function L calculates the edit distance (Levenshteindistance) between the two words, which measures the morphological similarity between words and provides better alignment for alphabetic languages.

[0080] Based on the calculated weights, all words in the monolingual vector space are translated, and for each source word w... xAccording to the weighting formula, calculate the relationship with... Based on the weights, select the K words with the highest weights and use... These words are represented using x1, ..., x K Represents its position in X R The corresponding language vectors are represented by y1,…y K The word in the seed dictionary Corresponding translation The language vectors are used to derive the translated language vectors, and the source language vector space X1 is obtained by using the following formula:

[0081]

[0082] Similarly, for each word w y After calculating the weights of word pairs in the seed dictionary, the K words with the highest weights are selected, and the translated language vectors are derived using the following formula to obtain the target language vector space Y1:

[0083]

[0084] Step 5: Iterative optimization of the vector space

[0085] Based on the source language vector space X1 and target language vector space Y1 obtained in step 4, a new bilingual dictionary D1 is extracted using the cross-domain similarity local scaling method. This method alleviates the problem of a point being the nearest neighbor of multiple other points in a high-dimensional space by calculating the weighted scaling distance between two points. Based on this distance, for each source word, the target word with the closest language vector is extracted to obtain the new bilingual dictionary D1.

[0086] Use the new bilingual dictionary D1 as the seed dictionary for steps 3 and 4, and iterate through steps 3, 4, and 5 to obtain the new bilingual dictionary D. n (That is, using bilingual dictionary D1 as the seed dictionary for step 3, the vector spaces X1 and Y1 are iterated through steps 3, 4, and 5 again to obtain vector spaces X2 and Y2, and bilingual dictionary D2; then, using the new bilingual dictionary D2 as the seed dictionary for step 3, the vector spaces X2 and Y2 are iterated through steps 3, 4, and 5 again to obtain vector spaces X3 and Y3, and bilingual dictionary D3; thus, steps 3, 4, and 5 are iterated repeatedly); when D n With D n-1 When the bilingual dictionary no longer changes, the nonlinear mapping of the vector space is completed, the loop is broken, and the alignment of the bilingual vector space is optimized.

[0087] Step 6: Extract the bilingual dictionary from the optimized vector space.

[0088] The source language vector space X after completing the nonlinear mappingn With the target language vector space Y n The bilingual dictionary D is extracted using the nearest neighbor method, which is based on the Euclidean distance between word vectors. n1 Using the cross-domain similarity local scaling method described in step 5, the bilingual dictionary D is extracted according to the weighted scaling distance between word vectors. n2 Take D n1 With D n2 The intersection of word pairs serves as the final extracted bilingual dictionary D. f .

[0089] The effectiveness of the proposed method is verified by performing bilingual dictionary extraction tasks on two datasets: MUSE and Uyghur-Chinese. The proposed method is compared with four classic linear mapping-based methods: Adversarial Autoencoder, MUSE, ICP, and Vecmap. The first two are adversarial training methods, while the latter two both use a seed dictionary construction followed by iterative generation of the final result. The effectiveness of the method is measured by the accuracy of the extracted seed dictionary, i.e., the percentage of correct word pairs.

[0090] Figure 5 The results of this method on commonly used language pairs in the MUSE dataset are shown. In language pairs with high usage frequency and large corpus size, this invention can effectively improve the accuracy of dictionary extraction.

[0091] Figure 6 The results of this method on Uyghur-Chinese language pairs are shown. Uyghur is a low-resource language, and therefore, traditional linear mapping models may fail to train (accuracy less than 5%). In the experiments, each model was run 10 times. "Best" in the table corresponds to the best accuracy achieved in the experiments, "Avg" corresponds to the average accuracy across 10 experiments, and "S" corresponds to the number of times word vectors were successfully aligned (accuracy greater than 5%) across 10 experiments. This invention achieves best results on all metrics, improving upon the best results of the baseline model by 1.57% and 2.5%, respectively. These experimental results demonstrate the effectiveness of the proposed invention.

[0092] In summary, this invention outperforms other comparative methods in the unsupervised bilingual dictionary extraction task, effectively demonstrating the rationality and effectiveness of the proposed nonlinear mapping-based unsupervised bilingual dictionary extraction method.

[0093] The present invention has been described above by way of example. It should be noted that any simple modifications, alterations or other equivalent substitutions that can be made by those skilled in the art without creative effort without departing from the core of the present invention fall within the protection scope of the present invention.

Claims

1. A non-linear mapping based unsupervised bilingual lexicon extraction method, characterized in that, Includes the following steps: Step 1: Using a bilingual unsupervised dictionary extraction method based on linear mapping, input the bilingual language vector space, learn to obtain the linear mapping matrix, and use the matrix to initially align the bilingual vector space; Step 2: Based on the preliminarily aligned bilingual vector space obtained in Step 1, a cosine similarity and word frequency weighting mechanism between language vectors is introduced. The words are sorted according to their confidence scores, and K-means clustering is used to extract word pairs with high confidence scores to obtain the initial seed dictionary. Step 3: Based on the seed dictionary obtained in Step 2, calculate the optimal rotation matrix that aligns the bilingual vector spaces as much as possible using the Kabsch algorithm, and then use this matrix to rotate the vector space. Step 4: Based on the seed dictionary obtained in Step 2, the confidence and similarity of word pairs are introduced to calculate the weights of different word pairs. The language vectors in the rotated vector space obtained in Step 3 are translated according to the calculated weights. Step 5: Based on the translated vector space obtained in Step 4, a new seed dictionary is extracted using the cross-domain similarity local scaling method, and the new bilingual dictionary is used to iterate through Steps 3-5 to optimize the alignment of the bilingual vector space. Step 6: After iterating through steps 3-5 until the bilingual vector space converges, use the cross-domain similarity local scaling method and the nearest neighbor method to extract the bilingual dictionary in the optimized bilingual vector space, and merge them as the final output.

2. The unsupervised bilingual lexicon extraction method based on nonlinear mapping according to claim 1, characterized in that: In step 1, the Vecmap model is used to perform initial alignment of the vector spaces. The inputs are the source language monolingual vector space X and the target language monolingual vector space Y. A linear mapping matrix W is learned. o Calculate the mapped vector space X D =XW o ; The monolingual vector space is a set of low-dimensional representation vectors of all words in a monolingual corpus. Each row of the matrix contains the vectors representing the words i1, ..., i2. n This represents the words w1, ..., w n The representation vector, where X is composed of words from all source languages. Y is composed of representation vectors from all words in the target language. composition.

3. The unsupervised bilingual lexicon extraction method based on nonlinear mapping according to claim 2, characterized in that: In step 2, at X D The seed dictionary is initialized on Y, where the source word w x ∈W x With target word w y ∈W y The similarity calculation formula is as follows: In this context, function d is used to calculate the cosine similarity between the source word vector and the target word vector, and function F is used to calculate the word frequency weight. The derivation of the word frequency weight is as follows: F(w8) = 1 + 1 / f(w * ) Where f(w) * When sorting monolingual vocabulary by frequency from highest to lowest, the word w * The order, w * w corresponding to the above formula x or w y ; For each source word, the target word with the highest confidence is calculated according to the confidence calculation formula, and these word pairs form a candidate seed dictionary. In order to select word pairs with high accuracy to form a seed dictionary, the K-means clustering algorithm is used to divide the word pairs in the candidate seed dictionary into N classes according to weight, and the class with the highest confidence is selected as the seed dictionary D output.

4. The unsupervised bilingual lexicon extraction method based on nonlinear mapping according to claim 1, characterized in that: In step 3, the optimal rotation matrix R is calculated using the Kabsch algorithm, and the derivation formula is as follows: R = VI R U T Among them, I R It is a special identity matrix whose last element is det(VU) T The source language vector space is rotated using the rotation matrix R, with all other values ​​being 1, to obtain the rotated vector space X. R The derivation formula is as follows: X R =X D R。 5. The unsupervised bilingual dictionary extraction method based on nonlinear mapping according to claim 1, characterized in that: In step 4, the confidence of a word pair is composed of the cosine similarity of the word pair, the word frequency weight, and the number of iterations of the word pair. The similarity is composed of the cosine similarity and the edit distance.

6. The unsupervised bilingual dictionary extraction method based on nonlinear mapping according to claim 1, characterized in that: In step 4, for each source word w x According to the weighting formula, calculate the relationship with... Given weights i = 1...N, select the K words with the highest weights and use... These words are represented using x1, ..., x K Indicates its position in X R The corresponding language vectors are represented by y1, ... y2. k Indicates a word in the seed dictionary Corresponding translation The language vectors are used to derive the translated language vectors, and the source language vector space X1 is obtained by using the following formula: Similarly, for each word w y After calculating the weights of word pairs in the seed dictionary, the K words with the highest weights are selected, and the translated language vectors are derived using the following formula to obtain the target language vector space Y1:

7. The method of claim 1, wherein the method is based on nonlinear mapping. In step 6, the source language vector space x after the nonlinear mapping is completed... n With the target language vector space Y n The bilingual dictionary D is extracted using the nearest neighbor method. n1 The bilingual dictionary D was extracted using a cross-domain similarity local scaling method. n2 Take D n1 With D n2 Intersection of word pairs D f This will be output as the result.

Citation Information

Patent Citations

  • Non-linear PLS intermittent process monitoring method of semi-supervised RSDAE

    CN113420815A

  • Bilingual word alignment method and system

    CN113591496A