Patents
Literature
Hiro is an intelligent assistant for R&D personnel, combined with Patent DNA, to facilitate innovative research.
Hiro

111 results about "Bilingual corpus" patented technology

Synonymous collocation extraction using translation information

InactiveUS7689412B2Natural language translationSpeech analysisCollocation extractionHuman language
A method of automatically extracting synonymous collocations from monolingual corpora and a small bilingual corpus is proposed. The methodology includes generating candidate synonymous collocations and selecting synonymous collocations as a function of translation information, including collocation translations and probabilities. Candidate synonymous collocations with similarity scores that exceed a threshold are extracted as synonymous collocations. The extracted collocations can be used later in language generation by substituting synonymous collocations for applications such as writing assistance programs.
Owner:MICROSOFT TECH LICENSING LLC

Synonymous collocation extraction using translation information

InactiveUS20050125215A1Natural language translationSpeech analysisCollocation extractionHuman language
A method of automatically extracting synonymous collocations from monolingual corpora and a small bilingual corpus is proposed. The methodology includes generating candidate synonymous collocations and selecting synonymous collocations as a function of translation information, including collocation translations and probabilities. Candidate synonymous collocations with similarity scores that exceed a threshold are extracted as synonymous collocations. The extracted collocations can be used later in language generation by substituting synonymous collocations for applications such as writing assistance programs.
Owner:MICROSOFT TECH LICENSING LLC

Method and apparatus for improving translation knowledge of machine translation

A method of improving translation knowledge includes the steps of preparing a set of translation knowledge, preparing a bilingual corpus of a source language and a target language, machine-translating sentences of the source language in the bilingual corpus to the target language using a set of translation knowledge, evaluating translation quality of the resulting translations in accordance with a prescribed evaluation standard, calculating degree of contribution to translation quality of a part of the translation knowledge, and removing the corresponding part of the translation knowledge when the calculated degree of contribution of the part is negative.
Owner:ATR ADVANCED TELECOMM RES INST INT

System and method for automatic detection of collocation mistakes in documents

A method and computer-readable medium are provided that construct a collocation mistake pattern database for use in writing in a first language by a person whose native language is a second language. The method includes obtaining a bilingual corpus having sentences in first and second languages and extracting second language word pairs from the second language sentences in the corpus. For each second language word pair extracted from the corpus, a corresponding first language word pair is extracted from the corresponding first language sentence in the corpus to determine a correct first language translation for the second language word pair. Also, for each second language word pair extracted from the corpus, a set of combinations of first language translation words corresponding to the second language word pair is created. Finally, for each second language word pair extracted from the corpus, the correct first language translation is removed from the set of combinations of first language translation words such that the set of combinations represent a set of collocation mistake first language word pairs corresponding to the second language word pair.
Owner:MICROSOFT TECH LICENSING LLC

Machine translation apparatus and machine translation computer program

A method of machine translation, using a bilingual corpus containing translation pairs each consisting of a sentence of a first language and a sentence of a second language, for translating an input sentence of the first language to the second language, including the steps of: receiving the input sentence of the first language and extracting, from the bilingual corpus, a sentence of the second language forming a pair with a sentence of the first language with highest similarity to the input sentence; applying an arbitrary modification among a plurality of predetermined modifications to the extracted sentence of the second language, and computing likelihood of sentences resulting from the modification; selecting a prescribed number of sentences having high likelihood from among the sentences resulting from the modification; repeating, on each of the sentences selected in the step of selecting, the steps of extracting, computing and selecting, until the likelihood no longer improves; and outputting, as a translation of the input sentence, a sentence having the highest likelihood among the sentences of the second language left at the end of the step of repeating.
Owner:ATR ADVANCED TELECOMM RES INST INT

Collocation translation from monolingual and available bilingual corpora

A system and method of extracting collocation translations is presented. The methods include constructing a collocation translation model using monolingual source and target language corpora as well as bilingual corpus, if available. The collocation translation model employs an expectation maximization algorithm with respect to contextual words surrounding collocations. The collocation translation model can be used later to extract a collocation translation dictionary. Optional filters based on context redundancy and / or bi-directional translation constrain can be used to ensure that only highly reliable collocation translations are included in the dictionary. The constructed collocation translation model and the extracted collocation translation dictionary can be used later for further natural language processing, such as sentence translation.
Owner:MICROSOFT TECH LICENSING LLC

A bilingual word embedding-based cross-language text similarity assessment technique

The invention belongs to the field of language processing, in particular to a cross-language text similarity evaluation technology based on bilingual word embedding. The technical route and workflow of cross-language text similarity evaluation technology based on bilingual word embedding can be divided into three stages: the construction of bilingual word embedding model, the construction of textsimilarity calculation framework based on multi-neural network, and the cross-language similarity calculation. Through this model, a bilingual shared word embedding representation can be generated, which is based on the word vector correlation theory and Skip-Gram model is used to train word vectors on artificially constructed pseudo-bilingual corpus. Secondly, in order to make the generated wordembedding space as complete as possible, monolingual corpus is used as a supplement to learn additional word embedding knowledge. The similarity score of sentences is obtained by combining several neural network structures to learn the semantic representation of sentences. By dividing short text into paragraphs and treating paragraphs as long sentences as sequence input, the similarity iteration on a larger scale can be realized.
Owner:HARBIN ENG UNIV

Method and apparatus for aligning bilingual corpora

A method is provided for aligning sentences in a first corpus to sentences in a second corpus. The method includes applying a length-based alignment model to align sentence boundaries of a sentence in the first corpus with sentence boundaries of a sentence in the second corpus to form an aligned sentence pair. The aligned sentence pair is then used to train a translation model. Once trained, the translation model is used to align sentences in the first corpus to sentences in the second corpus. Under aspects of the invention, pruning is used to reduce the number of sentence boundary alignments considered by the length-based alignment model and by the translation model. In further aspects of the invention, the length-based model utilizes a Poisson distribution.
Owner:MICROSOFT TECH LICENSING LLC

Method and device for aligning sentences in bilingual corpus

The embodiment of the invention discloses a method and a device for aligning sentences in a bilingual corpus. A source language corpus and a target language corpus in the bilingual corpus are in block alignment. The method comprises the following steps of: aiming at each alignment block in a source language and a target language, generating a candidate translation pair list according to a source keyword list and a target keyword list which are extracted from a source block and a target block respectively; generating a bilingual dictionary according to the translation probability of each translation pair in the candidate translation pair list; expanding the bilingual dictionary by taking a source-target keyword pair in each item in the bilingual dictionary as a seed translation pair in reference to contents of a text of the seed translation pair; translating a source sentence in the source block into a target language, and calculating the similarity between a translation result and a target sentence in the target block; and aligning the source sentence to the target sentence according to the similarity. By the embodiment of the invention, the flow of aligning the sentences can be simplified and the sentence alignment efficiency is improved.
Owner:FUJITSU LTD

Automatic extraction of transfer mappings from bilingual corpora

A method of aligning nodes of dependency structures obtained from a bilingual corpus includes a two-phase approach wherein a first phase comprises associating nodes of the dependency structures to form tentative correspondences. The nodes of the dependency structures are then aligned as a function of the tentative correspondences and structural considerations. Mappings are obtained from the aligned dependency structures. The mappings can be expanded with varying types and amounts of local context in order that a more fluent translation can be obtained when translation is performed.
Owner:MICROSOFT TECH LICENSING LLC

A Neural Network Mongolian-Chinese Machine Translation Method Based on Encoder-Decoder

Neural Network Mongolian-Chinese Machine Translation Method Based on Encoder-Decoder is provided. The method comprises the following steps of using an encoder e and two-layer decoders d1 and d2, encoding the Mongolian source language into a vector list by the encoder E, Then, at the hidden layer of the encoder, adopting a retrospective step with attention mechanism, In the decoding process, obtaining the implied state before softmax and the draft sentence by the decoder D1, and then taking the implied state of the encoder E and the decoder D1 as the input of the decoder D2 to obtain the secondchannel sequence, i.e. The final translation. At first, that Chinese corpus is divided into words in the preprocess stage, The Mongolian-Chinese bilingual corpus is segmented into stem, affixes and cases, and the Mongolian-Chinese bilingual corpus is segmented into word segments (BPE), which can effectively refine the translation granularity and reduce the number of unknown words, and then the Mongolian-Chinese word vector is constructed by Word2vec. For unknown words, a Mongolian-Chinese dictionary of proprietary vocabulary is also constructed, which can effectively improve the quality of translation.
Owner:INNER MONGOLIA UNIV OF TECH

Neural machine translation method and apparatus

The present invention provides a method of generating training data to which explicit word-alignment information is added without impairing sub-word tokens, and a neural machine translation method and apparatus including the method. The method of generating training data includes the steps of: (1) separating basic word boundaries through morphological analysis or named entity recognition of a sentence of a bilingual corpus used for learning; (2) extracting explicit word-alignment information from the sentence of the bilingual corpus used for learning; (3) further dividing the word boundaries separated in step (1) into sub-word tokens; (4) generating new source language training data by using an output from the step (1) and an output from the step (3); and (5) generating new target language training data by using the explicit word-alignment information generated in the step (2 ) and the target language outputs from the steps (1) and (3).
Owner:ELECTRONICS & TELECOMM RES INST

Statistical machine translation apparatus and method

A statistical machine translation apparatus and method reflecting linguistic information are provided. In the process of generating a translation model based on statistical information on source language sentences and target language sentences during word alignment, the translation model is generated using word alignment results that are amended based on a bilingual dictionary. Further, instead of using the source language sentence and the target language sentence (i.e., their bilingual corpora) as materials to generate the translation model, it is determined whether or not the morphemes are meaningful content words in the source and target language sentences. Based on the determination, pre-processing is performed on the source language sentence and the target language sentence.
Owner:SAMSUNG ELECTRONICS CO LTD

Auxiliary translation searching engine system and method thereof

The present invention discloses one kind of auxiliary translation searching engine system and its method, and relates to multilingual translation system and method in Internet. The method includes the steps of: 1. for the network robot to pick up web page and store in source information library; 2. to establish web page index library with web page index module; 3. to find single web page or bilingual web page pair in web page index library with the web page distinguishing and pre-treating module and perform web page pre-treatment; 4. to perform sentence matching treatment; 5. to store in bilingual library; 6. to establish index for the matched and stored bilingual pairs; 7. to respond user's request and search nearby bilingual result and source URL; and 8. to display in client end the bilingual result and source URL. The present invention is applied in automatic translation in network search.
Owner:贺方升

Method and system for filtering bilingualism corpora

The invention discloses a filtering method of a bilingual corpus and the method comprises the following steps: A. ratio flag value of sentence length of English-Chinese bilingual sentence pair is confirmed; B. the number of different parts of speech in the English-Chinese bilingual sentence pair is respectively counted, the matching number of the corresponding words in a bilingual intertranslating dictionary and words of the part of speech are calculated and the interpretation eigenvalue is confirmed according to the number of different parts of speech and the matching number; C. the filtration and classification are carried out by the ratio eigenvalue of the sentence length and the interpretation eigenvalue according to a classification model established by using a training set in advance. The invention discloses a bilingual corpus system; the invention also provides a filtering method of the bilingual corpus and a system thereof, which are used for improving universality, accuracy and recalling rate of the corpus.
Owner:BEIJING KINGSOFT SOFTWARE +2

Bilingual corpus resource acquisition method and bilingual corpus resource acquisition system

An embodiment of the invention discloses a bilingual corpus resource acquisition method and a bilingual corpus resource acquisition system. The bilingual corpus resource acquisition method includes the steps: acquiring a matched intermediate language common word string between a first language database and a second language database; and forming a mutually-translated text pair of a first language and a second language, wherein the mutually-translated text pair is used for forming bilingual corpus resources of the first language and the second language. The first language database comprises bilingual corpora of the first language and an intermediate language, and the second language database comprises bilingual corpora of the second language and the intermediate language. By means of applying the scheme provided by the embodiment, the bilingual corpora of the two languages are acquired by the aid of the third-party language, so that the problem of corpus resource scarcity between the languages is solved, and a high-quality translation rule can be acquired to construct a statistical machine translation system.
Owner:FUJITSU LTD

Phrase division model establishing method, statistical machine translation method and decoder

The invention discloses a phrase division model establishing method, a statistical machine translation method and a decoder. The phrase model establishing method comprises the following steps of: acquiring a training sample from a bilingual corpus; inputting the acquired training sample to a parameter training tool of a maximum entropy model, and performing parameter training to acquire a weight parameter of the maximum entropy model; and substituting the weight parameter into the maximum entropy model to generate a phrase division model.
Owner:FUJITSU LTD

Carapace bone script explanation machine translation method based on example

InactiveCN102693222ARealize vernacular interpretationLowering the Barriers to ResearchSpecial data processing applicationsSentence pairDisplay device
The invention discloses a carapace bone script explanation machine translation method based on an example, which comprises the steps of: (a) building to finish a carapace bone script explanation-modern Chinese language bilingual corpus; (b) finishing the sentence alignment, the phrase alignment and the word alignment of the bilingual corpus, and building a translation example library; (c) inputting a carapace bone script explanation to be translated; (d) on the basis of the translation example library built in step (b), carrying out full-example matching or parts-of-example matching retrieval to the input carapace bone script explanation to be translated; (e) displaying a final translation result to a user via a display; and (f) evaluating the translation result, and adding a bilingual sentence pair which satisfies a paraphrasing requirement into the translation example library. With the carapace bone script explanation machine translation method based on an example, which utilizes the storage and inquiry advantages of the computer, burden of carapace bone script experts is lightened, and a carapace bone script research threshold is lowered.
Owner:熊晶 +4

Method and system for determining intertranslation relationship of bilingual sentence pairs

The invention discloses a method and system for determining an intertranslation relationship of bilingual sentence pairs. The method comprises a step of determining matching feature values of the bilingual sentence pairs, performing filtering and classification on the bilingual sentence pairs according to the weights of the matching feature values in the intertranslation relationship according to a pre-established training classification model, and determining whether the bilingual sentence pairs are bilingual sentence pairs satisfying the requirements of the intertranslation relationship. Therefore, by adoption of the method for determining the intertranslation relationship of the bilingual sentence pairs provided by the embodiment of the invention, a bilingual corpus with a huge data size can be processed quickly and conveniently. The problem of determining the intertranslation relationship of the bilingual sentence pairs is converted into a binary classification problem by using the classification idea of the training classification model, so that the weights of the matching features of the bilingual corpus can be determined more scientifically and reasonably, and compared with the existing experience method, the universality is better, and the accuracy and the recall rate are improved accordingly.
Owner:BEIJING KINGSOFT OFFICE SOFTWARE INC +1

Method for improving bilingual corpus, device for improving bilingual corpus, machine translation method and machine translation device

According to one aspect, there is provided an apparatus for improving a bilingual corpus including a plurality of sentence pairs of a first language and a second language and word alignment information of each of the sentence pairs, the apparatus comprises: an extracting unit for extracting a split candidate from word alignment information of a given sentence pair; a calculating unit for calculating split confidence of said split candidate; a comparing unit for comparing said split confidence and a pre-set threshold; and a splitting unit for splitting said given sentence pair at said split candidate in a case that said split confidence is larger than said pre-set threshold.
Owner:KK TOSHIBA

Method and apparatus for training bilingual word alignment model, method and apparatus for bilingual word alignment

The present invention provides method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model. The method for training bilingual word alignment model, comprising: training a bilingual word alignment model for a first language and a second language, using a bilingual corpus of the first and second languages; training a bilingual word alignment model for the second language and a third language, using a bilingual corpus of the second and third languages; and estimating a bilingual word alignment model for the first language and the third language, based on said bilingual word alignment model for the first and second languages and said bilingual word alignment model for the second and third languages.
Owner:TOSHIBA DIGITAL SOLUTIONS CORP

Dependence mapping method and system

The invention provides a dependence mapping method. The method comprises the following steps of: firstly based on a bilingual language database of a source language and a target language, acquiring dependence syntactical information of a target language through dependence mapping, and establishing a dependence syntactical analysis model and a dependence syntactical analyzer of the current target language; and then based on a mapping dependence feature instance set and a supervision-free feature instance set, training the dependence syntactical model of the target language so as to obtain an optimal dependence syntactical analysis model, and constructing a final target dependence syntactic analyzer through the optimal dependence syntactical analysis model, wherein the mapping dependence feature instance set is extracted from dependence syntactical information of the target language after the dependence mapping, and the supervision-free feature instance set is extracted from a dependence tree obtained from the syntactical analysis of a target language database by the dependence syntactical analyzer of the current target language. The dependence mapping method can keep the mapping dependence information to the greatest extent, and can process noise information in a robust mode.
Owner:INST OF COMPUTING TECH CHINESE ACAD OF SCI

Method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model

The present invention provides method and apparatus for bilingual word alignment, method and apparatus for training bilingual word alignment model. The method for bilingual word alignment, comprising: training a bilingual word alignment model using a word-aligned labeled bilingual corpus; word-aligning a plurality of bilingual sentence pairs in a unlabeled bilingual corpus using said bilingual word alignment model; determining whether the word alignment of each of said plurality of bilingual sentence pairs is correct, and if it is correct, adding the bilingual sentence pair into the labeled bilingual corpus and removing the bilingual sentence pair from the unlabeled bilingual corpus; retraining the bilingual word alignment model using the expanded labeled bilingual corpus; and re-word-aligning the remaining bilingual sentence pairs in the unlabeled bilingual corpus using the retrained bilingual word alignment model.
Owner:KK TOSHIBA

Statistical machine translation method and system based on dependency tree

The invention provides a statistical machine translation method based on a dependency tree. According to transformation rules extracted from a bilingual corpus, each dependency side of the dependency tree of source language sentences is transformed into corresponding target language phase dependency sides, and the obtained target language phase dependency sides are spliced to generate a target language translation. The method combines the advantages of a dependency syntax model and adopts a mode of analysis-transformation-generation to divide a translation process into three stages, the three processes can be respectively and independently modeled and the more accurate control of the generation process of target language sentences becomes possible. The transformation based on dependency sides reserves more knowledge, can tolerate a higher syntax non-isomorphism phenomenon and can obtain performance better than that of the current mainstream translation method based on phase models.
Owner:INST OF COMPUTING TECH CHINESE ACAD OF SCI

A Chinese-blind automatic conversion method and system based on depth neural network

The invention relates to a Chinese-blind automatic conversion method and system based on depth neural network, includes obtaining Chinese-blind bilingual corpus for sentence and word level comparison,training depth neural network with Chinese-blind bilingual corpus to obtain word segmentation model for Chinese character string segmentation, and ussing Chinese-blind bilingual corpus to obtain tone-marking model for Chinese character tone marking; obtaining The Chinese character text to be converted, and segmenting the Chinese character text according to Braille rules using a word segmentationmodel to obtain a plurality of characters and words, and performing tone marking on the characters and words to be converted using a tone-marking model to convert the tone-marked characters and wordsinto Braille. The invention adopts the trained model to directly segment Chinese character strings according to Braille rules. Therefore, the Chinese character information can be fully utilized to avoid the problem that the Chinese character information is lost and the homophone words are confused with each other when the Braille string is segmented, and the segmentation effect is affected. By using the depth neural network model and the calibration model, higher conversion accuracy can be obtained.
Owner:INST OF COMPUTING TECH CHINESE ACAD OF SCI

Mongolian-Chinese translation method based on transfer learning

The method is provided for solving the problems of low translation quality and poor translation effect of existing Mongolian-Chinese machine translation. Mongolian belongs to a low-resource language,collection of a large number of Mongolian-Chinese parallel bilingual corpora is extremely difficult, and the problem is effectively solved through the thought of integrating transfer learning and priori knowledge in the method. Transfer learning is a method for solving problems in different but related fields by using existing knowledge. The method comprises the steps that firstly, large-scale English-Chinese parallel corpora are used for training based on a neural machine translation framework; secondly, translation model parameter weights trained by large-scale English-Chinese parallel corpora are migrated into a Mongolian-Chinese neural machine translation framework; thirdly, rich vocabulary, syntax and other related knowledge representation information obtained through large-scale corpus training are fused into a Mongolian-Chinese neural machine translation model; and finally, a neural machine translation model is trained by utilizing the existing Mongolian-Chinese parallel corpus.
Owner:INNER MONGOLIA UNIV OF TECH

Machine translation apparatus and machine translation computer program

A method of machine translation, using a bilingual corpus containing translation pairs each consisting of a sentence of a first language and a sentence of a second language, for translating an input sentence of the first language to the second language, including the steps of: receiving the input sentence of the first language and extracting, from the bilingual corpus, a sentence of the second language forming a pair with a sentence of the first language with highest similarity to the input sentence; applying an arbitrary modification among a plurality of predetermined modifications to the extracted sentence of the second language, and computing likelihood of sentences resulting from the modification; selecting a prescribed number of sentences having high likelihood from among the sentences resulting from the modification; repeating, on each of the sentences selected in the step of selecting, the steps of extracting, computing and selecting, until the likelihood no longer improves; and outputting, as a translation of the input sentence, a sentence having the highest likelihood among the sentences of the second language left at the end of the step of repeating.
Owner:ATR ADVANCED TELECOMM RES INST INT

Method and device for extension of data in bilingual corpuses

The invention discloses a method and device for extension of data in bilingual corpuses. The method for extension of the data in the corpuses includes the steps that the source language-pivot language corpus is searched for at least one first pivot language phrase matched with the semanteme of a first source language phrase; the source language-pivot language corpus is searched for at least one second language phrase matched with the semanteme of each first pivot language phrase; the pivot language-target language corpus is searched for at least one first target language phrase matched with the semanteme of each first pivot language phrase; the second source language phrases in a source language phrase set are combined with the first target language phrases in a target language phrase set; combined phrase pairs between the source language phrases and the target language phrases are stored in the source language-target language corpus. The method achieves extension of the data in the bilingual corpuses, thereby solving the problem of data sparseness in the bilingual corpuses.
Owner:BEIJING BAIDU NETCOM SCI & TECH CO LTD
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products