Cross-language entity word retrieval method, device and equipment and storage medium

CN115858733BActive Publication Date: 2026-06-26JILIN KEXUN INFORMATION TECH CO LTD +2

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JILIN KEXUN INFORMATION TECH CO LTD
Filing Date
2022-12-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing cross-language entity retrieval methods suffer from insufficient accuracy during the translation process. This is mainly because the entity to be retrieved in the source language needs to be translated into the target language, which leads to translation errors that affect the final entity matching results.

Method used

An end-to-end cross-language entity retrieval model is adopted, which directly takes the entity words to be retrieved in the source language and the text to be retrieved as input, and uses a neural network to annotate parallel entity words, avoiding errors in the translation process, and directly retrieving entity words that are parallel to the entity words in the source language in the target language text.

Benefits of technology

It improves the accuracy of entity word retrieval, simplifies the processing flow, avoids errors caused by translation engines, and enhances the accuracy and efficiency of retrieval results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115858733B_ABST
    Figure CN115858733B_ABST
Patent Text Reader

Abstract

The application discloses a cross-language entity word retrieval method and device, equipment and a storage medium. A cross-language entity word retrieval model is preconfigured. For an obtained source language entity word to be retrieved and target language text to be retrieved, the two are combined and input into the cross-language entity word retrieval model. After processing of the model, a parallel entity word annotation result in the text to be retrieved is predicted and output, that is, an entity word retrieval result is obtained. The end-to-end cross-language entity word retrieval model configured by the application has a simpler processing flow and does not need to perform two-stage processing as in the prior art. The source language entity word to be retrieved does not need to be translated into the target language, and a matching operation of the entity word is not needed. Translation errors caused by a translation engine can be avoided, and the accuracy of the entity word retrieval result is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of entity word retrieval technology, and more specifically, to a cross-language entity word retrieval method, apparatus, device, and storage medium. Background Technology

[0002] Cross-language entity retrieval refers to retrieving entity words from a target language text based on entity word information in the source language. For example, if the source language entity word information is "Turkey" and the target language text is "I want to take you to the romantic Turkey, and then go to Tokyo and Paris together," the cross-language entity retrieval algorithm needs to use the Chinese entity information "Turkey" to retrieve "Turkey" from the English target sentence.

[0003] Cross-language entity retrieval methods can be effectively used in fields such as Cross-Language Information Retrieval (CLIR), cross-language entity labeling, and translation engine-based natural language understanding. Most current cross-language entity retrieval methods consist of two stages. The first stage involves training a named entity recognition model using expert-annotated target language data, and then using this model to identify entity words in the target language text. The second stage translates the entity words from the source language into the target language and compares these target language entity words with those identified by the named entity recognition model to determine the successfully matched entity words, which are then used as the retrieval results. Because the second stage requires translating the entity words from the source language to the target language, if the translation engine does not translate the entity words correctly, it will lead to deviations in the subsequent entity matching process, significantly reducing the accuracy of the final retrieval results. Summary of the Invention

[0004] In view of the above problems, this application is proposed to provide a cross-language entity word retrieval method, apparatus, device, and storage medium to improve the accuracy of cross-language entity word retrieval. The specific solution is as follows:

[0005] Firstly, a cross-language entity word retrieval method is provided, including:

[0006] Obtain the entity words to be retrieved in the source language and the text to be retrieved in the target language;

[0007] Input the entity word to be retrieved and the text to be retrieved into a pre-configured cross-language entity word retrieval model to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved.

[0008] The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved, based on the input entity word to be retrieved and the text to be retrieved.

[0009] Secondly, a cross-language entity word retrieval device is provided, including:

[0010] The data acquisition unit is used to acquire the entity words to be retrieved in the source language and the text to be retrieved in the target language.

[0011] The model prediction unit is used to input the entity word to be retrieved and the text to be retrieved into a pre-trained cross-language entity word retrieval model to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved.

[0012] The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved, based on the input entity word to be retrieved and the text to be retrieved.

[0013] Thirdly, a cross-language entity word retrieval device is provided, including: a memory and a processor;

[0014] The memory is used to store programs;

[0015] The processor is used to execute the program to implement the various steps of the cross-language entity word retrieval method described above.

[0016] Fourthly, a storage medium is provided on which a computer program is stored, which, when executed by a processor, implements the various steps of the cross-language entity word retrieval method as described above.

[0017] By employing the aforementioned technical solution, this application pre-configures a cross-language entity retrieval model. This model, through training, is configured to take a combination of the entity to be retrieved and the text to be retrieved as input, and based on the input, perform end-to-end prediction of the annotation results of entity words parallel to the entity to be retrieved in the text. On this basis, for the obtained entity to be retrieved in the source language and the text to be retrieved in the target language, the combination of these two is input into the cross-language entity retrieval model, thus obtaining the annotation results of entity words parallel to the entity to be retrieved in the text output by the model, i.e., obtaining the entity retrieval results. Therefore, it is evident that this application configures an end-to-end cross-language entity retrieval model, which simplifies the processing flow and eliminates the need for two-stage processing as in existing technologies. It avoids translating the entity to be retrieved in the source language into the target language and performing entity word matching operations, thus avoiding translation errors caused by translation engines and improving the accuracy of entity retrieval results.

[0018] Furthermore, the cross-language entity word retrieval model designed in this application embodiment includes both the entity word to be retrieved and the text to be retrieved as input. This allows the model to effectively utilize the entity word information in the source language to accurately retrieve parallel entity words in the text to be retrieved in the target language, thereby further improving the accuracy of the entity word retrieval results. Attached Figure Description

[0019] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the scope of this application. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:

[0020] Figure 1 A flowchart illustrating the cross-language entity word retrieval method provided in this application embodiment;

[0021] Figure 2 This example illustrates the structure of an end-to-end cross-language entity word retrieval model.

[0022] Figure 3 This example illustrates the output probability distribution of a cross-language entity word retrieval model.

[0023] Figure 4 A schematic diagram of a cross-language entity word retrieval device provided in this application embodiment;

[0024] Figure 5 A schematic diagram of the structure of the cross-language entity word retrieval device provided in the embodiments of this application. Detailed Implementation

[0025] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0026] This application provides a cross-language entity word retrieval scheme, which can realize cross-language entity word retrieval tasks. The cross-language entity word retrieval method can be effectively used in fields such as Cross-Language Information Retrieval (CLIR), cross-language entity word annotation, and translation engine-based natural language understanding. Specifically, cross-language information retrieval allows retrieving text information in one query language from another, utilizing information retrieval, text processing, and machine translation technologies. Cross-language entity word annotation methods are commonly used in named entity recognition tasks, typically converting Chinese or English corpora with large amounts of entity annotation information into languages ​​with smaller corpora, thus optimizing the performance of named entity recognition networks even with limited data.

[0027] The proposed solution can be implemented based on a terminal with data processing capabilities, such as a mobile phone, computer, server, or cloud platform.

[0028] Next, combined Figure 1 The cross-language entity word retrieval method of this application may include the following steps:

[0029] Step S100: Obtain the entity words to be retrieved in the source language and the text to be retrieved in the target language.

[0030] Specifically, there can be various combinations of source and target languages. For example, the source language can be Chinese, and the target language can be another language other than Chinese, such as English, Japanese, German, or French. Of course, the source language can also be other languages ​​besides Chinese, which will not be elaborated on here.

[0031] The entity words to be retrieved are the entity words that need to be searched across languages. The process of obtaining the entity words to be retrieved in the source language in this step can be either by directly obtaining the entity words input by the user or specified by the user, or by obtaining the text containing the entity words to be retrieved and using the specified or automatically recognized entity words in the text as the entity words to be retrieved.

[0032] The text to be searched is the text containing entity words that have the same or similar meanings as the entity word to be searched. It can be text entered by the user or specified by the user.

[0033] Step S110: Input the entity word to be retrieved and the text to be retrieved into a pre-configured cross-language entity word retrieval model to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved.

[0034] The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved, based on the input entity word to be retrieved and the text to be retrieved.

[0035] The aforementioned cross-language entity word retrieval model is an end-to-end neural network structure. It can take the entity word to be retrieved and the text to be retrieved as input to the cross-language entity word retrieval model. The model uses an end-to-end approach to achieve cross-language entity word retrieval, that is, to obtain the annotation results of entity words in the text to be retrieved that are parallel to the input entity word to be retrieved.

[0036] Specifically, in this embodiment, the cross-language entity word retrieval model can use sequence labeling to annotate parallel entity words in the text to be retrieved, such as using the form B, I, O, or other labeling formats. Here, B represents the beginning token of a parallel entity word to the entity word to be retrieved, I represents the middle token of a parallel entity word to the entity word to be retrieved, and O represents a parallel entity word token that is not the entity word to be retrieved.

[0037] The above expression "entity words parallel to the entity word to be retrieved" means that the same entity object is represented in different languages. Therefore, "entity words parallel to the entity word to be retrieved" means entity words in the target language of the retrieved text that have the same or similar meaning as the entity word to be retrieved in the source language.

[0038] The cross-language entity word retrieval method provided in this application pre-configures a cross-language entity word retrieval model. This model is trained and configured to take a combination of the entity word to be retrieved and the text to be retrieved as input, and perform end-to-end prediction of the annotation results of entity words parallel to the entity word to be retrieved in the text based on the input. Based on this, for the obtained entity word to be retrieved in the source language and the text to be retrieved in the target language, the combination of the two is input into the cross-language entity word retrieval model, and the annotation results of entity words parallel to the entity word to be retrieved in the text output by the model are obtained, that is, the entity word retrieval result is obtained. Therefore, this application configures an end-to-end cross-language entity word retrieval model, which simplifies the processing flow and eliminates the need for two-stage processing as in existing technologies. It does not require translating the entity word to be retrieved in the source language into the target language or performing entity word matching operations, thus avoiding translation errors caused by translation engines and improving the accuracy of entity word retrieval results.

[0039] Furthermore, the cross-language entity word retrieval model designed in this application embodiment includes both the entity word to be retrieved and the text to be retrieved as input. This allows the model to effectively utilize the entity word information in the source language to accurately retrieve parallel entity words in the text to be retrieved in the target language, thereby further improving the accuracy of the entity word retrieval results.

[0040] The cross-language entity retrieval method provided in this embodiment can be applied to cross-language information retrieval, cross-language entity annotation, and natural language understanding based on translation engines. Cross-language entity annotation will be used as an example for illustration:

[0041] For some less commonly taught languages, the amount of training corpus containing entity annotations is limited, making it impossible to train natural language processing models, such as semantic understanding models, based on sufficient training data. Therefore, the cross-lingual entity word retrieval method described in this application can be used to automatically generate a large amount of training corpus for less commonly taught languages ​​with entity annotations. Specific methods may include:

[0042] First, a large amount of source language corpus with entity annotations is obtained (the source language can be Chinese, English, or other languages ​​where a large amount of entity-annotated corpus is readily available). The source language corpus is then translated using a translation engine to obtain the translated corpus in the minor language. Further, to align entity words from the source language corpus in the translated corpus, the scheme described in this application can be adopted: entity words from the source language corpus are used as the entity words to be retrieved, and the translated corpus is used as the text to be retrieved. This is input into a cross-language entity word retrieval model, and the model outputs the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved. That is, the entity words parallel to the entity words in the source language corpus are obtained in the translated corpus. This yields the training corpus in the minor language with entity word annotations. Subsequently, this large amount of training corpus in the minor language can be used to train a natural language task model.

[0043] In addition, the cross-language entity word retrieval method of this application can also be applied to other scenarios involving cross-language retrieval, such as cross-language paper plagiarism checking and cross-language literature retrieval. It can receive the entity words to be retrieved in the source language input by the user, as well as the text to be retrieved specified by the user, and then use the cross-language entity word retrieval method of this application to obtain the entity word annotation results that are parallel to the entity words to be retrieved in the text to be retrieved.

[0044] Furthermore, to make it easier for users to see the entity word search results, this application can also mark and display the parallel entity words based on the annotation results of the entity words in the text to be searched that are parallel to the entity word to be searched, by setting a marking method.

[0045] Specifically, parallel entity words can be marked and displayed in the text to be searched, or only parallel entity words can be marked and displayed individually. The marking methods include, but are not limited to, bolding, underlining, color marking, etc., as long as they can attract the user's visual attention.

[0046] In some embodiments of this application, the aforementioned cross-language entity word retrieval model is described.

[0047] This application provides an end-to-end cross-language entity word retrieval model, such as... Figure 2 As shown, it can include: an embedding layer, a feature extraction layer, and an output layer.

[0048] The input to the embedding layer includes the entity words to be retrieved and the text to be retrieved.

[0049] Specifically, the entity terms and the text to be retrieved can be concatenated according to a set method and then input into the embedding layer. During concatenation, model identifiers can be used to mark the entity terms to be retrieved, such as... Figure 2 In this process, the identifiers "CLS" and "SEP" are concatenated at both ends of the entity term to be searched, and then the entity term to be searched is concatenated.

[0050] Taking the entity term to be searched as "Paris" and the text to be searched as "I want to go to Paris" as an example, the concatenated input as the embedding layer is "CLSParisSEP I want to go to Paris".

[0051] The input sentence is encoded using an embedding layer to obtain encoded features.

[0052] In this embodiment, when encoding the input sentence through the embedding layer, position encoding, token encoding, and segmentation encoding of the sentence can be performed separately. The encoding features are obtained by combining the three encodings, which can enrich the meaning of the encoding features.

[0053] The feature extraction layer is used to perform deep encoding on the encoded features output by the embedding layer, resulting in deep encoded features. The feature extraction layer can have various network structures. Figure 2 The example shows that the feature extraction layer consists of several stacked Transformer encoders and fully connected layers.

[0054] Optionally, the embedding layer and feature extraction layer can be initialized with network parameters of a pre-trained language model trained on large-scale multilingual training corpora. This pre-trained language model includes, but is not limited to, mBERT. mBERT uses text corpora from multiple languages ​​during pre-training and possesses cross-lingual knowledge transfer and zero-shot learning capabilities. By transferring the network parameters of mBERT, the embedding layer and feature extraction layer can retain mBERT's knowledge transfer and zero-shot learning capabilities, enabling efficient cross-lingual entity word retrieval using the same network.

[0055] The output layer is used to predict the annotation results of entity words that are parallel to the entity words to be retrieved in the text based on deep encoding features.

[0056] In this embodiment, the softmax function can be used as the output layer, and its output is the probability distribution of each token in the input sentence corresponding to the "B", "I", and "O" labels.

[0057] like Figure 2 As shown, for the input entity word "Paris" and the search text "I want to go to Paris", the model's final output annotation is "OOOOOOOOOB", which means that the last token in the search text is the search result: Paris.

[0058] The cross-language entity word retrieval model provided in this embodiment can first load the network parameters of mBERT using transfer learning technology during training, and then use the cross-entropy function as the loss function for network training.

[0059] Next, we will further introduce the training process of the cross-language entity word retrieval model.

[0060] The cross-language entity word retrieval model uses training entity words in the source language and training text in the target language as training samples, and uses the annotation results of entity words parallel to the training entity words in the training text as sample labels for training.

[0061] Further, optionally, in order to make the model pay more attention to the entity words to be retrieved in the source language during the processing, this embodiment constructs positive training samples and negative training samples in a certain proportion to form the overall training samples.

[0062] In the positive training samples, the target language training text contains entity words parallel to the training entity words, while the target language training text in the negative training samples does not contain entity words parallel to the training entity words. That is, the meaning represented by the entity word to be retrieved in the source language does not appear in the training text of the negative training samples.

[0063] The ratio of positive to negative training samples can be adjusted, for example, 1:3.

[0064] Table 1 below shows the entity word annotation results for several training samples. The training samples include some positive training samples and some negative training samples.

[0065] Table 1

[0066]

[0067] As can be seen from Table 1 above, the third training sample is a positive training sample, while the rest of the training samples are negative training samples.

[0068] In this application embodiment, several different methods for obtaining the above-mentioned positive training samples are provided.

[0069] The first method of obtaining the information may include the following steps:

[0070] S10. Obtain the text corpus in the source language.

[0071] Specifically, open-source text corpora in the source language can be collected, such as dialogue stream data. Most sentences in dialogue stream data are relatively independent and have complete semantic information.

[0072] Optionally, this step may involve obtaining the original text corpus in the source language and then performing data cleaning on it. The data cleaning process may include: deleting sentences in the original text corpus whose length is less than a length threshold, and deleting sentences containing information from non-source languages, to obtain the cleaned text corpus in the source language.

[0073] The length threshold L can be set according to actual needs. Non-source language information can be content in languages ​​other than the source language, as well as emoticons, etc.

[0074] S11. Determine the proper nouns and non-proper nouns in the text corpus, and use the proper nouns and non-proper nouns as training entity words.

[0075] In this embodiment, it should be noted that entity words are defined manually based on the application scenario. For example, in the sentence "I want to listen to Zhang San's songs," Zhang San should be an entity word. This means that from a large sample perspective, entity words are strongly correlated with the core semantics of a sentence. Therefore, entity words can also be considered a type of keyword in a narrow sense. Furthermore, in this embodiment, keywords are divided into two types: proper nouns and non-proper nouns. Proper nouns refer to specific keywords, such as "singer," "song title," "director," and "film / television work," while non-proper nouns refer to all other keywords besides proper nouns.

[0076] In this embodiment, proper nouns and non-proper nouns are identified from the text corpus, and training entity words are composed of proper nouns and non-proper nouns.

[0077] The methods for identifying proper nouns and non-proper nouns from text corpora can be different, such as named entity recognition technology or based on proper noun dictionaries and non-proper noun dictionaries. This embodiment does not impose any restrictions.

[0078] S12. The proper nouns in the text corpus are marked using a first marker that matches the proper nouns, and the non-proper nouns in the text corpus are marked using a second marker that matches the non-proper nouns, to obtain the marked text corpus.

[0079] In this embodiment, considering the difference between proper nouns and non-proper nouns, and in order to emphasize the importance and distinctiveness of the two types of nouns in the translation process and improve the effect of cross-language entity word translation and alignment, two different markers are designed: a first marker that matches proper nouns and a second marker that matches non-proper nouns.

[0080] The proper nouns in the text corpus are marked using a first marker, and the non-proper nouns in the text corpus are marked using a second marker, resulting in a marked text corpus.

[0081] By using tags to mark nouns in the text corpus, we can both emphasize the importance of nouns to the translation engine and achieve cross-language word alignment through tags, that is, align entity words with the same meaning before and after translation through tags.

[0082] Furthermore, by using different markers to distinguish proper nouns and non-proper nouns in the text corpus, it is possible to adapt to the linguistic characteristics of proper nouns and non-proper nouns and achieve more accurate translation results.

[0083] In this embodiment, a specific expression for the first and second markers is creatively proposed: the first marker is "*", and the second marker is #[*]#. Here, * represents the keywords to be marked (i.e., proper nouns and non-proper nouns).

[0084] The first marker "*" can emphasize the importance of the marked proper nouns and take into account the context during the translation process of the translation engine, thus achieving cross-language phrase alignment.

[0085] The second marker #[*]# enables cross-language phrase alignment during the translation engine's translation process, while ensuring that the marked non-proper nouns are taken into account.

[0086] Examples are given below:

[0087] The source language text corpus is "The furthest distance in the world is not love, nor hate, but familiar people gradually becoming strangers."

[0088] The entity word contained in this text corpus is the non-proper noun "strange". Therefore, the second marker #[*]# is used to mark it, resulting in the marked text corpus "The farthest distance in the world is not love, not hate, but familiar people gradually becoming #[strange]#".

[0089] For example, the source language text corpus is "Taking the train to Lhasa is a good song".

[0090] If the entity word contained in the text corpus is the proper noun "Going to Lhasa by Train", the first marker "*" is used to mark it, and the marked text corpus is obtained: ""Going to Lhasa by Train" is a good song."

[0091] S13. Use a translation engine to translate the marked text corpus into the target language, and obtain the text corpus in the target language as the training text in the target language.

[0092] Specifically, input the above-marked text corpus into the translation engine to translate it into the target language, and the text corpus in the target language can be obtained as the training text in the target language.

[0093] Among them, various types of translation engines can be used for the translation engine, and the embodiments of the present application do not strictly limit this.

[0094] Moreover, through a large number of experimental verifications, after using the first and second markers introduced in this embodiment to mark proper nouns and non-proper nouns in the text corpus, the translation accuracy of the text corpus in the target language obtained through translation by the translation engine has been greatly improved. And through the first and second markers, the alignment of cross-language phrases can be well achieved, effectively avoiding the problem that cross-language phrases cannot be aligned due to reasons such as word order changes and uncertain entity spans existing in the prior art.

[0095] Taking the above-marked text corpus: The furthest distance in the world is not love, not hate, but the familiar person, gradually becoming #[strange]# as an example, after being translated into English by the translation engine, it is as follows:

[0096] The furthest distance in the world is not love,not hate,but thefamiliar person,gradually becoming#[strange]#.

[0097] It can be seen that the non-proper noun "strange" is translated as "strange". And since the text corpus in the target language after translation also carries markers, the alignment of cross-language phrases can be achieved through the markers.

[0098] The method provided in this embodiment determines proper nouns and non-proper nouns in the text corpus of the source language to form training entity words, further uses different markers to mark proper nouns and non-proper nouns respectively, translates the marked text corpus into the target language through a translation engine, obtains the text corpus in the target language as the training text in the target language, and forms a positive example training sample from the training entity words in the source language and the training text in the target language.

[0099] In this embodiment, considering the distinction between proper nouns and non-proper nouns, proper nouns and non-proper nouns are distinguished by the first and second markers, which not only emphasizes the importance and distinctiveness of the two types of nouns in the translation process and improves the accuracy of cross-language entity word translation, but also effectively achieves the alignment of cross-language phrases through the first and second markers, thus avoiding the problem of cross-language phrases not being aligned due to word order changes, uncertain entity span, etc. in the prior art.

[0100] This application also provides another method for obtaining positive training samples, as follows:

[0101] The second method of obtaining the information may include the following steps:

[0102] S20. Obtain the text corpus in the source language.

[0103] Step S20 is the same as step S10 in the previous text. Please refer to the previous text for details. It will not be repeated here.

[0104] S21. Determine the entity words in the text corpus as training entity words.

[0105] There are several ways to determine entity words in the text corpus in this step. For example, you can use entity word dictionary matching, or use a pre-trained entity word extraction model to extract entity words from the text corpus, or use a named entity recognition model to perform named entity recognition on the text corpus to obtain entity words, and so on.

[0106] S22. Use a translation engine to translate the text corpus into the target language to obtain the text corpus in the target language, which is then used as the training text for the target language.

[0107] Specifically, the translation engine can employ various different types of translation engines. By inputting the source language text corpus into the translation engine, the target language text corpus can be obtained after translation. Because the translation process is performed on sentence-level text corpus, it avoids the problem of inaccurate translation due to lack of context information that occurs when translating individual word segments.

[0108] S23. Positive training samples are composed of training entity words in the source language and training text in the target language.

[0109] The method for obtaining positive training samples provided in this embodiment involves identifying entity words in the source language text corpus as training entity words, and then using a translation engine to translate the source language text corpus to obtain the target language text corpus, which serves as the target language training text. The positive training samples consist of the training entity words and the training text. The entire process can be automated, requiring no manual annotation or translation.

[0110] Furthermore, in order to improve the diversity of training samples, the training entity words obtained in the above embodiments can also be expanded with synonyms, that is, synonyms of training entity words are obtained, and positive training samples are composed of the synonyms and training texts in the target language.

[0111] The synonym expansion process can utilize a synonym tool to obtain the synonyms of the training entity words, as well as the matching degree of each synonym. Furthermore, synonyms that meet the matching degree requirements can be selected for retention, such as retaining synonyms whose matching degree exceeds a threshold.

[0112] Taking the training entity word "strange" in the text corpus of the source language "The farthest distance in the world is not love, not hate, but familiar people gradually becoming strangers" as an example, the synonyms obtained through the synonym tool can include "unfamiliar".

[0113] Then, “stranger” can be used as an extended training entity word, and together with the training text in the target language “The furthest distance in the world is not love, not hate, but the familiar person, gradually becoming strange”, it can form a positive training sample.

[0114] In this embodiment, by expanding the synonyms of the training entity words, positive training samples can be composed of synonyms and training texts in the target language, thereby improving the diversity of training samples.

[0115] In some embodiments of this application, step S21 above, which determines entity words in the text corpus, is described as an optional implementation for training entity words.

[0116] In this embodiment, a word segmentation dictionary can be used to segment the text corpus to obtain the segmentation results.

[0117] The word segmentation dictionary contains several pre-collected proper nouns.

[0118] In this embodiment, open-source proper nouns can be obtained to form a proper noun dictionary D1. The proper noun dictionary D1 is then used to supplement the existing word segmentation dictionary, and the supplemented word segmentation dictionary is used to segment the text corpus to obtain the word segmentation results.

[0119] Taking the text "The furthest distance in the world is not love, not hate, but familiar people gradually becoming strangers" as an example, after word segmentation, the word segmentation result is "The furthest distance in the world is not love, not hate, but familiar people gradually becoming strangers".

[0120] After obtaining the word segmentation results of the text corpus, keywords can be further extracted from the word segmentation results to obtain a keyword set, which can be used as training entity words.

[0121] Specifically, a keyword extraction algorithm based on a pre-trained model can be used to extract keyword information from the word segmentation results. During extraction, each keyword and its confidence score can be obtained, and keywords with confidence scores exceeding a threshold can be selected to form a keyword set.

[0122] Taking the above word segmentation results as an example, the extracted keyword information can include "[distance, 0.6037], [farthest, 0.5953], [unfamiliar, 0.465], [love, 0.4386], [hate, 0.3985]". The values ​​in [] are the confidence scores of the corresponding keywords. Keywords with confidence scores exceeding the threshold can be filtered to form a keyword set. For example, when the threshold is set to 0.45, the keyword set contains the keywords: "distance, farthest, unfamiliar".

[0123] Building upon the aforementioned methods for determining entity words in the text corpus, this embodiment further describes step S22, which involves using a translation engine to translate the text corpus into the target language, thereby obtaining the target language text corpus as an optional implementation of training text for the target language. Specifically, this may include the following steps:

[0124] S30. Determine the type of each keyword in the keyword set, wherein the type includes proper nouns and non-proper nouns.

[0125] Alternatively, the keyword type can be determined based on the proper noun dictionary D1 created in the aforementioned scheme. Specifically, it can be determined whether each keyword in the keyword set is in the proper noun dictionary D1. If it is, the keyword belongs to the proper noun type; otherwise, the keyword belongs to the non-proper noun type.

[0126] S31. The first marker matching proper nouns is used to mark the keywords in the text corpus that belong to the proper noun type, and the second marker matching non-proper nouns is used to mark the keywords in the text corpus that belong to the non-proper noun type, to obtain the marked text corpus.

[0127] S32. The tagged text corpus is translated into the target language using a translation engine to obtain the text corpus in the target language, which is used as the training text for the target language.

[0128] In this embodiment, steps S31-S32 correspond one-to-one with steps S12-S13 in the previous embodiment. Please refer to the previous description for details, which will not be repeated here.

[0129] In this embodiment, considering the distinction between proper nouns and non-proper nouns, proper nouns and non-proper nouns are distinguished by the first and second markers, which not only emphasizes the importance and distinctiveness of the two types of nouns in the translation process and improves the accuracy of cross-language entity word translation, but also effectively achieves the alignment of cross-language phrases through the first and second markers, thus avoiding the problem of cross-language phrases not being aligned due to word order changes, uncertain entity span, etc. in the prior art.

[0130] The training samples obtained in the above embodiments of this application distinguish between proper nouns and non-proper nouns, improving the accuracy of cross-linguistic entity word translation and the alignment of cross-linguistic phrases, thus ensuring the accuracy of the training samples. Based on this, training the cross-linguistic entity word retrieval model using these training samples and sample labels can improve the model's generalization ability and robustness.

[0131] To verify the performance of the cross-language entity retrieval model trained in the embodiments of this application, the model was tested on the validation set, and the final test results are shown in Table 2 below:

[0132] Table 2

[0133] data Sentence accuracy Negative sample accuracy Validation set 0.930 0.965

[0134] Sentence accuracy refers to the probability of correctly predicting the entity word labels for all sample sentences in the validation set, while negative sample accuracy refers to the probability of correctly predicting the entity word labels for negative sample sentences in the validation set. As shown in Table 2 above, both sentence accuracy and negative sample accuracy achieve high values, indicating the excellent performance of the cross-lingual entity word retrieval model trained in this application.

[0135] Furthermore, to further analyze the working principle of the model, the output probability distribution of the cross-language entity word retrieval model is visualized in the embodiments of this application, as shown in the example below. Figure 3 .

[0136] Figure 3 In the diagram, the horizontal axis represents each word segment in the input text to be retrieved, and the vertical axis is a logarithmic coordinate system, representing the predicted probability distribution value.

[0137] The three bars corresponding to each word segment represent, from left to right, the probabilities of the model predicting the three labeling results as "B", "I", and "O" at that word segmentation position. "B", "I", and "O" are sequence labeling methods, the meaning of which has been introduced earlier: "B" and "I" indicate that the prediction is an entity word, and "O" indicates that the prediction is a non-entity word.

[0138] Figure 3 The corresponding search term is "treatment", and the search text is "Ah,then hur##ry up and go to treatment".

[0139] Depend on Figure 3 As can be seen, for the entity word "treatment" that is parallel to the search term "treatment" in the text to be retrieved, the cross-language entity word retrieval model of this application predicts it with a high probability score of "B", that is, it can accurately predict the entity word that is parallel to the search term in the text to be retrieved.

[0140] For other word segments of non-parallel entity words in the text to be retrieved, the probability score of predicting them as "O" is also very high, indicating that non-parallel entity words in the text to be retrieved can be accurately identified.

[0141] Furthermore, the probability score difference between "B" and "O" predicted for each word segment is large, meaning the confidence level of the prediction results is very high.

[0142] The above Figure 3 This further demonstrates the accuracy of the cross-language entity word retrieval model trained in this application for cross-language entity word retrieval results.

[0143] The cross-language entity retrieval model proposed in this application obtains training samples by constructing a large number of training samples based on open-source data in the source language, effectively solving the shortcomings of insufficient data and high manual annotation costs. Furthermore, during the construction of training samples, entity words are distinguished by proper nouns and non-proper nouns. This emphasizes the importance and differentiation of the two types of nouns in the translation process, improving the accuracy of cross-language entity word translation. Additionally, the first and second markers effectively achieve the alignment of cross-language phrases, avoiding the problem of misalignment of cross-language phrases caused by word order changes and uncertain entity spans in existing technologies.

[0144] Furthermore, the cross-language entity retrieval model provided in this application adopts an end-to-end structure, resulting in faster retrieval speed and higher retrieval accuracy. Training with the aforementioned training samples improves the model's robustness and generalization ability. Moreover, a negative example strategy is employed during model training, allowing the model to focus more on the source language's entity words to be retrieved, achieving better retrieval results. Furthermore, during model training, transfer learning is used to load the network parameters of mBERT, preserving the knowledge transfer capability and zero-shot learning capability of the pre-trained language model mBERT, enabling the effective implementation of cross-language entity retrieval in multiple languages ​​using the same cross-language entity retrieval model.

[0145] The cross-language entity word retrieval device provided in the embodiments of this application is described below. The cross-language entity word retrieval device described below can be referred to in correspondence with the cross-language entity word retrieval method described above.

[0146] See Figure 4 , Figure 4 This is a schematic diagram of the structure of a cross-language entity word retrieval device disclosed in an embodiment of this application.

[0147] like Figure 4 As shown, the device may include:

[0148] The data acquisition unit 11 is used to acquire the entity words to be retrieved in the source language and the text to be retrieved in the target language;

[0149] The model prediction unit 12 is used to input the entity word to be retrieved and the text to be retrieved into a pre-trained cross-language entity word retrieval model to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved.

[0150] The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved, based on the input entity word to be retrieved and the text to be retrieved.

[0151] The cross-language entity word retrieval model provided in this application embodiment may include an embedding layer, a feature extraction layer, and an output layer. Based on this, the process by which the above-mentioned model prediction unit inputs the entity word to be retrieved and the text to be retrieved into a pre-trained cross-language entity word retrieval model to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved, as output by the model, includes:

[0152] The entity words to be retrieved and the text to be retrieved are input into the embedding layer to obtain the encoded features of the input sentence;

[0153] The coded features are deep-encoded using the feature extraction layer to obtain deep-encoded features;

[0154] The output layer uses the deep encoding features to predict the annotation results of entity words in the text to be retrieved that are parallel to the entity words to be retrieved.

[0155] Optionally, the embedding layer and feature extraction layer of the aforementioned cross-lingual entity word retrieval model can be initialized using the network parameters of a multilingual pre-trained language model. The multilingual pre-trained language model can be mBERT or another model.

[0156] Optionally, the above cross-language entity word retrieval model uses training entity words in the source language and training text in the target language as training samples, and uses the entity word annotation results that are parallel to the training entity words in the training text as sample labels for training.

[0157] The training samples include positive training samples and negative training samples. The training text of the target language in the positive training samples contains entity words that are parallel to the training entity words, while the training text of the target language in the negative training samples does not contain entity words that are parallel to the training entity words.

[0158] Optionally, the apparatus of this application may further include: a first positive example training sample acquisition unit, the process of which acquiring positive example training samples may include:

[0159] Obtain text corpus in the source language;

[0160] The proper nouns and non-proper nouns in the text corpus are identified, and the proper nouns and non-proper nouns are used as training entity words;

[0161] The proper nouns in the text corpus are marked using a first marker that matches proper nouns, and the non-proper nouns in the text corpus are marked using a second marker that matches non-proper nouns, to obtain the marked text corpus;

[0162] The tagged text corpus is translated into the target language using a translation engine to obtain the target language text corpus, which is used as the training text for the target language.

[0163] Optionally, the apparatus of this application may further include: a second positive example training sample acquisition unit, the process of which acquiring positive example training samples may include:

[0164] Obtain text corpus in the source language;

[0165] Identify the entity words in the text corpus as training entity words;

[0166] The text corpus is translated into the target language using a translation engine to obtain the target language text corpus, which is used as the training text for the target language.

[0167] The positive training samples consist of training entity words in the source language and training text in the target language.

[0168] Optionally, the second positive example training sample acquisition unit described above can also be used for:

[0169] Obtain the synonyms of the training entity words, and form positive training samples from the synonyms of the training entity words and the training text of the target language.

[0170] Optionally, the process of the second positive example training sample acquisition unit determining entity words in the text corpus as training entity words may include:

[0171] The text corpus is segmented using a word segmentation dictionary to obtain segmentation results, wherein the word segmentation dictionary contains pre-collected proper nouns;

[0172] Keywords are extracted from the word segmentation results to obtain a keyword set, which is then used as training entity words.

[0173] Further optionally, the process by which the second positive example training sample acquisition unit uses a translation engine to translate the text corpus into the target language to obtain text corpus in the target language, and uses it as training text in the target language, may include:

[0174] Determine the type of each keyword in the keyword set, whereby the type includes proper nouns and non-proper nouns;

[0175] Keywords belonging to the proper noun type in the text corpus are marked using a first marker that matches proper nouns, and keywords belonging to the non-proper noun type in the text corpus are marked using a second marker that matches non-proper nouns, to obtain a marked text corpus;

[0176] The tagged text corpus is translated into the target language using a translation engine to obtain the target language text corpus, which is used as the training text for the target language.

[0177] The first marker may include "*", where "*" represents the keyword to be marked;

[0178] The second identifier may include: #[*]#.

[0179] Optionally, the process by which the first and second positive example training sample acquisition units acquire the text corpus of the source language may include:

[0180] Obtain the original text corpus in the source language;

[0181] Sentences shorter than the length threshold in the original text corpus were deleted, as were sentences containing information from non-source languages, resulting in cleaned text corpus in the source language.

[0182] The cross-language entity word retrieval device provided in this application embodiment can be applied to cross-language entity word retrieval devices, such as terminals: mobile phones, computers, etc. Optionally, Figure 5 The hardware structure block diagram of the cross-language entity word retrieval device is shown, with reference to... Figure 5 The hardware structure of a cross-language entity word retrieval device may include: at least one processor 1, at least one communication interface 2, at least one memory 3, and at least one communication bus 4;

[0183] In this embodiment of the application, the number of processor 1, communication interface 2, memory 3, and communication bus 4 is at least one, and processor 1, communication interface 2, and memory 3 communicate with each other through communication bus 4;

[0184] Processor 1 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention.

[0185] Memory 3 may include high-speed RAM, and may also include non-volatile memory, such as at least one disk storage device;

[0186] The memory stores a program, which the processor can call. The program is used for:

[0187] Obtain the entity words to be retrieved in the source language and the text to be retrieved in the target language;

[0188] Input the entity word to be retrieved and the text to be retrieved into a pre-configured cross-language entity word retrieval model to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved.

[0189] The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved, based on the input entity word to be retrieved and the text to be retrieved.

[0190] Optionally, the refined and extended functions of the program can be found in the description above.

[0191] This application embodiment also provides a storage medium that can store a program suitable for execution by a processor, the program being used for:

[0192] Obtain the entity words to be retrieved in the source language and the text to be retrieved in the target language;

[0193] Input the entity word to be retrieved and the text to be retrieved into a pre-configured cross-language entity word retrieval model to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved.

[0194] The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved, based on the input entity word to be retrieved and the text to be retrieved.

[0195] Optionally, the refined and extended functions of the program can be found in the description above.

[0196] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0197] The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can be referred to each other.

[0198] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A cross-linguistic entity word retrieval method, characterized in that, include: Obtain the entity words to be searched in the source language and the text to be searched in the target language; The entity word to be retrieved and the text to be retrieved are input into a pre-configured cross-language entity word retrieval model. The cross-language entity word retrieval model performs cross-language entity word retrieval in an end-to-end manner to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved. The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved based on the input entity word to be retrieved and the text to be retrieved. The cross-language entity word retrieval model uses training entity words in the source language and training text in the target language as training samples, and uses the annotation results of entity words parallel to the training entity words in the training text as sample labels for training. The training samples include positive training samples and negative training samples. The training text of the target language in the positive training samples contains entity words that are parallel to the training entity words, while the training text of the target language in the negative training samples does not contain entity words that are parallel to the training entity words. The process of obtaining the positive training samples includes: Obtain text corpus in the source language; The proper nouns and non-proper nouns in the text corpus are identified, and the proper nouns and non-proper nouns are used as training entity words; The proper nouns in the text corpus are marked using a first marker that matches proper nouns, and the non-proper nouns in the text corpus are marked using a second marker that matches non-proper nouns, to obtain the marked text corpus; The tagged text corpus is translated into the target language using a translation engine to obtain the target language text corpus, which is used as the training text for the target language.

2. The method according to claim 1, characterized in that, The cross-language entity word retrieval model includes an embedding layer, a feature extraction layer, and an output layer; The process of inputting the entity word to be retrieved and the text to be retrieved into a pre-trained cross-language entity word retrieval model, and obtaining the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved, as output by the model, includes: The entity words to be retrieved and the text to be retrieved are input into the embedding layer to obtain the encoded features of the input sentence; The coded features are deep-encoded using the feature extraction layer to obtain deep-encoded features; The output layer uses the deep encoding features to predict the annotation results of entity words in the text to be retrieved that are parallel to the entity words to be retrieved.

3. The method according to claim 2, characterized in that, The embedding layer and the feature extraction layer are initialized using the network parameters of a multilingual pre-trained language model.

4. The method according to claim 1, characterized in that, The first marker includes: "*", where * represents the keyword to be marked; The second marker includes: #[*]#.

5. The method according to claim 1, characterized in that, The acquisition of text corpus in the source language includes: Obtain the original text corpus in the source language; Sentences shorter than the length threshold in the original text corpus were deleted, as were sentences containing information from non-source languages, resulting in cleaned text corpus in the source language.

6. The method according to any one of claims 1-5, characterized in that, Also includes: Based on the annotation results of entity words in the text to be searched that are parallel to the entity word to be searched, the parallel entity words are marked and displayed by setting a marking method.

7. A cross-linguistic entity word retrieval method, characterized in that, include: Obtain the entity words to be searched in the source language and the text to be searched in the target language; The entity word to be retrieved and the text to be retrieved are input into a pre-configured cross-language entity word retrieval model. The cross-language entity word retrieval model performs cross-language entity word retrieval in an end-to-end manner to obtain the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved. The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved based on the input entity word to be retrieved and the text to be retrieved. The cross-language entity word retrieval model uses training entity words in the source language and training text in the target language as training samples, and uses the annotation results of entity words parallel to the training entity words in the training text as sample labels for training. The training samples include positive training samples and negative training samples. The training text of the target language in the positive training samples contains entity words that are parallel to the training entity words, while the training text of the target language in the negative training samples does not contain entity words that are parallel to the training entity words. The process of obtaining the positive training samples includes: Obtain text corpus in the source language; The text corpus is segmented using a word segmentation dictionary to obtain segmentation results, wherein the word segmentation dictionary contains pre-collected proper nouns; Keywords are extracted from the word segmentation results to obtain a keyword set, which is then used as training entity words. Determine the type of each keyword in the keyword set, whereby the type includes proper nouns and non-proper nouns; Keywords belonging to the proper noun type in the text corpus are marked using a first marker that matches proper nouns, and keywords belonging to the non-proper noun type in the text corpus are marked using a second marker that matches non-proper nouns, to obtain a marked text corpus; The tagged text corpus is translated into the target language using a translation engine to obtain the text corpus in the target language, which is then used as the training text for the target language. The positive training samples consist of training entity words in the source language and training text in the target language.

8. The method according to claim 7, characterized in that, Also includes: Obtain the synonyms of the training entity words, and form positive training samples from the synonyms of the training entity words and the training text of the target language.

9. The method according to claim 7, characterized in that, The first marker includes: "*", where * represents the keyword to be marked; The second marker includes: #[*]#.

10. The method according to any one of claims 7-9, characterized in that, Also includes: Based on the annotation results of entity words in the text to be searched that are parallel to the entity word to be searched, the parallel entity words are marked and displayed by setting a marking method.

11. A cross-language entity word retrieval device, characterized in that, include: The data acquisition unit is used to acquire the entity words to be retrieved in the source language and the text to be retrieved in the target language. The model prediction unit is used to input the entity word to be retrieved and the text to be retrieved into a pre-trained cross-language entity word retrieval model. The cross-language entity word retrieval model performs cross-language entity word retrieval in an end-to-end manner and obtains the annotation results of entity words in the text to be retrieved that are parallel to the entity word to be retrieved. The cross-language entity word retrieval model is configured to predict the internal state representation of the annotation results of entity words parallel to the entity words to be retrieved in the text to be retrieved based on the input entity word to be retrieved and the text to be retrieved. The cross-language entity word retrieval model uses training entity words in the source language and training text in the target language as training samples, and uses the annotation results of entity words parallel to the training entity words in the training text as sample labels for training. The training samples include positive training samples and negative training samples. The training text of the target language in the positive training samples contains entity words that are parallel to the training entity words, while the training text of the target language in the negative training samples does not contain entity words that are parallel to the training entity words. The process by which the first positive training sample acquisition unit acquires the positive training samples includes: Obtain text corpus in the source language; The proper nouns and non-proper nouns in the text corpus are identified, and the proper nouns and non-proper nouns are used as training entity words; The proper nouns in the text corpus are marked using a first marker that matches proper nouns, and the non-proper nouns in the text corpus are marked using a second marker that matches non-proper nouns, to obtain the marked text corpus; The tagged text corpus is translated into the target language using a translation engine to obtain the target language text corpus, which is used as the training text for the target language.

12. A cross-language entity word retrieval device, characterized in that, include: Memory and processor; The memory is used to store programs; The processor is configured to execute the program to implement each step of the cross-language entity word retrieval method as described in any one of claims 1 to 6 or 7 to 10.

13. A storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements each step of the cross-language entity word retrieval method as described in any one of claims 1 to 6 or 7 to 10.