Word vector incremental generation method and device, and electronic equipment

By acquiring new corpora of new words and specialized terms, converting them to the same vector space, and integrating them with existing word vectors, the problem of slow expansion of new words and specialized terms in incremental word vector generation methods is solved, achieving fast and effective word vector expansion and resource optimization.

CN113962220BActive Publication Date: 2026-06-23MIDEA GRP (SHANGHAI) CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
MIDEA GRP (SHANGHAI) CO LTD
Filing Date
2021-10-20
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing incremental word vector generation methods cannot quickly and effectively expand new words and specialized terms, nor can they effectively utilize existing word vector information, resulting in excessive training time and resource consumption.

Method used

By acquiring new corpora of new words and/or specialized terms, we train them using word vector training methods, transform them into the same vector space, and then select and fuse them to obtain incremental word vectors. We then use the original word vectors to quickly expand them.

Benefits of technology

Based on the existing word vectors, we can quickly expand the word vectors of new words and professional terms, reduce training time and resource consumption, improve generation efficiency, and effectively utilize the original word vector information.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN113962220B_ABST
    Figure CN113962220B_ABST
Patent Text Reader

Abstract

The application provides an incremental generation method, device and electronic equipment for word vectors of the same language, comprising: obtaining new corpus corresponding to new words and / or professional words, and training the new corpus by using a word vector training method to obtain specific word vectors containing the new words and / or professional words; converting the specific word vectors and original word vectors to the same vector space to obtain converted specific word vectors and converted original word vectors; and selecting and fusing the converted specific word vectors and the converted original word vectors to obtain incremental word vectors. The method can effectively utilize the original word vectors, quickly expand the original word vectors to obtain word vectors containing new words and / or professional words, greatly reduce the training time and resource consumption, and improve the word vector generation efficiency of new words and professional words.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of natural language processing, and in particular to an incremental generation method, apparatus, and electronic device for word vectors of the same language. Background Technology

[0002] Word vectors are vectors used to represent the feature information of words. They contain important semantic features and are an important foundation for text mining tasks such as text classification and sentiment analysis. They are also an important component of the field of natural language processing.

[0003] A complete and high-quality word vector plays a crucial role in text mining. However, with the continuous emergence of new words (e.g., "Lolita," "chicken parenting," etc.), the lack of new word vectors makes it impossible to effectively utilize the feature information of new words. The emergence of new words necessitates continuous and rapid expansion of word vectors. In different professional fields, specialized terms only appear in corpora containing those terms. Only by incorporating corpora from the specific professional field can word vectors containing those terms be obtained. Therefore, the generation of word vectors for specialized terms requires training on specialized corpora. In reality, to obtain word vectors for new words and specialized terms, a large amount of general and specialized corpora needs to be collected for training. Taking word2vec word vector training as an example, all corpora are input into the word vector model at once, and the corresponding static word vectors are obtained. After the previous training is completed, if new corpora are added, we need to input all corpora containing the new corpora into the word vector model again and retrain. The above process requires significant time and machine resources, hindering rapid iteration of word vector models. A more classic scenario is that large high-tech companies often train massive word vector libraries (let's call them word vector library A) on general-purpose corpora and then open-source them. Because they are trained on massive general-domain corpora, these word vector libraries have a large vocabulary, wide coverage, and good performance in the general domain. However, when we want to add specialized or new word vectors to these libraries, these large high-tech companies only open-source the final training results, not the corpus used, the specific parameters of the training model, or the auxiliary matrices generated during training. Therefore, we cannot train a word vector library containing a large amount of general and specialized vocabulary by fusing new or specialized corpora with these massive general-purpose corpora. Instead, our usual approach is to use a relatively limited collection of specialized corpora to train a word vector library containing specialized vocabulary and a certain number of general-purpose words (let's call it word vector library B). Because different algorithms, corpora, and parameters are used, word vector libraries A and B have serious spatial differences (i.e., the vectors of the same word have completely different representations in A and B). Therefore, we cannot merge A and B and use them simultaneously to produce a word vector library that contains both a large number of general words and professional words in the field of interest.

[0004] In summary, existing incremental word vector generation methods suffer from technical problems, such as the inability to quickly and effectively expand new words and specialized terms, and the inability to effectively utilize existing word vector information. Summary of the Invention

[0005] In view of this, the purpose of the present invention is to provide an incremental generation method, apparatus and electronic device for word vectors of the same language, so as to alleviate the technical problems of existing incremental generation methods of word vectors being unable to quickly and effectively expand new words and professional terms, and unable to effectively utilize the original word vector information.

[0006] In a first aspect, embodiments of the present invention provide an incremental generation method for word vectors of the same language, comprising:

[0007] Obtain new corpus corresponding to new words and / or professional terms, and train the new corpus using a word vector training method to obtain specific word vectors containing new words and / or professional terms;

[0008] The specific word vector and the original word vector are transformed into the same vector space to obtain the transformed specific word vector and the transformed original word vector;

[0009] The specific word vectors after transformation and the original word vectors after transformation are selected and fused to obtain incremental word vectors, wherein the incremental word vectors include word vectors of new words and / or professional words, and word vectors of original words.

[0010] Furthermore, the new corpus is trained using a word vector training method, including:

[0011] The new corpus is segmented to obtain new corpus word segments;

[0012] The new corpus is segmented and trained using the aforementioned word vector training method to obtain specific word vectors containing new words and / or specialized terms.

[0013] Furthermore, transforming the specific word vector and the original word vector to the same vector space includes:

[0014] Based on the intersection words in the specific word vector and the original word vector, extract the specific word vector corresponding to the intersection word in the specific word vector, and extract the original word vector corresponding to the intersection word in the original word vector to obtain the intersection specific word vector set and the intersection original word vector set;

[0015] Construct a target matrix based on the specific word vector set of the intersection and the original word vector set of the intersection;

[0016] Perform singular value decomposition on the target matrix to obtain an orthogonal transformation matrix;

[0017] The specific word vector and the original word vector are transformed to the same vector space according to the orthogonal transformation matrix, so as to obtain the transformed specific word vector and the transformed original word vector.

[0018] Furthermore, transforming the specific word vector and the original word vector to the same vector space according to the orthogonal transformation matrix includes:

[0019] The specific word vector is transformed into the vector space of the original word vector according to the orthogonal transformation matrix;

[0020] or,

[0021] The original word vectors are transformed into the vector space of the specific word vectors according to the orthogonal transformation matrix;

[0022] or,

[0023] The specific word vector and the original word vector are transformed to the third vector space according to the orthogonal transformation matrix.

[0024] Furthermore, the transformed specific word vectors include: the transformed intersection specific word vectors and the transformed newly added word vectors, and the transformed original word vectors include: the transformed intersection original word vectors and the transformed unique word vectors, wherein the words corresponding to the transformed intersection specific word vectors and the transformed intersection original word vectors are the same;

[0025] The selection and fusion of the transformed specific word vectors and the transformed original word vectors includes:

[0026] The specific word vectors of the transformed intersection and the original word vectors of the transformed intersection are selected and combined to obtain the summarized intersection word vectors.

[0027] The aggregated intersection word vectors are merged with the transformed new word vectors and the transformed unique word vectors to obtain the incremental word vectors.

[0028] Furthermore, the selection and combination of the transformed intersection-specific word vectors and the transformed intersection-original word vectors includes:

[0029] If the word represented by the specific word vector of the transformed intersection and the original word vector of the corresponding transformed intersection are words of the first category, then the original word vector of the transformed intersection is selected as the summarized intersection word vector;

[0030] If the word represented by the transformed intersection-specific word vector and the corresponding transformed intersection-original word vector is a second-class word, then the transformed intersection-specific word vector is selected as the summarized intersection word vector.

[0031] If the word represented by the transformed intersection-specific word vector and the corresponding transformed intersection-original word vector is a third-class word, then the weighted average of the transformed intersection-specific word vector and the corresponding transformed intersection-original word vector is calculated, and the resulting weighted average vector is used as the summarized intersection word vector.

[0032] Furthermore, the word vector training method includes any of the following: word2vec algorithm, GloVe algorithm, ELMo algorithm, and BERT algorithm.

[0033] Secondly, embodiments of the present invention also provide an incremental word vector generation apparatus for the same language, comprising:

[0034] The acquisition and training unit is used to acquire new corpus corresponding to new words and / or professional terms, and to train the new corpus using a word vector training method to obtain specific word vectors containing new words and / or professional terms.

[0035] The word vector conversion unit is used to convert the specific word vector and the original word vector to the same vector space to obtain the converted specific word vector and the converted original word vector.

[0036] The selection and fusion unit is used to select and fuse the transformed specific word vector and the transformed original word vector to obtain the incremental word vector, wherein the incremental word vector includes: word vectors of new words and / or professional words, and original words.

[0037] Thirdly, embodiments of the present invention also provide an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method described in any of the first aspects above.

[0038] Fourthly, embodiments of the present invention also provide a computer-readable storage medium storing machine-executable instructions, which, when invoked and executed by a processor, cause the processor to perform the method described in any of the first aspects above.

[0039] In this embodiment of the invention, an incremental generation method for word vectors of the same language is provided, comprising: first, acquiring new corpus corresponding to new words and / or specialized terms, and training the new corpus using a word vector training method to obtain specific word vectors containing new words and / or specialized terms; then, converting the specific word vectors and the original word vectors to the same vector space to obtain the converted specific word vectors and the converted original word vectors; finally, selecting and fusing the converted specific word vectors and the converted original word vectors to obtain the incremental word vectors. As described above, the incremental word vector generation method of the present invention does not rely on old corpora but directly trains on new corpora to obtain specific word vectors containing new words and / or professional words. Then, based on the specific word vectors and the original word vectors, a selection and fusion is performed to obtain word vectors containing new words and / or professional words as well as the original words. It can be seen that this method can effectively utilize the original word vectors and can quickly expand the original word vectors to obtain word vectors containing new words and / or professional words. This greatly reduces the training time and resource consumption, improves the efficiency of word vector generation for new words and professional words, and alleviates the technical problem that existing incremental word vector generation methods cannot quickly and effectively expand new words and professional words and cannot effectively utilize the original word vector information. Attached Figure Description

[0040] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0041] Figure 1 A flowchart illustrating an incremental word vector generation method for the same language, provided as an embodiment of the present invention;

[0042] Figure 2 This is a flowchart illustrating a method for converting specific word vectors and existing word vectors to the same vector space, as provided in an embodiment of the present invention.

[0043] Figure 3 A flowchart illustrating a method for selecting and fusing specific word vectors after conversion and original word vectors provided in an embodiment of the present invention;

[0044] Figure 4 A schematic diagram showing the effect comparison between two sets of word vectors independently trained on the same corpus before and after mapping, provided in an embodiment of the present invention.

[0045] Figure 5 This is a schematic diagram illustrating the effect of mapping proprietary word vectors to the intersection of proportional word vectors in an embodiment of the present invention.

[0046] Figure 6 This is a schematic diagram showing the effect of mapping proprietary word vectors to comparative word vectors on the creation of new words and specialized terms before and after the mapping of proprietary word vectors to comparative word vectors, as provided in an embodiment of the present invention.

[0047] Figure 7 A schematic diagram of an incremental word vector generation device for the same language provided in an embodiment of the present invention;

[0048] Figure 8 This is a schematic diagram of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0049] The technical solution of the present invention will be clearly and completely described below with reference to the embodiments. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0050] Currently, in order to obtain word vectors for new words and specialized terms, it is necessary to retrain all historical and new corpora, which requires a lot of time and machine resources, hindering the rapid iteration of word vector models and wasting the original word vector information.

[0051] Based on this, this embodiment provides an incremental generation method for word vectors of the same language. This method does not rely on old corpora, but directly trains on new corpora. It can effectively utilize the original word vectors and can quickly expand the original word vectors to obtain word vectors containing new words and / or professional words. This greatly reduces the training time and resource consumption and improves the efficiency of word vector generation for new words and professional words.

[0052] To facilitate understanding of this embodiment, a detailed description of an incremental word vector generation method for the same language, as disclosed in this embodiment of the invention, will be provided first.

[0053] Example 1:

[0054] According to an embodiment of the present invention, an embodiment of an incremental generation method for word vectors of the same language is provided. It should be noted that the steps shown in the flowchart in the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions. Furthermore, although a logical order is shown in the flowchart, in some cases, the steps shown or described may be executed in a different order than that shown here.

[0055] Figure 1 This is a flowchart of an incremental word vector generation method for the same language according to an embodiment of the present invention, such as... Figure 1 As shown, the method includes the following steps:

[0056] Step S102: Obtain new corpus corresponding to new words and / or professional terms, and train the new corpus using a word vector training method to obtain specific word vectors containing new words and / or professional terms;

[0057] In this embodiment of the invention, the word vector training method includes any of the following: word2vec algorithm, GloVe algorithm, ELMo algorithm, BERT algorithm. This embodiment of the invention does not impose specific limitations on the above word vector training methods, and other arbitrary word vector training methods are also allowed.

[0058] Step S104: Convert the specific word vector and the original word vector to the same vector space to obtain the converted specific word vector and the converted original word vector;

[0059] Specifically, the original word vectors are historical, older word vectors. Since the specific word vectors and the original word vectors are two different vector spaces, and in order to quickly expand the original word vectors to obtain word vectors containing new words and / or specialized terms, it is necessary to first transform the specific word vectors containing new words and / or specialized terms and the original word vectors to the same vector space. The specific transformation process will be described in detail later.

[0060] Step S106: Select and fuse the converted specific word vectors and the converted original word vectors to obtain the incremental word vectors, wherein the incremental word vectors include: word vectors of new words and / or professional words, and word vectors of original words.

[0061] Specifically, the transformed specific word vector and the transformed original word vector are in the same vector space. The transformed specific word vector contains information about new words and / or specialized words, while the transformed original word vector contains information about the original words. Therefore, by selecting and fusing the two, we can obtain an incremental word vector that contains both new words and / or specialized words as well as original words. The above process makes full use of the original word vector and can quickly expand the original word vector to obtain a word vector that contains new words and / or specialized words.

[0062] In this embodiment of the invention, an incremental generation method for word vectors of the same language is provided, comprising: first, acquiring new corpus corresponding to new words and / or specialized terms, and training the new corpus using a word vector training method to obtain specific word vectors containing new words and / or specialized terms; then, converting the specific word vectors and the original word vectors to the same vector space to obtain the converted specific word vectors and the converted original word vectors; finally, selecting and fusing the converted specific word vectors and the converted original word vectors to obtain the incremental word vectors. As described above, the incremental word vector generation method of the present invention does not rely on old corpora but directly trains on new corpora to obtain specific word vectors containing new words and / or professional words. Then, based on the specific word vectors and the original word vectors, a selection and fusion is performed to obtain word vectors containing new words and / or professional words as well as the original words. It can be seen that this method can effectively utilize the original word vectors and can quickly expand the original word vectors to obtain word vectors containing new words and / or professional words. This greatly reduces the training time and resource consumption, improves the efficiency of word vector generation for new words and professional words, and alleviates the technical problem that existing incremental word vector generation methods cannot quickly and effectively expand new words and professional words and cannot effectively utilize the original word vector information.

[0063] In an optional embodiment of the present invention, step S102 above, which uses a word vector training method to train the new corpus, specifically includes the following steps (1) and (2):

[0064] (1) Perform word segmentation on the new corpus to obtain the new corpus word segmentation;

[0065] The following section uses a specific new corpus as an example to illustrate this process:

[0066] If we obtain a new corpus related to the home appliance field, it would be: A robotic vacuum cleaner, also known as an automatic cleaning machine, intelligent vacuum cleaner, or robot vacuum cleaner, is a type of intelligent home appliance that can automatically clean the floor in a room using a certain level of artificial intelligence.

[0067] The new corpus was segmented into words, and the resulting words are: "A sweeping robot, also known as an automatic cleaning machine, is a type of smart home appliance that can automatically clean the floor in a room with a certain degree of artificial intelligence."

[0068] (2) The word vector training method is used to train the new corpus to obtain specific word vectors containing new words and / or professional words.

[0069] For example, after training the word2vec algorithm on the new corpus word segmentation---sweeping robot, the specific word vector of sweeping robot is obtained as: [0.123,0.345,0.234,0.879……]. Its dimension can be set, and it can be consistent with the word vector released by Tencent. Here it can be set to 200 dimensions (it should be noted that the above number is fictitious).

[0070] A word vector is a word and its corresponding multidimensional vector representation.

[0071] In an alternative embodiment of the present invention, reference is made to... Figure 2 Step S104 above transforms the specific word vector and the original word vector into the same vector space, specifically including the following steps:

[0072] Step S201: Based on the intersection words in the specific word vector and the original word vector, extract the specific word vector corresponding to the intersection word in the specific word vector, and extract the original word vector corresponding to the intersection word in the original word vector to obtain the intersection specific word vector set and the intersection original word vector set;

[0073] For example, if both the specific word vector and the original word vector contain "sweeping robot," then "sweeping robot" is an intersection word. In this case, extract the specific word vector corresponding to "sweeping robot" from the specific word vector, and extract the original word vector corresponding to "sweeping robot" from the original word vector. If there are multiple intersection words, extract the specific word vectors corresponding to multiple intersection words from the specific word vector to form an intersection set of specific word vectors; simultaneously, extract the original word vectors corresponding to multiple intersection words from the original word vector to form an intersection set of original word vectors.

[0074] Step S202: Construct the target matrix based on the intersection of the specific word vector set and the original word vector set of the intersection;

[0075] Specifically, suppose the intersection of a specific set of word vectors is represented as W. Ai The original word vector set of the intersection is represented as W. Bi The target matrix can then be: W Ai T W Bi .

[0076] Step S203: Perform singular value decomposition on the target matrix to obtain the orthogonal transformation matrix;

[0077] Specifically, for the aforementioned target matrix W Ai T W Bi Perform SVD decomposition to obtain the orthogonal transformation matrix UV. T That is, SVD(W) Ai T W Bi ) = USVT .

[0078] Step S204: Transform the specific word vector and the original word vector to the same vector space according to the orthogonal transformation matrix to obtain the transformed specific word vector and the transformed original word vector.

[0079] This invention provides three optional methods for transforming specific word vectors and existing word vectors into the same vector space:

[0080] Method 1: Transform the specific word vector into the vector space of the original word vector using the orthogonal transformation matrix; in this way, the specific word vector in the vector space of the original word vector is obtained (i.e., the transformed specific word vector). At this time, the original word vector is used as the transformed original word vector.

[0081] The specific conversion process can be: W A '=W A UV T Among them, W A ' represents the specific word vector after conversion, W A Represents specific word vectors, UV T This represents the orthogonal transformation matrix.

[0082] Method 2: Transform the original word vectors into the vector space of the specific word vectors using the orthogonal transformation matrix; in this way, the original word vectors in the vector space of the specific word vectors are obtained (i.e., the original word vectors after transformation). At this time, the specific word vectors are used as the specific word vectors after transformation.

[0083] The specific conversion process can be: W B '=W B VU T Among them, W B ' represents the original word vector after conversion, W B VU represents the original word vector. T This represents the orthogonal transformation matrix.

[0084] Method 3: Transform the specific word vector and the original word vector to the third vector space according to the orthogonal transformation matrix; in this way, we obtain the specific word vector in the third vector space (i.e., the transformed specific word vector) and the original word vector in the third vector space (i.e., the transformed original word vector).

[0085] The specific conversion process can be: W A '=W A U, W B '=W B V, where W A ' represents the specific word vector after conversion, W A W represents a specific word vector. B' represents the original word vector after conversion, W B U represents the original word vector, and U and V represent the orthogonal transformation matrices.

[0086] It should be noted that the third vector space is a different vector space from both the original word vector space and the vector space of a specific word vector.

[0087] Regarding orthogonal transformation, in order to transform a specific word vector W A Projection mapping (i.e., transformation) to another space W A '~W A O, and at the same time, this projection mapping should also be reversible, i.e., W A It should also be possible to project back to W through inverse operation. A word vector space W A ~W A 'O T That is, W A ~W A OO T Therefore, under the premise of linear transformation, the mapping matrix is ​​required to satisfy OO. T =I, meaning the mapping matrix must be an orthogonal transformation matrix. Furthermore, according to Protodyakonov's analysis, it can be proven that the orthogonal transformation matrix UV based on SVD decomposition (i.e., singular value decomposition) is... T It is the optimal solution that makes the transformation vector and the target vector the Euclidean (L2) norm.

[0088] In an optional embodiment of the present invention, the converted specific word vector includes: the converted intersection specific word vector and the converted newly added word vector, and the converted original word vector includes: the converted intersection original word vector and the converted unique word vector, and the words corresponding to the converted intersection specific word vector and the converted intersection original word vector are the same.

[0089] refer to Figure 3 Step S106 above involves selecting and fusing the converted specific word vectors and the converted original word vectors, specifically including the following steps:

[0090] Step S301: Select and combine the specific word vectors of the transformed intersection and the original word vectors of the transformed intersection to obtain the summarized intersection word vectors;

[0091] The specific process for selecting a combination is as follows:

[0092] A. If the words represented by the specific word vectors of the transformed intersection and the original word vectors of the corresponding transformed intersection are of the first category, then the original word vectors of the transformed intersection are selected as the summed intersection word vectors.

[0093] The first category of words mentioned above can be general words, that is, words that are words in a general field.

[0094] B. If the words represented by the specific word vectors of the intersection after conversion and the original word vectors of the corresponding converted intersection are words of the second category, then the specific word vectors of the intersection after conversion are selected as the summed intersection word vectors.

[0095] The second category of words mentioned above can be professional terms, that is, words that are vocabulary in a specific field.

[0096] C. If the words represented by the specific word vectors of the transformed intersection and the corresponding original word vectors of the transformed intersection are third-class words, then calculate the weighted average of the specific word vectors of the transformed intersection and the corresponding original word vectors of the transformed intersection, and use the resulting weighted average vector as the summarized intersection word vectors.

[0097] The third category of words mentioned above refers to words that cannot be directly and clearly defined as general terms or professional terms.

[0098] The above process can be simply explained as: the transformed specific word vector W A 'Includes: the transformed intersection-specific word vector W' Ai 'and the newly added word vector W after conversion AO ';The original word vector W after conversion B 'Includes: the original word vector W after transformation intersection' Bi 'and the transformed unique word vector W BO ', for the specific word vectors W of the transformed intersection Ai The intersection of the original word vector W with the transformed vector Bi When making selections and combinations, use W to select the more common words from the intersection. Bi The word vector representation in ' uses W' to extract more specialized words from the intersection of words. Ai The word vectors in ' are represented by their respective weighted average values, while the word vectors of other words are represented by their respective weighted average values, resulting in the summed intersection word vector W. i '.

[0099] Step S302: The summed intersection word vectors are merged with the transformed new word vectors and the transformed unique word vectors to obtain the incremental word vectors.

[0100] Specifically, the summed intersection word vectors W i 'Compared with the newly added word vector W after conversion respectively AO '、Transformed unique word vector W BO Merge the words to obtain the incremented word vector W. new .

[0101] The incremental word vector generation method of this invention does not rely on old corpora. It rapidly expands the word vectors of new words and specialized terms using new corpora and existing word vectors, and then spatially merges these word vectors with the original word vectors to obtain spatially consistent word vectors. This achieves the purpose of supplementing the original word vectors with missing new words and specialized terms, effectively utilizing the information of the original word vectors, making up for the shortcomings of the original word vectors in rapidly expanding new words and specialized terms, greatly reducing training time and resource consumption. The incrementally generated word vectors can effectively improve the accuracy of downstream tasks (such as classification, clustering, etc.).

[0102] Figure 4 This diagram illustrates the comparison of the effects of two sets of word vectors trained independently on the same corpus before and after mapping (the left image shows the effect before mapping, and the right image shows the effect after mapping). PCA is used to project the word vectors into a two-dimensional space for visualization. The left image shows two different sets of word vectors generated from the same corpus (the corpus on the left is one set of word vectors trained independently, and the corpus on the right is another set trained independently). The word relationships within their respective spaces are basically consistent (the word vectors of a single word are not very meaningful; the main focus is on the correlation between two words. For example, in the left image, the relative positions of the words "massage instrument - no wind" and "massage instrument - no wind" are basically the same, indicating that they are essentially consistent internally). However, there are spatial differences between the two word vector spaces (different word vectors cluster in different positions, representing different vector spaces, which are far apart). After mapping and fusion, the two vector spaces are merged, and the word vectors of the same words are very close (to see the same words, such as "no wind," the two "no wind" vectors almost overlap, indicating they are very close).

[0103] Figure 5 The diagram shows a comparison of the effects of mapping proprietary word vectors to the overlapping words of proportional word vectors before and after mapping. The left image shows the effect of mapping proprietary word vectors (words on the right in the diagram) to the overlapping words of proportional word vectors (words on the left in the diagram) before mapping, while the right image shows the effect of mapping proprietary word vectors to the overlapping words of proportional word vectors after mapping. After mapping, the distance between words in the same corpus can be shortened (for the same word, the distance becomes closer after mapping, indicating that the two words are basically equal).

[0104] Figure 6 This diagram illustrates the effect of mapping proprietary word vectors to proportional word vectors on the creation of new words and specialized vocabulary before and after mapping. The left image shows the effect of mapping proprietary word vectors (words on the right in the diagram) to proportional word vectors (words on the left in the diagram) on the creation of new words and specialized vocabulary. The right image shows the effect of mapping proprietary word vectors to proportional word vectors on the creation of new words and specialized vocabulary after mapping proprietary word vectors to proportional word vectors. Figure 5 and Figure 6It can be seen that before the orthogonal mapping, the two vector spaces have spatial differences; after the orthogonal mapping, the two vector spaces are merged. Specifically, for... Figure 6 In the left diagram, the word vectors on the right represent new words and specialized terms from the newly trained word vectors, while the word vectors on the left represent unique words from the comparative word vectors (i.e., non-intersecting words, which do not participate in the construction of the orthogonal mapping matrix). Figure 6 The right-hand image shows that even after orthogonal mapping, the internal relationship between new words and words unique to the comparative word vector can be maintained (here you can see word pairs; taking "rechargeable handheld vacuum cleaner - cordless vacuum cleaner" as an example, "rechargeable handheld vacuum cleaner" is a newly added word, while "cordless vacuum cleaner" is a word unique to the comparative word vector; this invention can also reflect the correlation between these two words). (The comparative word vector is a general-purpose word vector matrix of over 8 million words trained on a massive Chinese corpus).

[0105] Example 2:

[0106] This invention also provides an incremental word vector generation device for the same language. This incremental word vector generation device is mainly used to execute the incremental word vector generation method provided in Embodiment 1 of this invention. The incremental word vector generation device provided in this invention will be described in detail below.

[0107] Figure 7 This is a schematic diagram of an incremental word vector generation device for the same language according to an embodiment of the present invention, such as... Figure 7 As shown, the device mainly includes: an acquisition and training unit 10, a word vector conversion unit 20, and a selection and fusion unit 30, wherein:

[0108] The unit is acquired and trained to acquire new corpus corresponding to new words and / or professional terms, and the new corpus is trained using a word vector training method to obtain specific word vectors containing new words and / or professional terms.

[0109] The word vector transformation unit is used to transform a specific word vector and the original word vector to the same vector space, resulting in the transformed specific word vector and the transformed original word vector.

[0110] The selection and fusion unit is used to select and fuse the transformed specific word vectors and the transformed original word vectors to obtain the incremental word vectors. The incremental word vectors include the word vectors of new words and / or professional words, and the original words.

[0111] In this embodiment of the invention, an incremental word vector generation device for the same language is provided, comprising: firstly acquiring new corpus corresponding to new words and / or specialized terms, and training the new corpus using a word vector training method to obtain specific word vectors containing new words and / or specialized terms; then, converting the specific word vectors and the original word vectors to the same vector space to obtain converted specific word vectors and converted original word vectors; finally, selecting and fusing the converted specific word vectors and the converted original word vectors to obtain incremental word vectors. As described above, the incremental word vector generation device of the present invention does not rely on old corpora but directly trains on new corpora to obtain specific word vectors containing new words and / or professional words. Then, it selects and fuses the specific word vectors and the original word vectors to obtain word vectors containing new words and / or professional words as well as the original words. It can be seen that the device can effectively utilize the original word vectors and can quickly expand the original word vectors to obtain word vectors containing new words and / or professional words. This greatly reduces the training time and resource consumption, improves the efficiency of word vector generation for new words and professional words, and alleviates the technical problem that existing incremental word vector generation methods cannot quickly and effectively expand new words and professional words and cannot effectively utilize the original word vector information.

[0112] Optionally, the acquisition and training unit is also used to: perform word segmentation on the new corpus to obtain new corpus word segments; and train the new corpus word segments using a word vector training method to obtain specific word vectors containing new words and / or professional terms.

[0113] Optionally, the word vector transformation unit is also used to: extract the specific word vectors corresponding to the intersection words in the specific word vectors and the original word vectors, and extract the original word vectors corresponding to the intersection words in the original word vectors, to obtain the intersection specific word vector set and the intersection original word vector set; construct a target matrix based on the intersection specific word vector set and the intersection original word vector set; perform singular value decomposition on the target matrix to obtain an orthogonal transformation matrix; and transform the specific word vectors and the original word vectors to the same vector space based on the orthogonal transformation matrix to obtain the transformed specific word vectors and the transformed original word vectors.

[0114] Optionally, the word vector transformation unit is also used to: transform a specific word vector to the vector space of the original word vector according to the orthogonal transformation matrix; or, transform the original word vector to the vector space of the specific word vector according to the orthogonal transformation matrix; or, transform the specific word vector and the original word vector to a third vector space respectively according to the orthogonal transformation matrix.

[0115] Optionally, the transformed specific word vectors include: the transformed intersection specific word vectors and the transformed newly added word vectors; the transformed original word vectors include: the transformed intersection original word vectors and the transformed unique word vectors; the words corresponding to the transformed intersection specific word vectors and the transformed intersection original word vectors are the same; the selection and fusion unit is also used to: select and combine the transformed intersection specific word vectors and the transformed intersection original word vectors to obtain the summarized intersection word vectors; and merge the summarized intersection word vectors with the transformed newly added word vectors and the transformed unique word vectors to obtain the incremental word vectors.

[0116] Optionally, the selection fusion unit is further used to: if the words represented by the transformed intersection-specific word vector and the corresponding transformed intersection-original word vector are first-class words, then select the transformed intersection-original word vector as the summarized intersection word vector; if the words represented by the transformed intersection-specific word vector and the corresponding transformed intersection-original word vector are second-class words, then select the transformed intersection-specific word vector as the summarized intersection word vector; if the words represented by the transformed intersection-specific word vector and the corresponding transformed intersection-original word vector are third-class words, then calculate the weighted average of the transformed intersection-specific word vector and the corresponding transformed intersection-original word vector, and use the obtained weighted average vector as the summarized intersection word vector.

[0117] Optionally, word vector training methods include any of the following: word2vec algorithm, GloVe algorithm, ELMo algorithm, and BERT algorithm.

[0118] The device provided in this embodiment of the invention has the same implementation principle and technical effect as the aforementioned method embodiment. For the sake of brevity, any parts not mentioned in the device embodiment can be referred to the corresponding content in the aforementioned method embodiment.

[0119] like Figure 8 As shown in the embodiment of this application, an electronic device 600 includes a processor 601, a memory 602, and a bus. The memory 602 stores machine-readable instructions executable by the processor 601. When the electronic device is running, the processor 601 communicates with the memory 602 via the bus. The processor 601 executes the machine-readable instructions to perform the steps of the incremental word vector generation method for the same language as described above.

[0120] Specifically, the memory 602 and processor 601 mentioned above can be general-purpose memory and processor, without any specific limitations. When the processor 601 runs the computer program stored in the memory 602, it can execute the above-mentioned incremental generation method for word vectors of the same language.

[0121] The processor 601 may be an integrated circuit chip with signal processing capabilities. In implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 601 or by instructions in software form. The processor 601 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly manifested as execution by a hardware decoding processor, or execution by a combination of hardware and software modules in the decoding processor. The software module can reside in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 602, and processor 601 reads the information from memory 602 and, in conjunction with its hardware, completes the steps of the above method.

[0122] Corresponding to the above-described incremental generation method for word vectors of the same language, this application embodiment also provides a computer-readable storage medium storing machine-executable instructions. When the machine-executable instructions are invoked and executed by a processor, the machine-executable instructions cause the processor to perform the steps of the above-described incremental generation method for word vectors of the same language.

[0123] The incremental word vector generation device for the same language provided in this application embodiment can be specific hardware on a device or software or firmware installed on the device. The implementation principle and technical effects of the device provided in this application embodiment are the same as those in the foregoing method embodiments. For the sake of brevity, any parts not mentioned in the device embodiment can be referred to the corresponding content in the foregoing method embodiments. Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can all be referred to the corresponding processes in the above method embodiments, and will not be repeated here.

[0124] In the embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. The apparatus embodiments described above are merely illustrative. For example, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. Furthermore, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Additionally, the displayed or discussed mutual couplings, direct couplings, or communication connections may be through some communication interfaces; indirect couplings or communication connections between devices or units may be electrical, mechanical, or other forms.

[0125] For example, the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in block diagrams and / or flowcharts, may be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.

[0126] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0127] In addition, the functional units in the embodiments provided in this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0128] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause an electronic device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the vehicle marking method described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0129] It should be noted that similar labels and letters in the following figures indicate similar items. Therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. In addition, the terms "first", "second", "third", etc. are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0130] Finally, it should be noted that the above-described embodiments are merely specific implementations of this application, used to illustrate the technical solutions of this application, and not to limit them. The protection scope of this application is not limited thereto. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this application; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application. All should be covered within the protection scope of this application. Therefore, the protection scope of this application should be determined by the protection scope of the claims.

Claims

1. A method for incrementally generating word vectors for the same language, characterized in that, include: Obtain new corpus corresponding to new words and / or professional terms, and train the new corpus using a word vector training method to obtain specific word vectors containing new words and / or professional terms; The specific word vector and the original word vector are transformed into the same vector space to obtain the transformed specific word vector and the transformed original word vector; The transformed specific word vectors and the transformed original word vectors are selected and fused to obtain incremental word vectors, wherein the incremental word vectors include: word vectors of new words and / or professional words, and word vectors of original words; The process of transforming the specific word vector and the original word vector to the same vector space includes: Based on the intersection words in the specific word vector and the original word vector, extract the specific word vector corresponding to the intersection word in the specific word vector, and extract the original word vector corresponding to the intersection word in the original word vector to obtain the intersection specific word vector set and the intersection original word vector set; Construct a target matrix based on the specific word vector set of the intersection and the original word vector set of the intersection; Perform singular value decomposition on the target matrix to obtain an orthogonal transformation matrix; The specific word vector and the original word vector are transformed to the same vector space according to the orthogonal transformation matrix to obtain the transformed specific word vector and the transformed original word vector; The transformed specific word vectors include: the transformed intersection specific word vectors and the transformed newly added word vectors. The transformed original word vectors include: the transformed intersection original word vectors and the transformed unique word vectors. The words corresponding to the transformed intersection specific word vectors and the transformed intersection original word vectors are the same. The selection and fusion of the transformed specific word vectors and the transformed original word vectors includes: The specific word vectors of the transformed intersection and the original word vectors of the transformed intersection are selected and combined to obtain the summarized intersection word vectors. The aggregated intersection word vectors are merged with the transformed new word vectors and the transformed unique word vectors to obtain the incremental word vectors. The selection and combination of the specific word vectors of the transformed intersection and the original word vectors of the transformed intersection includes: If the words represented by the specific word vector of the transformed intersection and the corresponding original word vector of the transformed intersection are general words, then the original word vector of the transformed intersection is selected as the summarized intersection word vector; If the words represented by the transformed intersection-specific word vector and the corresponding transformed intersection-original word vector are professional terms, then the transformed intersection-specific word vector is selected as the summarized intersection word vector; If the words represented by the transformed intersection-specific word vector and the corresponding transformed intersection original word vector are words that cannot be clearly defined as general words or professional words, then the weighted average of the transformed intersection-specific word vector and the corresponding transformed intersection original word vector is calculated, and the resulting weighted average vector is used as the summarized intersection word vector.

2. The method according to claim 1, characterized in that, The new corpus is trained using word vector training methods, including: The new corpus is segmented to obtain new corpus word segments; The new corpus is segmented and trained using the aforementioned word vector training method to obtain specific word vectors containing new words and / or specialized terms.

3. The method according to claim 1, characterized in that, Transforming the specific word vector and the original word vector to the same vector space according to the orthogonal transformation matrix includes: The specific word vector is transformed into the vector space of the original word vector according to the orthogonal transformation matrix; or, The original word vectors are transformed into the vector space of the specific word vectors according to the orthogonal transformation matrix; or, The specific word vector and the original word vector are transformed to the third vector space according to the orthogonal transformation matrix.

4. The method according to claim 1, characterized in that, The word vector training method includes any of the following: word2vec algorithm, GloVe algorithm, ELMo algorithm, and BERT algorithm.

5. An incremental word vector generation device for a single language, characterized in that, include: The acquisition and training unit is used to acquire new corpus corresponding to new words and / or professional terms, and to train the new corpus using a word vector training method to obtain specific word vectors containing new words and / or professional terms. The word vector conversion unit is used to convert the specific word vector and the original word vector to the same vector space to obtain the converted specific word vector and the converted original word vector. The selection and fusion unit is used to select and fuse the transformed specific word vector and the transformed original word vector to obtain the incremental word vector, wherein the incremental word vector includes: word vectors of new words and / or professional words, and original words; The word vector conversion unit is further configured to: extract specific word vectors corresponding to the intersection words in the specific word vectors and the original word vectors, and extract the original word vectors corresponding to the intersection words in the original word vectors, to obtain a set of intersection specific word vectors and a set of intersection original word vectors; construct a target matrix based on the set of intersection specific word vectors and the set of intersection original word vectors; perform singular value decomposition on the target matrix to obtain an orthogonal transformation matrix; and transform the specific word vectors and the original word vectors to the same vector space based on the orthogonal transformation matrix to obtain the transformed specific word vectors and the transformed original word vectors. The transformed specific word vectors include: transformed intersection specific word vectors and transformed newly added word vectors; the transformed original word vectors include: transformed intersection original word vectors and transformed unique word vectors; the words corresponding to the transformed intersection specific word vectors and the transformed intersection original word vectors are the same; the selection and fusion unit is further used to: select and combine the transformed intersection specific word vectors and the transformed intersection original word vectors to obtain a summarized intersection word vector; and merge the summarized intersection word vectors with the transformed newly added word vectors and the transformed unique word vectors to obtain the incremental word vectors; The selection and fusion unit is further configured to: if the words represented by the converted intersection-specific word vector and the corresponding converted intersection-original word vector are general words, then select the converted intersection-original word vector as the summarized intersection word vector; if the words represented by the converted intersection-specific word vector and the corresponding converted intersection-original word vector are professional words, then select the converted intersection-specific word vector as the summarized intersection word vector; if the words represented by the converted intersection-specific word vector and the corresponding converted intersection-original word vector are words that cannot be clearly defined as general words or professional words, then calculate the weighted average of the converted intersection-specific word vector and the corresponding converted intersection-original word vector, and use the obtained weighted average vector as the summarized intersection word vector.

6. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 4.

7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores machine-executable instructions that, when invoked and executed by a processor, cause the processor to perform the method according to any one of claims 1 to 4.