Apparatus and method for evaluating machine translation on basis of residual score, and apparatus for evaluating language model response
The residual score-based evaluation method addresses the bias in existing metrics by comparing candidate sentences to references, ensuring that more natural translations receive higher scores, enhancing translation quality assessment.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- IND UNIV COOP FOUND HANYANG UNIV ERICA CAMPUS
- Filing Date
- 2025-10-30
- Publication Date
- 2026-07-02
AI Technical Summary
Existing machine translation quality evaluation methods, such as BLEU, chrF++, and BERTScore, are biased towards reference sentences, preventing large language models from achieving higher scores even when their outputs are more natural and communicative, necessitating new evaluation metrics.
A residual score-based evaluation method that calculates a superiority relationship value by encoding source, candidate, and reference sentences into embedding vectors and applying an activation function to determine the quality of candidate sentences relative to the reference, allowing higher scores for more natural translations.
Enables fair evaluation of large language model translations by outputting the highest quality sentence, improving translation quality assessment beyond traditional reference-based metrics.
Smart Images

Figure KR2025017621_02072026_PF_FP_ABST
Abstract
Description
Residual score-based machine translation evaluation device and method and language model response evaluation device
[0001] The specification of the present disclosure relates to the performance evaluation of machine translation, and more specifically, is characterized by calculating residual scores of candidate sentences based on a reference sentence and a source text, and quantitatively evaluating the quality of the translated sentences based thereon.
[0002]
[0003] Most existing machine translation quality evaluation methods measure the quality of a translation by assessing the similarity between the target sentence and candidate translations, based on a reference translation written by a human.
[0004] Representative evaluation indicators for conventional translation sentences include BLEU, chrF++, BERTScore, and COMET. These indicators calculate scores based on grammatical, lexical, and semantic similarity with reference sentences.
[0005] However, because this evaluation method regards the reference sentence as an absolute standard, it has a structural limitation (reference bias) in which candidate sentences generated by the recent advancements in Large Language Models (LLMs) cannot receive higher scores than the reference sentence, even though such candidate sentences are more natural and communicative than the reference sentence.
[0006] Due to these structural limitations, there is a growing need for new evaluation metrics that can fairly assess the capabilities of large-scale language model-based translation models.
[0007]
[0008] The embodiments disclosed in this specification provide an apparatus and method for finally outputting the highest quality sentence as a translation by calculating a residual score that evaluates the quality of candidate sentences generated based on a machine translation model.
[0009]
[0010] A residual score-based machine translation evaluation device according to an embodiment of the present invention may include: an input unit that receives a source text; an evaluation unit that inputs the source text into a machine translation module based on a large-scale language model to generate one or more candidate sentences, searches for a reference sentence corresponding to the source text, and inputs the source text, candidate sentence, and reference sentence into a residual score calculation module to calculate a superiority relationship value regarding the quality of the candidate sentence; an output unit that outputs the candidate sentence as a translation when the superiority relationship value indicates that the quality of the candidate sentence is higher than that of the reference sentence, and outputs the reference sentence as a translation when the superiority relationship value indicates that the quality of the candidate sentence is lower than that of the reference sentence; and a residual score calculation module that encodes the source text, candidate sentence, and reference sentence, converts each into an embedding vector, inputs them into a predetermined layer, and applies an activation function to calculate a superiority relationship value.
[0011] The above residual score calculation module outputs the above superiority-lower relationship value and can output whether the quality of the candidate sentence is higher or lower.
[0012] The above reference sentence can be retrieved from among a number of previously stored reference sentences regarding the above original text.
[0013] The above residual score calculation module may be a machine learning-based model trained using a training dataset that includes original text, candidate sentences, reference sentences, and dominance relationship values.
[0014] The above residual score calculation module can encode the source text, candidate sentence, and reference sentence to convert them into embedding vectors, input the embedding vectors into a fully connected layer, and calculate the above dominance relationship value through the activation function of a Sigmoid function.
[0015] The dominance relationship values included in the above training data set may be pre-labeled values.
[0016] The above residual score calculation module can calculate a residual score minus 1 as the dominance relationship value when the candidate sentence is of lower quality than the reference sentence, and calculate a residual score minus 1 as the dominance relationship value when the candidate sentence is of higher quality than the reference sentence.
[0017] In addition, a residual score-based machine translation evaluation method according to an embodiment of the present invention may include: a step in which a residual score-based machine translation evaluation device receives a source text; a step in which the residual score-based machine translation evaluation device inputs the source text into a machine translation module based on a large-scale language model to generate one or more candidate sentences; a step in which the residual score-based machine translation evaluation device searches for a reference sentence corresponding to the source text; a step in which the residual score-based machine translation evaluation device inputs the source text, candidate sentence, and reference sentence into a residual score calculation model to calculate a superiority relationship value indicating whether the quality of the candidate sentence is higher than that of the reference sentence; and a step in which the residual score-based machine translation evaluation device outputs the candidate sentence as a translation for the source text if the quality of the candidate sentence is higher, and the reference sentence if the quality of the candidate sentence is lower.
[0018] In addition, a computer-readable storage medium according to an embodiment of the present invention may store a computer program to execute the method using a computer.
[0019] In addition, a residual score-based language model response evaluation device according to an embodiment of the present invention may include: an input unit that receives a source text; an evaluation unit that inputs the source text into a large-scale language model-based response generation module to generate one or more candidate responses, searches for a reference response corresponding to the source text, and inputs the source text, candidate response, and reference response into a residual score calculation module to calculate a superiority-subordinate relationship value regarding the quality of the candidate response; an output unit that outputs the candidate response as an output response when the superiority-subordinate relationship value indicates that the quality of the candidate response is higher than that of the reference response, and outputs the reference response as an output response when the superiority-subordinate relationship value indicates that the quality of the candidate response is lower than that of the reference response; and a residual score calculation module that encodes the source text, candidate response, and reference response, converts them into embedding vectors respectively, inputs them into a predetermined layer, and applies an activation function to calculate a superiority-subordinate relationship value.
[0020] A computer program according to an embodiment of the present invention may be stored on a medium to execute any one of the methods according to an embodiment of the present invention using a computer.
[0021] In addition to the above, other methods for implementing the present invention, other systems, and computer-readable recording media for recording a computer program for executing said methods are further provided.
[0022] Other aspects, features, and advantages other than those described above will become apparent from the following drawings, claims, and detailed description of the invention.
[0023]
[0024] According to any one of the aforementioned means for solving the problem, by calculating residual scores that evaluate the quality of candidate sentences generated based on a machine translation model, the sentence of the highest quality is finally output as the translation.
[0025] Furthermore, residual scores suggest potential for use in evaluating not only sentence translation quality but also the quality of LLM response outcomes.
[0026]
[0027] FIG. 1 is a diagram relating to the network environment of an optimization system according to embodiments of the present invention.
[0028] FIG. 2 is a block diagram of an evaluation apparatus according to embodiments of the present disclosure.
[0029] Figure 3 is a block diagram of the translation output section.
[0030] FIG. 4 is a flowchart of a residual score-based machine translation quality evaluation method according to an embodiment of the present disclosure.
[0031] FIG. 5 is a diagram illustrating the data flow in the case where a candidate sentence has lower quality than a reference sentence during the operation of a residual score-based evaluation model according to one embodiment of the present invention.
[0032] FIG. 6 is a diagram illustrating the data flow in the case where a candidate sentence has a higher quality than a reference sentence during the operation of a residual score calculation model (M) according to one embodiment of the present invention.
[0033]
[0034] The structure and operation of the present invention will be described in detail below with reference to embodiments of the present invention illustrated in the attached drawings.
[0035] The present invention is capable of various modifications and may have various embodiments; specific embodiments are illustrated in the drawings and described in detail in the detailed description. The effects and features of the present invention, and the methods for achieving them, will become clear by referring to the embodiments described below in detail together with the drawings. However, the present invention is not limited to the embodiments disclosed below but can be implemented in various forms.
[0036] Hereinafter, embodiments of the present invention will be described in detail with reference to the attached drawings. When describing with reference to the drawings, identical or corresponding components are given the same reference numerals, and redundant descriptions thereof will be omitted.
[0037] In the following, terms described as "upper" or "upper" may include not only those directly above in contact, but also those above without contact.
[0038] In the following embodiments, terms such as first, second, etc. are used not in a limiting sense, but for the purpose of distinguishing one component from another component.
[0039] In the following embodiments, singular expressions include plural expressions unless the context clearly indicates otherwise.
[0040] In the following embodiments, terms such as "include" or "have" mean that the features or components described in the specification are present, and do not preclude the possibility that one or more other features or components may be added.
[0041] In the drawings, the size of components may be exaggerated or reduced for convenience of explanation. For example, the size and thickness of each component shown in the drawings are depicted arbitrarily for convenience of explanation, so the present invention is not necessarily limited to what is illustrated.
[0042] Additionally, terms such as “…part,” “…area,” etc., as described in this specification may refer to a unit that processes at least one function or operation.
[0043] FIG. 1 illustrates a data generation network system including a server and a user terminal according to one embodiment of the present disclosure.
[0044] The residual score-based machine translation sentence evaluation system (1, hereinafter, evaluation system) of the present disclosure may include a server (20) and at least one user terminal (11 to 16). The server (20) may provide various online activities through a network. The server (20) may provide online activities to at least one user terminal (11 to 16) simultaneously. Here, the online activities may include providing a residual score-based machine translation service.
[0045] A residual score-based machine translation service can evaluate the optimal translation for an input text by comparing a reference sentence and a candidate sentence for the original text, and output the optimal translation.
[0046] In this specification, a reference sentence refers to a sentence translated from the original text by a human translator or a reliable system, which serves as a standard for evaluating machine translation quality.
[0047] In this specification, a candidate sentence refers to a translated sentence generated by a machine translation module. The candidate sentence can be generated through services such as Moses, Google Translate, Deepl, Amazon, and Microsoft.
[0048] According to one embodiment of the present disclosure, the term server (20) may include a single server, a collection of servers, a cloud server, etc., but is not limited to the above examples. The server (20) may provide various online activities and may include a database that stores data for online activities. Additionally, the server (20) may include a payment server that generates and processes payment events. As previously described, the server (20) may be a residual score-based machine translation sentence evaluation device.
[0049] According to one embodiment of the present disclosure, a network refers to a connection established (or formed) using any communication method, and may refer to a communication network connected through any communication method that transmits and receives data between terminals or between a terminal and a server.
[0050] The term "all communication methods" may include all communication methods, such as communication through a specified communication standard, a specified frequency band, a specified protocol, or a specified channel. For example, it may include communication methods via Bluetooth, BLE, Wi-Fi, Zigbee, 3G, 4G, 5G, 6G, LTE, and ultrasound, and may include short-range communication, long-range communication, wireless communication, and wired communication. Of course, it is not limited to the above examples.
[0051] According to one embodiment of the present disclosure, a short-range communication method may mean a communication method in which communication is possible only when a device (terminal or server) performing communication is within a predetermined range, and may include, for example, Bluetooth, NFC, etc. A long-range communication method may mean a communication method in which a device performing communication can communicate regardless of distance. For example, a long-range communication method may mean a method in which two devices performing communication through a repeater such as an AP can communicate even when they are more than a predetermined distance apart, and may include communication methods using cellular networks (3G, 4G, 5G, LTE) such as SMS and telephone. Of course, it is not limited to the above examples. The meaning of receiving a service using a network may include the meaning that communication between a server and a terminal can be performed through any communication method.
[0052] Throughout the specification, the term "at least one user terminal (11 to 16)" may include a personal computer (11), a tablet (12), a cellular phone (13), a laptop (14), a smartphone (15), a TV (16), as well as various electronic devices such as personal digital assistants (PDA), portable multimedia players (PMP), navigation devices, and MP3 players, and is not limited to the examples above. As previously described, at least one user terminal (11 to 16) may be an evaluation device.
[0053] According to one embodiment of the present disclosure, a server (20) can output candidate sentences through a machine translation module based on a large-scale language model of input source data. The server (20) can calculate a residual score for the output candidate sentences and compare the translation quality of the reference sentence and the candidate sentences based on the residual score. The server (20) can output a candidate sentence having a higher reference score than the reference sentence as a translation of the source text.
[0054] According to one embodiment of the present disclosure, at least one user terminal (11 to 16) can output candidate sentences through a machine translation module based on a large-scale language model of input source data. At least one user terminal (11 to 16) can calculate a residual score for the output candidate sentences and compare the quality of the reference sentence and the candidate sentences based on the residual score. At least one user terminal (11 to 16) can output a candidate sentence having a higher reference score than the reference sentence as a translation of the source text.
[0055] In addition, according to one embodiment of the present disclosure, the optimization system (1) can output candidate sentences through a machine translation module based on a large-scale language model of input source data. The optimization system (1) can calculate a residual score for the output candidate sentences and compare the quality of the reference sentence and the candidate sentences based on the residual score. The optimization system (1) can output a candidate sentence having a higher reference score than the reference sentence as a translation of the source text.
[0056] This is explained in more detail below.
[0057] FIG. 2 is a drawing for explaining the detailed configuration of an evaluation device according to embodiments of the present disclosure.
[0058] As illustrated in FIG. 2, a residual score-based machine translation sentence evaluation device (100, hereinafter referred to as the evaluation device) according to some embodiments may include a processor (110), an input / output unit (130), a memory (140), a communication unit (150), and a translation output unit (200). However, not all components illustrated in FIG. 2 are essential components of the evaluation device (100). The evaluation device (100) may be implemented with more components than those illustrated in FIG. 2, or with fewer components than those illustrated in FIG. 2. The evaluation device (100) may be a user terminal, a server, a residual score-based machine translation network system, or a separate device.
[0059] According to one embodiment of the present disclosure, the processor (110) typically controls the overall operation of the evaluation device (100). For example, the processor (110) can control the components included in the evaluation device (100) by executing a program stored in the evaluation device (100).
[0060] According to one embodiment of the present disclosure, the processor (110) can output candidate sentences through a machine translation module based on a large-scale language model by executing instructions stored in the translation output unit (200). The processor (110) can calculate a residual score for the output candidate sentences and compare the quality of the reference sentence and the candidate sentence based on the residual score. The processor (110) can output a candidate sentence having a higher reference score than the reference sentence as a translation of the original text. The processor (110) is configured to control the evaluation device (100) overall. Specifically, the processor (110) controls the overall operation of the evaluation device (100) using various programs stored in the storage medium (150) of the evaluation device (100). For example, the processor (110) may include a CPU, RAM, ROM, and a system bus. Here, ROM is a configuration in which an instruction set for system booting is stored, and the CPU copies the operating system stored in the evaluation device (100) to RAM according to the instructions stored in the ROM, and executes the O / S to boot the system. Once the system booting is complete, the CPU can copy various stored applications to RAM and execute them to perform various operations. Although the evaluation device (100) has been described above as including only one CPU, it may be implemented with multiple CPUs (or DSP, SoC, etc.) during implementation.
[0061] According to one embodiment of the present invention, the processor (110) may be implemented as a digital signal processor (DSP) that processes digital signals, a microprocessor, or a TCON (Time controller). However, it is not limited thereto, and may include or be defined by one or more of a central processing unit (CPU), a Micro Controller Unit (MCU), a micro processing unit (MPU), a controller, an application processor (AP), a communication processor (CP), or an ARM processor. Additionally, the processor (110) may be implemented as a System on Chip (SoC) or Large Scale Integration (LSI) with a built-in processing algorithm, or may be implemented in the form of a Field Programmable Gate Array (FPGA).
[0062] According to one embodiment of the present disclosure, the input / output unit (130) can display an interface generated by the memory (140). According to one embodiment of the present invention, the input / output unit (130) can display a user interface for input user input. The input / output unit (130) can output stored graphic data, visual data, auditory data, and vibration data under the control of the memory (140).
[0063] The input / output unit (130) can be implemented as a display panel of various forms. For example, the display panel can be implemented as a display technology such as LCD (Liquid Crystal Display), OLED (Organic Light Emitting Diodes), AM-OLED (Active-Matrix Organic Light-Emitting Diode), LcoS (Liquid Crystal on Silicon), or DLP (Digital Light Processing). Additionally, the input / output unit (130) may be combined with at least one of the front area, side area, and rear area of the display panel in the form of a flexible display.
[0064] The input / output unit (130) can be implemented as a touch screen with a layer structure. The touch screen can have a function to detect not only the display function but also the touch input location, the touched area, and the touch input pressure, and can also have a function to detect not only real touch but also proximity touch.
[0065] The input / output unit (130) may include a user interface for inputting various information to the evaluation device (100).
[0066] According to one embodiment of the present disclosure, the memory (140) may store a program for processing and controlling the processor (110) and / or the translation output unit (200), and may also store data input to or output from the evaluation device (100). According to one embodiment of the present disclosure, the memory (140) may store information regarding a user account and may store information necessary to calculate the reference score of a candidate sentence. The memory (140) may include a database that stores the above information.
[0067] According to one embodiment of the present disclosure, the memory (140) may include at least one type of storage medium among a flash memory type, a hard disk type, a multimedia card micro type, a card type memory (e.g., SD or XD memory, etc.), RAM (Random Access Memory), SRAM (Static Random Access Memory), ROM (Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), PROM (Programmable Read-Only Memory), magnetic memory, a magnetic disk, and an optical disk. Additionally, according to one embodiment of the present disclosure, programs stored in the memory (140) may be classified into a plurality of modules according to their functions.
[0068] According to one embodiment of the present disclosure, the communication unit (150) can communicate with an external device of the processor (110). For example, the communication unit (150) can communicate with an external device, such as a payment server or an authentication server, under the control of the processor (110). Additionally, the communication unit (150) may obtain user information or user input through communication with an external interface.
[0069] FIG. 3 is a diagram illustrating the detailed structure and operation of the translation output unit (200).
[0070] The translation output unit (200) inputs the input original text into a machine translation module based on a large-scale language model to output candidate sentences, and inputs the candidate sentences and the original text into a residual score calculation module (230) to obtain a superiority-lowering relationship value regarding the quality of the candidate sentences. The translation output unit (200) outputs the candidate sentence as a translation of the original text if the quality of the candidate sentence is higher, and outputs the reference sentence as a translation of the original text if the quality of the candidate sentence is lower.
[0071] The input unit (221) can receive a first original text to be translated. The first original text may be a sentence in a first language. The input unit (221) can receive additional information about a second language to which the first original text is to be translated.
[0072] The evaluation unit (222) can input the first original text into a machine translation module based on a large-scale language model and output a first candidate sentence for the first original text.
[0073] Here, the machine translation module based on a large-scale language model performs the function of converting an input source text into a translation of the target language based on the latest language generation technology in the field of Natural Language Processing.
[0074] A machine translation module based on a large-scale language model can convert source text into a vector in a representation space that reflects the meaning of the sentence by tokenizing the source text and transforming each token into a high-dimensional embedding vector. The module understands the contextual characteristics of the target language of the source text through an encoder network and generates translated sentences that conform to the grammar and word order of the target language through a decoder network. The module can improve translation performance by weighting contextually important parts through an attention module. The module can generate not only a single sentence but also multiple candidate sentences as translation results corresponding to the source text. The module can enhance semantic agreement between the source text and the translation by utilizing a multilingual corpus. Here, a multilingual corpus refers to a collection of text data consisting of two or more languages, where sentences in two or more languages with the same meaning are stored as sets. Multilingual corpora may include the multilingual minutes of the European Parliament, open multilingual corpus repositories, multilingual translations of official UN documents, and corpora for machine translation evaluation.
[0075] The evaluation unit (222) can obtain a first reference sentence for the first original text. The evaluation unit (222) can search for a first reference sentence for the first original text among the previously stored reference sentences.
[0076] The evaluation unit (222) can input the first original text, the first candidate sentence, and the first reference sentence into the residual score calculation module and output the superiority relationship value of the first candidate sentence.
[0077] The evaluation unit (222) can output the first candidate sentence as a translation of the first original text if the superiority relationship value of the first candidate sentence indicates that the quality of the first candidate sentence is higher. The output unit (223) can output the first reference sentence as a translation of the first original text if the superiority relationship value of the first candidate sentence indicates that the quality of the first candidate sentence is lower.
[0078] The residual score calculation module is a model trained on a training dataset that includes inputs of source text, candidate sentence, and reference sentence, and outputs a superiority-inferiority relationship value. Depending on the inputs of source text, candidate sentence, and reference sentence, the module can output a superiority-inferiority relationship value. The superiority-inferiority relationship value may be a value indicating whether the quality of the candidate sentence is higher or lower than that of the reference sentence.
[0079] The residual score calculation module refers to a model that outputs a dominance relationship value indicating whether a candidate sentence has higher or lower quality than a reference sentence. Here, higher quality refers to the quality as a translation of the source text; a higher quality means that it is more suitable as a translation of the source text.
[0080] The residual score calculation module can be trained with a training dataset of pairs of candidate sentences and reference sentences for a single source text and superiority relationship values. The training dataset of the residual score calculation module may be data that includes a reference sentence and a candidate sentence for a source text, and is labeled with a superiority relationship value indicating whether the candidate sentence is superior.
[0081] Here, the dominance relationship value may be a value labeled by dividing the candidate sentence quality into high and low.
[0082] The residual score calculation module takes the source text, candidate sentences, and reference sentences as input data, and can output whether the candidate sentences are of high quality.
[0083] The residual score calculation module can convert the source text, candidate sentence, and reference sentence included in the input data into embedding vector values by encoding them respectively. The residual score calculation module can input the embedding vector value of the source text, the embedding vector value of the candidate sentence, and the embedding vector value of the reference sentence into a predetermined layer (fully connected layer) and output the superiority relationship value of the candidate sentence by passing it through a sigmoid function. The residual score calculation module may use at least one of the tanh function, ReLU function, Leaky ReLU function, Swish function, Softsign function, and Hard sigmoid function other than the sigmoid function.
[0084] The residual score calculation module may be a learning-based model generated through a learning process as described above, but is not limited thereto, and may be a rule-based, statistics-based, or non-learning-based model.
[0085] Through this, the evaluation device according to the embodiments of the present disclosure can compare quality more precisely by interpreting the semantic differences between the original text, candidate sentences, and reference sentences using a deep model.
[0086] An evaluation device according to embodiments of the present disclosure can generate a plurality of candidate sentences for an input text and compare the quality of each candidate sentence with a reference sentence to select and output the candidate sentence having the highest quality among the candidate sentences.
[0087] The residual score calculation module learns from pre-labeled data to output the optimal translation for the original text, and by inputting the output optimal translation and the original text again, its ability to calculate evaluation scores for candidate sentences can be improved through reinforcement learning.
[0088] The evaluation device according to the embodiments of the present disclosure can utilize various translated candidate sentences, not limited to the reference sentence, through a residual score calculation module, and this can enable various translations of the original text to be output.
[0089] FIG. 4 is a flowchart of a residual score-based machine translation quality evaluation method according to an embodiment of the present disclosure.
[0090] In S110, the residual score-based machine translation evaluation device can receive a first source text to be translated.
[0091] In S120, the residual score-based machine translation evaluation device can input the first source text into a large-scale language model-based machine translation module to generate a first candidate sentence for the first source text.
[0092] In S130, the residual score-based machine translation evaluation device can obtain a first reference sentence for a first source text. The residual score-based machine translation evaluation device can search for the first reference sentence for the first source text among previously stored reference sentences.
[0093] In S140, the residual score-based machine translation evaluation device inputs the first source text, the first candidate sentence, and the first reference sentence into the residual score calculation module and outputs the superiority relationship value of the first candidate sentence. The operation of the residual score calculation module is the same as described in FIG. 3, so it is omitted.
[0094] In S150, the residual score-based machine translation evaluation device can output the first candidate sentence as a translation for the first original text if the superiority relationship value of the first candidate sentence indicates that the quality of the first candidate sentence is higher.
[0095] In S160, the residual score-based machine translation evaluation device can output a first reference sentence as a translation for the first original text when the superiority relationship value of the first candidate sentence indicates that the quality of the first candidate sentence is lower.
[0096] FIG. 5 is a diagram illustrating the data flow in the case where a candidate sentence has lower quality than a reference sentence during the learning operation of a residual score-based evaluation model according to one embodiment of the present invention.
[0097] As illustrated in FIG. 5, a residual score-based machine translation evaluation device according to an embodiment of the present invention may include a residual score calculation model (M) that takes a source, a reference sentence, and a candidate sentence as inputs and calculates the quality of a candidate sentence based on the input sentence pairs.
[0098] Specifically, the residual score calculation model (M) receives input data for the source, reference, and candidate sentences. The source(s) refers to a sentence in the first language that serves as the standard for translation. The reference sentence (r) refers to a high-quality translation or standard sentence translated by a human. The candidate sentence (c) refers to a translation generated by a machine translation system.
[0099] The residual score calculation model (M) can take the source text, reference sentence, and candidate sentence as inputs and output a superiority relationship value of the candidate sentence. It can determine whether the translation quality of the candidate sentence is lower than that of the reference sentence based on the superiority relationship value, and in this case, calculate a value of y - 1 as the target output value (Target Δy). Here, y is the output value of the calculation model and is a superiority relationship value representing the relative quality of the candidate sentence. The superiority relationship value may be a scalar value.
[0100] FIG. 6 is a diagram illustrating the data flow in the case where a candidate sentence has a higher quality than a reference sentence during the operation of a residual score calculation model (M) according to one embodiment of the present invention.
[0101] As illustrated in FIG. 6, a residual score-based machine translation evaluation device according to an embodiment of the present invention may include a residual score calculation model (M) that takes a source, a candidate sentence, and a reference sentence as inputs and calculates the quality of the candidate sentence based on the input sentence pairs.
[0102] The residual score calculation model (M) can be trained to output a target value Δy = 1 - y, which corresponds to the case where the translation quality of the candidate sentence (r) is higher than that of the reference sentence (c).
[0103] At this time, if it is determined that the candidate sentence has a higher translation quality than the reference sentence, a positive (+) residual score (target delta score) is assigned. For example, if it is determined that the candidate sentence is more natural or more faithful to the original meaning compared to the reference sentence, that is, if the quality of the candidate sentence is higher than that of the reference sentence, the output model (M) can adjust the weights so that Δy is in the positive (+) direction to reflect this situation.
[0104] The residual score calculation model (M) is a learning-based evaluation model that numerically expresses the translation quality of candidate sentences by embedding and encoding three input sentences (s, c, r) to extract semantic features and calculating the dominance relationship values of the candidate sentences.
[0105] Although the above description focused on an example evaluating the quality of candidate sentences generated based on a machine translation model, the residual score according to one embodiment suggests the possibility of utilizing it to evaluate the quality of LLM response results. That is, by mapping the aforementioned machine translation module to a response generation module, candidate sentences to candidate responses, and reference sentences to reference responses, the residual score can be extended and applied to evaluate the quality of LLM response results in the same manner as described above. Redundant explanations below will be omitted.
[0106] The embodiments of the present invention described so far may be implemented as hardware components, software components, and / or combinations of hardware components and software components. For example, the devices, methods, and components described in the embodiments may be implemented using a general-purpose computer or a special-purpose computer, such as, for example, a processor, a controller, an arithmetic logic unit (ALU), a digital signal processor, a microcomputer, a field programmable gate array (FPGA), a programmable logic unit (PLU), a microprocessor, or any other device capable of executing and responding to instructions. The processing unit may execute an operating system (OS) and software applications executed on said operating system. Additionally, the processing unit may access, store, manipulate, process, and generate data in response to the execution of the software. For ease of understanding, the processing unit may be described as being used as a single unit, but those skilled in the art will understand that the processing unit may include a plurality of processing elements and / or a plurality of types of processing elements. For example, the processing unit may include a plurality of processors or one processor and one controller. In addition, other processing configurations, such as parallel processors, are also possible.
[0107] Software may include computer programs, code, instructions, or a combination of one or more of these, and may configure a processing unit to operate as desired or command the processing unit independently or collectively. Software and / or data may be permanently or temporarily embodied in any type of machine, component, physical device, virtual equipment, computer storage medium or device, or transmitted signal wave in order to be interpreted by the processing unit or to provide instructions or data to the processing unit. Software may be distributed over networked computer systems and may be stored or executed in a distributed manner. Software and data may be stored on computer-readable recording media.
[0108] Furthermore, embodiments of the present invention may be implemented in the form of program instructions that can be executed through various computer means and recorded on a computer-readable medium. The computer-readable medium may include program instructions, data files, data structures, etc., either alone or in combination. The program instructions recorded on the medium may be those specifically designed and configured for the present invention, or they may be those known and available to those skilled in the art of computer software. Examples of computer-readable recording media include magnetic media such as hard disks, floppy disks, and magnetic tapes; optical recording media such as CD-ROMs and DVDs; magneto-optical media such as floptical disks; and hardware devices configured to store and execute program instructions, such as ROM, RAM, and flash memory. Examples of program instructions include machine code, such as that generated by a compiler, as well as high-level language code that can be executed by a computer using an interpreter, etc. The hardware devices described above may be configured to operate as at least one software module to perform the operation of one embodiment of the present invention, and vice versa.
[0109] As described above, the present invention has been explained by specific details such as specific components, limited embodiments, and drawings; however, this is provided merely to aid in a more comprehensive understanding of the invention, and the invention is not limited to the above embodiments. A person skilled in the art to which the invention pertains can make various modifications and variations from this description. Therefore, the scope of the invention should not be limited to the described embodiments, and all things equivalent to or having equivalent variations to the claims set forth below, as well as the claims themselves, shall be considered to fall within the scope of the concept of the invention.
[0110]
[0111] [Explanation of the symbol]
[0112] 100 : Residual score-based machine translation sentence evaluation device, evaluation device
[0113] 110 : Processor
[0114] 130 : Input / Output Section
[0115] 140 : Memory
[0116] 150 : Communications Department
[0117] 200 : Translation Output Section
[0118] 210: Large-scale language model-based machine translation module
[0119] 221 : Input section
[0120] 222 : Evaluation Department
[0121] 223 Output section
[0122] 230 : Residual score calculation module
[0123] M: Residual score calculation model
Claims
1. Input section for receiving the original text; An evaluation unit that inputs the above original text into a machine translation module based on a large-scale language model to generate one or more candidate sentences, searches for a reference sentence corresponding to the above original text, and inputs the above original text, candidate sentence, and reference sentence into a residual score calculation module to calculate a superiority-inferiority relationship value regarding the quality of the above candidate sentence; An output unit that outputs the candidate sentence as a translation when the quality of the candidate sentence is higher than that of the reference sentence based on the superiority relationship value, and outputs the reference sentence as a translation when the quality of the candidate sentence is lower than that of the reference sentence based on the superiority relationship value; and A residual score-based machine translation evaluation device characterized by including a residual score calculation module that encodes the above original text, candidate sentence, and reference sentence, converts each into an embedding vector, inputs them into a predetermined layer, and applies an activation function to calculate a superiority-inferiority relationship value.
2. In Paragraph 1, A residual score-based machine translation evaluation device characterized by the above residual score calculation module outputting the above superiority-lower relationship value and outputting whether the quality of the candidate sentence is higher or lower.
3. In Paragraph 1, A residual score-based machine translation evaluation device characterized in that the above reference sentence is retrieved from among a plurality of previously stored reference sentences for the above original text.
4. In Paragraph 1, The above residual score calculation module is, A residual score-based machine translation evaluation device characterized by being a machine learning-based model trained using a training dataset including source text, candidate sentences, reference sentences, and dominance relationship values.
5. In Paragraph 1, The above residual score calculation module is, Encode the source text, candidate sentences, and reference sentences to convert them into embedding vectors, and The above embedding vector is input into a fully connected layer, and A residual score-based machine translation evaluation device characterized by calculating the above-mentioned dominance relationship value through an activation function of a sigmoid function.
6. In Paragraph 5, The dominance relationship values included in the above training data set are, A machine translation evaluation device based on residual scores, which are pre-labeled values.
7. In Paragraph 1, The above residual score calculation module is, If the candidate sentence is of lower quality than the reference sentence, the residual score minus 1 is calculated as the dominance relationship value, and A residual score-based machine translation evaluation device that calculates a 1-residual score as a superiority relationship value when the candidate sentence is of higher quality than the reference sentence.
8. A residual score-based machine translation evaluation device receives the source text; The above residual score-based machine translation evaluation device inputs the source text into a large-scale language model-based machine translation module to generate one or more candidate sentences; The above residual score-based machine translation evaluation device includes the step of searching for a reference sentence corresponding to the source text; The above residual score-based machine translation evaluation device inputs the source text, candidate sentence, and reference sentence into a residual score calculation model to calculate a superiority relationship value indicating whether the quality of the candidate sentence is higher than that of the reference sentence; and A residual score-based machine translation evaluation method characterized by the above-described residual score-based machine translation evaluation device including the step of outputting a candidate sentence as a translation for the original text when the quality of the candidate sentence is higher, and a reference sentence when the quality of the candidate sentence is lower.
9. A computer program stored on a computer-readable storage medium to execute the method of paragraph 8 using a computer.
10. Input section for receiving the original text; An evaluation unit that inputs the above original text into a response generation module based on a large-scale language model to generate one or more candidate responses, searches for a reference response corresponding to the above original text, and inputs the above original text, candidate response, and reference response into a residual score calculation module to calculate a superiority-inferiority relationship value regarding the quality of the above candidate response; An output unit that outputs the candidate response as an output response when the quality of the candidate response is higher than the reference response based on the above dominance relationship value, and outputs the reference response as an output response when the quality of the candidate response is lower than the reference response based on the above dominance relationship value; and A residual score-based language model response evaluation device characterized by including a residual score calculation module that encodes the above original text, candidate response, and reference response, converts each into an embedding vector, inputs them into a predetermined layer, and applies an activation function to calculate a superiority-inferiority relationship value.