Training method and device of corpus classification model and corpus classification method and device
By using a pre-trained named entity recognition model to identify and replace entity phrases in similar question corpora, and training a corpus classification model, the problem of low corpus classification efficiency in FAQ systems is solved, achieving efficient and accurate corpus classification and reducing manpower costs and time investment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING YOUZHUJU NETWORK TECH CO LTD
- Filing Date
- 2021-11-26
- Publication Date
- 2026-06-12
AI Technical Summary
Existing FAQ systems are inefficient in corpus classification, failing to efficiently and accurately cover common user questions, resulting in poor user experience and high labor costs.
The pre-trained named entity recognition model identifies entity phrases in the sample data and replaces similar business entities with the same type in similar question corpora. The corpora with the replaced similar questions and standard question corpora are used to train the corpus classification model, reducing the need for manual annotation and improving the model training efficiency.
It reduced labor costs, shortened model training time, and improved the model's learning ability and classification accuracy.
Smart Images

Figure CN116186247B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to natural language processing using machine learning, and more specifically, to a method and apparatus for training a corpus classification model, as well as a corpus classification method and apparatus. Background Technology
[0002] Frequently Asked Questions (FAQ), or more commonly known as "Frequently Asked Questions and Answers," are a primary means of providing online help. They involve pre-organizing frequently asked questions and answers and publishing them on web pages to offer consultation services to users.
[0003] FAQs (Featured FAQs) are a common feature on many websites, listing frequently asked questions and serving as a form of online help. When using website functions or services, we often encounter seemingly simple problems that are difficult to understand without explanation. Sometimes, these overlooked details can even lead to losing customers. In many cases, a simple explanation can resolve these issues, which is the value of an FAQ. In online marketing, FAQs are considered a frequently used online customer service tool. A good FAQ system should answer at least 80% of general user questions, as well as frequently asked questions. This not only benefits users but also significantly reduces the workload of website staff, saves substantial customer service costs, and increases customer satisfaction. Therefore, an excellent website should prioritize the design of its FAQs.
[0004] Based on the above, there is an urgent need for a more efficient and accurate corpus classification method to improve the coverage of FAQs. Summary of the Invention
[0005] This summary section is provided to briefly introduce the concepts, which will be described in detail in the detailed description section below. This summary section is not intended to identify key or essential features of the claimed technical solution, nor is it intended to limit the scope of the claimed technical solution.
[0006] In a first aspect, this disclosure provides a method for training a corpus classification model, comprising: acquiring sample data; the sample data including a standard question corpus and similar question corpus matching the standard question corpus, wherein the similar question corpus and the standard question corpus include at least one identical entity phrase; replacing the entity phrase in the similar question corpus in the sample data to obtain a replaced similar question corpus; obtaining target sample data based on the replaced similar question corpus and the standard question corpus; and training a corpus classification model based on the target sample data.
[0007] Secondly, this disclosure provides a corpus classification method, including: acquiring a corpus to be classified; calling a corpus classification model to classify the corpus to be classified, and obtaining a classification result; wherein the corpus classification model is trained using the training method of the corpus classification model described in the first aspect.
[0008] Thirdly, this disclosure provides a training apparatus for a corpus classification model, comprising: an acquisition module for acquiring sample data; the sample data including a standard question corpus and similar question corpus matching the standard question corpus, wherein the similar question corpus and the standard question corpus include at least one identical entity phrase; a replacement module for replacing the entity phrase in the similar question corpus in the sample data to obtain a replaced similar question corpus; a processing module for obtaining target sample data based on the replaced similar question corpus and the standard question corpus; the processing module is further configured to train a corpus classification model based on the target sample data.
[0009] Fourthly, this disclosure provides a corpus classification device, comprising: a storage module for storing a corpus classification model; a corpus acquisition module for acquiring corpus to be classified; wherein the corpus classification model is obtained by training the corpus classification model as described in the first aspect; and a calling module for calling the corpus classification model to classify the corpus to be classified and obtain a classification result.
[0010] Fifthly, this disclosure provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the training method for the corpus classification model described in the first aspect.
[0011] In a sixth aspect, this disclosure provides a computer device, comprising: a storage device having a computer program stored thereon; and a processing device for executing the computer program in the storage device to implement the steps of the training method for the corpus classification model described in the first aspect.
[0012] In a seventh aspect, this disclosure provides a computer-readable medium having a computer program stored thereon, which, when executed by a processing device, implements the steps of the corpus classification method described in the second aspect.
[0013] Eighthly, this disclosure provides a computer device, comprising: a storage device having a computer program stored thereon; and a processing device for executing the computer program in the storage device to implement the steps of the corpus classification method described in the second aspect.
[0014] In the above technical solution, a pre-trained named entity model is used to identify entity phrases in standard question corpora in sample data, as well as entity phrases in similar question corpora that match the standard question corpora. The entity phrases in the similar question corpora are replaced with the same type of business entities. The replaced similar question corpora and standard question corpora are used to train and update the corpus classification model. This eliminates the need for manual annotation of large amounts of data, reduces labor costs, reduces the time required for model training, improves the efficiency of model training, and enhances the model's learning ability.
[0015] Other features and advantages of this disclosure will be described in detail in the following detailed description section. Attached Figure Description
[0016] The above and other features, advantages, and aspects of the embodiments of this disclosure will become more apparent from the accompanying drawings and the following detailed description. Throughout the drawings, the same or similar reference numerals denote the same or similar elements. It should be understood that the drawings are schematic, and the originals and elements are not necessarily drawn to scale. In the drawings:
[0017] Figure 1 This is a schematic diagram of the structure of a computer system provided in an exemplary embodiment of this disclosure.
[0018] Figure 2 This is a flowchart of a training method for a corpus classification model provided in an exemplary embodiment of this disclosure.
[0019] Figure 3 This is a flowchart illustrating a sub-step of step S102 as shown in an exemplary embodiment of this disclosure.
[0020] Figure 4 This is a flowchart illustrating a sub-step of step S103 as shown in an exemplary embodiment of this disclosure.
[0021] Figure 5 This is a schematic diagram of an encoder-decoder network structure illustrated in an exemplary embodiment of this disclosure.
[0022] Figure 6 This is a schematic diagram illustrating the structure of a corpus classification model according to an exemplary embodiment of this disclosure.
[0023] Figure 7 This is a block diagram of a training apparatus for a corpus classification model, as illustrated in an exemplary embodiment of this disclosure.
[0024] Figure 8 This is a block diagram of a corpus classification device illustrated in an exemplary embodiment of the present disclosure.
[0025] Figure 9 This is a schematic diagram illustrating the structure of a computer device according to an exemplary embodiment of the present disclosure.
[0026] Explanation of reference numerals in the attached figures
[0027] 120 - Terminal; 140 - Server; 20 - Training device for corpus classification model; 201 - Acquisition module; 203 - Replacement module; 205 - Processing module; 30 - Corpus classification device; 301 - Storage module; 303 - Corpus acquisition module; 305 - Calling module; 600 - Computer equipment; 601 - Processing device; 602 - ROM; 603 - RAM; 604 - Bus; 605 - I / O interface; 606 - Input device; 607 - Output device; 608 - Storage device; 609 - Communication device. Detailed Implementation
[0028] Embodiments of this disclosure will now be described in more detail with reference to the accompanying drawings. While some embodiments of this disclosure are shown in the drawings, it should be understood that this disclosure can be implemented in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided to provide a more thorough and complete understanding of this disclosure. It should be understood that the accompanying drawings and embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of protection of this disclosure.
[0029] It should be understood that the steps described in the method embodiments of this disclosure may be performed in different orders and / or in parallel. Furthermore, the method embodiments may include additional steps and / or omit the steps shown. The scope of this disclosure is not limited in this respect.
[0030] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on".
[0031] It should be noted that the concepts of "first" and "second" mentioned in this disclosure are used only to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.
[0032] It should be noted that the terms "a" and "a plurality of" used in this disclosure are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".
[0033] The names of messages or information exchanged between multiple devices in the embodiments of this disclosure are for illustrative purposes only and are not intended to limit the scope of such messages or information.
[0034] Natural Language Processing (NLP) is an important field within computer science and artificial intelligence. It studies the theories and methods for enabling effective communication between humans and computers using natural language. NLP is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field involves natural language—the language people use in daily life—and thus it has a close relationship with linguistic research. NLP techniques typically include text processing, semantic understanding, machine translation, question answering, and knowledge graphs.
[0035] Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and learn-by-doing.
[0036] Named Entity Recognition (NER) aims to identify named entities such as names of people, places, organizations, times, and proper nouns in a corpus, recognizing entities with specific meanings in text. It is a common task in Natural Language Processing (NLP) with a wide range of applications, playing a crucial foundational role in information extraction, syntactic analysis, and machine translation. NER requires identifying both entity boundaries and entity categories, such as names of people, places, organizations, or others. Therefore, the concept of an entity can be very broad; any specific text fragment required for business purposes can be considered an entity.
[0037] Figure 1 A schematic diagram of the structure of a computer system provided in an exemplary embodiment of the present disclosure is shown. The computer system includes a terminal 120 and a server 140.
[0038] Terminal 120 and server 140 are connected to each other via wired or wireless network.
[0039] Terminal 120 may include at least one of smartphones, laptops, desktop computers, tablets, smart speakers, and smart robots.
[0040] Terminal 120 includes a display; the display is used to show the named entity recognition results.
[0041] Terminal 120 includes a first memory and a first processor. The first memory stores a first program; the first program is invoked and executed by the first processor to implement a training method or a corpus classification method for a corpus classification model. The first memory may include, but is not limited to, the following: Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), and Electrically Erasable Programmable Read-Only Memory (EEPROM).
[0042] The first processor can consist of one or more integrated circuit chips. Optionally, the first processor can be a general-purpose processor, such as a Central Processing Unit (CPU) or a Network Processor (NP). Optionally, the first processor can implement the training method or corpus classification method of the corpus classification model provided in this disclosure by calling a pre-trained NER model. For example, the trained NER model in the terminal can be trained by the terminal itself; or it can be trained by the server and obtained by the terminal from the server.
[0043] Server 140 includes a second memory and a second processor. The second memory stores a second program, which is invoked by the second processor to implement the training method or corpus classification method of the corpus classification model provided in this disclosure. For example, the second memory stores a pre-trained NER model, which is invoked by the second processor to implement the training method or corpus classification method of the corpus classification model. Optionally, the second memory may include, but is not limited to, RAM, ROM, PROM, EPROM, and EEPROM. Optionally, the second processor may be a general-purpose processor, such as a CPU or NP.
[0044] The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. The terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, etc., but is not limited to these. The terminal and server can be connected directly or indirectly through wired or wireless communication, and this disclosure does not impose any restrictions.
[0045] As an illustration, the training method or classification method for the corpus classification model provided in this disclosure can be used for basic entity recognition tasks in various natural language processing tasks in fields such as healthcare, home management, current affairs, shopping, and recommendation. For example, in the healthcare field, it can be used for multiple basic tasks such as symptom recognition, body part recognition, disease recognition, and drug recognition.
[0046] Please see Figure 2 , Figure 2 This is a flowchart illustrating a training method for a corpus classification model provided as an exemplary embodiment of the present disclosure. The method is performed by a computer device, for example, by... Figure 1 The terminal or server in the computer system shown is used to execute the command. Figure 2 The training method for the corpus classification model shown includes the following steps:
[0047] In step S101, sample data is obtained.
[0048] The sample data includes multiple standard question corpora and similar question corpora matching the standard question corpora. The similar question corpora and the standard question corpora contain at least one of the same entity phrases. For example, the standard question corpus is "Can a broken corner of my phone be repaired?", and the matching similar question corpora could be "How to repair a broken corner of my phone", "How to fix a broken corner of my phone", "What to do if a corner of my phone is broken", "How to repair a chipped corner of my phone", "How to repair a broken corner of my phone", "How to repair a broken corner of my phone", "How to fix a broken corner of my phone", "How to repair a broken corner of my phone", or "How to repair a broken corner of my phone", etc. It can be clearly seen that the similar question corpus and the standard question corpus contain at least the same entity phrases "phone" and "repair".
[0049] For example, the standard question corpus Qstd is combined with at least one similar question corpus {Qsim_1, Qsim_2, Qsim_3…Qsim_n} to form a similar question set (Qstd, Qsim_n), and (Qstd, Qsim_n) is used as sample data. The similar question corpus is selected from the user input data.
[0050] In step S102, entity phrases in the similar question corpus in the sample data are replaced to obtain the replaced similar question corpus.
[0051] It should be noted that step S102 includes sub-steps S1021, S1022, and S1023. The specific replacement method for the obtained similar question corpus will be described in detail in the sub-steps of step S102. Please refer to [link / reference]. Figure 3 , Figure 3 This is a flowchart illustrating a sub-step of step S102 as shown in an exemplary embodiment of this disclosure.
[0052] In sub-step S1021, a pre-trained named entity recognition model is invoked to identify the first entity phrase and the corresponding business type of similar question corpus in the sample data.
[0053] The pre-trained named entity recognition model is invoked to identify the first entity phrase and its corresponding business type in similar question corpora. For example, for the input corpus "the corner glass of a mobile phone is broken and needs repair", the pre-trained named entity recognition model can identify that its first entity phrases include "mobile phone", "corner", "glass", "broken", and "repair". The business type corresponding to the first entity phrase can be the attribute of the entity phrase. For example, the attribute of "mobile phone" is a small handheld smart terminal, "corner" refers to the corner position, and "broken" and "repair" represent the actions of breaking and repairing, respectively.
[0054] It should be noted that the pre-trained named entity recognition model mentioned above can be designed using network structures such as LSTM-CRF network, BERT-CRF network, and BERT-fc-Softmax network.
[0055] In sub-step S1022, a second entity phrase with the same business type as the first entity phrase is obtained.
[0056] For example, for the user-input corpus "the corner glass of the mobile phone is broken and needs repair", the first entity phrases include "mobile phone", "corner", "glass", "broken" and "repair". The second entity phrases of the same business type as "mobile phone" can be "smartphone", "smartphone" or "tablet", etc., and the second entity phrases of the same business type as "repair" can be "fixed", "repair" or "repair".
[0057] In sub-step S1023, the first entity phrase is replaced with the second entity phrase to obtain the similar question corpus after replacement.
[0058] The first entity phrase is replaced by the second entity phrase of the same business type obtained in sub-step S1022 to obtain the similar question corpus after replacement.
[0059] In sub-step S103, target sample data is obtained based on the replaced similar question corpus and the standard question corpus.
[0060] It should be noted that step S103 includes sub-steps S1031, S1032, and S1033. The specific method for obtaining the target sample data will be described in detail in the sub-steps of step S103. Please refer to [link / reference]. Figure 4 , Figure 4 This is a flowchart illustrating a sub-step of step S103 as shown in an exemplary embodiment of this disclosure.
[0061] In sub-step S1031, the replaced similar question corpus is concatenated with the standard question corpus to obtain negative sample data.
[0062] For example, the standard question corpus is "Can a broken corner of my phone be repaired?", and the matching similar question corpus is "How do I repair a broken corner of my phone?". The replaced similar question corpus could be "How do I repair a broken corner of my phone?", "What should I do if a corner of my smartphone is broken?", "How do I repair a broken corner of my tablet?", "How do I repair a broken corner of my smartphone screen?", "How do I repair a broken corner of my smartphone?", "Repairing a broken corner of my phone screen", etc. The replaced similar question corpus is then concatenated with the corresponding standard question corpus to form negative sample data.
[0063] In sub-step S1032, the encoder-decoder generation model is trained using the sample data and negative sample data to obtain the sample data to be labeled.
[0064] The sample data obtained in step S101 is used as positive sample data, and the negative sample data obtained in sub-step S1031 is used to train the encoder-decoder generation model.
[0065] Please see Figure 5 , Figure 5 This is a schematic diagram of an encoder-decoder network structure illustrated in an exemplary embodiment of this disclosure.
[0066] The network structure of an encoder-decoder generative model can adopt a seq2seq network structure, which is a type of encoder-decoder network structure. Figure 6 The encoder-decoder network structure shown is based on the idea of using two neural networks (RNNs): one RNN as the encoder and the other as the decoder. The encoder is responsible for processing the input corpus sequence X1X2...X... T The semantic vector is compressed into a vector of a specified length, which can then be considered the semantics of the corpus sequence. This process is called encoding. The simplest way to obtain the semantic vector is to directly use the last input hidden state as the semantic vector C. Alternatively, a transformation can be performed on the last hidden state to obtain the semantic vector, or a transformation can be performed on all the hidden states of the input corpus sequence to obtain the semantic variables. The decoder is responsible for generating the specified corpus sequence based on the semantic vector. This process is also called decoding. As shown in the diagram, the simplest way is to use the semantic variables obtained by the encoder as the initial state input into the decoder's RNN to obtain the output corpus sequence Y. T ...Y1Y2. It can be seen that the output of the previous time step is used as the input of the current time step, and the semantic vector C only participates in the calculation as the initial state. The subsequent calculations are unrelated to the semantic vector C.
[0067] Output corpus sequence Y T ...Y1Y2 are the sample data to be labeled.
[0068] In sub-step S1033, target sample data is obtained based on the sample data to be labeled.
[0069] For example, the attributes of the sample data to be labeled can be determined based on manually added evaluation annotations. First, the manually added evaluation annotations of the sample data to be labeled are obtained. These annotations characterize the attributes of the sample data, such as whether the sentences are fluent, incoherent, or contain complete or incomplete business terms. Based on these annotations, the sample data to be labeled is classified, resulting in categorized labeled sample data. This categorized labeled sample data is then used as the target sample data.
[0070] In step S104, a corpus classification model is trained based on the target sample data.
[0071] The target sample data that is labeled as grammatically correct and contains complete business terms will be used as positive samples, and the remaining target sample data will be used as negative samples to train the corpus classification model.
[0072] The parameters of the corpus classification model are updated based on the positive and negative sample data mentioned above. The updated parameters are then used to perform the step of identifying the target sample data again until the output of the corpus classification model meets the preset training conditions, thus obtaining the corpus classification model. The preset training conditions can be that the accuracy of the corpus classification results reaches a certain threshold, which can be obtained based on experience.
[0073] For example, this disclosure also provides a corpus classification method, which can be performed by a computer device, such as... Figure 1 The method is executed using the terminal or server shown. The method includes the following steps.
[0074] Obtain the corpus to be classified;
[0075] The corpus classification model is invoked to classify the corpus to be classified, and the classification result is obtained. The corpus classification model is trained using the training method of the corpus classification model described above.
[0076] In summary, this disclosure identifies entity phrases in standard question corpora in sample data using a pre-trained named entity model, and identifies entity phrases in similar question corpora that match the standard question corpora. The entity phrases in the similar question corpora are then replaced with similar business entities. The replaced similar question corpora and the standard question corpora are used to train and update the corpus classification model. This eliminates the need for manual annotation of large amounts of data, reducing labor costs, decreasing the time required for model training, improving model training efficiency, and enhancing the model's learning ability.
[0077] Figure 7 This is a block diagram of a training apparatus for a corpus classification model, illustrating an exemplary embodiment of this disclosure. (Refer to...) Figure 7 The device 20 includes an acquisition module 201, a replacement module 203, and a processing module 205.
[0078] The acquisition module 201 is used to acquire sample data; the sample data includes standard question corpus and similar question corpus matching the standard question corpus, wherein the similar question corpus and the standard question corpus include at least one identical entity phrase.
[0079] The replacement module 203 is used to replace the entity phrases in the similar question corpus in the sample data to obtain the replaced similar question corpus.
[0080] Processing module 205 is used to obtain target sample data based on the replaced similar question corpus and the standard question corpus.
[0081] The processing module 205 is also used to train a corpus classification model based on the target sample data.
[0082] Optionally, the replacement module 203 further includes:
[0083] The submodule is invoked to call a pre-trained named entity recognition model to identify the first entity phrase and the corresponding business type of the similar question corpus in the sample data.
[0084] The entity acquisition submodule is used to acquire a second entity phrase that has the same business type as the first entity phrase.
[0085] The entity replacement submodule is used to replace the first entity phrase with the second entity phrase to obtain a similar question corpus after replacement.
[0086] Optionally, the processing module 205 is further configured to concatenate the replaced similar question corpus with the standard question corpus to obtain negative sample data;
[0087] It is also used to train an encoder-decoder generation model using the sample data and the negative sample data to obtain the sample data to be labeled; wherein the sample data is used as positive sample data;
[0088] It is also used to obtain the target sample data based on the sample data to be labeled.
[0089] Optionally, the processing module 205 is further configured to obtain manually evaluated annotations of the sample data to be labeled; the manually evaluated annotations are used to characterize the attributes of the sample data to be labeled.
[0090] It is also used to classify the sample data to be labeled based on the manual evaluation and annotation, so as to obtain the labeled sample data after classification;
[0091] It is also used to use the classified labeled sample data as the target sample data.
[0092] Optionally, the processing module 205 is further configured to use the target sample data labeled as grammatically correct and containing complete business terms as positive samples, and the remaining target sample data as negative samples, to train the corpus classification model.
[0093] Figure 8 This is a block diagram illustrating a corpus classification apparatus according to an exemplary embodiment of this disclosure. (Refer to...) Figure 8 The device 30 includes a storage module 301, a corpus acquisition module 303, and a retrieval module 305.
[0094] The storage module 301 is used to store the corpus classification model; the corpus classification model is trained using the aforementioned training method for the corpus classification model.
[0095] Corpus acquisition module 303 is used to acquire corpus to be classified;
[0096] The calling module 305 is used to call the corpus classification model to classify the corpus to be classified and obtain the classification result.
[0097] The following is for reference. Figure 9 It illustrates a computer device suitable for implementing embodiments of the present disclosure (e.g., Figure 1 The diagram below shows the structure of the terminal device or server 600. The terminal device in this embodiment may include, but is not limited to, mobile terminals such as mobile phones, laptops, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablet computers), PMPs (portable multimedia players), and vehicle terminals (e.g., vehicle navigation terminals), as well as fixed terminals such as digital TVs and desktop computers. Figure 9 The computer device shown is merely an example and should not be construed as limiting the functionality and scope of the embodiments disclosed herein.
[0098] like Figure 9 As shown, computer device 600 may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 601, which can perform various appropriate actions and processes according to a program stored in read-only memory (ROM) 602 or a program loaded from storage device 608 into random access memory (RAM) 603. The RAM 603 also stores various programs and data required for the operation of computer device 600. The processing unit 601, ROM 602, and RAM 603 are interconnected via bus 604. Input / output (I / O) interface 605 is also connected to bus 604.
[0099] Typically, the following devices can be connected to I / O interface 605: input devices 606 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 607 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 608 including, for example, magnetic tapes, hard disks, etc.; and communication devices 609. Communication device 609 allows computer device 600 to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 9 A computer device 600 with various devices is shown, but it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have instead.
[0100] In particular, according to embodiments of this disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of this disclosure include a computer program product comprising a computer program carried on a non-transitory computer-readable medium, the computer program containing program code for performing the methods shown in the flowcharts. In such embodiments, the computer program can be downloaded and installed from a network via a communication device 609, or installed from a storage device 608, or installed from a ROM 602. When the computer program is executed by the processing device 601, it performs the functions defined in the methods of embodiments of this disclosure.
[0101] It should be noted that the computer-readable medium described in this disclosure can be a computer-readable signal medium or a computer-readable storage medium, or any combination thereof. A computer-readable storage medium can be, for example,—but not limited to—an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of a computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this disclosure, a computer-readable storage medium can be any tangible medium containing or storing a program that can be used by or in connection with an instruction execution system, apparatus, or device. In this disclosure, a computer-readable signal medium can include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals can take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A computer-readable signal medium can be any computer-readable medium other than a computer-readable storage medium, which can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the computer-readable medium can be transmitted using any suitable medium, including but not limited to: wires, optical fibers, RF (radio frequency), etc., or any suitable combination thereof.
[0102] In some implementations, terminals and servers can communicate using any currently known or future-developed network protocol, such as HTTP (Hypertext Transfer Protocol), and can interconnect with digital data communication (e.g., communication networks) of any form or medium. Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), the Internet (e.g., the Internet of Things), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or future-developed networks.
[0103] The aforementioned computer-readable medium may be included in the aforementioned computer device; or it may exist independently and not assembled into the computer device.
[0104] The aforementioned computer-readable medium carries one or more programs, which, when executed by the computer device, cause the computer device to: acquire sample data; the sample data includes a standard question corpus and similar question corpus matching the standard question corpus, wherein the similar question corpus and the standard question corpus include at least one identical entity phrase; replace the entity phrase in the similar question corpus in the sample data to obtain a replaced similar question corpus; obtain target sample data based on the replaced similar question corpus and the standard question corpus; and train a corpus classification model based on the target sample data.
[0105] Alternatively, the aforementioned computer-readable medium carries one or more programs that, when executed by the computer device, cause the computer device to: invoke a corpus classification model to classify the corpus to be classified; the corpus classification model is trained using the aforementioned training method for the corpus classification model.
[0106] Computer program code for performing the operations of this disclosure can be written in one or more programming languages or a combination thereof, including but not limited to object-oriented programming languages such as Java, Smalltalk, and C++, as well as conventional procedural programming languages such as the "C" language or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network—including a local area network (LAN) or a wide area network (WAN)—or can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0107] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0108] The modules described in the embodiments of this disclosure can be implemented in software or hardware. The names of the modules are not, in some cases, intended to limit the functionality of the module itself.
[0109] The functions described above in this document can be performed, at least in part, by one or more hardware logic components. For example, exemplary types of hardware logic components that can be used, without limitation, include: Field Programmable Gate Arrays (FPGAs), Application-Specific Integrated Circuits (ASICs), Application Standard Products (ASSPs), System-on-Chip (SoCs), Complex Programmable Logic Devices (CPLDs), and so on.
[0110] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
[0111] According to one or more embodiments of this disclosure, Example 1 provides a method for training a corpus classification model, including:
[0112] Obtain sample data; the sample data includes standard question corpus and similar question corpus matching the standard question corpus, wherein the similar question corpus and the standard question corpus include at least one identical entity phrase;
[0113] Replace the entity phrases in the similar question corpus in the sample data to obtain the replaced similar question corpus;
[0114] Target sample data is obtained based on the replaced similar question corpus and the standard question corpus;
[0115] A corpus classification model is trained based on the target sample data.
[0116] According to one or more embodiments of this disclosure, Example 2 provides the method of Example 1, wherein the step of replacing the entity phrase in the similar question corpus in the sample data to obtain the replaced similar question corpus includes:
[0117] The pre-trained named entity recognition model is invoked to identify the first entity phrase and the corresponding business type of the similar question corpus in the sample data;
[0118] Obtain a second entity phrase with the same business type as the first entity phrase;
[0119] The first entity phrase is replaced by the second entity phrase to obtain the similar question corpus.
[0120] According to one or more embodiments of this disclosure, Example 3 provides the method of Example 1, wherein the step of obtaining target sample data based on the replaced similar question corpus and the standard question corpus includes:
[0121] The replaced similar question corpus is concatenated with the standard question corpus to obtain negative sample data;
[0122] The encoder-decoder generation model is trained using the sample data and the negative sample data to obtain the sample data to be labeled; wherein, the sample data is used as positive sample data.
[0123] The target sample data is obtained based on the sample data to be labeled.
[0124] According to one or more embodiments of this disclosure, Example 4 provides the method of Example 3, wherein the step of obtaining the target sample data based on the sample data to be labeled includes:
[0125] Obtain manually evaluated annotations for the sample data to be labeled; the manually evaluated annotations are used to characterize the attributes of the sample data to be labeled.
[0126] The unlabeled sample data is classified based on the manual evaluation and annotation to obtain the classified unlabeled sample data.
[0127] The categorized labeled sample data is used as the target sample data.
[0128] According to one or more embodiments of this disclosure, Example 5 provides the method of Example 1, wherein the manual evaluation annotation includes a marker indicating whether the sentence is fluent and a marker indicating whether the business terms are complete;
[0129] The step of training the corpus model based on the target sample data includes:
[0130] The target sample data labeled as grammatically correct and containing complete business terms are used as positive samples, and the remaining target sample data are used as negative samples to train the corpus classification model.
[0131] According to one or more embodiments of this disclosure, Example 6 provides a corpus classification method, including:
[0132] Obtain the corpus to be classified;
[0133] The corpus classification model is invoked to classify the corpus to be classified, and the classification result is obtained; the corpus classification model is trained using any of the methods described in Examples 1 to 5.
[0134] The above description is merely a preferred embodiment of this disclosure and an explanation of the technical principles employed. Those skilled in the art should understand that the scope of this disclosure is not limited to technical solutions formed by specific combinations of the above-described technical features, but should also cover other technical solutions formed by arbitrary combinations of the above-described technical features or their equivalents without departing from the above-described concept. For example, technical solutions formed by substituting the above features with (but not limited to) technical features disclosed in this disclosure that have similar functions.
[0135] Furthermore, while the operations are described in a specific order, this should not be construed as requiring these operations to be performed in the specific order shown or in a sequential order. In certain environments, multitasking and parallel processing may be advantageous. Similarly, while several specific implementation details are included in the above discussion, these should not be construed as limiting the scope of this disclosure. Certain features described in the context of individual embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented individually or in any suitable sub-combination in multiple embodiments.
[0136] Although the subject matter has been described using language specific to structural features and / or methodological logic, it should be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or actions described above. Rather, the specific features and actions described above are merely illustrative forms of implementing the claims. Regarding the apparatus in the above embodiments, the specific manner in which the various modules perform their operations has been described in detail in the embodiments relating to the method, and will not be elaborated upon here.
Claims
1. A training method for a corpus classification model, the method being applied to question answering, characterized in that, include: Obtain sample data; the sample data includes standard question corpus and similar question corpus matching the standard question corpus, wherein the similar question corpus and the standard question corpus include at least one identical entity phrase; The pre-trained named entity recognition model is invoked to identify the first entity phrase and the business type corresponding to the first entity phrase in the similar question corpus in the sample data, wherein the business type is the attribute corresponding to the first entity phrase; Obtain a second entity phrase with the same business type as the first entity phrase; Replace the first entity phrase with the second entity phrase to obtain a similar question corpus after replacement; The replaced similar question corpus is concatenated with the standard question corpus to obtain negative sample data; The encoder-decoder generation model is trained using the sample data and the negative sample data to obtain the sample data to be labeled; wherein, the sample data is used as positive sample data. The target sample data is obtained based on the sample data to be labeled; A corpus classification model is trained based on the target sample data.
2. The method according to claim 1, characterized in that, The step of obtaining the target sample data based on the unlabeled sample data includes: Obtain manually evaluated annotations for the sample data to be labeled; the manually evaluated annotations are used to characterize the attributes of the sample data to be labeled. Based on the manual evaluation and annotation, the unannotated sample data is classified to obtain the classified labeled sample data; The categorized labeled sample data is used as the target sample data.
3. The method according to claim 2, characterized in that, The manual evaluation annotations include markers indicating whether the sentences are fluent and markers indicating whether the business terms are complete; The step of training the corpus classification model based on the target sample data includes: The target sample data labeled as grammatically correct and containing complete business terms are used as positive samples, and the remaining target sample data are used as negative samples to train the corpus classification model.
4. A corpus classification method, characterized in that, include: Obtain the corpus to be classified; The corpus classification model is invoked to classify the corpus to be classified, and the classification result is obtained; The corpus classification model is trained using the method described in any one of claims 1 to 3.
5. A training device for a corpus classification model, characterized in that, include: The acquisition module is used to acquire sample data; the sample data includes standard question corpus and similar question corpus matching the standard question corpus, wherein the similar question corpus and the standard question corpus include at least one identical entity phrase; The replacement module is used to call a pre-trained named entity recognition model to identify the first entity phrase and the business type corresponding to the first entity phrase in the sample data of the similar question corpus, wherein the business type is the attribute corresponding to the first entity phrase; obtain a second entity phrase with the same business type as the first entity phrase; replace the first entity phrase with the second entity phrase to obtain the replaced similar question corpus; The processing module is used to concatenate the replaced similar question corpus with the standard question corpus to obtain negative sample data; train an encoder-decoder generation model using the sample data and the negative sample data to obtain the sample data to be labeled; wherein the sample data is used as positive sample data; and obtain target sample data based on the sample data to be labeled. The processing module is also used to train a corpus classification model based on the target sample data.
6. A corpus classification device, characterized in that, include: The storage module is used to store the corpus classification model; The corpus classification model is trained using the method described in any one of claims 1 to 3; The corpus acquisition module is used to acquire the corpus to be classified. The calling module is used to call the corpus classification model to classify the corpus to be classified and obtain the classification result.
7. A computer-readable medium having a computer program stored thereon, characterized in that, When the program is executed by the processing device, it implements the steps of the method according to any one of claims 1-3.
8. A computer device, characterized in that, include: A storage device on which computer programs are stored; A processing device for executing the computer program in the storage device to implement the steps of the method according to any one of claims 1-3.
9. A computer-readable medium having a computer program stored thereon, characterized in that, When the program is executed by the processing device, it implements the steps of the method of claim 4.
10. A computer device, characterized in that, include: A storage device on which computer programs are stored; A processing device for executing the computer program in the storage device to implement the steps of the method of claim 4.