Entity word recognition method and device, electronic equipment, medium and program product
By determining the tags and codes generated by the model through labels, and directly combining characters to identify entity words, this solves the problem of high complexity in the identification of nested and discontinuous entity words in existing technologies, and achieves efficient identification with low complexity.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING DAJIA INTERNET INFORMATION TECH CO LTD
- Filing Date
- 2022-08-19
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies have high computational complexity when identifying nested and discontinuous entity words, making it difficult to accurately identify various forms of entity words with lower complexity.
A label determination model is used to generate a label for each character in the text to be recognized. The label includes a marker and an encoding. Characters are directly combined by the marker and the encoding to identify entity words, reducing computational complexity.
It achieves accurate recognition of tiling, nesting, and discontinuous entity words with low computational complexity, reducing computational load and improving recognition efficiency.
Smart Images

Figure CN115481634B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer application technology, and in particular to a method, apparatus, electronic device, medium, and program product for recognizing entity words. Background Technology
[0002] Named entity recognition is an important task in natural language processing, which refers to the identification of entity words such as names of people, places, organizations, and times in text. Entity words exist in three forms: flat entity words, nested entity words, and discontinuous entity words.
[0003] To identify nested and discontinuous entity words, related entity word recognition technologies, after classifying each character in the text, also need to determine the relationship between each character and all other characters, which is highly complex. For example, it is necessary to determine whether the entity word to which the following character belongs is the next entity word to which the preceding character belongs, and to determine whether two characters are the beginning and end of the same entity word. Therefore, how to identify various forms of entity words with lower computational complexity is a pressing technical problem that needs to be solved. Summary of the Invention
[0004] To overcome the problems existing in related technologies, this disclosure provides a method, apparatus, electronic device, medium, and program product for entity word recognition. The technical solution of this disclosure is as follows:
[0005] According to a first aspect of the present disclosure, a method for recognizing entity words is provided, comprising:
[0006] The text to be identified is input into the label determination model to obtain a label for each character in the text to be identified. The label includes a marker and an encoding. The marker represents the position of the character in the entity word. Characters with the same encoding belong to the same entity word.
[0007] The characters are combined according to their tags to obtain the target entity word.
[0008] Optionally, the step of combining the characters according to their tags to obtain the target entity word includes:
[0009] Characters with the same encoding in the text to be identified are identified as characters of the same entity word;
[0010] The characters are combined into the same entity word according to the positions represented by the markers of the characters.
[0011] Optionally, the marker includes an intermediate marker that represents the middle position of the character in the entity word;
[0012] If, among the characters of the same entity word, there is more than one middle character with the intermediate marker, the method further includes:
[0013] Obtain the sequential order of the multiple intermediate characters in the text to be recognized;
[0014] The order of the multiple middle characters in the text to be identified is determined as the order of the multiple middle characters in the same entity word;
[0015] The positions of the multiple middle characters in the same entity word are obtained based on their order within the word.
[0016] Optionally, the marker includes a non-entity marker, which indicates that the character is not located in any arbitrary position of any entity word.
[0017] Optionally, the label determination model is a model trained based on text samples, wherein the text samples contain entity word samples, each character sample in the text samples carries the tag and the encoding, and the entity word samples include nested entity word samples and discontinuous entity word samples.
[0018] Optionally, the label determination model is trained according to the following steps:
[0019] Obtain the text sample;
[0020] The text samples are input into the base model to obtain the predicted label and predicted code for each character sample;
[0021] The cross-entropy loss function value is determined based on the difference between the tag carried by each character sample and the predicted tag, and the difference between the encoding carried by each character sample and the predicted encoding.
[0022] The base model is trained based on the cross-entropy loss function value until the preset training termination condition is met, thus obtaining the label determination model.
[0023] According to a second aspect of the present disclosure, an entity word recognition device is provided, comprising:
[0024] The tag acquisition module is configured to input the text to be identified into the tag determination model to obtain the tag of each character in the text to be identified. The tag includes a marker and an encoding. The marker represents the position of the character in the entity word. Characters with the same encoding belong to the same entity word.
[0025] The entity word acquisition module is configured to combine the characters according to their tags to obtain the target entity word.
[0026] According to a third aspect of the present disclosure, an electronic device is provided, comprising: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the entity word recognition method as described in the first aspect.
[0027] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided, wherein when instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is enabled to perform the entity word recognition method as described in the first aspect.
[0028] According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements the entity word recognition method as described in the first aspect.
[0029] The technical solutions provided by the embodiments of this disclosure may include the following beneficial effects:
[0030] In this disclosure, the label determination model generates a label for each character in the text to be identified, which includes a marker and an encoding. The marker can represent the position of the character in an entity word, and characters with the same encoding belong to the same entity word. Therefore, characters can be directly combined based on their labels to obtain various forms of target entity words. Thus, there is no need to determine the relationship between each character and all other characters; the character's own label is sufficient to represent whether each character belongs to the same entity word and to represent the character's position in the entity word. Therefore, the computational complexity is low.
[0031] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0032] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with the invention and, together with the description, serve to explain the principles of the invention.
[0033] Figure 1 This is a flowchart illustrating an entity word recognition method according to an exemplary embodiment;
[0034] Figure 2 This is a schematic diagram illustrating a label of multiple characters according to an exemplary embodiment;
[0035] Figure 3 This is a block diagram illustrating an entity word recognition device according to an exemplary embodiment;
[0036] Figure 4This is a block diagram illustrating an entity word recognition device according to an exemplary embodiment;
[0037] Figure 5 This is a block diagram illustrating an entity word recognition device according to an exemplary embodiment. Detailed Implementation
[0038] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings.
[0039] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0040] Tiled entity words are entity words composed of adjacent characters, and each tiled entity word contains only one entity word. Nested entity words are entity words composed of adjacent characters, and each nested entity word contains more than one entity word. For example, in the nested entity word "strawberry milk," "strawberry" can be a single entity word, and "milk" can also be a single entity word. However, in practical applications, "strawberry milk" is treated as an indivisible nested entity word. Splitting it into "strawberry" and "milk" for separate processing may lead to inaccurate results. Discontinuous entity words are entity words composed of non-adjacent characters. For example, "muscle soreness" in the text "muscle fatigue and soreness" is a discontinuous entity word.
[0041] There are three main entity word recognition methods in related technologies: sequence labeling, span-based methods, and word-word relation classification. The encoders of these three methods are basically the same; the difference lies in the decoding method. If the sentence length is n, the computational complexity of decoding using the sequence labeling method is n, but it cannot recognize nested or discontinuous entity words. If the sentence length is n, the computational complexity of decoding using the span-based method is 2n, but it cannot recognize discontinuous entity words. The word-word relation classification method can recognize nested and discontinuous entity words, but it requires determining the relationship between each character and all other characters in the text, resulting in high computational complexity; if the sentence length is n, the computational complexity of decoding using the word-word relation classification method is n squared. This disclosure proposes a new labeling form (tag + encoding) to recognize entity words of various forms without increasing decoding complexity.
[0042] Figure 1 This is a flowchart illustrating an entity word recognition method according to an exemplary embodiment, such as... Figure 1 As shown, this entity recognition method can be used in electronic devices such as computers, mobile phones, tablets, and wearable devices, and includes the following steps.
[0043] In step S11, the text to be identified is input into the label determination model to obtain the label of each character in the text to be identified. The label includes a mark and an encoding. The mark represents the position of the character in the entity word. Characters with the same encoding belong to the same entity word.
[0044] In step S12, the characters are combined according to their tags to obtain the target entity word.
[0045] The label determination model is trained to extract the semantic representation vector of the input text to be identified and output a label for each character in the text. Characters can be separated by spaces. For example, in Chinese, a character refers to a single Chinese character or a punctuation mark; in English, a character refers to a word or a punctuation mark. Although a word contains multiple letters, it is not segmented.
[0046] A tag can include markers, or it can include both markers and codes. Characters with the same code belong to the same entity word. A character may include multiple codes, and a character with multiple codes may belong to multiple entity words. If multiple characters carry the same code, then those multiple characters belong to the same entity word. For example, if character A carries both code 1 and code 2, character B carries code 1, and character C carries code 2, then characters A and B are characters of the same entity word (the entity word corresponding to code 1), and characters A and C are also characters of the same entity word (the entity word corresponding to code 2).
[0047] A marker represents the position of a character within an entity word. Markers can include start markers, middle markers, end markers, and non-entity markers. Specifically, a character carrying a start marker indicates that it is at the beginning of the entity word it forms; a character carrying a middle marker indicates that it is in the middle of the entity word; a character carrying an end marker indicates that it is at the end of the entity word; and a character carrying a non-entity marker indicates that it is not in any position belonging to any entity word. In the embodiments of this disclosure, "B" is used as the start marker, "I" as the middle marker, "E" as the end marker, "O" as the non-entity marker, and numbers as the encoding. Optionally, various markers and encodings may have other forms, which this disclosure does not limit.
[0048] Because characters carrying non-entity tags do not belong to any entity word, and because the tags of characters carrying non-entity tags may not include encoding, or can be understood as the encoding carried by the character being empty, characters carrying non-entity tags can be directly excluded when generating entity words. This means that only characters carrying other tags need to be processed, reducing the computational load.
[0049] Because a character may belong to multiple entity words at the same time, for example, in "muscle fatigue and soreness", the word "muscle" belongs to both the nested entity word "muscle fatigue" and the discontinuous entity word "muscle soreness". Therefore, a character can include multiple tags, and a tag represents the information that the character is related to an entity word.
[0050] For example, if a character includes both label B1 and label E2, then that character is both the first character in the entity word corresponding to code 1 and the last character in the entity word corresponding to code 2.
[0051] After obtaining the label for each character, characters with the same encoding can be combined according to the position of that character in the entity word as indicated by the label, to obtain the target entity word. The target entity word can be a flat entity word, a nested entity word, or a non-contiguous entity word.
[0052] For example, if the five-character labels are B1B2, I1, E1, I2, and E2, then the characters carrying B1, I1, and E1 can be combined into one entity word, and the characters carrying B2, I2, and E2 can be combined into another entity word.
[0053] In this way, entity words can be obtained simply by using the character's own label, without having to determine the relationship between each character and all other characters. Therefore, if the length of the text to be recognized is n, the computational complexity of combining characters into entity words is n, which is relatively low.
[0054] Furthermore, even if the characters belonging to the same entity word are not consecutive, when generating entity words based on the character labels, the non-consecutive entity words can be directly combined through encoding, thus enabling the identification of non-consecutive entity words in the text to be identified.
[0055] Because the label determination model can generate accurate labels, it will not generate different codes for multiple entity words contained in nested entity words, but will generate the same code for the nested entity words as a whole, thus avoiding the situation where nested entity words are identified as multiple entity words.
[0056] Using the technical solution disclosed herein, the label determination model generates a label for each character in the text to be recognized, including a marker and an encoding. The marker can represent the position of the character in the entity word, and the encoding can represent characters belonging to the same entity word. Therefore, characters can be directly combined based on their labels to obtain various forms of target entity words. Thus, there is no need to determine the relationship between each character and all other characters; the character's own label is sufficient to represent whether each character belongs to the same entity word and to represent the character's position in the entity word. Therefore, the computational complexity is low.
[0057] Optionally, based on the above technical solution, combining characters according to their tags to obtain entity words may include: identifying characters with the same encoding in the text to be identified as characters of the same entity word; and combining the characters into the same entity word according to the positions represented by the tags of the characters of the same entity word.
[0058] Since characters with the same encoding belong to the same entity word, we can first group characters with the same encoding into one category to identify them as characters belonging to the same entity word. Then, based on the position of each character within the same entity word as indicated by its tag, we can combine these characters to obtain the entity word itself.
[0059] For example, the five characters of the label are: I1, B1B2, E2, E1, I2. For these five characters, it is possible to first determine that the five characters belong to two entity words according to the encoding. The first entity word includes the 1st, 2nd, and 4th characters, and the second entity word includes the 2nd, 3rd, and 5th characters. According to the respective tags of the 1st, 2nd, and 4th characters, it can be determined that the order of the characters included in the first entity word should be that the 2nd character is the start position of the entity word, the 1st character is the middle position of the entity word, and the 4th character is the end position of the entity word; according to the respective tags of the 2nd, 3rd, and 5th characters, it can be determined that the order of the characters included in the second entity word should be that the 2nd character is the start position of the entity word, the 5th character is the middle position of the entity word, and the 3rd character is the end position of the entity word.
[0060] Optionally, it is also possible to first determine the position of each character in its respective entity word, and then determine the multiple characters belonging to the same entity word.
[0061] By adopting the technical solution of the present disclosure, even if the entity word is a flat entity word, a discontinuous entity word or a nested entity word, the entity word can be accurately generated through the tags of the characters, and there is no need to judge the relationship between each character and all other characters.
[0062] For an entity word, there will only be one start position and one end position. Therefore, the positions of the characters carrying the start tag and the end tag in the entity word can be directly determined. However, an entity word may have multiple middle-position characters (characters that are neither in the start position nor in the end position of the entity word). For example, the entity word "tensor fasciae latae" has 3 middle-position characters. Assuming that the encoding corresponding to the entity word "tensor fasciae latae" is 1, the tags of the characters "muscle", "fascia", and "tensor" are all I1.
[0063] The entity word is an entity word extracted from the text to be recognized. Therefore, the order of the characters in the entity word is the same as the order of the characters in the text to be recognized.
[0064] Therefore, in order to accurately determine the positions of the middle-position characters with the same tag in the entity word, the order of the multiple middle-position characters in the text to be recognized can be obtained. The order of the multiple middle-position characters of the same entity word in the text to be recognized is determined as the order of the multiple middle-position characters of the same entity word in the same entity word, so as to obtain the positions of the multiple middle-position characters of the same entity word in the same entity word. After obtaining the positions of the multiple middle-position characters of the same entity word in the same entity word, combined with the start-position character and the end-position character of the entity word, the same entity word can be combined.
[0065] In this way, the problem of the uncertain positions of the multiple middle-position characters in the same entity word is solved, and thus the entity word can be accurately generated.
[0066] Optionally, based on the above technical solution, the label determination model is a model obtained by supervised training based on text samples. The text samples include entity word samples, and each character sample in the text samples carries the tag and the encoding. The entity word samples include nested entity word samples and discontinuous entity word samples.
[0067] The marker of a character sample indicates its position within the entity word sample, and the identical encoding of character samples indicates that they belong to the same entity word sample. Nested entity word samples contain character samples with identical encodings, and non-contiguous entity word samples contain character samples with identical encodings.
[0068] The text samples can be open-source Chinese and English annotation datasets, such as ACE2004 and ACE2005, or other datasets. Each sentence is then segmented by spaces to obtain multiple character samples. Based on the entity word samples in the text sample to which each character sample belongs, and the position of each character sample within those entity word samples, each character sample is labeled and encoded.
[0069] In this way, the trained label determination model can be guaranteed to recognize the encoding of each character in a nested entity word as the same encoding, and the encoding of each character in a non-contiguous entity word as the same encoding, when faced with nested entity words and non-contiguous entity words.
[0070] Optionally, the label determination model can be trained according to the following steps: acquiring the text samples; inputting the text samples into a base model to obtain the predicted label and predicted code for each character sample; determining the cross-entropy loss function value based on the difference between the label carried by each character sample and the predicted label, and the difference between the code carried by each character sample and the predicted code; training the base model based on the cross-entropy loss function value until a preset training termination condition is met, thus obtaining the label determination model. The preset training termination condition can be model convergence or the number of training iterations reaching a fixed value.
[0071] The base model is an untrained label determination model. By training the base model, its parameters can be adjusted to gradually improve the accuracy of its predicted character labels, ultimately resulting in a label determination model capable of accurately identifying character labels. The label determination model / base model can include a semantic representation module and a classifier. The semantic representation module can be a BERT (Bidirectional Encoder Representations from Transformers) model or other NLP (Natural Language Processing) models, and the classifier can be a softmax (normalized exponential function) classifier.
[0072] Text samples are input into the base model, where the semantic representation module extracts the semantic representation vectors of the text samples and the semantic representation vectors of each character sample. Based on the semantic representation vectors determined by the semantic representation module, the classifier outputs a predicted label and a predicted code for each character sample.
[0073] The cross-entropy loss function can be established based on the difference between the label carried by each character sample and the predicted label, as well as the difference between the encoding carried by each character sample and the predicted encoding. The learning rate of the base model is 1e-5.
[0074] Thus, by simply performing supervised training on the base model, a label determination model can be obtained. This model can then accurately output the label of each character in the text to be identified, thereby generating entity words in various forms.
[0075] Figure 2 This is a schematic diagram illustrating multiple character tags according to an exemplary embodiment, wherein the multiple characters are characters of the text to be recognized, "muscle fatigue and soreness.", which includes the nested entity word "muscle fatigue" and the discontinuous entity word "muscle soreness." Because the word "muscle" belongs to both of these entity words, "muscle" includes both tag "B1" and tag "B2," and "flesh" includes both tag "I1" and tag "I2." Combining B1, I1, and E1 into the entity word "muscle fatigue," and combining B2, I2, and E2 into the entity word "muscle soreness," the non-entity tag O is insufficient to form an entity word.
[0076] Accurately identifying entity words in the text to be identified has multiple uses. For example, when searching for text to be identified, only entity words in the text can be searched. When pushing information, information with the same entity words can be pushed in the same information stream.
[0077] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure.
[0078] Figure 3 This is a block diagram illustrating an entity word recognition device according to an exemplary embodiment, with reference to... Figure 3 The device includes a tag acquisition module 31 and an entity word acquisition module 32, wherein:
[0079] The tag acquisition module 31 is configured to input the text to be identified into the tag determination model to obtain the tag of each character in the text to be identified. The tag includes a mark and an encoding. The mark represents the position of the character in the entity word. Characters with the same encoding belong to the same entity word.
[0080] The entity word acquisition module 32 is configured to combine the characters according to the tags of the characters to obtain the target entity word.
[0081] Optionally, the tag acquisition module 31 includes:
[0082] The same entity word determination unit is configured to determine each character with the same encoding in the text to be identified as a character of the same entity word;
[0083] The combination unit is configured to combine the characters into the same entity word according to the positions represented by the markers of the characters of the same entity word.
[0084] Optionally, the marker includes an intermediate marker representing the middle position of the character in the entity word; if there is more than one middle character with the intermediate marker among the characters of the same entity word, it further includes:
[0085] The sequence acquisition module is configured to acquire the sequential order of the multiple intermediate characters in the text to be recognized.
[0086] The sequence determination module is configured to determine the order of multiple middle characters in the text to be identified as the order of multiple middle characters in the same entity word.
[0087] The position determination module is configured to determine the position of the multiple middle characters in the same entity word based on their sequential order within the same entity word.
[0088] Optionally, the marker includes a non-entity marker, which indicates that the character is not located in any arbitrary position of any entity word.
[0089] Optionally, the label determination model is a model trained based on text samples, wherein the text samples contain entity word samples, each character sample in the text samples carries the tag and the encoding, and the entity word samples include nested entity word samples and discontinuous entity word samples.
[0090] Optionally, the label determination model is trained according to the following steps:
[0091] Obtain the text sample;
[0092] The text samples are input into the base model to obtain the predicted label and predicted code for each character sample;
[0093] The cross-entropy loss function value is determined based on the difference between the tag carried by each character sample and the predicted tag, and the difference between the encoding carried by each character sample and the predicted encoding.
[0094] The base model is trained based on the cross-entropy loss function value until the preset training termination condition is met, thus obtaining the label determination model.
[0095] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0096] Figure 4 This is a block diagram illustrating an entity word recognition device 400 according to an exemplary embodiment. For example, device 400 may be a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness equipment, personal digital assistant, etc.
[0097] Reference Figure 4 The device 400 may include one or more of the following components: a processing component 402, a memory 404, a power component 406, a multimedia component 408, an audio component 410, an input / output (I / O) interface 412, a sensor component 414, and a communication component 416.
[0098] Processing component 402 typically controls the overall operation of device 400, such as operations associated with display, telephone calls, data communication, camera operation, and recording. Processing component 402 may include one or more processors 420 to execute instructions to complete all or part of the steps of the entity word recognition method described above. Furthermore, processing component 402 may include one or more modules to facilitate interaction between processing component 402 and other components. For example, processing component 402 may include a multimedia module to facilitate interaction between multimedia component 408 and processing component 402.
[0099] Memory 404 is configured to store various types of data to support the operation of device 400. Examples of such data include instructions for any application or method operating on device 400, contact data, phonebook data, messages, pictures, videos, etc. Memory 404 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0100] Power supply component 406 provides power to various components of device 400. Power supply component 406 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 400.
[0101] Multimedia component 408 includes a screen that provides an output interface between the device 400 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of the touch or swipe action but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 408 includes a front-facing camera and / or a rear-facing camera. When the device 400 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or the rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
[0102] Audio component 410 is configured to output and / or input audio signals. For example, audio component 410 includes a microphone (MIC) configured to receive external audio signals when device 400 is in an operating mode, such as call mode, recording mode, and voice recognition mode. The received audio signals may be further stored in memory 404 or transmitted via communication component 416. In some embodiments, audio component 410 also includes a speaker for outputting audio signals.
[0103] I / O interface 412 provides an interface between processing component 402 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.
[0104] Sensor assembly 414 includes one or more sensors for providing status assessments of various aspects of device 400. For example, sensor assembly 414 may detect the on / off state of device 400, the relative positioning of components such as the display and keypad of device 400, changes in the position of device 400 or a component of device 400, the presence or absence of user contact with device 400, the orientation or acceleration / deceleration of device 400, and temperature changes of device 400. Sensor assembly 414 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 414 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 414 may also include an accelerometer, a gyroscope, a magnetometer, a pressure sensor, or a temperature sensor.
[0105] Communication component 416 is configured to facilitate wired or wireless communication between device 400 and other devices. Device 400 can access wireless networks based on communication standards, such as WiFi, carrier networks (such as 2G, 3G, 4G, or 5G), or combinations thereof. In one exemplary embodiment, communication component 416 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In one exemplary embodiment, communication component 416 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
[0106] In an exemplary embodiment, the device 400 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the aforementioned entity word recognition method.
[0107] In an exemplary embodiment, a non-transitory computer-readable storage medium including instructions is also provided, such as a memory 404 including instructions, which can be executed by a processor 420 of the device 400 to complete the aforementioned entity word recognition method. For example, the non-transitory computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.
[0108] Figure 5This is a block diagram illustrating an entity word recognition device 500 according to an exemplary embodiment. For example, device 500 may be provided as a server. (Refer to...) Figure 5 The apparatus 500 includes a processing component 522, which further includes one or more processors, and memory resources represented by memory 532 for storing instructions executable by the processing component 522, such as computer program products. The computer program products stored in memory 532 may include one or more modules, each corresponding to a set of instructions. Furthermore, the processing component 522 is configured to execute instructions to perform the aforementioned entity word recognition method.
[0109] Device 500 may also include a power supply component 526 configured to perform power management of device 500, a wired or wireless network interface 550 configured to connect device 500 to a network, and an input / output (I / O) interface 558. Device 500 may operate on an operating system stored in memory 532, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™, or similar.
[0110] Other embodiments of the invention will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the invention that follow the general principles of the invention and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of the invention are indicated by the following claims.
[0111] It should be understood that the present invention is not limited to the precise structure described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of the invention is limited only by the appended claims.
Claims
1. A method for recognizing entity words, characterized in that, include: The text to be identified is input into the label determination model to obtain a label for each character in the text to be identified. The label includes a marker and an encoding. The marker represents the position of the character in the entity word. Characters with the same encoding belong to the same entity word. A character includes at least one encoding, and a character with multiple encodings belongs to multiple entity words. The characters are combined according to their tags to obtain the target entity word; The label determination model is a model trained on text samples. The text samples contain entity word samples. Each character sample in the text sample carries the tag and the encoding. The entity word samples include nested entity word samples and non-contiguous entity word samples. The label determination model was trained according to the following steps: Obtain the text sample; The text samples are input into the base model to obtain the predicted label and predicted code for each character sample; The cross-entropy loss function value is determined based on the difference between the tag carried by each character sample and the predicted tag, and the difference between the encoding carried by each character sample and the predicted encoding. The base model is trained based on the cross-entropy loss function value until the preset training termination condition is met, thereby obtaining the label determination model. The step of combining the characters according to their tags to obtain the target entity word includes: Characters with the same encoding in the text to be identified are identified as characters of the same entity word; The characters are combined into the same entity word according to the positions represented by the markers of the characters.
2. The method according to claim 1, characterized in that, The marker includes an intermediate marker that represents the middle position of the character in the entity word; If, among the characters of the same entity word, there is more than one middle character with the intermediate marker, the method further includes: Obtain the sequential order of the multiple intermediate characters in the text to be recognized; The order of the multiple middle characters in the text to be identified is determined as the order of the multiple middle characters in the same entity word; The positions of the multiple middle characters in the same entity word are obtained based on their order within the word.
3. The method according to claim 1, characterized in that, The markers include non-entity markers, which indicate that the character is not located in any arbitrary position of any entity word.
4. A device for recognizing entity words, characterized in that, include: The tag acquisition module is configured to input the text to be identified into the tag determination model to obtain a tag for each character in the text to be identified. The tag includes a marker and an encoding. The marker represents the position of the character in an entity word. Characters with the same encoding belong to the same entity word. A character includes at least one encoding, and a character including multiple encodings belongs to multiple entity words. The entity word acquisition module is configured to combine the characters according to the tags of the characters to obtain the target entity word; The label determination model is a model trained on text samples. The text samples contain entity word samples. Each character sample in the text sample carries the tag and the encoding. The entity word samples include nested entity word samples and non-contiguous entity word samples. The label determination model was trained according to the following steps: Obtain the text sample; The text samples are input into the base model to obtain the predicted label and predicted code for each character sample; The cross-entropy loss function value is determined based on the difference between the tag carried by each character sample and the predicted tag, and the difference between the encoding carried by each character sample and the predicted encoding. The base model is trained based on the cross-entropy loss function value until the preset training termination condition is met, thereby obtaining the label determination model. The tag acquisition module includes: The same entity word determination unit is configured to determine each character with the same encoding in the text to be identified as a character of the same entity word; The combination unit is configured to combine the characters into the same entity word according to the positions represented by the markers of the characters of the same entity word.
5. An electronic device, characterized in that, include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the entity word recognition method as described in any one of claims 1 to 3.
6. A computer-readable storage medium, wherein when instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is enabled to perform the entity word recognition method as described in any one of claims 1 to 3.
7. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the entity word recognition method as described in any one of claims 1 to 3.