Text processing method, apparatus, device, and medium
By acquiring query text pairs associated with information query behavior, performing character alignment processing and information matching degree analysis, and constructing a dictionary of similar characters, the problem of long time consumption and inaccuracy in the construction of dictionaries of similar-looking characters in the existing technology is solved, and the error correction effect is improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAOHONGSHU TECH CO LTD
- Filing Date
- 2023-09-27
- Publication Date
- 2026-06-26
AI Technical Summary
The existing dictionary of similar-looking characters is time-consuming and inaccurate, resulting in low efficiency in correcting similar-looking characters and affecting the correction effect.
By acquiring query text pairs associated with information query behavior, word alignment processing is performed to determine the word alignment range, and a dictionary of similar characters is constructed based on the matching degree of text information.
It improves the accuracy of the similar character dictionary and its relevance to the query scenario, reduces the omission of similar-looking characters, and enhances the error correction effect of query terms.
Smart Images

Figure CN117725155B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a text processing method, apparatus, device and medium. Background Technology
[0002] Automatically correcting incorrect user queries into correct ones is a crucial step in the search process. Common correction methods include correcting similar-looking characters (such as those with similar forms). This requires relying on a dictionary of similar-looking characters. Search engines can use this dictionary to correct incorrect queries containing similar-looking characters, thus obtaining the correct query.
[0003] Currently, the construction of dictionaries of similar-looking characters is usually done manually by compiling a list of similar-looking characters. For example, relevant personnel determine a second set of characters that share the same radical as the first character, and then select similar-looking characters from the second set. However, this method is time-consuming and prone to overlooking some potentially existing similar-looking characters, making the dictionary inaccurate. It also makes it difficult to find similar-looking characters present in incorrect search terms, resulting in low efficiency in correcting errors. In other words, the quality of the similar-looking character dictionary directly affects the effectiveness of similar-looking character correction. Summary of the Invention
[0004] This application provides a text processing method, apparatus, device, and medium that can improve the quality of the constructed similar character dictionary.
[0005] On one hand, embodiments of this application provide a text processing method, which includes:
[0006] Obtain the query text pair associated with the information query behavior; the query text pair includes a first query text and a second query text; the text included in the first query text is used to form a first text set; the text included in the second query text is used to form a second text set;
[0007] The first and second character sets are aligned to obtain the character alignment interval between them. The character set corresponding to the character alignment interval in the first character set is the first character subset, and the character set corresponding to the character alignment interval in the second character set is the second character subset. The first characters included in the first character subset are aligned with the second characters included in the second character subset.
[0008] Obtain the first text information of the first character and the second text information of the second character, and determine the information matching degree between the first character and the second character based on the first text information and the second text information;
[0009] When determining that the first and second characters are similar characters based on the information matching degree between the first and second characters, a dictionary of similar characters associated with the information query behavior is constructed using the first and second characters.
[0010] On one hand, embodiments of this application provide a text processing apparatus, the apparatus comprising:
[0011] The acquisition module is used to acquire query text pairs associated with information query behavior; the query text pairs include a first query text and a second query text; the text included in the first query text is used to form a first text set; the text included in the second query text is used to form a second text set.
[0012] The processing module is used to perform character alignment processing on the first character set and the second character set to obtain the character alignment interval between the first character set and the second character set; the character set corresponding to the character alignment interval in the first character set is the first character subset, the character set corresponding to the character alignment interval in the second character set is the second character subset, and the first characters included in the first character subset are aligned with the second characters included in the second character subset.
[0013] The processing module is also used to obtain the first text information of the first text and the second text information of the second text, and to determine the information matching degree between the first text and the second text based on the first text information and the second text information;
[0014] The processing module is also used to construct a dictionary of similar characters associated with the information query behavior when it is determined that the first character and the second character are similar characters based on the information matching degree between the first character and the second character.
[0015] On one hand, embodiments of this application provide an electronic device including a processor and a memory, wherein the memory is used to store a computer program, the computer program including program instructions, and the processor is configured to invoke the program instructions to execute some or all of the steps in the above method.
[0016] On one hand, embodiments of this application provide a computer-readable storage medium storing a computer program, the computer program including program instructions, which, when executed by a processor, are used to perform some or all of the steps in the above-described method.
[0017] Accordingly, according to one aspect of this application, a computer program product or computer program is provided, which includes computer instructions that, when executed by a processor, can implement some or all of the steps in the above-described method.
[0018] In this embodiment, query text pairs associated with information query behavior can be obtained. Character alignment processing is performed on the first and second text sets to obtain the character alignment interval between them. This interval can be used to determine the first and second text subsets. The first characters in the first text subset are aligned with the second characters in the second text subset, meaning they may be similar characters. Therefore, based on the first character information and the second character information, the information matching degree between the first and second characters can be determined. Furthermore, when the information matching degree determines that the first and second characters are similar characters... By constructing a dictionary of similar characters associated with information query behavior using the first and second characters, the query text pairs associated with the information query behavior can be used as a data source to quickly find the first and second character sets that may contain similar-looking characters. Furthermore, based on the information of the first and second characters, similar characters in the first and second character sets can be identified more accurately. At the same time, this dictionary of similar characters is strongly correlated with the user's query intent. While reducing the omission of similar-looking characters, it also makes it more likely to include a large number of similar characters that users are likely to encounter during the query process, and reduces most of the similar characters that are irrelevant to the information query behavior. This ensures the quality and accuracy of the construction of the dictionary of similar characters. Attached Figure Description
[0019] To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 A schematic diagram illustrating a text processing scenario provided in an embodiment of this application;
[0021] Figure 2 A flowchart illustrating a text processing method provided in this application embodiment. Figure 1 ;
[0022] Figure 3 This application provides a scenario illustration of alignment processing. Figure 1 ;
[0023] Figure 4 This application provides a scenario illustration of alignment processing. Figure 2 ;
[0024] Figure 5 A flowchart illustrating a text processing method provided in this application embodiment. Figure 2 ;
[0025] Figure 6 This is a schematic diagram of the structure of a text processing device provided in an embodiment of this application;
[0026] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0027] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.
[0028] The text processing method proposed in this application is implemented in an electronic device, which can be a server or a terminal. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, and big data and artificial intelligence platforms. The terminal can be a smartphone, tablet computer, laptop computer, desktop computer, etc., but is not limited to these.
[0029] A schematic diagram of a text processing scenario proposed based on this text processing method can be shown as follows: Figure 1 As shown, Figure 1 A network architecture is also proposed, which may include a server and a user terminal cluster. The user terminal cluster may include one or more user terminals; the number of user terminals in the cluster is not limited here. Communication connections can exist between the user terminals in the cluster. Simultaneously, any user terminal in the cluster can have a communication connection with the server, so that each user terminal in the cluster can interact with the server through this communication connection. The communication connection method is not limited; it can be a direct or indirect connection via wired communication, a direct or indirect connection via wireless communication, or other methods, which are not limited here. Furthermore, it is understood that the electronic devices involved in the embodiments of this application can be… Figure 1 The server shown can also be Figure 1 Any user terminal in the user terminal cluster shown.
[0030] For example, in an embodiment of this application, an electronic device (such as a server or user terminal) can obtain a query text pair associated with the information query behavior, and the text processing method proposed in this application can construct a similar text dictionary associated with the information query behavior based on the query text pair. For example, the server can obtain query text pairs associated with information query behavior. These query text pairs may include a first query text and a second query text. The first query text contains characters that form a first character set, and the second query text contains characters that form a second character set. The first and second character sets are then aligned to obtain a character alignment interval between them. The character set corresponding to the character alignment interval in the first character set is the first character subset, and the character set corresponding to the character alignment interval in the second character set is the second character subset. Furthermore, the first characters in the first character subset and the second characters in the second character subset are aligned, meaning that the first characters in the first character subset and the second characters in the second character set may be similar characters to each other. Therefore, the server can obtain the first character information of the first character and the second character information of the second character, and determine the information matching degree between the first and second characters based on the first and second character information. When the information matching degree determines that the first and second characters are similar characters to each other, a similar character dictionary is constructed using the first and second characters.
[0031] It is understandable that the similar character dictionary constructed at this point is determined by query text pairs associated with the information query behavior. Such a similar character dictionary can reduce the omission of similar-looking characters and is strongly correlated with the information query behavior. That is, the similar characters in the similar character dictionary are similar characters that users are likely to encounter during the search query process, ensuring the accuracy of the similar character dictionary, its relevance to the search query scenario, and the quality of dictionary construction. Therefore, when performing query term correction, similar characters involved in the incorrect query term can be quickly found from the similar character dictionary, and the incorrect query term can be corrected based on the found similar characters to obtain the correct query term, thereby improving the similar character correction effect.
[0032] Optionally, in some embodiments, the electronic device can execute the text processing method according to actual business needs to improve the construction effect of the similar text dictionary. The technical solution of this application can be applied to any query scenario, and the query text pair can be determined based on the text extracted from the query information and / or recall information entered in any query scenario. For example, through multimedia information query behavior in a multimedia query scenario, query text pairs in the multimedia query scenario can be obtained (e.g., the first text in the query text pair is text information related to multimedia (such as images) input by the user, and the second text in the query text pair can be determined based on the recall information related to multimedia corresponding to the text information input by the user), to construct a similar text dictionary associated with the multimedia query scenario. Similarly, through product information query behavior in a product query scenario, query text pairs in the product query scenario can be obtained (e.g., the first text in the query text pair is text information related to products input by the user, and the second text in the query text pair can be determined based on the recall information related to products corresponding to the text information input by the user), to construct a similar text dictionary associated with the product query scenario. No limitations are imposed here. Electronic devices can use the technical solution of this application to determine a dictionary of similar characters associated with one or more query scenarios, so as to efficiently find similar characters that may be involved in the erroneous query words in the current query scenario from the dictionary of similar characters, thereby improving the error correction effect of similar characters in the query scenario.
[0033] In this context, "similar characters" can refer to characters with similar forms or homophones, etc. Therefore, the resulting dictionary of similar characters could be a dictionary of similar-form characters, a dictionary of homophones, etc. No specific limitations are imposed here.
[0034] Optionally, the data involved in this application, such as query text pairs and similar character dictionaries, can be stored in a database or in a blockchain. This application does not limit the storage of such data through a blockchain distributed system.
[0035] It should be noted that in specific embodiments of this application, when scenarios involving the acquisition of user information and related data, such as obtaining user-input query information, user permission or consent is required. That is, when the embodiments of this application are applied to specific products or technologies, the collection, use, and processing of relevant user data comply with the relevant laws, regulations, and standards of the relevant regions. For example, prompts can be issued through an interactive interface to indicate what data will be collected or acquired. Specifically, the types and content of this data can be presented to the user through lists or other means. Further data collection and processing will only proceed after a confirmation or instruction to allow data collection is received on the interactive interface.
[0036] It is understood that the above scenarios are merely examples and do not constitute a limitation on the application scenarios of the technical solutions provided in the embodiments of this application. The technical solutions of this application can also be applied to other scenarios. For example, as those skilled in the art will know, with the evolution of system architecture and the emergence of new business scenarios, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.
[0037] Based on the foregoing description, this application proposes a text processing method that can be executed by the aforementioned electronic device. Please refer to... Figure 2 , Figure 2 This is a flowchart illustrating a text processing method provided in an embodiment of this application.
[0038] like Figure 2 As shown, the flow of the text processing method in this application embodiment may include the following:
[0039] S101. Obtain the query text pair associated with the information query behavior.
[0040] The query text pair includes a first query text and a second query text. The text in the first query text is used to form a first text set. The text in the second query text is used to form a second text set.
[0041] Information query behavior refers to the act of a business object inputting query information and returning recall information that matches the query information in an information query scenario. It can be understood that the returned recall information varies depending on the information query scenario. For example, in a multimedia query scenario, the returned recall information could be video data. Similarly, in a product query scenario, the returned recall information could be product data.
[0042] The process of obtaining the query text pair can be as follows: Obtain the query information entered by the business object in two consecutive information query actions; these two information query actions are consecutive, and the time interval between the actions is less than a time interval threshold; based on the query information entered in the two information query actions, obtain the query text pair. The time interval threshold can be set based on experience, such as 10 seconds.
[0043] In other words, the system obtains the query information entered by the business object (i.e., the user) in two consecutive information query actions (i.e., the user's historical query behavior log, which is the query information continuously entered by the user during the information query period), and uses this two entered query information as a query text pair. That is, the first query text in the query text pair is determined by one of the two entered query information, and the second query text in the query text pair is determined by the other query information in the two entered query information.
[0044] For example, if the query information is the text information input by the user, one of the query information entered twice (or the Chinese text in the query information) can be used as the first query text, and the other query information entered twice (or the Chinese text in the query information) can be used as the second query text.
[0045] 又如,查询信息为用户输入的图像信息,则可以将从该两次录入的查询信息中的一个查询信息中所提取出的文字信息(或者该文字信息中的中文文本)作为第一查询文本,将从该两次录入的查询信息中的另一个查询信息中所提取出的文字信息(或者该文字信息中的中文文本)作为第二查询文本。 Another example, if the query information is the image information input by the user, the text information extracted from one of the query information entered twice (or the Chinese text in the text information) can be used as the first query text, and the text information extracted from the other query information entered twice (or the Chinese text in the text information) can be used as the second query text.
[0046] 又如,查询信息为用户输入的语音信息,则可以将从该两次录入的查询信息中的一个查询信息中所提取出的文字信息(或者该文字信息中的中文文本)作为第一查询文本,将从该两次录入的查询信息中的另一个查询信息中所提取出的文字信息(或者该文字信息中的中文文本)作为第二查询文本。 Another example, if the query information is the voice information input by the user, the text information extracted from one of the query information entered twice (or the Chinese text in the text information) can be used as the first query text, and the text information extracted from the other query information entered twice (or the Chinese text in the text information) can be used as the second query text.
[0047] 可以理解,用户在进行信息的搜索查询时,可能存在第一次录入的是错误查询信息,但在第二次录入时进行修正的情况。比如,用户在第一次信息查询行为中输入的是错误查询信息(如“平果”),但用户在确定录入的查询信息错误后,在第二次信息查询行为中输入的是纠正后的查询信息(如“苹果”),也就是说,两次连续的信息查询行为所录入的查询信息中可能出现纠错行为,且可能是因为相似文字所导致的录入错误,如具体可以是因为形近字所导致的录入错误。因此,可以基于两次连续的信息查询行为中所录入的查询信息确定与信息查询行为相关联的查询文本对。 It can be understood that when the user conducts a search query for information, there may be a situation where the query information entered for the first time is incorrect, but it is corrected when entered for the second time. For example, the user enters incorrect query information (such as "pingguo") in the first information query behavior, but after determining that the entered query information is incorrect, the user enters the corrected query information (such as "apple") in the second information query behavior. That is to say, there may be an error correction behavior in the query information entered in two consecutive information query behaviors, and it may be an input error caused by similar characters, such as specifically an input error caused by similar-looking characters. Therefore, a query text pair associated with the information query behavior can be determined based on the query information entered in two consecutive information query behaviors.
[0048] 此外,可以理解,当所获取到的信息查询行为是连续的查询行为且行为发生时间比较相近时,可以理解为这两个查询行为是具有一定关系的查询行为。例如,用户在第一次信息查询行为中发现录入的查询信息有误,并快速在第二次信息查询行为中进行查询信息的修正。也就是说,两次连续且相近的查询行为更有可能存在查询信息的纠错行为。 In addition, it can be understood that when the obtained information query behaviors are consecutive query behaviors and the time of occurrence of the behaviors is relatively close, it can be understood that these two query behaviors are query behaviors with a certain relationship. For example, the user discovers that the entered query information is incorrect in the first information query behavior and quickly corrects the query information in the second information query behavior. That is to say, there is a greater possibility of an error correction behavior in the query information in two consecutive and close query behaviors.
[0049] Optionally, obtaining the query text pair can also involve: obtaining the information interaction behavior of the business object in response to the entered query information during the information query behavior; and obtaining the query text pair based on the query information entered during the information query behavior and the recall information generated during the information interaction behavior.
[0050] Information interaction behavior refers to the interactive behavior with the retrieved information corresponding to the entered query information, such as clicking, saving, or liking the retrieved information. In other words, when information interaction behavior instructs a user to click on a retrieved information, it indicates that the retrieved information has been interacted with.
[0051] In other words, it acquires the query information entered by the business object (i.e., the user) in an information query behavior, as well as the recall information generated by the interaction in response to that information query behavior, and uses the entered query information as a query text pair. That is, the first query text (or the second query text) in the query text pair is determined by the entered query information, and the second query text (or the first query text) in the query text pair is determined by the recall information generated by the interaction.
[0052] For example, if the query information is text information entered by the user, then the entered query information (or the Chinese text within that query information) can be used as the first query text, and the text information extracted from the recall information generated by the interaction (or the Chinese text within that text information) can be used as the second query text. For instance, when the recall information is video data, the extracted text information could be the video title, description, or other information from the video itself. Or, when the recall information is note data, the extracted text information could be the note title, note text, or other information contained within the note. Or, when the recall information is product data, the extracted text information could be the product title, details, or other information. No further limitations are imposed here.
[0053] For example, if the query information is an image input by the user, then the text information (or the Chinese text within that text information) extracted from the entered query information can be used as the first query text. Similarly, if the query information is voice information input by the user, then the text information (or the Chinese text within that text information) extracted from the entered query information can be used as the first query text.
[0054] It is understandable that when a user searches for information, the resulting interactive recall information is information associated with the entered query information. When a user enters incorrect query information, the returned recall information may be related to the correct query information. Therefore, the interactive recall information generated when entering incorrect query information can be considered as recall information associated with the correct query information. Thus, the error in the incorrect query information can be determined based on the recall information associated with the correct query information. This error may be due to input errors caused by similar characters, such as homophones. In other words, the entered query information and the resulting interactive recall information can be used to determine the query text pair associated with the information query behavior.
[0055] It is understandable that the query text in a query text pair is obtained through information query behavior and / or information interaction behavior corresponding to the information query scenario. The query text in the query text pair is related to the information query scenario. Thus, the similar text dictionary constructed from the query text pair is also related to the information query scenario. That is, a large number of similar texts contained in the similar text dictionary are similar texts that users may mistype in the information query scenario. Similar texts that are not related to the information query scenario can be reduced in the similar text dictionary. Therefore, the constructed similar text dictionary can contain similar texts that match the user's search intent in a specific query scenario.
[0056] S102. Perform character alignment processing on the first character set and the second character set to obtain the character alignment interval between the first character set and the second character set.
[0057] In this context, the character alignment intervals in the first character set correspond to the first character set, which is called the first character subset. The character alignment intervals in the second character set correspond to the second character subset, which is called the second character subset. Furthermore, the first characters included in the first character subset and the second characters included in the second character subset are aligned. It can be understood that the first characters included in the first character subset and the second characters included in the second character subset may be similar characters to each other.
[0058] The first character set consists of N1 characters. These N1 characters include the i1th character and the j1th character; i1 is less than j1, and i1 is a positive integer less than N1, while j1 is a positive integer less than or equal to N1. The second character set consists of N2 characters; these N2 characters include the i2th character and the j2th character; i2 is less than j2, and i2 is a positive integer less than N2, while j2 is a positive integer less than or equal to N2.
[0059] In some embodiments, the character alignment processing of the first character set and the second character set may specifically include: searching for a character identical to the i1th character among N2 characters; if the found character identical to the i1th character is the i2th character, then the i1th character is determined as the first aligned character, and at least one character after the i2th character is obtained from the N2 characters; searching for a character identical to the j1th character among the at least one character; if the found character identical to the j1th character is the j2th character, then the j1th character is determined as the second aligned character; and determining alignment based on the first aligned character and the second aligned character. The text set is defined as follows: The aligned text set includes B aligned texts, where B is the b-th aligned text and the (b+1)-th aligned text; B is a positive integer; b is a positive integer less than B. The function retrieves a first interval consisting of the text set between the b-th and (b+1)-th aligned texts from N1 texts, and a second interval consisting of the text set between the b-th and (b+1)-th aligned texts from N2 texts. When obtaining the aligned text intervals based on the first and second intervals, the text set corresponding to the first interval is taken as the first text subset, and the text set corresponding to the second interval is taken as the second text subset. The texts in the text set corresponding to the first interval are different from the texts in the text set corresponding to the second interval.
[0060] In other words, it involves identifying different characters that are potentially similar from the first and second character sets. This means that for the first and second query texts, characters positioned between two identical characters might be erroneous, or the characters in the first and second query texts that are positioned between two identical characters might be similar. In other words, characters in the same position but different from each other in the first and second query texts are considered potentially similar. This involves aligning the first and second query texts, specifically identifying characters in the same position between them.
[0061] It can be understood that when a character (character A) identical to a character in the first query text is found in the second query text, and the search continues from character A in the first query text, the search begins in the characters following character A in the second query text. It can also be understood that characters with the same meaning found in the first character set are considered as aligned characters, resulting in an aligned character set. That is, the aligned character set includes the identical characters between the first and second character sets, i.e., the aligned characters.
[0062] It is understandable that between the first query text and the second query text, identical text is identified sequentially, and the identified text is used as the aligned text. The position of the identified text in the first query text and the position in the second query text should be the same, that is, they are aligned. In addition, the positions of the corresponding texts of any two identified texts in the first query text and the corresponding texts in the second query text should also be the same, that is, they are aligned.
[0063] For example, if the first query text is "skincare products" and the second query text is "husband care products", the first character in the first query text is the same as the first character in the second query text, and the third character in the first query text is the same as the third character in the second query text. Therefore, the interval formed by the characters between the first and third characters in the first query text is the first interval, and the interval formed by the characters between the first and third characters in the second query text is the second interval. The first interval and the second interval are character-aligned intervals. The first interval includes the second character in the first query text, and the second interval includes the second character in the second query text. Therefore, it can be determined whether "skincare products" in the first query text and "husband care products" in the second query text are similar characters.
[0064] It is understandable that the characters in the character set between the i1th and j1st characters are all different from the characters in the character set between the i2th and j2nd characters. It is also understandable that the character alignment interval includes the first aligned interval determined in the first character set and the corresponding second interval determined in the second character set.
[0065] It should be noted that, at this point, the first text set can be considered to include not only the text in the first query text but also the default text. The default text is considered to be the first and last characters in the first text set, and also the first and last characters in the second text set. Furthermore, the first character in the first text set is the same as the second character in the second text set, and the last character in the second text set is the same as the last character in the second text set. Therefore, the first text set sequentially includes: the default text as the first character, the text in the first query text, and the default text as the last character. The second text set sequentially includes: the default text as the first character, the text in the second query text, and the default text as the last character.
[0066] It is understandable that the first text subset may include one or more first texts, and the second text subset may include one or more second texts. The information matching degree between each first text and each second text can be determined in turn to determine whether they are similar texts.
[0067] It is understandable that we can start from the first character in the first character set (e.g., character 11), and search for characters in the second character set that are the same as character 11. If the character that is the same as character 11 is the second character in the second character set (e.g., character 22), then we start from the third character in the second character set and search for characters that are the same as the second character in the first character set (e.g., character 12). If no character is found, then we search for characters that are the same as the third character in the first character set (e.g., character 13). If the character that is the same as character 13 is the fourth character in the second character set (e.g., character 24), then we take character 12 (the second character in the first character set) between character 11 and character 13 as the first character set, character 22 (the second character in the second character set) between character 21 and character 23 as the second character set, and the interval formed by character 12 as the first interval, and the interval formed by character 22 as the second interval, thus obtaining the character alignment interval. Subsequently, starting from the fourth character in the first character set (e.g., character 14), and after the fourth character in the second character set, search for characters that are the same as character 14 to determine the new character alignment interval.
[0068] At this point, the characters with the same content in the second character set as determined from the first character set are used as aligned characters to obtain an aligned character set. Then, the corresponding character alignment interval can be determined by any two adjacent aligned characters in the aligned character set. Furthermore, any two adjacent aligned characters in the subset of characters between the first and second character sets are aligned in position but different in content, and similar characters are more likely to appear in these two subsets. It can be understood that there can be one or more character alignment intervals between the first and second character sets. A character alignment interval includes a first interval and a second interval.
[0069] It can be understood that there may be similar characters among the different characters between the first query text and the second query text. For example, when the user enters incorrect query information "Pingguo efficacy" in the first information query behavior and enters correct query information "Apple efficacy" in the second information query behavior, character alignment processing can be performed at this time. It can be understood that for the first query text and the second query text, the characters with the same content and aligned positions can be regarded as the correct input parts, while the characters with different content and aligned positions can be regarded as the incorrect input parts. There may be similar characters in the incorrect input parts. Therefore, similar characters can be determined from the incorrect input parts. That is, for example, the first character subset "Ping" and the first character subset "Ping" are obtained, and the similarity of these two characters is judged. Compared with combining all the characters between the first character set and the second character set to determine whether they are similar characters, this can first screen out the first character subset and the second character subset containing similar characters, and then combine all the characters between the first character subset and the second character subset to determine whether they are similar characters, thereby reducing the workload and improving the efficiency of determining similar characters.
[0070] For example, as Figures 3-4 shown, Figures 3-4A schematic diagram of a scenario for alignment processing provided by an embodiment of the present application; wherein, the first query text is "Which dermatology hospital is better", and the second query text is "Which top - three dermatology hospital is better". The first word set constructed from the words in the first query text includes words a1 - a12 (default word A, which, family, skin, department, hospital, more, good, default word B); the second word set constructed from the words in the second query text includes words b1 - b14 (default word A, which, family, skin, husband, department, three, first, class, hospital, more, good, default word B); among them, the word found to be the same as word a1 in words b1 - b14 is word b1, the word found to be the same as word a2 in words b2 - b14 is word b2, the word found to be the same as word a2 in words b2 - b14 is word b2, the word found to be the same as word a3 in words b3 - b14 is word b3, the word found to be the same as word a4 in words b4 - b14 is word b4, no word found to be the same as word a5 in words b5 - b14, the word found to be the same as word a6 in words b5 - b14 is word b6, no word found to be the same as word a7 in words b7 - b14, the word found to be the same as word a8 in words b7 - b14 is word b10, the word found to be the same as word a9 in words b11 - b14 is word b11, the word found to be the same as word a10 in words b12 - b14 is word b12, no word found to be the same as word a11 in words b13 - b14, and the word found to be the same as word a12 in words b14 is word b14; therefore, words a1, a2, a3, a4, a6, a8, a9, a10, a12 are determined as the aligned word set; there is no word set between word a1 and word a2 in the first word set, no word set between word a2 and word a3 in the first word set, no word set between word a3 and word a4 in the first word set, there is a word set c11 ("skin") between word a4 and word a6 in the first word set and a word set c12 ("husband") between word a4 and word a6 in the second word set, there is a word set c21 ("hospital") between word a6 and word a8 in the first word set and a word set c22 ("three, first, class, hospital") between word a6 and word a8 in the second word set, no word set between word a8 and word a9 in the first word set, no word set between word a9 and word a10 in the first word set, there is a word set c31 ("good") between word a10 and word a12 in the first word set and a word set c32 ("good") between word a10 and word a12 in the second word set.
[0071] Therefore, as Figure 4 The first interval d11 can be determined from the text set c11, and the second interval d12 can be determined from the text set c12. The first interval d11 and the second interval d12 are then defined as the character alignment interval e1. The text set c11 corresponding to the first interval d11 is the first text subset h11, and the text set c12 corresponding to the second interval d12 is the second text subset h12. The information matching degree between the character a5 in the first text subset h11 and the character b5 in the second text subset h12 can be determined. Similarly, the first interval d21 can be determined from the text set c21, and the second interval d22 can be determined from the text set c22. The first interval d21 and the second interval d22 are then defined as the character alignment interval e2. The text set c21 corresponding to the first interval d21 is the first text subset h21, and the text set c22 corresponding to the second interval d22 is the second text subset h22. This process can be repeated sequentially. The information matching degree between text a7 in the first text subset h21 and text b7 in the second text subset h22 is determined; the information matching degree between text a7 in the first text subset h21 and text b8 in the second text subset h22 is determined; the information matching degree between text a7 in the first text subset h21 and text b9 in the second text subset h22 is determined; the first interval d31 can be determined by text set c31, the second interval d32 can be determined by text set c32, and the first interval d31 and the second interval d32 can be determined as the character alignment interval e3, and the text set c31 corresponding to the first interval d31 is the first text subset h31, and the text set c32 corresponding to the second interval d32 is the second text subset h32, and the information matching degree between text a11 in the first text subset h31 and text b13 in the second text subset h32 can be determined.
[0072] It can be understood that the following pairs of text are considered to be similar: text a1 and text b1 are text with the same content and are aligned; text a2 and text b2 are text with the same content and are aligned; text a3 and text b3 are text with the same content and are aligned; text a4 and text b4 are text with the same content and are aligned; text a5 and text b5 are text with different content and are aligned (which can be used to determine whether they are similar text); text a6 and text b6 are text with the same content and are aligned; text a7 and text b7-b9 are text with different content and are aligned (which can be used to determine whether they are similar text); text a8 and text b10 are text with the same content and are aligned; text a9 and text b11 are text with the same content and are aligned; text a10 and text b12 are text with the same content and are aligned; text a11 and text b13 are text with different content and are aligned (which can be used to determine whether they are similar text); and text a12 and text b14 are text with the same content and are aligned.
[0073] Optionally, when both N1 and N2 are 1 and the first query text is different from the second query text, it can be determined whether the first query text and the second query text are similar characters.
[0074] Optionally, when N1 is 1, N2 is a positive integer greater than 1, and each character in the first query text is different from each character in the second query text, it can be determined respectively whether each character in the first query text and the second query text is a similar character.
[0075] Optionally, when N1 is a positive integer greater than 1, N2 is 1, and each character in the first query text is different from the second query text, it can be determined respectively whether each character in the first query text and the second query text is a similar character.
[0076] S103. Obtain the first character information of the first character and the second character information of the second character, and determine the information matching degree between the first character and the second character based on the first character information and the second character information.
[0077] Among them, when the type of similar characters is similar in form, the first character information of the first character and the second character information of the second character can be the glyph structure information of the characters. The glyph structure information can refer to the set of character components obtained after splitting the characters, and multiple character components in the set of character components form a character. For example, for the character "维", the set of character components obtained after splitting is "纟、隹".
[0078] Therefore, determining the information matching degree based on the first character information and the second character information can be to determine the same character components between the set of character components of the first character and the set of character components of the second character, obtain the minimum number from the number of character components included in the set of character components of the first character and the number of character components included in the set of character components of the second character, and use the ratio of the number of the same character components to the minimum number as the first reference value; if the first reference value is greater than the reference threshold, determine the information matching degree between the first character and the second character as the first matching degree; if the first reference value is less than or equal to the reference threshold, determine the information matching degree between the first character and the second character as the second matching degree. Among them, the first matching degree is used to indicate that the first character and the second character are similar in form to each other, such as 1. The second matching degree is used to indicate that the first character and the second character are not similar in form to each other, such as 0.
[0079] Among them, when the type of similar characters is homophone, the first character information of the first character and the second character information of the second character can be the phonetic glyph structure information of the characters. The phonetic glyph structure information can refer to the pinyin information of the characters. For example, for the character "维", the pinyin information is "wei".
[0080] Therefore, determining the information matching degree based on the first and second text information can be achieved by determining the pinyin editing distance between the pinyin information of the first and second characters; if the pinyin editing distance is greater than an editing distance threshold, the information matching degree between the first and second characters is determined as the first matching degree; if the pinyin editing distance is less than or equal to the editing distance threshold, the information matching degree between the first and second characters is determined as the second matching degree. The first matching degree indicates that the first and second characters are homophones, for example, it is 1. The second matching degree indicates that the first and second characters are not homophones, for example, it is 0.
[0081] S104. When determining that the first and second characters are similar characters based on the information matching degree between the first and second characters, construct a similar character dictionary associated with the information query behavior using the first and second characters.
[0082] It's understandable that when the information matching degree between the first and second characters indicates that they are similar characters, both characters can be added to the similar character dictionary. For example, when determining whether the first and second characters are similar in form, if the information matching degree is the first degree, then the first and second characters are determined to be similar in form; if the information matching degree is the second degree, then the first and second characters are determined not to be similar in form. Similarly, when determining whether the first and second characters are homophones, if the information matching degree is the first degree, then the first and second characters are determined to be homophones; if the information matching degree is the second degree, then the first and second characters are determined not to be homophones.
[0083] This dictionary of similar characters records the correspondences between the first and second characters. It can be understood that when the first and second characters are similar in form, the resulting dictionary is a dictionary of similar-looking characters, which can be used for error correction. When the first and second characters are homophones, the resulting dictionary is a dictionary of homophones, which can be used for error correction. Subsequently, in search scenarios, this dictionary can be used to correct errors in search results, providing more accurate search results.
[0084] Optionally, when a user performs an information query, they can input query information (such as target query text). When the electronic device performs an information query based on the target query text, it can first perform text correction processing on the target query text before performing the information query. Alternatively, when performing an information query on the target query text, if the retrieved recall information is less than a specified number, it indicates that there may be an information input error, and text correction processing can be performed on it before performing the information query.
[0085] Therefore, it's possible for an electronic device to acquire the target query text; based on a dictionary of similar characters, it performs text correction processing on the target query text to obtain a corrected query text; the corrected query text is then used to perform information retrieval based on the target query text. In other words, information retrieval is performed through the corrected query text. This improves the effectiveness of information retrieval. For example, in product searches, it can more accurately retrieve product information related to the target query text, thereby improving product conversion rates and the overall search experience.
[0086] For example, when obtaining the query information input by the user, the system performs error character recognition on the query information. For instance, it uses a pre-trained similar-character recognition model to identify erroneous characters contained in the query information. These erroneous characters are those that may contain similar-character errors. The system then searches for the similar-characters corresponding to the identified erroneous characters in the similar-character dictionary as candidate correction characters. Based on these candidate correction characters, the system corrects the erroneous characters in the query information to obtain the corrected query information. The system then performs error character recognition on the corrected query information. If it is determined that the corrected query information does not contain erroneous characters, the corrected query information is identified as the correct query information (corrected query text), and the query search service is then performed again based on this correct query information.
[0087] In this embodiment, query text pairs associated with information query behavior can be obtained. Character alignment processing is performed on the first and second text sets to obtain the character alignment interval between them. This interval can be used to determine the first and second text subsets. The first characters in the first text subset are aligned with the second characters in the second text subset, meaning they may be similar characters. Therefore, based on the first character information and the second character information, the information matching degree between the first and second characters can be determined. Furthermore, when the information matching degree determines that the first and second characters are similar characters... By constructing a dictionary of similar characters associated with information query behavior using the first and second characters, the query text pairs associated with the information query behavior can be used as a data source to quickly find the first and second character sets that may contain similar-looking characters. Furthermore, based on the information of the first and second characters, similar characters in the first and second character sets can be identified more accurately. At the same time, this dictionary of similar characters is strongly correlated with the user's query intent. While reducing the omission of similar-looking characters, it also makes it more likely to include a large number of similar characters that users are likely to encounter during the query process, and reduces most of the similar characters that are irrelevant to the information query behavior. This ensures the quality and accuracy of the construction of the dictionary of similar characters.
[0088] Based on the foregoing description, this application proposes a text processing method that can be executed by the aforementioned electronic device. Please refer to... Figure 5 , Figure 5 This is a flowchart illustrating a text processing method provided in an embodiment of this application.
[0089] like Figure 5 As shown, the flow of the text processing method in this application embodiment may include the following:
[0090] S201. Obtain the query text pair associated with the information query behavior.
[0091] S202. Perform character alignment processing on the first character set and the second character set to obtain the character alignment interval between the first character set and the second character set. The specific implementation methods of steps S201-S202 can be found in the relevant descriptions of the above embodiments, and will not be repeated here.
[0092] S203. Obtain the first text information of the first character and the second text information of the second character, and determine the text association information between the first character and the second character based on the first text information and the second text information.
[0093] The first text information may include at least one of the following: first glyph structure information, first phonetic-phonetic structure information, and first converted image information of the first character. The second text information includes: second glyph structure information, second phonetic-phonetic structure information, and second converted image information of the second character.
[0094] Optionally, taking the first character structure information as an example, it may include at least one of the following: the set of character elements obtained after splitting the first character, the font structure of the first character (such as top-bottom structure, left-right structure, etc.), the radical structure of the first character, the four-corner encoding information of the first character, the number of strokes of the first character structure information, etc.
[0095] Optionally, taking the first phonetic-semantic structural information as an example, it may include at least one of the following: the pinyin information of the first character, the tone information of the first character, etc.
[0096] Optionally, taking the first converted image information as an example, it can refer to the text image obtained after converting the first text. For example, a k*k (e.g., 64) two-dimensional pixel matrix is generated for the first text, and the two-dimensional pixel matrix is determined as the first converted image information.
[0097] Wherein, when the first text information includes the first character structure information and the second text information includes the second character structure information, the text association information may include the associated character structure information determined by the first character structure information and the second character structure information.
[0098] Optionally, when the first glyph structure information includes the set of constituent elements of the first character and the first glyph structure information includes the set of constituent elements of the second character, the associated glyph structure information can be determined by the set of identical constituent elements between the set of constituent elements of the first character and the set of constituent elements of the second character. For example, it can be determined by taking the minimum number of constituent elements contained in the set of constituent elements of the first character and the set of constituent elements of the second character, and determining the ratio of the number of identical constituent elements to the minimum number as the associated glyph structure information corresponding to the set of constituent elements. Alternatively, when the aforementioned ratio is greater than a ratio threshold, a first value (e.g., 1) is used as the associated glyph structure information corresponding to the set of constituent elements; when the aforementioned ratio is less than or equal to the ratio threshold, a second value (e.g., 0) is used as the associated glyph structure information corresponding to the set of constituent elements.
[0099] Optionally, when the first glyph structure information includes the font structure of the first character and the first glyph structure information includes the font structure of the second character, when the font structure of the first character is the same as the font structure of the second character, the first value (e.g., 1) is used as the associated glyph structure information corresponding to the font structure; when the aforementioned ratio is less than or equal to the ratio threshold, the second value (e.g., 0) is used as the associated glyph structure information corresponding to the font structure.
[0100] Optionally, when the first character structure information includes the radical structure of the first character and the first character structure information includes the radical structure of the second character, when the radical structure of the first character is the same as the radical structure of the second character, the first value (e.g., 1) is used as the associated character structure information corresponding to the radical structure; when the aforementioned ratio is less than or equal to the ratio threshold, the second value (e.g., 0) is used as the associated character structure information corresponding to the radical structure.
[0101] Optionally, when the first glyph structure information includes the four-corner encoding information of the first character and the first glyph structure information includes the four-corner encoding information of the second character, when the four-corner encoding information of the first character is the same as the four-corner encoding information of the second character, the first value (e.g., 1) is used as the associated glyph structure information corresponding to the four-corner encoding information; when the current ratio is less than or equal to the ratio threshold, the second value (e.g., 0) is used as the associated glyph structure information corresponding to the four-corner encoding information.
[0102] Optionally, when the first glyph structure information includes the number of strokes of the first character and the first glyph structure information includes the number of strokes of the second character, the character association information can be determined by the difference in the number of strokes between the number of strokes of the first character and the number of strokes of the second character. For example, it can be that the difference in the number of strokes is determined as the associated glyph structure information corresponding to the number of strokes. Or, when the aforementioned difference in the number of strokes is less than the difference threshold, the first value (such as 1) is used as the associated glyph structure information corresponding to the number of strokes; when the aforementioned difference in the number of strokes is greater than or equal to the difference threshold, the second value (such as 0) is used as the associated glyph structure information corresponding to the number of strokes.
[0103] Optionally, when the first glyph structure information includes the set of character - forming elements of the first character and the first glyph structure information includes the set of character - forming elements of the second character, the character association information can be determined by the set of character - forming elements of the first character and the set of character - forming elements of the second character. For example, when the set of character - forming elements of the first character belongs to the set of character - forming elements of the second character, or the set of character - forming elements of the second character belongs to the set of character - forming elements of the first character, it indicates that there is a character inclusion relationship between the first character and the second character. For example, the set of character - forming elements of the first character includes "bian", and the set of character - forming elements of the second character includes "纟, bian", so the set of character - forming elements of the first character belongs to the set of character - forming elements of the second character. Therefore, there is an inclusion relationship between the first character and the second character, that is, the second character includes the first character. The indication information (such as 1) used to indicate the character inclusion relationship can be determined as the associated glyph structure information corresponding to the character relationship. Or, when the set of character - forming elements of the first character does not belong to the set of character - forming elements of the second character and the set of character - forming elements of the second character does not belong to the set of character - forming elements of the first character, it indicates that there is no character inclusion relationship between the first character and the second character. The indication information (such as 0) used to indicate the non - existence of the character inclusion relationship can be determined as the associated glyph structure information corresponding to the character relationship.
[0104] Thus, one or more different - dimensional associated glyph structure information can be obtained through the above - mentioned methods.
[0105] Among them, when the first character information includes the first phonetic - glyph structure information and the second character information includes the second phonetic - glyph structure information, the character association information can include the associated phonetic - glyph structure information determined by the first phonetic - glyph structure information and the second phonetic - glyph structure information.
[0106] Optionally, when the first phonetic-semantic structural information includes the pinyin information of the first character and the second phonetic-semantic structural information includes the pinyin information of the second character, the associated glyph structural information can be determined through the pinyin information of the first character and the pinyin information of the second character. For example, it can be that the pinyin editing distance between the pinyin information of the first character and the pinyin information of the second character is obtained, and the pinyin editing distance is determined as the associated glyph structural information corresponding to the pinyin information. Alternatively, when the aforementioned pinyin editing distance is greater than the editing distance threshold, a first value (e.g., 1) is used as the associated glyph structural information corresponding to the pinyin information; when the aforementioned pinyin editing distance is less than or equal to the editing distance threshold, a second value (e.g., 0) is used as the associated glyph structural information corresponding to the pinyin information.
[0107] Optionally, when the first glyph structure information includes the tone information of the first character and the first glyph structure information includes the tone information of the second character, when the tone information of the first character is the same as the tone information of the second character, the first value (e.g., 1) is used as the associated phonetic structure information corresponding to the tone information; when the tone information of the first character is different from the tone information of the second character, the second value (e.g., 0) is used as the associated phonetic structure information corresponding to the tone information.
[0108] Therefore, the above methods can be used to associate sound and shape structure information in one or more different dimensions.
[0109] Wherein, when the first text information includes first converted image information and the second text information includes second converted image information, the text association information may include the association converted image information determined by the first converted image information and the second converted image information. For example, it may be the extraction of first image features from the text image of the first text and second image features from the text image of the second text, and the determination of the feature similarity between the first image features and the second image features as the association converted image information.
[0110] S204. Based on the first text information, the second text information, and the text association information, determine the information matching degree between the first text and the second text.
[0111] Determining the information matching degree between the first character and the second character can be achieved by: determining the text integration features between the first character and the second character based on the first character information, the second character information, and the text association information; and determining the information matching degree between the first character and the second character based on the text integration features.
[0112] Specifically, an information feature sequence consisting of first textual information, second textual information, and textual association information can be identified, and textual integration features can be determined based on this information feature sequence. For example, the information sequence can be used as the textual integration feature. Alternatively, the normalized sequence of the information sequence can be used as the textual integration feature. No limitation is made here.
[0113] For example, the first text information includes first character shape structure information, first phonetic-phonetic structure information, and first converted image information; the second text information includes second character shape structure information, second phonetic-phonetic structure information, and second converted image information; the text association information includes associated character shape structure information, associated phonetic-phonetic structure information, and associated converted image information; it can be that the information feature sequence determined by the first character shape structure information, first phonetic-phonetic structure information, first converted image information, second character shape structure information, second phonetic-phonetic structure information, second converted image information, associated character shape structure information, associated phonetic-phonetic structure information, and associated converted image information is obtained to determine the text integration features between the first character and the second character.
[0114] For example, the target information can be determined from all or part of the following dimensions: the first character shape structure information, the first phonetic-phonetic structure information, the first converted image information, the second character shape structure information, the second phonetic-phonetic structure information, the second converted image information, the associated character shape structure information, the associated phonetic-phonetic structure information, and the associated converted image information. (For example, it could be all or part of the first character shape structure information, which could be some specified dimensions of the first character shape structure information, or it could be information from any dimension not selected from the first character shape structure information to form the target information. For example, when determining whether the first character and the second character are similar in form, any information from the first phonetic-phonetic structure information can be omitted; or, when determining whether the first character and the second character are homophones, any information from the first character shape structure information can be omitted, and the same applies to the other information.) The sequence formed by the target information is then defined as the information feature sequence. This specific setting can be made by relevant business personnel and is not limited here.
[0115] For example, the first character structure information includes the set of character-forming elements, font structure, radical structure, four-corner coding information, and number of strokes for the first character; the second character structure information includes the set of character-forming elements, font structure, radical structure, four-corner coding information, and number of strokes for the second character; the first phonetic-phonetic structure information includes the pinyin information and tone information for the first character; the second phonetic-phonetic structure information includes the pinyin information and tone information for the second character; the first character information includes the first converted image information, and the second character information includes the second converted image information; the related character structure information in the character association information includes the related character structure information corresponding to the set of character-forming elements, the related character structure information corresponding to the font structure, the related character structure information corresponding to the radical structure, the related character structure information corresponding to the four-corner coding information, the related character structure information corresponding to the number of strokes, and the related character structure information corresponding to the character relationship; the character association information... The associated phonetic-semantic structure information includes the associated character structure information corresponding to the pinyin information and the associated phonetic-semantic structure information corresponding to the tone information; the text association information also includes associated transformation image information; for example, target information 1 obtained from the first character structure information can be the font structure, radical structure, and four-corner encoding information of the first character; target information 2 obtained from the second character structure information can be the font structure, radical structure, and four-corner encoding information of the second character; target information 3 obtained from the text association information can be the associated character structure information corresponding to the set of character elements, the associated character structure information corresponding to the number of strokes, the associated character structure information corresponding to the character relationship, the associated character structure information corresponding to the pinyin information, the associated phonetic-semantic structure information corresponding to the tone information, and the associated transformation image information; therefore, the sequence formed by target information 1, target information 2, and target information 3 can be used as the information feature sequence.
[0116] One approach is to use a trained model to predict the information matching degree between the first and second characters based on the text integration features. For example, this could involve acquiring a text information processing model associated with the information matching degree; inputting the text integration features into the text information processing model; and then having the model predict the information matching degree between the first and second characters based on the text integration features.
[0117] The text information processing model can be any neural network model, such as the XGBoost model (Extreme Gradient Boosting Tree model). This model includes at least one decision tree used as a classifier. When making predictions based on integrated text features, the text information processing model can segment the integrated text features and predict the leaf nodes segmented on at least one decision tree. Based on the node parameters corresponding to the leaf nodes segmented on at least one decision tree, the information matching degree between the first and second characters can be determined. For example, the sum (or average) of the parameter values of the node parameters corresponding to the segmented leaf nodes can be used as the information matching degree.
[0118] Wherein, when the text information processing model is an extreme gradient boosting tree model, obtaining the text information processing model can be as follows: obtaining an initial processing model to be trained, and obtaining training sample data pairs for training the initial processing model; the initial processing model includes at least one decision tree to be trained; the training sample data pairs include a first sample text and a second sample text; obtaining the first sample text information of the first sample text and the second sample text information of the second sample text, and determining the sample text association information between the first sample text and the second sample text based on the first sample text information and the second sample text information; determining the sample text integration features between the first sample text and the second sample text based on the first sample text information, the second sample text information, and the sample text association information, inputting the sample text integration features into the initial processing model, and having the initial processing model perform feature partitioning on the sample text integration features to predict the leaf nodes partitioned on at least one decision tree to be trained; determining the sample information matching degree between the first sample text and the second sample text based on the node parameters corresponding to the leaf nodes partitioned on at least one decision tree to be trained; training the initial processing model based on the sample information matching degree to obtain the trained target processing model, and determining the target processing model as the text information processing model; the target processing model includes at least one trained decision tree. The training sample data pairs are labeled with information matching degree tags. The prediction bias for the initial processing model can be determined by the sample information matching degree and information matching degree tags. The initial processing model is then trained using this prediction bias until the model converges.
[0119] Alternatively, some or all of the information from the first character structure information, the first phonetic structure information, the first converted image information, the second character structure information, the second phonetic structure information, the second converted image information, the associated character structure information, and the associated phonetic structure information (such as the target information 1-3 in the example above) can be directly input into the text information processing model to obtain the information matching degree.
[0120] Therefore, similar-looking characters can be identified and mined by comparing the phonetic and glyphic structures of the first and second characters. By using query records as the data source for similar-looking character mining, a precise, comprehensive dictionary of similar-looking characters that conforms to downstream application scenarios can be constructed.
[0121] S205. When determining that the first character and the second character are similar characters based on the information matching degree between the first character and the second character, construct a similar character dictionary associated with the information query behavior using the first character and the second character.
[0122] It's understandable that when the information matching degree between the first and second characters indicates that they are similar characters, both characters can be added to the similar character dictionary. For example, when determining whether the first and second characters are similar in form, if the information matching degree is greater than or equal to a first matching degree threshold, then the first and second characters are determined to be similar in form; if the information matching degree is less than the first matching degree threshold, then the first and second characters are determined not to be similar in form. Similarly, when determining whether the first and second characters are homophones, if the information matching degree is greater than or equal to a second matching degree threshold, then the first and second characters are determined to be homophones; if the information matching degree is less than the second matching degree threshold, then the first and second characters are determined not to be homophones.
[0123] Optionally, the above method uses query text pairs associated with information query behavior as the data source to determine the similar character dictionary. Alternatively, a Chinese character dictionary can be used as the data source to construct a similar character dictionary, which serves as the corresponding extended similar character dictionary. Thus, after performing similar character matching through the similar character dictionary, similar character matching can also be performed through the extended similar character dictionary. For example, if no similar-looking characters in the query term are matched from the similar-looking character dictionary corresponding to the query text pair, matching can continue from the extended similar-looking character dictionary to determine possible similar-looking characters in the query term. For example, two characters to be judged for similarity can be obtained from the Chinese character dictionary, and their similarity can be determined as described above. For example, characters with the same character structure (such as radicals) can be obtained from the Chinese character dictionary and paired to form similar character pairs; or, characters with the same phonetic structure (such as initials or finals) can be obtained from the Chinese character dictionary and paired to form similar character pairs. The information matching degree between the two characters in the similar character pair can be determined in the above manner to further determine whether the two characters are similar characters, thereby constructing the extended similar character dictionary.
[0124] Therefore, the process could be as follows: Obtain a Chinese character dictionary; retrieve a third character and a fourth character that is suspected to be similar to the third character from the dictionary; obtain the third character information of the third character and the fourth character information of the fourth character; and determine the information matching degree between the third and fourth characters based on the third and fourth character information; when the third and fourth characters are determined to be similar characters based on the information matching degree, construct a similar character extended dictionary corresponding to the similar character dictionary using the third and fourth characters. Optionally, after constructing the similar character extended dictionary, deduplication can be performed on the similar character extended dictionary based on the similar character dictionary, that is, similar characters contained in the similar character extended dictionary can be deleted from the similar character extended dictionary, and the similar characters in the similar character extended dictionary can be added as a supplementary similar character.
[0125] This can be achieved by identifying similar character pairs from a Chinese character dictionary and determining whether the two characters in the pair are indeed similar. The two characters in this pair are the third and fourth characters, which are suspected to be similar to each other. The specific method for determining whether the third and fourth characters are similar is the same as the method for determining whether the first and second characters are similar; please refer to the relevant descriptions above for details, which will not be repeated here.
[0126] In this embodiment, query text pairs associated with information query behavior can be obtained. Character alignment processing is performed on the first and second text sets to obtain the character alignment interval between them. This interval can be used to determine the first and second text subsets. The first characters in the first text subset are aligned with the second characters in the second text subset, meaning they may be similar characters. Therefore, based on the first character information and the second character information, the information matching degree between the first and second characters can be determined. Furthermore, when the information matching degree determines that the first and second characters are similar characters... By constructing a dictionary of similar characters associated with information query behavior using the first and second characters, the query text pairs associated with the information query behavior can be used as a data source to quickly find the first and second character sets that may contain similar-looking characters. Furthermore, based on the information of the first and second characters, similar characters in the first and second character sets can be identified more accurately. At the same time, this dictionary of similar characters is strongly correlated with the user's query intent. While reducing the omission of similar-looking characters, it also makes it more likely to include a large number of similar characters that users are likely to encounter during the query process, and reduces most of the similar characters that are irrelevant to the information query behavior. This ensures the quality and accuracy of the construction of the dictionary of similar characters.
[0127] Please see Figure 6 , Figure 6 This is a schematic diagram of a text processing device provided in an embodiment of this application. It should be noted that... Figure 6 The text processing apparatus shown is used to execute this application. Figure 2 and Figure 5 The methods in the illustrated embodiments are shown only in the parts relevant to the embodiments of this application for ease of explanation; specific technical details are not disclosed. Reference to this application is required. Figure 2 and Figure 5 The illustrated embodiment. The text processing device 600 may include: an acquisition module 601 and a processing module 602. Wherein:
[0128] The acquisition module 601 is used to acquire query text pairs associated with information query behavior; the query text pairs include a first query text and a second query text; the text included in the first query text is used to form a first text set; the text included in the second query text is used to form a second text set.
[0129] The processing module 602 is used to perform character alignment processing on the first character set and the second character set to obtain the character alignment interval between the first character set and the second character set; the character set corresponding to the character alignment interval in the first character set is the first character subset, the character set corresponding to the character alignment interval in the second character set is the second character subset, and the first characters included in the first character subset are aligned with the second characters included in the second character subset.
[0130] The processing module 602 is also used to obtain the first text information of the first text and the second text information of the second text, and to determine the information matching degree between the first text and the second text based on the first text information and the second text information;
[0131] The processing module 602 is also used to construct a dictionary of similar characters associated with the information query behavior when it is determined that the first character and the second character are similar characters based on the information matching degree between the first character and the second character.
[0132] Specifically, when the acquisition module 601 acquires query text pairs associated with the information query behavior, it is used for:
[0133] Retrieve the query information entered by the business object in two consecutive information query actions; the two information query actions are consecutive query actions, and the time interval between the actions is less than the time interval threshold.
[0134] Based on the query information entered in the two information query actions, query text pairs are obtained.
[0135] Specifically, when the acquisition module 601 acquires query text pairs associated with the information query behavior, it is used for:
[0136] Retrieve the information interaction behavior of the business object in response to the entered query information during the information query process;
[0137] Based on the query information entered in the information query behavior and the recall information generated in the information interaction behavior, query text pairs are obtained.
[0138] The first character set includes N1 characters; the N1 characters include the i1th character and the j1th character; N1 is a positive integer greater than 1; i1 is less than j1, and i1 is a positive integer less than N1, and j1 is a positive integer less than or equal to N1; the second character set includes N2 characters; the N2 characters include the i2th character and the j2th character; i2 is less than j2, and i2 is a positive integer less than N2, and j2 is a positive integer less than or equal to N2; N2 is a positive integer greater than 1.
[0139] When processing module 602 performs character alignment processing on the first character set and the second character set to obtain the character alignment interval between the first character set and the second character set, it specifically performs the following:
[0140] Find the character that is the same as the i1th character among the N2 characters;
[0141] If the character found to be the same as the i1th character is the i2th character, then the i1th character is determined as the first aligned character, and at least one character after the i2th character is obtained from the N2 characters.
[0142] Find the character that is the same as the j1th character in at least one character;
[0143] If the character found to be identical to the j1th character is the j2th character, then the j1st character is determined as the second aligned character.
[0144] The alignment text set is determined based on the first and second alignment texts; the alignment text set includes B alignment texts, and the B alignment texts include the b-th alignment text and the (b+1)-th alignment text; B is a positive integer; b is a positive integer less than B;
[0145] Obtain a first interval consisting of the set of N1 characters located between the b-th aligned character and the (b+1)-th aligned character, and obtain a second interval consisting of the set of N2 characters located between the b-th aligned character and the (b+1)-th aligned character; the characters in the set of characters corresponding to the first interval are different from the characters in the set of characters corresponding to the second interval.
[0146] When obtaining the character alignment interval based on the first interval and the second interval, the character set corresponding to the first interval is taken as the first character subset, and the character set corresponding to the second interval is taken as the second character subset.
[0147] Specifically, when processing module 602 determines the information matching degree between the first text information and the second text information, it is used for:
[0148] Determine the text association information between the first text and the second text based on the first text information and the second text information;
[0149] Based on the first text information, the second text information, and the text association information, the text integration features between the first text and the second text are determined;
[0150] The degree of information matching between the first and second characters is determined based on text integration features.
[0151] The first text information includes: the first character shape structure information, the first phonetic-phonetic structure information, and the first converted image information of the first character; the second text information includes: the second character shape structure information, the second phonetic-phonetic structure information, and the second converted image information of the second character; the text association information includes: the associated character shape structure information determined by the first character shape structure information and the second phonetic-phonetic structure information, the associated phonetic-phonetic structure information determined by the first phonetic-phonetic structure information and the second phonetic-phonetic structure information, and the associated converted image information determined by the first converted image information and the second converted image information;
[0152] When processing module 602 determines the text integration features between the first text and the second text based on the first text information, the second text information, and the text association information, it specifically performs the following:
[0153] Obtain the information feature sequence determined by the first character shape structure information, the first phonetic shape structure information, the first converted image information, the second character shape structure information, the second phonetic shape structure information, the second converted image information, the associated character shape structure information, the associated phonetic shape structure information, and the associated converted image information;
[0154] Based on the information feature sequence, the text integration features between the first and second characters are determined.
[0155] Specifically, when processing module 602 is used to determine the information matching degree between the first character and the second character based on text integration features, it is used for:
[0156] Obtain a text information processing model; the text information processing model includes at least one decision tree;
[0157] The text integration features are input into the text information processing model, which then performs feature segmentation on the text integration features and predicts the leaf nodes that will be segmented on at least one decision tree.
[0158] Based on the node parameters corresponding to the leaf nodes divided on at least one decision tree, the information matching degree between the first text and the second text is determined.
[0159] The processing module 602 is also used for:
[0160] Obtain a Chinese character dictionary, and from the dictionary, obtain the third character and a fourth character that is suspected to be similar to the third character;
[0161] Obtain the information of the third character and the information of the fourth character, and determine the degree of information matching between the third character and the fourth character based on the information of the third character and the fourth character.
[0162] When determining that the third and fourth characters are similar characters based on the information matching degree between them, an extended dictionary of similar characters is constructed using the third and fourth characters.
[0163] The processing module 602 is also used for:
[0164] Retrieve the target query text;
[0165] Based on a dictionary of similar characters, text correction processing is performed on the target query text to obtain the corrected query text; the corrected query text is used to perform information retrieval based on the target query text.
[0166] The specific implementation methods of the acquisition module and the processing module can be found in the description of the above embodiments, and will not be repeated here. It should be understood that the beneficial effects obtained by using the same method will also not be repeated here.
[0167] Please see Figure 7 , Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Figure 7 As shown, the electronic device 700 includes at least one processor 701 and a memory 702. Optionally, the electronic device may also include a network interface. The processor 701, memory 702, and network interface can exchange data. The network interface, controlled by the processor 701, is used to send and receive messages. The memory 702 stores computer programs, including program instructions. The processor 701 executes the program instructions stored in the memory 702. The processor 701 is configured to invoke the program instructions to execute the aforementioned method.
[0168] The memory 702 may include volatile memory, such as random-access memory (RAM); the memory 702 may also include non-volatile memory, such as flash memory, solid-state drive (SSD), etc.; the memory 702 may also include a combination of the above types of memory.
[0169] Processor 701 may be a central processing unit (CPU). In one embodiment, processor 701 may also be a graphics processing unit (GPU). Processor 701 may also be a combination of a CPU and a GPU. Processor 701 can be used to invoke device control applications stored in memory 702 to perform the above-described tasks. Figure 2 and Figure 5 The text processing method described in the corresponding embodiments can also be executed as described above. Figure 6 The description of the text processing device in the corresponding embodiments will not be repeated here. Furthermore, the beneficial effects of using the same method will also not be repeated.
[0170] In specific implementations, the devices, processors, memory, etc., described in the embodiments of this application can execute the implementation methods described in the above method embodiments, or they can execute the implementation methods described in the embodiments of this application, which will not be repeated here.
[0171] This application also provides a computer-readable storage medium storing a computer program. The computer program includes program instructions, which, when executed by a processor, enable the processor to perform some or all of the steps described in the above method embodiments. Optionally, the computer storage medium can be volatile or non-volatile. The computer-readable storage medium may primarily include a program storage area and a data storage area. The program storage area may store an operating system, at least one application program required for a given function, etc.; the data storage area may store data created based on the use of blockchain nodes, etc.
[0172] This application provides a computer program product, which may include a computer program. When the computer program is executed by a processor, it can implement some or all of the steps in the above method, which will not be elaborated here.
[0173] In this article, "multiple" refers to two or more. "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, or B alone. The character " / " generally indicates that the preceding and following related objects have an "or" relationship.
[0174] Those skilled in the art will understand that all or part of the processes in the above embodiments can be implemented by a computer program instructing related hardware. This computer program can be stored in a computer storage medium, which can be a computer-readable storage medium. When executed, the program can include the processes of the embodiments of the above methods. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.
[0175] The above-disclosed embodiments are merely some of the embodiments of this application, and should not be construed as limiting the scope of this application. Those skilled in the art can understand that all or part of the processes for implementing the above embodiments, and equivalent changes made in accordance with the claims of this application, still fall within the scope of this application.
Claims
1. A text processing method, characterized in that, The method includes: Obtain query text pairs associated with information query behavior; the query text pairs include a first query text and a second query text; the characters included in the first query text are used to form a first character set; the characters included in the second query text are used to form a second character set; The first character set and the second character set are aligned to obtain the character alignment interval between the first character set and the second character set; the character set corresponding to the character alignment interval in the first character set is the first character subset, the character set corresponding to the character alignment interval in the second character set is the second character subset, and the first characters included in the first character subset are aligned with the second characters included in the second character subset. Obtain the first text information of the first text and the second text information of the second text, and determine the information matching degree between the first text and the second text based on the first text information and the second text information; When the first character and the second character are determined to be similar characters based on the information matching degree between them, a similar character dictionary associated with the information query behavior is constructed using the first character and the second character.
2. The method according to claim 1, characterized in that, The acquisition of query text pairs associated with the information query behavior includes: Obtain the query information entered by the business object in the two information query behaviors; the two information query behaviors are consecutive query behaviors, and the time interval between the occurrence of the behaviors is less than the time interval threshold; The query text pair is obtained based on the query information entered in the two information query actions.
3. The method according to claim 1, characterized in that, The acquisition of query text pairs associated with the information query behavior includes: Acquire the information interaction behavior of the business object in response to the entered query information during the information query behavior; The query text pair is obtained based on the query information entered in the information query behavior and the recall information generated in the information interaction behavior.
4. The method according to claim 1, characterized in that, The first character set includes N1 characters; the N1 characters include the i-th character and the j-th character; N1 is a positive integer greater than 1; i1 is less than j1, and i1 is a positive integer less than N1, and j1 is a positive integer less than or equal to N1; the second character set includes N2 characters; the N2 characters include the i2th character and the j2th character; i2 is less than j2, and i2 is a positive integer less than N2, and j2 is a positive integer less than or equal to N2; N2 is a positive integer greater than 1; The step of performing character alignment processing on the first character set and the second character set to obtain the character alignment interval between the first character set and the second character set includes: Find the character that is the same as the i-th character among the N2 characters; If the character found to be identical to the i-th character is the i-th character, then the i-th character is determined as the first aligned character, and at least one character following the i-th character is obtained from the N2 characters; Find the character that is identical to the j1th character among the at least one character; If the character found to be identical to the j1st character is the j2nd character, then the j1st character is determined as the second aligned character; A set of aligned text characters is determined based on the first and second aligned text characters; the set of aligned text characters includes B aligned text characters, and the B aligned text characters include the b-th aligned text character and the b+1-th aligned text character; B is a positive integer; b is a positive integer less than B; Obtain a first interval consisting of the set of characters located between the b-th aligned character and the (b+1)-th aligned character from the N1 characters, and obtain a second interval consisting of the set of characters located between the b-th aligned character and the (b+1)-th aligned character from the N2 characters; the characters in the set of characters corresponding to the first interval are different from the characters in the set of characters corresponding to the second interval. When obtaining the character alignment interval based on the first interval and the second interval, the character set corresponding to the first interval is taken as the first character subset, and the character set corresponding to the second interval is taken as the second character subset.
5. The method according to claim 1, characterized in that, The step of determining the information matching degree between the first text information and the second text information includes: Based on the first text information and the second text information, determine the text association information between the first text and the second text; Based on the first text information, the second text information, and the text association information, determine the text integration features between the first text and the second text; The degree of information matching between the first character and the second character is determined based on the text integration features.
6. The method according to claim 5, characterized in that, The first text information includes: first glyph structure information, first phonetic structure information, and first converted image information of the first character; the second text information includes: second glyph structure information, second phonetic structure information, and second converted image information of the second character; the text association information includes: associated glyph structure information determined by the first glyph structure information and the second glyph structure information, associated phonetic structure information determined by the first phonetic structure information and the second phonetic structure information, and associated converted image information determined by the first converted image information and the second converted image information; The step of determining the text integration features between the first text information, the second text information, and the text association information, including: Obtain the information feature sequence determined by the first character shape structure information, the first phonetic shape structure information, the first converted image information, the second character shape structure information, the second phonetic shape structure information, the second converted image information, the associated character shape structure information, the associated phonetic shape structure information, and the associated converted image information; Based on the information feature sequence, the text integration features between the first character and the second character are determined.
7. The method according to claim 5, characterized in that, Determining the information matching degree between the first character and the second character based on the text integration features includes: A text information processing model is obtained; the text information processing model includes at least one decision tree; The text integration features are input into the text information processing model, and the text information processing model performs feature segmentation on the text integration features to predict the leaf nodes segmented on the at least one decision tree; Based on the node parameters corresponding to the leaf nodes divided on the at least one decision tree, the information matching degree between the first text and the second text is determined.
8. The method according to claim 1, characterized in that, The method further includes: Obtain a Chinese character dictionary, and from the Chinese character dictionary, obtain a third character and a fourth character that is suspected to be similar to the third character; Obtain the third character information of the third character and the fourth character information of the fourth character, and determine the information matching degree between the third character and the fourth character based on the third character information and the fourth character information; When the third and fourth characters are determined to be similar characters based on the information matching degree between them, an extended dictionary of similar characters corresponding to the dictionary of similar characters is constructed using the third and fourth characters.
9. The method according to claim 1, characterized in that, The method further includes: Retrieve the target query text; Based on the dictionary of similar characters, text correction processing is performed on the target query text to obtain the corrected query text; the corrected query text is used to perform information retrieval on the target query text.
10. A text processing device, characterized in that, The device includes: The acquisition module is used to acquire query text pairs associated with information query behavior; the query text pairs include a first query text and a second query text; the text included in the first query text is used to form a first text set; the text included in the second query text is used to form a second text set. The processing module is used to perform character alignment processing on the first character set and the second character set to obtain the character alignment interval between the first character set and the second character set; the character set corresponding to the character alignment interval in the first character set is a first character subset, the character set corresponding to the character alignment interval in the second character set is a second character subset, and the first characters included in the first character subset are aligned with the second characters included in the second character subset. The processing module is further configured to acquire first text information of the first text and second text information of the second text, and determine the information matching degree between the first text and the second text based on the first text information and the second text information; The processing module is further configured to construct a dictionary of similar characters associated with the information query behavior based on the first character and the second character when it is determined that the first character and the second character are similar characters based on the information matching degree between the first character and the second character.
11. An electronic device, characterized in that, The system includes a processor and a memory, wherein the memory is used to store a computer program, the computer program including program instructions, and the processor is configured to invoke the program instructions to perform the method as described in any one of claims 1-9.
12. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the method as described in any one of claims 1-9.
13. A computer program product, characterized in that, The computer program product includes computer instructions that, when executed by a processor, implement the method as described in any one of claims 1-9.