[0039] The embodiments of the present application provide a data processing method and device to reduce the repetition rate between stored data.
[0040] The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work fall within the protection scope of this application.
[0041] See figure 1 , Which shows a schematic flowchart of an embodiment of a data processing method of the present application, and the method of this embodiment may include:
[0042] 101. Obtain information to be stored.
[0043] Wherein, the information to be stored is composed of at least one set of character string sequences to be stored, and the character string sequence includes at least one character.
[0044] Among them, the characters in the string sequence can be Chinese characters, letters, or symbols.
[0045] The information to be stored may have multiple information representing different categories. For example, when the information to be stored is contact information, the information to be stored may include one or more information such as contact name, contact phone number, and contact work phone number. Among them, the contact name, contact phone number, and contact work phone number correspond to different string sequences. For example, the contact name can be Zhang San, and the string sequence Zhang San is a phrase composed of two Chinese characters. .
[0046] 102. Obtain multiple to-be-matched character strings obtained by word segmentation of the at least one group of character string sequences by separately segmenting each group of the string sequence to be stored.
[0047] In the embodiment of the present application, before storing the to-be-stored character string sequence contained in the to-be-stored information, the to-be-stored character string sequence is segmented to obtain a character string derived from the to-be-stored character string.
[0048] In order to facilitate the distinction and description, the character string obtained by segmenting the character string to be stored is called the character string to be matched.
[0049] 103. Respectively match each group of the string sequence to be stored and each string to be matched with the target information stored in the information database.
[0050] Wherein, the target information includes at least one set of target string sequences. Here, in order to distinguish it from the string sequence to be stored, the string sequence contained in the target information stored in the information database is called the target string sequence.
[0051] Before storing the information to be stored, the embodiment of the present application actually performs a deduplication operation to avoid repeated storage of the same situation.
[0052] In the case of deduplication, this application not only uses the entire information to be stored as a search keyword, but also each character string to be stored contained in the information to be stored and the character string to be stored is segmented from the character string to be stored. Both are used as search keywords, thereby increasing the refinement of search matching.
[0053] For example, when the information to be stored is ABC, assuming that the word segmentation obtains A, AB, C, BC, AC, it is necessary to retrieve from the information database whether there is a target that has a matching degree with ABC, A, AB, C, BC, and AC. information.
[0054] 104. When there is no target information matching at least one set of character string sequences to be stored and multiple character strings to be matched in the information library, store the information to be stored in the information library.
[0055] If the target information that matches any string sequence to be stored and any string to be matched cannot be retrieved from the information database, it can indicate that there is no information stored with the information to be stored in the information database. The information to be stored is stored in the information database, which helps reduce repeated storage.
[0056] It is understandable that, based on the to-be-stored character string and the to-be-matched character string, retrieving from the information database the target information whose matching degree meets the requirements is actually to compare the to-be-stored character string and the characters in the to-be-matched character string with the The characters contained in the string sequence of the target information. The specific matching process can use any existing matching technology, which is not limited here.
[0057] In the embodiment of the present application, before the information to be stored is stored, the to-be-stored character string sequence contained in the to-be-stored information is segmented, and the to-be-stored string sequence and the to-be-stored string sequence are segmented. The matched character string is used as a search keyword to search and match from the information database, thereby helping to improve the search accuracy and accurately locate target information similar to the information to be stored, thereby helping to reduce repeated storage of the same information.
[0058] It should be noted that in the embodiments of the present application, any existing word segmentation methods such as string matching, intelligent word segmentation, and finest-grained word segmentation can be used to segment each group of the string sequence to be stored. For example, if the information to be stored is a string: "What Zhang San said is indeed reasonable", the result of intelligent segmentation is "Zhang San|said|really|really"; the result of the most fine-grained word segmentation is "Zhang San|三| Said|really|really|really|really|reasonable".
[0059] Optionally, when there is an ambiguity based on the existing word segmentation method, the combined traversal method can be used for processing, and a set of disjoint character strings is selected from the word segmented character strings as the string to be matched. Among them, disjoint means that the matched character string and the dictionary character string have no common parts. For example, the character string to be segmented is abcd, and abcd is sorted according to the order in which it appears in the text. If a and b intersect, b and c intersect, c and d do not intersect, then the word segmentation result is cut into two participles abc and d.
[0060] It is understandable that, in the embodiment of the application, the information matching is performed to match the to-be-stored string sequence and the to-be-matched string from the word segmentation with the target information stored in the information database, and since the target information may be one or more If the string sequence to be matched and the string sequence to be stored are directly matched with the string sequence of the target information, it may appear that the granularity of the string sequence corresponding to the target information is relatively high, so the search matching is not improved Accuracy.
[0061] Therefore, optionally, while storing the target information in the information database, a target character string associated with the target information can be stored, wherein the target character string is a character string obtained by word segmentation of the target information. When searching and matching, each string sequence to be stored and the target string sequence corresponding to the target information and the target string can be matched in turn to determine whether there is target information with a matching degree that meets the requirements. Or the target string.
[0062] The target character string corresponding to the target information may be obtained by segmenting the target information after storing the target information. Considering that this application needs to segment the information to be stored before storing the information, so optionally, while storing the information to be stored in the information database, the character string to be matched can also be used as the The associated information of the information to be stored is stored in the information database. In this way, the stored target information and the target character string segmented from the target information can be maintained in the information database.
[0063] See figure 2 , Which shows a schematic flowchart of another embodiment of a data processing method of the present application, and the method of this embodiment may include:
[0064] 201. Obtain information to be stored.
[0065] Wherein, the information to be stored is composed of at least one set of character string sequences to be stored, and the character string sequence includes at least one character.
[0066] 202. Obtain a plurality of to-be-matched character strings derived from the word segmentation of the at least one set of character string sequences by respectively performing word segmentation on each group of the string sequence to be stored.
[0067] 203. Match each group of the string sequence to be stored and each string to be matched with the target information stored in the information database.
[0068] Wherein, the target information includes at least one set of target string sequences.
[0069] 204. When there is no target information matching at least one set of character string sequences to be stored and multiple character strings to be matched in the information database, store the information to be stored in the information database.
[0070] 205. When there is at least one target information matching the string sequence to be stored and/or the string to be matched in the information database, output prompt information.
[0071] Wherein, the prompt information is used to prompt to retrieve the target information whose matching degree with the information to be stored meets the requirements.
[0072] The prompt information can be output in the form of a dialog box, or it can be displayed directly on the information input page.
[0073] Wherein, when there is a character string sequence in at least one set of character string sequences to be stored in the information database, and/or target information that matches one or more character strings in the plurality of character strings to be matched At this time, it means that the information library has already stored some or all of the same information as the content contained in the information to be stored. If the information to be stored continues to be stored, the situation of repeated storage may occur.
[0074] In this embodiment of the application, when it is detected that the information inventory contains target information that matches the information to be stored, prompt information will be input to the user, so that the user can determine whether to continue storing the information to be stored according to the prompt information, so that the The data processing process is more humane.
[0075] Further, after the prompt information is output, when a cancel instruction for the prompt information input by the user is received, the prompt information is cancelled.
[0076] Of course, after inputting the prompt information, or after canceling the prompt information, if a user's storage instruction for the information to be stored is received, the information to be stored is stored in the information database.
[0077] It can be understood that, based on storage requirements and differences in the system to be stored in the above embodiments, the information to be stored in the embodiments of the present application may also have multiple situations. For example, the information to be stored may be stored customer information. Wherein, the customer information to be stored may include: a string sequence to be stored for characterizing one or more of the company name, the company industry, the person in charge of the company, and the contact number. Among them, the company name, company industry, and company person in charge all correspond to this different string sequence.
[0078] In order to facilitate the understanding of the embodiments of the present application, the information to be stored is customer information as an example for introduction. For example, if the customer information to be stored includes: the company name "Beijing Chaoyang District No. 1 Trading Company", the following results can be obtained by segmenting the customer information: "Beijing", "Chaoyang", "Chaoyang District", "First" , "Transaction", "first transaction", "company" and "trading company" these six participles.
[0079] When searching, use these six word segments and "Beijing Chaoyang District No. 1 Trading Company" as keywords. Search for information matching the keyword from the customer information stored in the information database and the word segmentation information derived from the customer information.
[0080] If none of the information related to these 7 words is matched from the information database, the client information is stored.
[0081] Assuming that the information database stores "Chaoyang District Trading Company" and "Chaoyang", "Chaoyang District", "Trading" and "Trading Company" from the "Chaoyang District Trading Company" participle, you can retrieve the "Chaoyang District Trading Company". Beijing Chaoyang District No. 1 Trading Company” and “Chaoyang District Trading Company”, “Trading Company”, “Transaction” and so on that match the above six participles.
[0082] Corresponding to a data processing method of this application, this application also provides a data processing device. See image 3 , Which shows a schematic structural diagram of another embodiment of a data processing method of the present application. The device of this embodiment includes: an information acquisition unit 301, a word segmentation unit 302, a matching unit 303, and a first storage unit 304.
[0083] Wherein, the information obtaining unit 301 is configured to obtain information to be stored, where the information to be stored is composed of at least one set of character string sequences to be stored, and the character string sequence to be stored includes at least one character;
[0084] The word segmentation unit 302 is configured to segment each group of the to-be-stored character string sequence to obtain multiple to-be-matched character strings obtained by segmenting the at least one group of the to-be-stored string sequence;
[0085] The matching unit 303 is configured to respectively match each set of the string sequence to be stored and each string to be matched with target information stored in an information database, wherein the target information includes at least one set of target character strings sequence;
[0086] The first storage unit 304 is configured to store the information to be stored in the information library when there is no target information matching the at least one set of character string sequences to be stored and the multiple character strings to be matched The information database.
[0087] In the embodiment of the present application, before the information to be stored is stored, the to-be-stored character string sequence contained in the to-be-stored information is segmented, and the to-be-stored string sequence and the to-be-stored string sequence are segmented. The matched character string is used as a search keyword to search and match from the information database, thereby helping to improve the search accuracy and accurately locate target information similar to the information to be stored, thereby helping to reduce repeated storage of the same information.
[0088] See Figure 4 , Which shows a schematic structural diagram of an embodiment of a data processing apparatus of the present application, and the apparatus of this embodiment is image 3 The difference of the device of the illustrated embodiment is:
[0089] In addition to the information acquisition unit 301, the word segmentation unit 302, the matching unit 303, and the first storage unit 304, the device of this embodiment also includes:
[0090] The second storage unit 305 is configured to store the to-be-matched character string as the associated information of the to-be-stored information in the information library while the first storage unit stores the to-be-stored information in the information library. In the information library.
[0091] Among them, the information acquisition unit 301, the word segmentation unit 302, the matching unit 303 and the first storage unit 304 can refer to image 3 The relevant introduction of the embodiment will not be repeated here.
[0092] Optionally, in the embodiment of any one of the above devices of this application, the information database may include: stored target information and a target character string obtained by segmenting the target information;
[0093] Then the matching unit includes:
[0094] The matching subunit is used to respectively match each of the to-be-stored character string sequence and the to-be-matched character string with the target character string sequence and the target character string corresponding to the target information in the information database.
[0095] Optionally, in any one of the above device embodiments, the information acquisition unit includes:
[0096] The information acquisition subunit is used for acquiring customer information to be stored, and the customer information to be stored includes: a string sequence to be stored for characterizing one or more of the company name, the company industry, the person in charge of the company, and the contact number.
[0097] See Figure 5 , Which shows a schematic structural diagram of another embodiment of a data processing device according to the present application. The difference between the device in this embodiment and the devices in the previous embodiments is:
[0098] The device of this embodiment may further include:
[0099] The prompt unit 306 is configured to output prompt information when the matching unit determines that there is at least one piece of target information that matches the sequence of strings to be stored and/or the character string to be matched in the information library, the prompt information It is used to prompt to retrieve the target information whose matching degree with the information to be stored meets the requirements.
[0100] The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments can be referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant information can be referred to the description of the method part.
[0101] The above description of the disclosed embodiments enables those skilled in the art to implement or use this application. Various modifications to these embodiments will be obvious to those skilled in the art, and the general principles defined in this document can be implemented in other embodiments without departing from the spirit or scope of the application. Therefore, this application will not be limited to the embodiments shown in this document, but should conform to the widest scope consistent with the principles and novel features disclosed in this document.